* Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs
5.2 KiB
title | tag | source |
---|---|---|
Sentencizer | class | spacy/pipeline/pipes.pyx |
A simple pipeline component, to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser
, so the
Sentencizer
lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded. The component is also available via
the string name "sentencizer"
. After initialization, it is typically added to
the processing pipeline using nlp.add_pipe
.
Compared to the previous SentenceSegmenter
class, the Sentencizer
component
doesn't add a hook to doc.user_hooks["sents"]
. Instead, it iterates over the
tokens in the Doc
and sets the Token.is_sent_start
property. The
SentenceSegmenter
is still available if you import it directly:
from spacy.pipeline import SentenceSegmenter
Sentencizer.__init__
Initialize the sentencizer.
Example
# Construction via create_pipe sentencizer = nlp.create_pipe("sentencizer") # Construction from class from spacy.pipeline import Sentencizer sentencizer = Sentencizer()
Name | Type | Description |
---|---|---|
punct_chars |
list | Optional custom list of punctuation characters that mark sentence ends. Defaults to [".", "!", "?"]. |
RETURNS | Sentencizer |
The newly constructed object. |
Sentencizer.__call__
Apply the sentencizer on a Doc
. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe
.
Example
from spacy.lang.en import English nlp = English() sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) doc = nlp(u"This is a sentence. This is another sentence.") assert list(doc.sents) == 2
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | Doc |
The modified Doc with added sentence boundaries. |
Sentencizer.to_disk
Save the sentencizer settings (punctuation characters) a directory. Will create
a file sentencizer.json
. This also happens automatically when you save an
nlp
object with a sentencizer added to its pipeline.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer.to_disk("/path/to/sentencizer.jsonl")
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a file, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
Sentencizer.from_disk
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp
object or model with a sentencizer
added to its pipeline.
Example
sentencizer = Sentencizer() sentencizer.from_disk("/path/to/sentencizer.json")
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a JSON file. Paths may be either strings or Path -like objects. |
RETURNS | Sentencizer |
The modified Sentencizer object. |
Sentencizer.to_bytes
Serialize the sentencizer settings to a bytestring.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer_bytes = sentencizer.to_bytes()
Name | Type | Description |
---|---|---|
RETURNS | bytes | The serialized data. |
Sentencizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
sentencizer_bytes = sentencizer.to_bytes() sentencizer = Sentencizer() sentencizer.from_bytes(sentencizer_bytes)
Name | Type | Description |
---|---|---|
bytes_data |
bytes | The bytestring to load. |
RETURNS | Sentencizer |
The modified Sentencizer object. |