5.2 KiB
| title | tag | source |
|---|---|---|
| Sentencizer | class | spacy/pipeline/pipes.pyx |
A simple pipeline component, to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser, so the
Sentencizer lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded. The component is also available via
the string name "sentencizer". After initialization, it is typically added to
the processing pipeline using nlp.add_pipe.
Compared to the previous SentenceSegmenter class, the Sentencizer component
doesn't add a hook to doc.user_hooks["sents"]. Instead, it iterates over the
tokens in the Doc and sets the Token.is_sent_start property. The
SentenceSegmenter is still available if you import it directly:
from spacy.pipeline import SentenceSegmenter
Sentencizer.__init__
Initialize the sentencizer.
Example
# Construction via create_pipe sentencizer = nlp.create_pipe("sentencizer") # Construction from class from spacy.pipeline import Sentencizer sentencizer = Sentencizer()
| Name | Type | Description |
|---|---|---|
punct_chars |
list | Optional custom list of punctuation characters that mark sentence ends. Defaults to [".", "!", "?"]. |
| RETURNS | Sentencizer |
The newly constructed object. |
Sentencizer.__call__
Apply the sentencizer on a Doc. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe.
Example
from spacy.lang.en import English nlp = English() sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) doc = nlp("This is a sentence. This is another sentence.") assert len(list(doc.sents)) == 2
| Name | Type | Description |
|---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
| RETURNS | Doc |
The modified Doc with added sentence boundaries. |
Sentencizer.to_disk
Save the sentencizer settings (punctuation characters) a directory. Will create
a file sentencizer.json. This also happens automatically when you save an
nlp object with a sentencizer added to its pipeline.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer.to_disk("/path/to/sentencizer.jsonl")
| Name | Type | Description |
|---|---|---|
path |
unicode / Path |
A path to a file, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. |
Sentencizer.from_disk
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp object or model with a sentencizer
added to its pipeline.
Example
sentencizer = Sentencizer() sentencizer.from_disk("/path/to/sentencizer.json")
| Name | Type | Description |
|---|---|---|
path |
unicode / Path |
A path to a JSON file. Paths may be either strings or Path-like objects. |
| RETURNS | Sentencizer |
The modified Sentencizer object. |
Sentencizer.to_bytes
Serialize the sentencizer settings to a bytestring.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer_bytes = sentencizer.to_bytes()
| Name | Type | Description |
|---|---|---|
| RETURNS | bytes | The serialized data. |
Sentencizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
sentencizer_bytes = sentencizer.to_bytes() sentencizer = Sentencizer() sentencizer.from_bytes(sentencizer_bytes)
| Name | Type | Description |
|---|---|---|
bytes_data |
bytes | The bytestring to load. |
| RETURNS | Sentencizer |
The modified Sentencizer object. |