5.4 KiB
title | tag | source |
---|---|---|
Sentencizer | class | spacy/pipeline/pipes.pyx |
A simple pipeline component, to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser
, so the
Sentencizer
lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded. The component is also available via
the string name "sentencizer"
.
Sentencizer.__init__
Initialize the sentencizer.
Example
# Construction via add_pipe sentencizer = nlp.add_pipe("sentencizer")
Name | Type | Description |
---|---|---|
punct_chars |
list | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. |
RETURNS | Sentencizer |
The newly constructed object. |
### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
'፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
'᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
'⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
'꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?',
'𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
'𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
'𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
'𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
'𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']
Sentencizer.__call__
Apply the sentencizer on a Doc
. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe
.
Example
from spacy.lang.en import English nlp = English() nlp.add_pipe("sentencizer") doc = nlp("This is a sentence. This is another sentence.") assert len(list(doc.sents)) == 2
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | Doc |
The modified Doc with added sentence boundaries. |
Sentencizer.to_disk
Save the sentencizer settings (punctuation characters) a directory. Will create
a file sentencizer.json
. This also happens automatically when you save an
nlp
object with a sentencizer added to its pipeline.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer.to_disk("/path/to/sentencizer.jsonl")
Name | Type | Description |
---|---|---|
path |
str / Path |
A path to a file, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
Sentencizer.from_disk
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp
object or model with a sentencizer
added to its pipeline.
Example
sentencizer = Sentencizer() sentencizer.from_disk("/path/to/sentencizer.json")
Name | Type | Description |
---|---|---|
path |
str / Path |
A path to a JSON file. Paths may be either strings or Path -like objects. |
RETURNS | Sentencizer |
The modified Sentencizer object. |
Sentencizer.to_bytes
Serialize the sentencizer settings to a bytestring.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer_bytes = sentencizer.to_bytes()
Name | Type | Description |
---|---|---|
RETURNS | bytes | The serialized data. |
Sentencizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
sentencizer_bytes = sentencizer.to_bytes() sentencizer = Sentencizer() sentencizer.from_bytes(sentencizer_bytes)
Name | Type | Description |
---|---|---|
bytes_data |
bytes | The bytestring to load. |
RETURNS | Sentencizer |
The modified Sentencizer object. |