title |
tag |
source |
SentenceSegmenter |
class |
spacy/pipeline.pyx |
A simple spaCy hook, to allow custom sentence boundary detection logic that
doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser
, so the
SentenceSegmenter
lets you implement a simpler, rule-based strategy that
doesn't require a statistical model to be loaded. The component is also
available via the string name "sentencizer"
. After initialization, it is
typically added to the processing pipeline using
nlp.add_pipe
.
SentenceSegmenter.__init__
Initialize the sentence segmenter. To change the sentence boundary detection
strategy, pass a generator function strategy
on initialization, or assign a
new strategy to the .strategy
attribute. Sentence detection strategies should
be generators that take Doc
objects and yield Span
objects for each
sentence.
Example
# Construction via create_pipe
sentencizer = nlp.create_pipe("sentencizer")
# Construction from class
from spacy.pipeline import SentenceSegmenter
sentencizer = SentenceSegmenter(nlp.vocab)
Name |
Type |
Description |
vocab |
Vocab |
The shared vocabulary. |
strategy |
unicode / callable |
The segmentation strategy to use. Defaults to "on_punct" . |
RETURNS |
SentenceSegmenter |
The newly constructed object. |
SentenceSegmenter.__call__
Apply the sentence segmenter on a Doc
. Typically, this happens automatically
after the component has been added to the pipeline using
nlp.add_pipe
.
Example
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp(u"This is a sentence. This is another sentence.")
assert list(doc.sents) == 2
Name |
Type |
Description |
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS |
Doc |
The modified Doc with added sentence boundaries. |
SentenceSegmenter.split_on_punct
Split the Doc
on punctuation characters .
, !
and ?
. This is the default
strategy used by the SentenceSegmenter.
Name |
Type |
Description |
doc |
Doc |
The Doc object to process. |
YIELDS |
Span |
The sentences in the document. |
Attributes
Name |
Type |
Description |
strategy |
callable |
The segmentation strategy. Can be overwritten after initialization. |