9.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | tag | source | teaser | api_string_name | api_trainable | 
|---|---|---|---|---|---|
| Sentencizer | class | spacy/pipeline/sentencizer.pyx | Pipeline component for rule-based sentence boundary detection | sentencizer | false | 
A simple pipeline component to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser, so the
Sentencizer lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded.
Assigned Attributes
Calculated values will be assigned to Token.is_sent_start. The resulting
sentences can be accessed using Doc.sents.
| Location | Value | 
|---|---|
| Token.is_sent_start | A boolean value indicating whether the token starts a sentence. This will be either TrueorFalsefor all tokens. | 
| Doc.sents | An iterator over sentences in the Doc, determined byToken.is_sent_startvalues. | 
Config and implementation
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
config argument on nlp.add_pipe or in your
config.cfg for training.
Example
config = {"punct_chars": None} nlp.add_pipe("sentencizer", config=config)
| Setting | Description | 
|---|---|
| punct_chars | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None. | 
| overwrite3.2 | Whether existing annotation is overwritten. Defaults to False. | 
| scorer3.2 | The scoring method. Defaults to Scorer.score_spansfor the attribute"sents" | 
%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx
Sentencizer.__init__
Initialize the sentencizer.
Example
# Construction via add_pipe sentencizer = nlp.add_pipe("sentencizer") # Construction from class from spacy.pipeline import Sentencizer sentencizer = Sentencizer()
| Name | Description | 
|---|---|
| keyword-only | |
| punct_chars | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. | 
| overwrite3.2 | Whether existing annotation is overwritten. Defaults to False. | 
| scorer3.2 | The scoring method. Defaults to Scorer.score_spansfor the attribute"sents" | 
### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?',
 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']
Sentencizer.__call__
Apply the sentencizer on a Doc. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe.
Example
from spacy.lang.en import English nlp = English() nlp.add_pipe("sentencizer") doc = nlp("This is a sentence. This is another sentence.") assert len(list(doc.sents)) == 2
| Name | Description | 
|---|---|
| doc | The Docobject to process, e.g. theDocin the pipeline. | 
| RETURNS | The modified Docwith added sentence boundaries. | 
Sentencizer.pipe
Apply the pipe to a stream of documents. This usually happens under the hood
when the nlp object is called on a text and all pipeline components are
applied to the Doc in order.
Example
sentencizer = nlp.add_pipe("sentencizer") for doc in sentencizer.pipe(docs, batch_size=50): pass
| Name | Description | 
|---|---|
| stream | A stream of documents. | 
| keyword-only | |
| batch_size | The number of documents to buffer. Defaults to 128. | 
| YIELDS | The processed documents in order. | 
Sentencizer.to_disk
Save the sentencizer settings (punctuation characters) to a directory. Will
create a file sentencizer.json. This also happens automatically when you save
an nlp object with a sentencizer added to its pipeline.
Example
config = {"punct_chars": [".", "?", "!", "。"]} sentencizer = nlp.add_pipe("sentencizer", config=config) sentencizer.to_disk("/path/to/sentencizer.json")
| Name | Description | 
|---|---|
| path | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. | 
Sentencizer.from_disk
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp object or model with a sentencizer
added to its pipeline.
Example
sentencizer = nlp.add_pipe("sentencizer") sentencizer.from_disk("/path/to/sentencizer.json")
| Name | Description | 
|---|---|
| path | A path to a JSON file. Paths may be either strings or Path-like objects. | 
| RETURNS | The modified Sentencizerobject. | 
Sentencizer.to_bytes
Serialize the sentencizer settings to a bytestring.
Example
config = {"punct_chars": [".", "?", "!", "。"]} sentencizer = nlp.add_pipe("sentencizer", config=config) sentencizer_bytes = sentencizer.to_bytes()
| Name | Description | 
|---|---|
| RETURNS | The serialized data. | 
Sentencizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
sentencizer_bytes = sentencizer.to_bytes() sentencizer = nlp.add_pipe("sentencizer") sentencizer.from_bytes(sentencizer_bytes)
| Name | Description | 
|---|---|
| bytes_data | The bytestring to load. | 
| RETURNS | The modified Sentencizerobject. |