mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Update sentence segmentation usage docs
Update sentence segmentation usage docs to incorporate `senter`.
This commit is contained in:
		
							parent
							
								
									e1e1760fd6
								
							
						
					
					
						commit
						48df50533d
					
				| 
						 | 
					@ -10,7 +10,7 @@ api_trainable: true
 | 
				
			||||||
---
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
A trainable pipeline component for sentence segmentation. For a simpler,
 | 
					A trainable pipeline component for sentence segmentation. For a simpler,
 | 
				
			||||||
ruse-based strategy, see the [`Sentencizer`](/api/sentencizer).
 | 
					rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Config and implementation {#config}
 | 
					## Config and implementation {#config}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1472,28 +1472,45 @@ print("After:", [(token.text, token._.is_musician) for token in doc])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Sentence Segmentation {#sbd}
 | 
					## Sentence Segmentation {#sbd}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: include senter -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
 | 
					A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
 | 
				
			||||||
property. Unlike other libraries, spaCy uses the dependency parse to determine
 | 
					property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
 | 
				
			||||||
sentence boundaries. This is usually more accurate than a rule-based approach,
 | 
					generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
 | 
				
			||||||
but it also means you'll need a **statistical model** and accurate predictions.
 | 
					has sentence boundaries with the `doc.is_sentenced` attribute.
 | 
				
			||||||
If your texts are closer to general-purpose news or web text, this should work
 | 
					 | 
				
			||||||
well out-of-the-box. For social media or conversational text that doesn't follow
 | 
					 | 
				
			||||||
the same rules, your application may benefit from a custom rule-based
 | 
					 | 
				
			||||||
implementation. You can either use the built-in
 | 
					 | 
				
			||||||
[`Sentencizer`](/api/sentencizer) or plug an entirely custom rule-based function
 | 
					 | 
				
			||||||
into your [processing pipeline](/usage/processing-pipelines).
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
spaCy's dependency parser respects already set boundaries, so you can preprocess
 | 
					```python
 | 
				
			||||||
your `Doc` using custom rules _before_ it's parsed. Depending on your text, this
 | 
					### {executable="true"}
 | 
				
			||||||
may also improve accuracy, since the parser is constrained to predict parses
 | 
					import spacy
 | 
				
			||||||
consistent with the sentence boundaries.
 | 
					
 | 
				
			||||||
 | 
					nlp = spacy.load("en_core_web_sm")
 | 
				
			||||||
 | 
					doc = nlp("This is a sentence. This is another sentence.")
 | 
				
			||||||
 | 
					assert doc.is_sentenced
 | 
				
			||||||
 | 
					for sent in doc.sents:
 | 
				
			||||||
 | 
					    print(sent.text)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					spaCy provides three alternatives for sentence segmentation:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most
 | 
				
			||||||
 | 
					   accurate sentence boundaries based on full dependency parses.
 | 
				
			||||||
 | 
					2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a
 | 
				
			||||||
 | 
					   simpler and faster alternative to the parser that only sets sentence
 | 
				
			||||||
 | 
					   boundaries.
 | 
				
			||||||
 | 
					3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer`
 | 
				
			||||||
 | 
					   sets sentence boundaries using a customizable list of sentence-final
 | 
				
			||||||
 | 
					   punctuation.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You can also plug an entirely custom [rule-based function](#sbd-custom) into
 | 
				
			||||||
 | 
					your [processing pipeline](/usage/processing-pipelines).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Default: Using the dependency parse {#sbd-parser model="parser"}
 | 
					### Default: Using the dependency parse {#sbd-parser model="parser"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator
 | 
					Unlike other libraries, spaCy uses the dependency parse to determine sentence
 | 
				
			||||||
that yields [`Span`](/api/span) objects.
 | 
					boundaries. This is usually the most accurate approach, but it requires a
 | 
				
			||||||
 | 
					**statistical model** that provides accurate predictions. If your texts are
 | 
				
			||||||
 | 
					closer to general-purpose news or web text, this should work well out-of-the-box
 | 
				
			||||||
 | 
					with spaCy's provided models. For social media or conversational text that
 | 
				
			||||||
 | 
					doesn't follow the same rules, your application may benefit from a custom model
 | 
				
			||||||
 | 
					or rule-based component.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
### {executable="true"}
 | 
					### {executable="true"}
 | 
				
			||||||
| 
						 | 
					@ -1505,12 +1522,41 @@ for sent in doc.sents:
 | 
				
			||||||
    print(sent.text)
 | 
					    print(sent.text)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					spaCy's dependency parser respects already set boundaries, so you can preprocess
 | 
				
			||||||
 | 
					your `Doc` using custom components _before_ it's parsed. Depending on your text,
 | 
				
			||||||
 | 
					this may also improve parse accuracy, since the parser is constrained to predict
 | 
				
			||||||
 | 
					parses consistent with the sentence boundaries.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
 | 
				
			||||||
 | 
					component that only provides sentence boundaries. Along with being faster and
 | 
				
			||||||
 | 
					smaller than the parser, its primary advantage is that it's easier to train
 | 
				
			||||||
 | 
					custom models because it only requires annotated sentence boundaries rather than
 | 
				
			||||||
 | 
					full dependency parses.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- TODO: correct senter loading -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					### {executable="true"}
 | 
				
			||||||
 | 
					import spacy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
 | 
				
			||||||
 | 
					doc = nlp("This is a sentence. This is another sentence.")
 | 
				
			||||||
 | 
					for sent in doc.sents:
 | 
				
			||||||
 | 
					    print(sent.text)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The recall for the `senter` is typically slightly lower than for the parser,
 | 
				
			||||||
 | 
					which is better at predicting sentence boundaries when punctuation is not
 | 
				
			||||||
 | 
					present.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Rule-based pipeline component {#sbd-component}
 | 
					### Rule-based pipeline component {#sbd-component}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The [`Sentencizer`](/api/sentencizer) component is a
 | 
					The [`Sentencizer`](/api/sentencizer) component is a
 | 
				
			||||||
[pipeline component](/usage/processing-pipelines) that splits sentences on
 | 
					[pipeline component](/usage/processing-pipelines) that splits sentences on
 | 
				
			||||||
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
 | 
					punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
 | 
				
			||||||
need sentence boundaries without the dependency parse.
 | 
					need sentence boundaries without dependency parses.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
### {executable="true"}
 | 
					### {executable="true"}
 | 
				
			||||||
| 
						 | 
					@ -1537,7 +1583,7 @@ and can still be overwritten by the parser.
 | 
				
			||||||
<Infobox title="Important note" variant="warning">
 | 
					<Infobox title="Important note" variant="warning">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To prevent inconsistent state, you can only set boundaries **before** a document
 | 
					To prevent inconsistent state, you can only set boundaries **before** a document
 | 
				
			||||||
is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is
 | 
					is parsed (and `doc.is_parsed` is `False`). To ensure that your component is
 | 
				
			||||||
added in the right place, you can set `before='parser'` or `first=True` when
 | 
					added in the right place, you can set `before='parser'` or `first=True` when
 | 
				
			||||||
adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
 | 
					adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in New Issue
	
	Block a user