Merge pull request #5991 from adrianeboyd/docs/sent-usage-v3

Update sentence segmentation usage docs
2025-11-09 04:17:53 +03:00 · 2020-08-29 12:40:06 +02:00 · 2020-08-29 12:40:06 +02:00 · 450bf806b0
commit 450bf806b0
parent f45095a666 48df50533d
2 changed files with 66 additions and 20 deletions
--- a/website/docs/api/sentencerecognizer.md
+++ b/website/docs/api/sentencerecognizer.md
@ -10,7 +10,7 @@ api_trainable: true
 ---
 A trainable pipeline component for sentence segmentation. For a simpler,
-ruse-based strategy, see the [`Sentencizer`](/api/sentencizer).
+rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
 ## Config and implementation {#config}
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1472,28 +1472,45 @@ print("After:", [(token.text, token._.is_musician) for token in doc])
 ## Sentence Segmentation {#sbd}
 <!-- TODO: include senter -->
 A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
-property. Unlike other libraries, spaCy uses the dependency parse to determine
+property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
-sentence boundaries. This is usually more accurate than a rule-based approach,
+generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
-but it also means you'll need a **statistical model** and accurate predictions.
+has sentence boundaries with the `doc.is_sentenced` attribute.
 If your texts are closer to general-purpose news or web text, this should work
 well out-of-the-box. For social media or conversational text that doesn't follow
 the same rules, your application may benefit from a custom rule-based
 implementation. You can either use the built-in
 [`Sentencizer`](/api/sentencizer) or plug an entirely custom rule-based function
 into your [processing pipeline](/usage/processing-pipelines).
-spaCy's dependency parser respects already set boundaries, so you can preprocess
+```python
-your `Doc` using custom rules _before_ it's parsed. Depending on your text, this
+### {executable="true"}
-may also improve accuracy, since the parser is constrained to predict parses
+import spacy
-consistent with the sentence boundaries.
+
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("This is a sentence. This is another sentence.")
 assert doc.is_sentenced
 for sent in doc.sents:
    print(sent.text)
 ```
 spaCy provides three alternatives for sentence segmentation:
 1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most
   accurate sentence boundaries based on full dependency parses.
 2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a
   simpler and faster alternative to the parser that only sets sentence
   boundaries.
 3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer`
   sets sentence boundaries using a customizable list of sentence-final
   punctuation.
 You can also plug an entirely custom [rule-based function](#sbd-custom) into
 your [processing pipeline](/usage/processing-pipelines).
 ### Default: Using the dependency parse {#sbd-parser model="parser"}
-To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator
+Unlike other libraries, spaCy uses the dependency parse to determine sentence
-that yields [`Span`](/api/span) objects.
+boundaries. This is usually the most accurate approach, but it requires a
 **statistical model** that provides accurate predictions. If your texts are
 closer to general-purpose news or web text, this should work well out-of-the-box
 with spaCy's provided models. For social media or conversational text that
 doesn't follow the same rules, your application may benefit from a custom model
 or rule-based component.
 ```python
 ### {executable="true"}
@ -1505,12 +1522,41 @@ for sent in doc.sents:
    print(sent.text)
 ```
 spaCy's dependency parser respects already set boundaries, so you can preprocess
 your `Doc` using custom components _before_ it's parsed. Depending on your text,
 this may also improve parse accuracy, since the parser is constrained to predict
 parses consistent with the sentence boundaries.
 ### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}
 The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
 component that only provides sentence boundaries. Along with being faster and
 smaller than the parser, its primary advantage is that it's easier to train
 custom models because it only requires annotated sentence boundaries rather than
 full dependency parses.
 <!-- TODO: correct senter loading -->
 ```python
 ### {executable="true"}
 import spacy
 nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
 doc = nlp("This is a sentence. This is another sentence.")
 for sent in doc.sents:
    print(sent.text)
 ```
 The recall for the `senter` is typically slightly lower than for the parser,
 which is better at predicting sentence boundaries when punctuation is not
 present.
 ### Rule-based pipeline component {#sbd-component}
 The [`Sentencizer`](/api/sentencizer) component is a
 [pipeline component](/usage/processing-pipelines) that splits sentences on
 punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
-need sentence boundaries without the dependency parse.
+need sentence boundaries without dependency parses.
 ```python
 ### {executable="true"}
@ -1537,7 +1583,7 @@ and can still be overwritten by the parser.
 <Infobox title="Important note" variant="warning">
 To prevent inconsistent state, you can only set boundaries **before** a document
-is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is
+is parsed (and `doc.is_parsed` is `False`). To ensure that your component is
 added in the right place, you can set `before='parser'` or `first=True` when
 adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).