Merge pull request #5991 from adrianeboyd/docs/sent-usage-v3

Update sentence segmentation usage docs
This commit is contained in:
Ines Montani 2020-08-29 12:40:06 +02:00 committed by GitHub
commit 450bf806b0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 66 additions and 20 deletions

View File

@ -10,7 +10,7 @@ api_trainable: true
---
A trainable pipeline component for sentence segmentation. For a simpler,
ruse-based strategy, see the [`Sentencizer`](/api/sentencizer).
rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
## Config and implementation {#config}

View File

@ -1472,28 +1472,45 @@ print("After:", [(token.text, token._.is_musician) for token in doc])
## Sentence Segmentation {#sbd}
<!-- TODO: include senter -->
A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
property. Unlike other libraries, spaCy uses the dependency parse to determine
sentence boundaries. This is usually more accurate than a rule-based approach,
but it also means you'll need a **statistical model** and accurate predictions.
If your texts are closer to general-purpose news or web text, this should work
well out-of-the-box. For social media or conversational text that doesn't follow
the same rules, your application may benefit from a custom rule-based
implementation. You can either use the built-in
[`Sentencizer`](/api/sentencizer) or plug an entirely custom rule-based function
into your [processing pipeline](/usage/processing-pipelines).
property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
has sentence boundaries with the `doc.is_sentenced` attribute.
spaCy's dependency parser respects already set boundaries, so you can preprocess
your `Doc` using custom rules _before_ it's parsed. Depending on your text, this
may also improve accuracy, since the parser is constrained to predict parses
consistent with the sentence boundaries.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.is_sentenced
for sent in doc.sents:
print(sent.text)
```
spaCy provides three alternatives for sentence segmentation:
1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most
accurate sentence boundaries based on full dependency parses.
2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a
simpler and faster alternative to the parser that only sets sentence
boundaries.
3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer`
sets sentence boundaries using a customizable list of sentence-final
punctuation.
You can also plug an entirely custom [rule-based function](#sbd-custom) into
your [processing pipeline](/usage/processing-pipelines).
### Default: Using the dependency parse {#sbd-parser model="parser"}
To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator
that yields [`Span`](/api/span) objects.
Unlike other libraries, spaCy uses the dependency parse to determine sentence
boundaries. This is usually the most accurate approach, but it requires a
**statistical model** that provides accurate predictions. If your texts are
closer to general-purpose news or web text, this should work well out-of-the-box
with spaCy's provided models. For social media or conversational text that
doesn't follow the same rules, your application may benefit from a custom model
or rule-based component.
```python
### {executable="true"}
@ -1505,12 +1522,41 @@ for sent in doc.sents:
print(sent.text)
```
spaCy's dependency parser respects already set boundaries, so you can preprocess
your `Doc` using custom components _before_ it's parsed. Depending on your text,
this may also improve parse accuracy, since the parser is constrained to predict
parses consistent with the sentence boundaries.
### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}
The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
component that only provides sentence boundaries. Along with being faster and
smaller than the parser, its primary advantage is that it's easier to train
custom models because it only requires annotated sentence boundaries rather than
full dependency parses.
<!-- TODO: correct senter loading -->
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
```
The recall for the `senter` is typically slightly lower than for the parser,
which is better at predicting sentence boundaries when punctuation is not
present.
### Rule-based pipeline component {#sbd-component}
The [`Sentencizer`](/api/sentencizer) component is a
[pipeline component](/usage/processing-pipelines) that splits sentences on
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
need sentence boundaries without the dependency parse.
need sentence boundaries without dependency parses.
```python
### {executable="true"}
@ -1537,7 +1583,7 @@ and can still be overwritten by the parser.
<Infobox title="Important note" variant="warning">
To prevent inconsistent state, you can only set boundaries **before** a document
is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is
is parsed (and `doc.is_parsed` is `False`). To ensure that your component is
added in the right place, you can set `before='parser'` or `first=True` when
adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).