mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-27 01:34:30 +03:00
Merge pull request #5991 from adrianeboyd/docs/sent-usage-v3
Update sentence segmentation usage docs
This commit is contained in:
commit
450bf806b0
|
@ -10,7 +10,7 @@ api_trainable: true
|
||||||
---
|
---
|
||||||
|
|
||||||
A trainable pipeline component for sentence segmentation. For a simpler,
|
A trainable pipeline component for sentence segmentation. For a simpler,
|
||||||
ruse-based strategy, see the [`Sentencizer`](/api/sentencizer).
|
rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
|
|
@ -1472,28 +1472,45 @@ print("After:", [(token.text, token._.is_musician) for token in doc])
|
||||||
|
|
||||||
## Sentence Segmentation {#sbd}
|
## Sentence Segmentation {#sbd}
|
||||||
|
|
||||||
<!-- TODO: include senter -->
|
|
||||||
|
|
||||||
A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
|
A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
|
||||||
property. Unlike other libraries, spaCy uses the dependency parse to determine
|
property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
|
||||||
sentence boundaries. This is usually more accurate than a rule-based approach,
|
generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
|
||||||
but it also means you'll need a **statistical model** and accurate predictions.
|
has sentence boundaries with the `doc.is_sentenced` attribute.
|
||||||
If your texts are closer to general-purpose news or web text, this should work
|
|
||||||
well out-of-the-box. For social media or conversational text that doesn't follow
|
|
||||||
the same rules, your application may benefit from a custom rule-based
|
|
||||||
implementation. You can either use the built-in
|
|
||||||
[`Sentencizer`](/api/sentencizer) or plug an entirely custom rule-based function
|
|
||||||
into your [processing pipeline](/usage/processing-pipelines).
|
|
||||||
|
|
||||||
spaCy's dependency parser respects already set boundaries, so you can preprocess
|
```python
|
||||||
your `Doc` using custom rules _before_ it's parsed. Depending on your text, this
|
### {executable="true"}
|
||||||
may also improve accuracy, since the parser is constrained to predict parses
|
import spacy
|
||||||
consistent with the sentence boundaries.
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
doc = nlp("This is a sentence. This is another sentence.")
|
||||||
|
assert doc.is_sentenced
|
||||||
|
for sent in doc.sents:
|
||||||
|
print(sent.text)
|
||||||
|
```
|
||||||
|
|
||||||
|
spaCy provides three alternatives for sentence segmentation:
|
||||||
|
|
||||||
|
1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most
|
||||||
|
accurate sentence boundaries based on full dependency parses.
|
||||||
|
2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a
|
||||||
|
simpler and faster alternative to the parser that only sets sentence
|
||||||
|
boundaries.
|
||||||
|
3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer`
|
||||||
|
sets sentence boundaries using a customizable list of sentence-final
|
||||||
|
punctuation.
|
||||||
|
|
||||||
|
You can also plug an entirely custom [rule-based function](#sbd-custom) into
|
||||||
|
your [processing pipeline](/usage/processing-pipelines).
|
||||||
|
|
||||||
### Default: Using the dependency parse {#sbd-parser model="parser"}
|
### Default: Using the dependency parse {#sbd-parser model="parser"}
|
||||||
|
|
||||||
To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator
|
Unlike other libraries, spaCy uses the dependency parse to determine sentence
|
||||||
that yields [`Span`](/api/span) objects.
|
boundaries. This is usually the most accurate approach, but it requires a
|
||||||
|
**statistical model** that provides accurate predictions. If your texts are
|
||||||
|
closer to general-purpose news or web text, this should work well out-of-the-box
|
||||||
|
with spaCy's provided models. For social media or conversational text that
|
||||||
|
doesn't follow the same rules, your application may benefit from a custom model
|
||||||
|
or rule-based component.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -1505,12 +1522,41 @@ for sent in doc.sents:
|
||||||
print(sent.text)
|
print(sent.text)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
spaCy's dependency parser respects already set boundaries, so you can preprocess
|
||||||
|
your `Doc` using custom components _before_ it's parsed. Depending on your text,
|
||||||
|
this may also improve parse accuracy, since the parser is constrained to predict
|
||||||
|
parses consistent with the sentence boundaries.
|
||||||
|
|
||||||
|
### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}
|
||||||
|
|
||||||
|
The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
|
||||||
|
component that only provides sentence boundaries. Along with being faster and
|
||||||
|
smaller than the parser, its primary advantage is that it's easier to train
|
||||||
|
custom models because it only requires annotated sentence boundaries rather than
|
||||||
|
full dependency parses.
|
||||||
|
|
||||||
|
<!-- TODO: correct senter loading -->
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
|
||||||
|
doc = nlp("This is a sentence. This is another sentence.")
|
||||||
|
for sent in doc.sents:
|
||||||
|
print(sent.text)
|
||||||
|
```
|
||||||
|
|
||||||
|
The recall for the `senter` is typically slightly lower than for the parser,
|
||||||
|
which is better at predicting sentence boundaries when punctuation is not
|
||||||
|
present.
|
||||||
|
|
||||||
### Rule-based pipeline component {#sbd-component}
|
### Rule-based pipeline component {#sbd-component}
|
||||||
|
|
||||||
The [`Sentencizer`](/api/sentencizer) component is a
|
The [`Sentencizer`](/api/sentencizer) component is a
|
||||||
[pipeline component](/usage/processing-pipelines) that splits sentences on
|
[pipeline component](/usage/processing-pipelines) that splits sentences on
|
||||||
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
|
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
|
||||||
need sentence boundaries without the dependency parse.
|
need sentence boundaries without dependency parses.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -1537,7 +1583,7 @@ and can still be overwritten by the parser.
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
To prevent inconsistent state, you can only set boundaries **before** a document
|
To prevent inconsistent state, you can only set boundaries **before** a document
|
||||||
is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is
|
is parsed (and `doc.is_parsed` is `False`). To ensure that your component is
|
||||||
added in the right place, you can set `before='parser'` or `first=True` when
|
added in the right place, you can set `before='parser'` or `first=True` when
|
||||||
adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user