mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
e597110d31
<!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
1190 lines
46 KiB
Markdown
1190 lines
46 KiB
Markdown
---
|
||
title: Linguistic Features
|
||
next: /usage/rule-based-matching
|
||
menu:
|
||
- ['POS Tagging', 'pos-tagging']
|
||
- ['Dependency Parse', 'dependency-parse']
|
||
- ['Named Entities', 'named-entities']
|
||
- ['Tokenization', 'tokenization']
|
||
- ['Merging & Splitting', 'retokenization']
|
||
- ['Sentence Segmentation', 'sbd']
|
||
---
|
||
|
||
Processing raw text intelligently is difficult: most words are rare, and it's
|
||
common for words that look completely different to mean almost the same thing.
|
||
The same words in a different order can mean something completely different.
|
||
Even splitting text into useful word-like units can be difficult in many
|
||
languages. While it's possible to solve some problems starting from only the raw
|
||
characters, it's usually better to use linguistic knowledge to add useful
|
||
information. That's exactly what spaCy is designed to do: you put in raw text,
|
||
and get back a [`Doc`](/api/doc) object, that comes with a variety of
|
||
annotations.
|
||
|
||
## Part-of-speech tagging {#pos-tagging model="tagger, parser"}
|
||
|
||
import PosDeps101 from 'usage/101/\_pos-deps.md'
|
||
|
||
<PosDeps101 />
|
||
|
||
### Rule-based morphology {#rule-based-morphology}
|
||
|
||
Inflectional morphology is the process by which a root form of a word is
|
||
modified by adding prefixes or suffixes that specify its grammatical function
|
||
but do not changes its part-of-speech. We say that a **lemma** (root form) is
|
||
**inflected** (modified/combined) with one or more **morphological features** to
|
||
create a surface form. Here are some examples:
|
||
|
||
| Context | Surface | Lemma | POS | Morphological Features |
|
||
| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- |
|
||
| I was reading the paper | reading | read | verb | `VerbForm=Ger` |
|
||
| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
|
||
| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
|
||
|
||
English has a relatively simple morphological system, which spaCy handles using
|
||
rules that can be keyed by the token, the part-of-speech tag, or the combination
|
||
of the two. The system works as follows:
|
||
|
||
1. The tokenizer consults a
|
||
[mapping table](/usage/adding-languages#tokenizer-exceptions)
|
||
`TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to
|
||
multiple tokens. Each token may be assigned a part of speech and one or more
|
||
morphological features.
|
||
2. The part-of-speech tagger then assigns each token an **extended POS tag**. In
|
||
the API, these tags are known as `Token.tag`. They express the part-of-speech
|
||
(e.g. `VERB`) and some amount of morphological information, e.g. that the
|
||
verb is past tense.
|
||
3. For words whose POS is not set by a prior process, a
|
||
[mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a
|
||
part-of-speech and a set of morphological features.
|
||
4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to
|
||
a lemma in light of the previously assigned extended part-of-speech and
|
||
morphological information, without consulting the context of the token. The
|
||
lemmatizer also accepts list-based exception files, acquired from
|
||
[WordNet](https://wordnet.princeton.edu/).
|
||
|
||
<Infobox title="📖 Part-of-speech tag scheme">
|
||
|
||
For a list of the fine-grained and coarse-grained part-of-speech tags assigned
|
||
by spaCy's models across different languages, see the
|
||
[POS tag scheme documentation](/api/annotation#pos-tagging).
|
||
|
||
</Infobox>
|
||
|
||
## Dependency Parsing {#dependency-parse model="parser"}
|
||
|
||
spaCy features a fast and accurate syntactic dependency parser, and has a rich
|
||
API for navigating the tree. The parser also powers the sentence boundary
|
||
detection, and lets you iterate over base noun phrases, or "chunks". You can
|
||
check whether a [`Doc`](/api/doc) object has been parsed with the
|
||
`doc.is_parsed` attribute, which returns a boolean value. If this attribute is
|
||
`False`, the default sentence iterator will raise an exception.
|
||
|
||
### Noun chunks {#noun-chunks}
|
||
|
||
Noun chunks are "base noun phrases" – flat phrases that have a noun as their
|
||
head. You can think of noun chunks as a noun plus the words describing the noun
|
||
– for example, "the lavish green grass" or "the world’s largest tech fund". To
|
||
get the noun chunks in a document, simply iterate over
|
||
[`Doc.noun_chunks`](/api/doc#noun_chunks)
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
|
||
for chunk in doc.noun_chunks:
|
||
print(chunk.text, chunk.root.text, chunk.root.dep_,
|
||
chunk.root.head.text)
|
||
```
|
||
|
||
> - **Text:** The original noun chunk text.
|
||
> - **Root text:** The original text of the word connecting the noun chunk to
|
||
> the rest of the parse.
|
||
> - **Root dep:** Dependency relation connecting the root to its head.
|
||
> - **Root head text:** The text of the root token's head.
|
||
|
||
| Text | root.text | root.dep\_ | root.head.text |
|
||
| ------------------- | ------------- | ---------- | -------------- |
|
||
| Autonomous cars | cars | `nsubj` | shift |
|
||
| insurance liability | liability | `dobj` | shift |
|
||
| manufacturers | manufacturers | `pobj` | toward |
|
||
|
||
### Navigating the parse tree {#navigating}
|
||
|
||
spaCy uses the terms **head** and **child** to describe the words **connected by
|
||
a single arc** in the dependency tree. The term **dep** is used for the arc
|
||
label, which describes the type of syntactic relation that connects the child to
|
||
the head. As with other attributes, the value of `.dep` is a hash value. You can
|
||
get the string value with `.dep_`.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
|
||
for token in doc:
|
||
print(token.text, token.dep_, token.head.text, token.head.pos_,
|
||
[child for child in token.children])
|
||
```
|
||
|
||
> - **Text:** The original token text.
|
||
> - **Dep:** The syntactic relation connecting child to head.
|
||
> - **Head text:** The original text of the token head.
|
||
> - **Head POS:** The part-of-speech tag of the token head.
|
||
> - **Children:** The immediate syntactic dependents of the token.
|
||
|
||
| Text | Dep | Head text | Head POS | Children |
|
||
| ------------- | ---------- | --------- | -------- | ----------------------- |
|
||
| Autonomous | `amod` | cars | `NOUN` | |
|
||
| cars | `nsubj` | shift | `VERB` | Autonomous |
|
||
| shift | `ROOT` | shift | `VERB` | cars, liability, toward |
|
||
| insurance | `compound` | liability | `NOUN` | |
|
||
| liability | `dobj` | shift | `VERB` | insurance |
|
||
| toward | `prep` | shift | `NOUN` | manufacturers |
|
||
| manufacturers | `pobj` | toward | `ADP` | |
|
||
|
||
import DisplaCyLong2Html from 'images/displacy-long2.html'
|
||
|
||
<Iframe title="displaCy visualization of dependencies and entities 2" html={DisplaCyLong2Html} height={450} />
|
||
|
||
Because the syntactic relations form a tree, every word has **exactly one
|
||
head**. You can therefore iterate over the arcs in the tree by iterating over
|
||
the words in the sentence. This is usually the best way to match an arc of
|
||
interest — from below:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.symbols import nsubj, VERB
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
|
||
|
||
# Finding a verb with a subject from below — good
|
||
verbs = set()
|
||
for possible_subject in doc:
|
||
if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
|
||
verbs.add(possible_subject.head)
|
||
print(verbs)
|
||
```
|
||
|
||
If you try to match from above, you'll have to iterate twice. Once for the head,
|
||
and then again through the children:
|
||
|
||
```python
|
||
# Finding a verb with a subject from above — less good
|
||
verbs = []
|
||
for possible_verb in doc:
|
||
if possible_verb.pos == VERB:
|
||
for possible_subject in possible_verb.children:
|
||
if possible_subject.dep == nsubj:
|
||
verbs.append(possible_verb)
|
||
break
|
||
```
|
||
|
||
To iterate through the children, use the `token.children` attribute, which
|
||
provides a sequence of [`Token`](/api/token) objects.
|
||
|
||
#### Iterating around the local tree {#navigating-around}
|
||
|
||
A few more convenience attributes are provided for iterating around the local
|
||
tree from the token. [`Token.lefts`](/api/token#lefts) and
|
||
[`Token.rights`](/api/token#rights) attributes provide sequences of syntactic
|
||
children that occur before and after the token. Both sequences are in sentence
|
||
order. There are also two integer-typed attributes,
|
||
[`Token.n_lefts`](/api/token#n_lefts) and
|
||
[`Token.n_rights`](/api/token#n_rights) that give the number of left and right
|
||
children.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"bright red apples on the tree")
|
||
print([token.text for token in doc[2].lefts]) # ['bright', 'red']
|
||
print([token.text for token in doc[2].rights]) # ['on']
|
||
print(doc[2].n_lefts) # 2
|
||
print(doc[2].n_rights) # 1
|
||
```
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("de_core_news_sm")
|
||
doc = nlp(u"schöne rote Äpfel auf dem Baum")
|
||
print([token.text for token in doc[2].lefts]) # ['schöne', 'rote']
|
||
print([token.text for token in doc[2].rights]) # ['auf']
|
||
```
|
||
|
||
You can get a whole phrase by its syntactic head using the
|
||
[`Token.subtree`](/api/token#subtree) attribute. This returns an ordered
|
||
sequence of tokens. You can walk up the tree with the
|
||
[`Token.ancestors`](/api/token#ancestors) attribute, and check dominance with
|
||
[`Token.is_ancestor`](/api/token#is_ancestor)
|
||
|
||
> #### Projective vs. non-projective
|
||
>
|
||
> For the [default English model](/models/en), the parse tree is **projective**,
|
||
> which means that there are no crossing brackets. The tokens returned by
|
||
> `.subtree` are therefore guaranteed to be contiguous. This is not true for the
|
||
> German model, which has many
|
||
> [non-projective dependencies](https://explosion.ai/blog/german-model#word-order).
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Credit and mortgage account holders must submit their requests")
|
||
|
||
root = [token for token in doc if token.head == token][0]
|
||
subject = list(root.lefts)[0]
|
||
for descendant in subject.subtree:
|
||
assert subject is descendant or subject.is_ancestor(descendant)
|
||
print(descendant.text, descendant.dep_, descendant.n_lefts,
|
||
descendant.n_rights,
|
||
[ancestor.text for ancestor in descendant.ancestors])
|
||
```
|
||
|
||
| Text | Dep | n_lefts | n_rights | ancestors |
|
||
| -------- | ---------- | ------- | -------- | -------------------------------- |
|
||
| Credit | `nmod` | `0` | `2` | holders, submit |
|
||
| and | `cc` | `0` | `0` | holders, submit |
|
||
| mortgage | `compound` | `0` | `0` | account, Credit, holders, submit |
|
||
| account | `conj` | `1` | `0` | Credit, holders, submit |
|
||
| holders | `nsubj` | `1` | `0` | submit |
|
||
|
||
Finally, the `.left_edge` and `.right_edge` attributes can be especially useful,
|
||
because they give you the first and last token of the subtree. This is the
|
||
easiest way to create a `Span` object for a syntactic phrase. Note that
|
||
`.right_edge` gives a token **within** the subtree — so if you use it as the
|
||
end-point of a range, don't forget to `+1`!
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Credit and mortgage account holders must submit their requests")
|
||
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
|
||
with doc.retokenize() as retokenizer:
|
||
retokenizer.merge(span)
|
||
for token in doc:
|
||
print(token.text, token.pos_, token.dep_, token.head.text)
|
||
```
|
||
|
||
| Text | POS | Dep | Head text |
|
||
| ----------------------------------- | ------ | ------- | --------- |
|
||
| Credit and mortgage account holders | `NOUN` | `nsubj` | submit |
|
||
| must | `VERB` | `aux` | submit |
|
||
| submit | `VERB` | `ROOT` | submit |
|
||
| their | `ADJ` | `poss` | requests |
|
||
| requests | `NOUN` | `dobj` | submit |
|
||
|
||
<Infobox title="📖 Dependency label scheme">
|
||
|
||
For a list of the syntactic dependency labels assigned by spaCy's models across
|
||
different languages, see the
|
||
[dependency label scheme documentation](/api/annotation#pos-tagging).
|
||
|
||
</Infobox>
|
||
|
||
### Visualizing dependencies {#displacy}
|
||
|
||
The best way to understand spaCy's dependency parser is interactively. To make
|
||
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
|
||
or a list of `Doc` objects to displaCy and run
|
||
[`displacy.serve`](top-level#displacy.serve) to run the web server, or
|
||
[`displacy.render`](top-level#displacy.render) to generate the raw markup. If
|
||
you want to know how to write rules that hook into some type of syntactic
|
||
construction, just plug the sentence into the visualizer and see how spaCy
|
||
annotates it.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy import displacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
|
||
# Since this is an interactive Jupyter environment, we can use displacy.render here
|
||
displacy.render(doc, style='dep')
|
||
```
|
||
|
||
<Infobox>
|
||
|
||
For more details and examples, see the
|
||
[usage guide on visualizing spaCy](/usage/visualizers). You can also test
|
||
displaCy in our [online demo](https://explosion.ai/demos/displacy)..
|
||
|
||
</Infobox>
|
||
|
||
### Disabling the parser {#disabling}
|
||
|
||
In the [default models](/models), the parser is loaded and enabled as part of
|
||
the [standard processing pipeline](/usage/processing-pipelin). If you don't need
|
||
any of the syntactic information, you should disable the parser. Disabling the
|
||
parser will make spaCy load and run much faster. If you want to load the parser,
|
||
but need to disable it for specific documents, you can also control its use on
|
||
the `nlp` object.
|
||
|
||
```python
|
||
nlp = spacy.load("en_core_web_sm", disable=["parser"])
|
||
nlp = English().from_disk("/model", disable=["parser"])
|
||
doc = nlp(u"I don't want parsed", disable=["parser"])
|
||
```
|
||
|
||
<Infobox title="Important note: disabling pipeline components" variant="warning">
|
||
|
||
Since spaCy v2.0 comes with better support for customizing the processing
|
||
pipeline components, the `parser` keyword argument has been replaced with
|
||
`disable`, which takes a list of
|
||
[pipeline component names](/usage/processing-pipelines). This lets you disable
|
||
both default and custom components when loading a model, or initializing a
|
||
Language class via [`from_disk`](/api/language#from_disk).
|
||
|
||
```diff
|
||
+ nlp = spacy.load("en_core_web_sm", disable=["parser"])
|
||
+ doc = nlp(u"I don't want parsed", disable=["parser"])
|
||
|
||
- nlp = spacy.load("en_core_web_sm", parser=False)
|
||
- doc = nlp(u"I don't want parsed", parse=False)
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
## Named Entity Recognition {#named-entities}
|
||
|
||
spaCy features an extremely fast statistical entity recognition system, that
|
||
assigns labels to contiguous spans of tokens. The default model identifies a
|
||
variety of named and numeric entities, including companies, locations,
|
||
organizations and products. You can add arbitrary classes to the entity
|
||
recognition system, and update the model with new examples.
|
||
|
||
### Named Entity Recognition 101 {#named-entities-101}
|
||
|
||
import NER101 from 'usage/101/\_named-entities.md'
|
||
|
||
<NER101 />
|
||
|
||
### Accessing entity annotations {#accessing}
|
||
|
||
The standard way to access entity annotations is the [`doc.ents`](/api/doc#ents)
|
||
property, which produces a sequence of [`Span`](/api/span) objects. The entity
|
||
type is accessible either as a hash value or as a string, using the attributes
|
||
`ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so
|
||
you can iterate over the entity or index into it. You can also get the text form
|
||
of the whole entity, as though it were a single token.
|
||
|
||
You can also access token entity annotations using the
|
||
[`token.ent_iob`](/api/token#attributes) and
|
||
[`token.ent_type`](/api/token#attributes) attributes. `token.ent_iob` indicates
|
||
whether an entity starts, continues or ends on the tag. If no entity type is set
|
||
on a token, it will return an empty string.
|
||
|
||
> #### IOB Scheme
|
||
>
|
||
> - `I` – Token is inside an entity.
|
||
> - `O` – Token is outside an entity.
|
||
> - `B` – Token is the beginning of an entity.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"San Francisco considers banning sidewalk delivery robots")
|
||
|
||
# document level
|
||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||
print(ents)
|
||
|
||
# token level
|
||
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
|
||
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
|
||
print(ent_san) # [u'San', u'B', u'GPE']
|
||
print(ent_francisco) # [u'Francisco', u'I', u'GPE']
|
||
```
|
||
|
||
| Text | ent_iob | ent_iob\_ | ent_type\_ | Description |
|
||
| --------- | ------- | --------- | ---------- | ---------------------- |
|
||
| San | `3` | `B` | `"GPE"` | beginning of an entity |
|
||
| Francisco | `1` | `I` | `"GPE"` | inside an entity |
|
||
| considers | `2` | `O` | `""` | outside an entity |
|
||
| banning | `2` | `O` | `""` | outside an entity |
|
||
| sidewalk | `2` | `O` | `""` | outside an entity |
|
||
| delivery | `2` | `O` | `""` | outside an entity |
|
||
| robots | `2` | `O` | `""` | outside an entity |
|
||
|
||
### Setting entity annotations {#setting-entities}
|
||
|
||
To ensure that the sequence of token annotations remains consistent, you have to
|
||
set entity annotations **at the document level**. However, you can't write
|
||
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
|
||
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute
|
||
and create the new entity as a [`Span`](/api/span).
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.tokens import Span
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"FB is hiring a new Vice President of global policy")
|
||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||
print('Before', ents)
|
||
# the model didn't recognise "FB" as an entity :(
|
||
|
||
ORG = doc.vocab.strings[u"ORG"] # get hash value of entity label
|
||
fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
|
||
doc.ents = list(doc.ents) + [fb_ent]
|
||
|
||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||
print('After', ents)
|
||
# [(u'FB', 0, 2, 'ORG')] 🎉
|
||
```
|
||
|
||
Keep in mind that you need to create a `Span` with the start and end index of
|
||
the **token**, not the start and end index of the entity in the document. In
|
||
this case, "FB" is token `(0, 1)` – but at the document level, the entity will
|
||
have the start and end indices `(0, 2)`.
|
||
|
||
#### Setting entity annotations from array {#setting-from-array}
|
||
|
||
You can also assign entity annotations using the
|
||
[`doc.from_array`](/api/doc#from_array) method. To do this, you should include
|
||
both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you're importing
|
||
from.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import numpy
|
||
import spacy
|
||
from spacy.attrs import ENT_IOB, ENT_TYPE
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp.make_doc(u"London is a big city in the United Kingdom.")
|
||
print("Before", doc.ents) # []
|
||
|
||
header = [ENT_IOB, ENT_TYPE]
|
||
attr_array = numpy.zeros((len(doc), len(header)))
|
||
attr_array[0, 0] = 3 # B
|
||
attr_array[0, 1] = doc.vocab.strings[u"GPE"]
|
||
doc.from_array(header, attr_array)
|
||
print("After", doc.ents) # [London]
|
||
```
|
||
|
||
#### Setting entity annotations in Cython {#setting-cython}
|
||
|
||
Finally, you can always write to the underlying struct, if you compile a
|
||
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
||
write efficient native code.
|
||
|
||
```python
|
||
# cython: infer_types=True
|
||
from spacy.tokens.doc cimport Doc
|
||
|
||
cpdef set_entity(Doc doc, int start, int end, int ent_type):
|
||
for i in range(start, end):
|
||
doc.c[i].ent_type = ent_type
|
||
doc.c[start].ent_iob = 3
|
||
for i in range(start+1, end):
|
||
doc.c[i].ent_iob = 2
|
||
```
|
||
|
||
Obviously, if you write directly to the array of `TokenC*` structs, you'll have
|
||
responsibility for ensuring that the data is left in a consistent state.
|
||
|
||
### Built-in entity types {#entity-types}
|
||
|
||
> #### Tip: Understanding entity types
|
||
>
|
||
> You can also use `spacy.explain()` to get the description for the string
|
||
> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
|
||
> will return "any named language".
|
||
|
||
<Infobox title="Annotation scheme">
|
||
|
||
For details on the entity types available in spaCy's pre-trained models, see the
|
||
[NER annotation scheme](/api/annotation#named-entities).
|
||
|
||
</Infobox>
|
||
|
||
### Training and updating {#updating}
|
||
|
||
To provide training examples to the entity recognizer, you'll first need to
|
||
create an instance of the [`GoldParse`](/api/goldparse) class. You can specify
|
||
your annotations in a stand-off format or as token tags. If a character offset
|
||
in your entity annotations doesn't fall on a token boundary, the `GoldParse`
|
||
class will treat that annotation as a missing value. This allows for more
|
||
realistic training, because the entity recognizer is allowed to learn from
|
||
examples that may feature tokenizer errors.
|
||
|
||
```python
|
||
train_data = [
|
||
("Who is Chaka Khan?", [(7, 17, "PERSON")]),
|
||
("I like London and Berlin.", [(7, 13, "LOC"), (18, 24, "LOC")]),
|
||
]
|
||
```
|
||
|
||
```python
|
||
doc = Doc(nlp.vocab, [u"rats", u"make", u"good", u"pets"])
|
||
gold = GoldParse(doc, entities=[u"U-ANIMAL", u"O", u"O", u"O"])
|
||
```
|
||
|
||
<Infobox>
|
||
|
||
For more details on **training and updating** the named entity recognizer, see
|
||
the usage guides on [training](/usage/training) or check out the runnable
|
||
[training script](https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py)
|
||
on GitHub.
|
||
|
||
</Infobox>
|
||
|
||
### Visualizing named entities {#displacy}
|
||
|
||
The
|
||
[displaCy <sup>ENT</sup> visualizer](https://explosion.ai/demos/displacy-ent)
|
||
lets you explore an entity recognition model's behavior interactively. If you're
|
||
training a model, it's very useful to run the visualization yourself. To help
|
||
you do that, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
|
||
or a list of `Doc` objects to displaCy and run
|
||
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
|
||
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
|
||
|
||
For more details and examples, see the
|
||
[usage guide on visualizing spaCy](/usage/visualizers).
|
||
|
||
```python
|
||
### Named Entity example
|
||
import spacy
|
||
from spacy import displacy
|
||
|
||
text = """But Google is starting from behind. The company made a late push
|
||
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
|
||
software, which runs on its Echo and Dot devices, have clear leads in
|
||
consumer adoption."""
|
||
|
||
nlp = spacy.load("custom_ner_model")
|
||
doc = nlp(text)
|
||
displacy.serve(doc, style="ent")
|
||
```
|
||
|
||
import DisplacyEntHtml from 'images/displacy-ent.html'
|
||
|
||
<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={275} />
|
||
|
||
## Tokenization {#tokenization}
|
||
|
||
Tokenization is the task of splitting a text into meaningful segments, called
|
||
_tokens_. The input to the tokenizer is a unicode text, and the output is a
|
||
[`Doc`](/api/doc) object. To construct a `Doc` object, you need a
|
||
[`Vocab`](/api/vocab) instance, a sequence of `word` strings, and optionally a
|
||
sequence of `spaces` booleans, which allow you to maintain alignment of the
|
||
tokens into the original string.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
spaCy's tokenization is **non-destructive**, which means that you'll always be
|
||
able to reconstruct the original input from the tokenized output. Whitespace
|
||
information is preserved in the tokens and no information is added or removed
|
||
during tokenization. This is kind of a core principle of spaCy's `Doc` object:
|
||
`doc.text == input_text` should always hold true.
|
||
|
||
</Infobox>
|
||
|
||
import Tokenization101 from 'usage/101/\_tokenization.md'
|
||
|
||
<Tokenization101 />
|
||
|
||
### Tokenizer data {#101-data}
|
||
|
||
**Global** and **language-specific** tokenizer data is supplied via the language
|
||
data in
|
||
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang). The
|
||
tokenizer exceptions define special cases like "don't" in English, which needs
|
||
to be split into two tokens: `{ORTH: "do"}` and `{ORTH: "n't", LEMMA: "not"}`.
|
||
The prefixes, suffixes and infixes mostly define punctuation rules – for
|
||
example, when to split off periods (at the end of a sentence), and when to leave
|
||
tokens containing periods intact (abbreviations like "U.S.").
|
||
|
||
![Language data architecture](../images/language_data.svg)
|
||
|
||
<Infobox title="📖 Language data">
|
||
|
||
For more details on the language-specific data, see the usage guide on
|
||
[adding languages](/usage/adding-languages).
|
||
|
||
</Infobox>
|
||
|
||
<Accordion title="Should I change the language data or add custom tokenizer rules?">
|
||
|
||
Tokenization rules that are specific to one language, but can be **generalized
|
||
across that language** should ideally live in the language data in
|
||
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) – we
|
||
always appreciate pull requests! Anything that's specific to a domain or text
|
||
type – like financial trading abbreviations, or Bavarian youth slang – should be
|
||
added as a special case rule to your tokenizer instance. If you're dealing with
|
||
a lot of customizations, it might make sense to create an entirely custom
|
||
subclass.
|
||
|
||
</Accordion>
|
||
|
||
---
|
||
|
||
### Adding special case tokenization rules {#special-cases}
|
||
|
||
Most domains have at least some idiosyncrasies that require custom tokenization
|
||
rules. This could be very certain expressions, or abbreviations only used in
|
||
this specific field. Here's how to add a special case rule to an existing
|
||
[`Tokenizer`](/api/tokenizer) instance:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.symbols import ORTH, LEMMA, POS, TAG
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"gimme that") # phrase to tokenize
|
||
print([w.text for w in doc]) # ['gimme', 'that']
|
||
|
||
# add special case rule
|
||
special_case = [{ORTH: u"gim", LEMMA: u"give", POS: u"VERB"}, {ORTH: u"me"}]
|
||
nlp.tokenizer.add_special_case(u"gimme", special_case)
|
||
|
||
# check new tokenization
|
||
print([w.text for w in nlp(u"gimme that")]) # ['gim', 'me', 'that']
|
||
|
||
# Pronoun lemma is returned as -PRON-!
|
||
print([w.lemma_ for w in nlp(u"gimme that")]) # ['give', '-PRON-', 'that']
|
||
```
|
||
|
||
<Infobox title="Why -PRON-?" variant="warning">
|
||
|
||
For details on spaCy's custom pronoun lemma `-PRON-`,
|
||
[see here](/usage/#pron-lemma).
|
||
|
||
</Infobox>
|
||
|
||
The special case doesn't have to match an entire whitespace-delimited substring.
|
||
The tokenizer will incrementally split off punctuation, and keep looking up the
|
||
remaining substring:
|
||
|
||
```python
|
||
assert "gimme" not in [w.text for w in nlp(u"gimme!")]
|
||
assert "gimme" not in [w.text for w in nlp(u'("...gimme...?")')]
|
||
```
|
||
|
||
The special case rules have precedence over the punctuation splitting:
|
||
|
||
```python
|
||
special_case = [{ORTH: u"...gimme...?", LEMMA: u"give", TAG: u"VB"}]
|
||
nlp.tokenizer.add_special_case(u"...gimme...?", special_case)
|
||
assert len(nlp(u"...gimme...?")) == 1
|
||
```
|
||
|
||
Because the special-case rules allow you to set arbitrary token attributes, such
|
||
as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary
|
||
fix-up rules. Having this logic live in the tokenizer isn't very satisfying from
|
||
a design perspective, however, so the API may eventually be exposed on the
|
||
[`Language`](/api/language) class itself.
|
||
|
||
### How spaCy's tokenizer works {#how-tokenizer-works}
|
||
|
||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
||
between performance, ease of definition, and ease of alignment into the original
|
||
string.
|
||
|
||
After consuming a prefix or infix, we consult the special cases again. We want
|
||
the special cases to handle things like "don't" in English, and we want the same
|
||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||
the exclamation, then the close bracket, and finally matching the special-case.
|
||
Here's an implementation of the algorithm in Python, optimized for readability
|
||
rather than performance:
|
||
|
||
```python
|
||
def tokenizer_pseudo_code(text, special_cases,
|
||
find_prefix, find_suffix, find_infixes):
|
||
tokens = []
|
||
for substring in text.split(' '):
|
||
suffixes = []
|
||
while substring:
|
||
if substring in special_cases:
|
||
tokens.extend(special_cases[substring])
|
||
substring = ''
|
||
elif find_prefix(substring) is not None:
|
||
split = find_prefix(substring)
|
||
tokens.append(substring[:split])
|
||
substring = substring[split:]
|
||
elif find_suffix(substring) is not None:
|
||
split = find_suffix(substring)
|
||
suffixes.append(substring[-split:])
|
||
substring = substring[:-split]
|
||
elif find_infixes(substring):
|
||
infixes = find_infixes(substring)
|
||
offset = 0
|
||
for match in infixes:
|
||
tokens.append(substring[offset : match.start()])
|
||
tokens.append(substring[match.start() : match.end()])
|
||
offset = match.end()
|
||
substring = substring[offset:]
|
||
else:
|
||
tokens.append(substring)
|
||
substring = ''
|
||
tokens.extend(reversed(suffixes))
|
||
return tokens
|
||
```
|
||
|
||
The algorithm can be summarized as follows:
|
||
|
||
1. Iterate over space-separated substrings
|
||
2. Check whether we have an explicitly defined rule for this substring. If we
|
||
do, use it.
|
||
3. Otherwise, try to consume a prefix.
|
||
4. If we consumed a prefix, go back to the beginning of the loop, so that
|
||
special-cases always get priority.
|
||
5. If we didn't consume a prefix, try to consume a suffix.
|
||
6. If we can't consume a prefix or suffix, look for "infixes" — stuff like
|
||
hyphens etc.
|
||
7. Once we can't consume any more of the string, handle it as a single token.
|
||
|
||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
||
|
||
Let's imagine you wanted to create a tokenizer for a new language or specific
|
||
domain. There are five things you would need to define:
|
||
|
||
1. A dictionary of **special cases**. This handles things like contractions,
|
||
units of measurement, emoticons, certain abbreviations, etc.
|
||
2. A function `prefix_search`, to handle **preceding punctuation**, such as open
|
||
quotes, open brackets, etc.
|
||
3. A function `suffix_search`, to handle **succeeding punctuation**, such as
|
||
commas, periods, close quotes, etc.
|
||
4. A function `infixes_finditer`, to handle non-whitespace separators, such as
|
||
hyphens etc.
|
||
5. An optional boolean function `token_match` matching strings that should never
|
||
be split, overriding the previous rules. Useful for things like URLs or
|
||
numbers.
|
||
|
||
You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
|
||
to use `re.compile()` to build a regular expression object, and pass its
|
||
`.search()` and `.finditer()` methods:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import re
|
||
import spacy
|
||
from spacy.tokenizer import Tokenizer
|
||
|
||
prefix_re = re.compile(r'''^[\[\("']''')
|
||
suffix_re = re.compile(r'''[\]\)"']$''')
|
||
infix_re = re.compile(r'''[-~]''')
|
||
simple_url_re = re.compile(r'''^https?://''')
|
||
|
||
def custom_tokenizer(nlp):
|
||
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
|
||
suffix_search=suffix_re.search,
|
||
infix_finditer=infix_re.finditer,
|
||
token_match=simple_url_re.match)
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.tokenizer = custom_tokenizer(nlp)
|
||
doc = nlp(u"hello-world.")
|
||
print([t.text for t in doc])
|
||
```
|
||
|
||
If you need to subclass the tokenizer instead, the relevant methods to
|
||
specialize are `find_prefix`, `find_suffix` and `find_infix`.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When customizing the prefix, suffix and infix handling, remember that you're
|
||
passing in **functions** for spaCy to execute, e.g. `prefix_re.search` – not
|
||
just the regular expressions. This means that your functions also need to define
|
||
how the rules should be applied. For example, if you're adding your own prefix
|
||
rules, you need to make sure they're only applied to characters at the
|
||
**beginning of a token**, e.g. by adding `^`. Similarly, suffix rules should
|
||
only be applied at the **end of a token**, so your expression should end with a
|
||
`$`.
|
||
|
||
</Infobox>
|
||
|
||
### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}
|
||
|
||
The tokenizer is the first component of the processing pipeline and the only one
|
||
that can't be replaced by writing to `nlp.pipeline`. This is because it has a
|
||
different signature from all the other components: it takes a text and returns a
|
||
`Doc`, whereas all other components expect to already receive a tokenized `Doc`.
|
||
|
||
![The processing pipeline](../images/pipeline.svg)
|
||
|
||
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
||
custom function that takes a text, and returns a `Doc`.
|
||
|
||
```python
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.tokenizer = my_tokenizer
|
||
```
|
||
|
||
| Argument | Type | Description |
|
||
| ----------- | ------- | ------------------------- |
|
||
| `text` | unicode | The raw text to tokenize. |
|
||
| **RETURNS** | `Doc` | The tokenized document. |
|
||
|
||
<Infobox title="Important note: using a custom tokenizer" variant="warning">
|
||
|
||
In spaCy v1.x, you had to add a custom tokenizer by passing it to the `make_doc`
|
||
keyword argument, or by passing a tokenizer "factory" to `create_make_doc`. This
|
||
was unnecessarily complicated. Since spaCy v2.0, you can write to
|
||
`nlp.tokenizer` instead. If your tokenizer needs the vocab, you can write a
|
||
function and use `nlp.vocab`.
|
||
|
||
```diff
|
||
- nlp = spacy.load("en_core_web_sm", make_doc=my_tokenizer)
|
||
- nlp = spacy.load("en_core_web_sm", create_make_doc=my_tokenizer_factory)
|
||
|
||
+ nlp.tokenizer = my_tokenizer
|
||
+ nlp.tokenizer = my_tokenizer_factory(nlp.vocab)
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Example: A custom whitespace tokenizer {#custom-tokenizer-example}
|
||
|
||
To construct the tokenizer, we usually want attributes of the `nlp` pipeline.
|
||
Specifically, we want the tokenizer to hold a reference to the vocabulary
|
||
object. Let's say we have the following class as our tokenizer:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.tokens import Doc
|
||
|
||
class WhitespaceTokenizer(object):
|
||
def __init__(self, vocab):
|
||
self.vocab = vocab
|
||
|
||
def __call__(self, text):
|
||
words = text.split(' ')
|
||
# All tokens 'own' a subsequent space character in this tokenizer
|
||
spaces = [True] * len(words)
|
||
return Doc(self.vocab, words=words, spaces=spaces)
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
|
||
doc = nlp(u"What's happened to me? he thought. It wasn't a dream.")
|
||
print([t.text for t in doc])
|
||
```
|
||
|
||
As you can see, we need a `Vocab` instance to construct this — but we won't have
|
||
it until we get back the loaded `nlp` object. The simplest solution is to build
|
||
the tokenizer in two steps. This also means that you can reuse the "tokenizer
|
||
factory" and initialize it with different instances of `Vocab`.
|
||
|
||
### Bringing your own annotations {#own-annotations}
|
||
|
||
spaCy generally assumes by default that your data is raw text. However,
|
||
sometimes your data is partially annotated, e.g. with pre-existing tokenization,
|
||
part-of-speech tags, etc. The most common situation is that you have pre-defined
|
||
tokenization. If you have a list of strings, you can create a `Doc` object
|
||
directly. Optionally, you can also specify a list of boolean values, indicating
|
||
whether each word has a subsequent space.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.tokens import Doc
|
||
from spacy.lang.en import English
|
||
|
||
nlp = English()
|
||
doc = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"],
|
||
spaces=[False, True, False, False])
|
||
print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])
|
||
```
|
||
|
||
If provided, the spaces list must be the same length as the words list. The
|
||
spaces list affects the `doc.text`, `span.text`, `token.idx`, `span.start_char`
|
||
and `span.end_char` attributes. If you don't provide a `spaces` sequence, spaCy
|
||
will assume that all words are whitespace delimited.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.tokens import Doc
|
||
from spacy.lang.en import English
|
||
|
||
nlp = English()
|
||
bad_spaces = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"])
|
||
good_spaces = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"],
|
||
spaces=[False, True, False, False])
|
||
|
||
print(bad_spaces.text) # 'Hello , world !'
|
||
print(good_spaces.text) # 'Hello, world!'
|
||
```
|
||
|
||
Once you have a [`Doc`](/api/doc) object, you can write to its attributes to set
|
||
the part-of-speech tags, syntactic dependencies, named entities and other
|
||
attributes. For details, see the respective usage pages.
|
||
|
||
## Merging and splitting {#retokenization new="2.1"}
|
||
|
||
The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and
|
||
split tokens. Modifications to the tokenization are stored and performed all at
|
||
once when the context manager exits. To merge several tokens into one single
|
||
token, pass a `Span` to [`retokenizer.merge`](/api/doc#retokenizer.merge). An
|
||
optional dictionary of `attrs` lets you set attributes that will be assigned to
|
||
the merged token – for example, the lemma, part-of-speech tag or entity type. By
|
||
default, the merged token will receive the same attributes as the merged span's
|
||
root.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Inspect the `token.lemma_` attribute with and without setting the `attrs`.
|
||
> You'll see that the lemma defaults to "New", the lemma of the span's root.
|
||
> 2. Overwrite other attributes like the `"ENT_TYPE"`. Since "New York" is also
|
||
> recognized as a named entity, this change will also be reflected in the
|
||
> `doc.ents`.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp("I live in New York")
|
||
print("Before:", [token.text for token in doc])
|
||
|
||
with doc.retokenize() as retokenizer:
|
||
retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
|
||
print("After:", [token.text for token in doc])
|
||
```
|
||
|
||
<Infobox title="Tip: merging entities and noun phrases">
|
||
|
||
If you need to merge named entities or noun chunks, check out the built-in
|
||
[`merge_entities`](/api/pipeline-functions#merge_entities) and
|
||
[`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
|
||
components. When added to your pipeline using `nlp.add_pipe`, they'll take care
|
||
of merging the spans automatically.
|
||
|
||
</Infobox>
|
||
|
||
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
||
one token into two or more tokens. This can be useful for cases where
|
||
tokenization rules alone aren't sufficient. For example, you might want to split
|
||
"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
|
||
can write rule-based logic that can find only the correct "its" to split, but by
|
||
that time, the `Doc` will already be tokenized.
|
||
|
||
This process of splitting a token requires more settings, because you need to
|
||
specify the text of the individual tokens, optional per-token attributes and how
|
||
the should be attached to the existing syntax tree. This can be done by
|
||
supplying a list of `heads` – either the token to attach the newly split token
|
||
to, or a `(token, subtoken)` tuple if the newly split token should be attached
|
||
to another subtoken. In this case, "New" should be attached to "York" (the
|
||
second split subtoken) and "York" should be attached to "in".
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Assign different attributes to the subtokens and compare the result.
|
||
> 2. Change the heads so that "New" is attached to "in" and "York" is attached
|
||
> to "New".
|
||
> 3. Split the token into three tokens instead of two – for example,
|
||
> `["New", "Yo", "rk"]`.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy import displacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp("I live in NewYork")
|
||
print("Before:", [token.text for token in doc])
|
||
displacy.render(doc) # displacy.serve if you're not in a Jupyter environment
|
||
|
||
with doc.retokenize() as retokenizer:
|
||
heads = [(doc[3], 1), doc[2]]
|
||
attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
|
||
retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
|
||
print("After:", [token.text for token in doc])
|
||
displacy.render(doc) # displacy.serve if you're not in a Jupyter environment
|
||
```
|
||
|
||
Specifying the heads as a list of `token` or `(token, subtoken)` tuples allows
|
||
attaching split subtokens to other subtokens, without having to keep track of
|
||
the token indices after splitting.
|
||
|
||
| Token | Head | Description |
|
||
| -------- | ------------- | --------------------------------------------------------------------------------------------------- |
|
||
| `"New"` | `(doc[3], 1)` | Attach this token to the second subtoken (index `1`) that `doc[3]` will be split into, i.e. "York". |
|
||
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
||
|
||
If you don't care about the heads (for example, if you're only running the
|
||
tokenizer and not the parser), you can each subtoken to itself:
|
||
|
||
```python
|
||
### {highlight="3"}
|
||
doc = nlp("I live in NewYorkCity")
|
||
with doc.retokenize() as retokenizer:
|
||
heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)]
|
||
retokenizer.split(doc[3], ["New", "York", "City"], heads=heads)
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When splitting tokens, the subtoken texts always have to match the original
|
||
token text – or, put differently `''.join(subtokens) == token.text` always needs
|
||
to hold true. If this wasn't the case, splitting tokens could easily end up
|
||
producing confusing and unexpected results that would contradict spaCy's
|
||
non-destructive tokenization policy.
|
||
|
||
```diff
|
||
doc = nlp("I live in L.A.")
|
||
with doc.retokenize() as retokenizer:
|
||
- retokenizer.split(doc[3], ["Los", "Angeles"], heads=[(doc[3], 1), doc[2]])
|
||
+ retokenizer.split(doc[3], ["L.", "A."], heads=[(doc[3], 1), doc[2]])
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
## Sentence Segmentation {#sbd}
|
||
|
||
A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
|
||
property. Unlike other libraries, spaCy uses the dependency parse to determine
|
||
sentence boundaries. This is usually more accurate than a rule-based approach,
|
||
but it also means you'll need a **statistical model** and accurate predictions.
|
||
If your texts are closer to general-purpose news or web text, this should work
|
||
well out-of-the-box. For social media or conversational text that doesn't follow
|
||
the same rules, your application may benefit from a custom rule-based
|
||
implementation. You can either plug a rule-based component into your
|
||
[processing pipeline](/usage/processing-pipelines) or use the
|
||
`SentenceSegmenter` component with a custom strategy.
|
||
|
||
### Default: Using the dependency parse {#sbd-parser model="parser"}
|
||
|
||
To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator
|
||
that yields [`Span`](/api/span) objects.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(u"This is a sentence. This is another sentence.")
|
||
for sent in doc.sents:
|
||
print(sent.text)
|
||
```
|
||
|
||
### Setting boundaries manually {#sbd-manual}
|
||
|
||
spaCy's dependency parser respects already set boundaries, so you can preprocess
|
||
your `Doc` using custom rules _before_ it's parsed. This can be done by adding a
|
||
[custom pipeline component](/usage/processing-pipelines). Depending on your
|
||
text, this may also improve accuracy, since the parser is constrained to predict
|
||
parses consistent with the sentence boundaries.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
To prevent inconsistent state, you can only set boundaries **before** a document
|
||
is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is
|
||
added in the right place, you can set `before='parser'` or `first=True` when
|
||
adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
||
|
||
</Infobox>
|
||
|
||
Here's an example of a component that implements a pre-processing rule for
|
||
splitting on `'...'` tokens. The component is added before the parser, which is
|
||
then used to further segment the text. This approach can be useful if you want
|
||
to implement **additional** rules specific to your data, while still being able
|
||
to take advantage of dependency-based sentence segmentation.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
text = u"this is a sentence...hello...and another sentence."
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp(text)
|
||
print("Before:", [sent.text for sent in doc.sents])
|
||
|
||
def set_custom_boundaries(doc):
|
||
for token in doc[:-1]:
|
||
if token.text == "...":
|
||
doc[token.i+1].is_sent_start = True
|
||
return doc
|
||
|
||
nlp.add_pipe(set_custom_boundaries, before="parser")
|
||
doc = nlp(text)
|
||
print("After:", [sent.text for sent in doc.sents])
|
||
```
|
||
|
||
### Rule-based pipeline component {#sbd-component}
|
||
|
||
The `sentencizer` component is a
|
||
[pipeline component](/usage/processing-pipelines) that splits sentences on
|
||
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
|
||
need sentence boundaries without the dependency parse. Note that `Doc.sents`
|
||
will **raise an error** if no sentence boundaries are set.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.lang.en import English
|
||
|
||
nlp = English() # just the language with no model
|
||
sentencizer = nlp.create_pipe("sentencizer")
|
||
nlp.add_pipe(sentencizer)
|
||
doc = nlp(u"This is a sentence. This is another sentence.")
|
||
for sent in doc.sents:
|
||
print(sent.text)
|
||
```
|
||
|
||
### Custom rule-based strategy {#sbd-custom}
|
||
|
||
If you want to implement your own strategy that differs from the default
|
||
rule-based approach of splitting on sentences, you can also instantiate the
|
||
`SentenceSegmenter` directly and pass in your own strategy. The strategy should
|
||
be a function that takes a `Doc` object and yields a `Span` for each sentence.
|
||
Here's an example of a custom segmentation strategy for splitting on newlines
|
||
only:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
from spacy.lang.en import English
|
||
from spacy.pipeline import SentenceSegmenter
|
||
|
||
def split_on_newlines(doc):
|
||
start = 0
|
||
seen_newline = False
|
||
for word in doc:
|
||
if seen_newline and not word.is_space:
|
||
yield doc[start:word.i]
|
||
start = word.i
|
||
seen_newline = False
|
||
elif word.text == '\\n':
|
||
seen_newline = True
|
||
if start < len(doc):
|
||
yield doc[start:len(doc)]
|
||
|
||
nlp = English() # Just the language with no model
|
||
sentencizer = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
|
||
nlp.add_pipe(sentencizer)
|
||
doc = nlp(u"This is a sentence\\n\\nThis is another sentence\\nAnd more")
|
||
for sent in doc.sents:
|
||
print([token.text for token in sent])
|
||
```
|
||
|
||
## Rule-based matching {#rule-based-matching hidden="true"}
|
||
|
||
<div id="rule-based-matching">
|
||
<Infobox title="📖 Rule-based matching" id="rule-based-matching">
|
||
|
||
The documentation on rule-based matching
|
||
[has moved to its own page](/usage/rule-based-matching).
|
||
|
||
</Infobox>
|
||
</div>
|