mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
923 lines
41 KiB
Markdown
923 lines
41 KiB
Markdown
---
|
||
title: Language Processing Pipelines
|
||
next: vectors-similarity
|
||
menu:
|
||
- ['Processing Text', 'processing']
|
||
- ['How Pipelines Work', 'pipelines']
|
||
- ['Custom Components', 'custom-components']
|
||
- ['Extension Attributes', 'custom-components-attributes']
|
||
- ['Plugins & Wrappers', 'plugins']
|
||
---
|
||
|
||
import Pipelines101 from 'usage/101/\_pipelines.md'
|
||
|
||
<Pipelines101 />
|
||
|
||
## Processing text {#processing}
|
||
|
||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||
component** on the `Doc`, in order. It then returns the processed `Doc` that you
|
||
can work with.
|
||
|
||
```python
|
||
doc = nlp("This is a text")
|
||
```
|
||
|
||
When processing large volumes of text, the statistical models are usually more
|
||
efficient if you let them work on batches of texts. spaCy's
|
||
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
|
||
processed `Doc` objects. The batching is done internally.
|
||
|
||
```diff
|
||
texts = ["This is a text", "These are lots of texts", "..."]
|
||
- docs = [nlp(text) for text in texts]
|
||
+ docs = list(nlp.pipe(texts))
|
||
```
|
||
|
||
<Infobox title="Tips for efficient processing">
|
||
|
||
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
|
||
buffer them in batches, instead of one-by-one. This is usually much more
|
||
efficient.
|
||
- Only apply the **pipeline components you need**. Getting predictions from the
|
||
model that you don't actually need adds up and becomes very inefficient at
|
||
scale. To prevent this, use the `disable` keyword argument to disable
|
||
components you don't need – either when loading a model, or during processing
|
||
with `nlp.pipe`. See the section on
|
||
[disabling pipeline components](#disabling) for more details and examples.
|
||
|
||
</Infobox>
|
||
|
||
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
|
||
(potentially very large) iterable of texts as a stream. Because we're only
|
||
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
|
||
disable all other statistical components (the `tagger` and `parser`) during
|
||
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
|
||
access the named entity predictions:
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
|
||
> empty, because the entity recognizer didn't run.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
texts = [
|
||
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||
]
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
|
||
# Do something with the doc here
|
||
print([(ent.text, ent.label_) for ent in doc.ents])
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
|
||
[generator](https://realpython.com/introduction-to-python-generators/) that
|
||
yields `Doc` objects – not a list. So if you want to use it like a list, you'll
|
||
have to call `list()` on it first:
|
||
|
||
```diff
|
||
- docs = nlp.pipe(texts)[0] # will raise an error
|
||
+ docs = list(nlp.pipe(texts))[0] # works as expected
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
## How pipelines work {#pipelines}
|
||
|
||
spaCy makes it very easy to create your own pipelines consisting of reusable
|
||
components – this includes spaCy's default tagger, parser and entity recognizer,
|
||
but also your own custom processing functions. A pipeline component can be added
|
||
to an already existing `nlp` object, specified when initializing a `Language`
|
||
class, or defined within a [model package](/usage/saving-loading#models).
|
||
|
||
When you load a model, spaCy first consults the model's
|
||
[`meta.json`](/usage/saving-loading#models). The meta typically includes the
|
||
model details, the ID of a language class, and an optional list of pipeline
|
||
components. spaCy then does the following:
|
||
|
||
> #### meta.json (excerpt)
|
||
>
|
||
> ```json
|
||
> {
|
||
> "lang": "en",
|
||
> "name": "core_web_sm",
|
||
> "description": "Example model for spaCy",
|
||
> "pipeline": ["tagger", "parser", "ner"]
|
||
> }
|
||
> ```
|
||
|
||
1. Load the **language class and data** for the given ID via
|
||
[`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
|
||
`Language` class contains the shared vocabulary, tokenization rules and the
|
||
language-specific annotation scheme.
|
||
2. Iterate over the **pipeline names** and create each component using
|
||
[`create_pipe`](/api/language#create_pipe), which looks them up in
|
||
`Language.factories`.
|
||
3. Add each pipeline component to the pipeline in order, using
|
||
[`add_pipe`](/api/language#add_pipe).
|
||
4. Make the **model data** available to the `Language` class by calling
|
||
[`from_disk`](/api/language#from_disk) with the path to the model data
|
||
directory.
|
||
|
||
So when you call this...
|
||
|
||
```python
|
||
nlp = spacy.load("en_core_web_sm")
|
||
```
|
||
|
||
... the model's `meta.json` tells spaCy to use the language `"en"` and the
|
||
pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
|
||
`spacy.lang.en.English`, and create each pipeline component and add it to the
|
||
processing pipeline. It'll then load in the model's data from its data directory
|
||
and return the modified `Language` class for you to use as the `nlp` object.
|
||
|
||
Fundamentally, a [spaCy model](/models) consists of three components: **the
|
||
weights**, i.e. binary data loaded in from a directory, a **pipeline** of
|
||
functions called in order, and **language data** like the tokenization rules and
|
||
annotation scheme. All of this is specific to each model, and defined in the
|
||
model's `meta.json` – for example, a Spanish NER model requires different
|
||
weights, language data and pipeline components than an English parsing and
|
||
tagging model. This is also why the pipeline state is always held by the
|
||
`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
|
||
together and returns an instance of `Language` with a pipeline set and access to
|
||
the binary data:
|
||
|
||
```python
|
||
### spacy.load under the hood
|
||
lang = "en"
|
||
pipeline = ["tagger", "parser", "ner"]
|
||
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
|
||
|
||
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
|
||
nlp = cls() # 2. Initialize it
|
||
for name in pipeline:
|
||
component = nlp.create_pipe(name) # 3. Create the pipeline components
|
||
nlp.add_pipe(component) # 4. Add the component to the pipeline
|
||
nlp.from_disk(model_data_path) # 5. Load in the binary data
|
||
```
|
||
|
||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||
component** on the `Doc`, in order. Since the model data is loaded, the
|
||
components can access it to assign annotations to the `Doc` object, and
|
||
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
|
||
don't own any data themselves. All components return the modified document,
|
||
which is then processed by the component next in the pipeline.
|
||
|
||
```python
|
||
### The pipeline under the hood
|
||
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
|
||
for name, proc in nlp.pipeline: # iterate over components in order
|
||
doc = proc(doc) # apply each component
|
||
```
|
||
|
||
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
||
list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
|
||
list of human-readable component names.
|
||
|
||
```python
|
||
print(nlp.pipeline)
|
||
# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
|
||
print(nlp.pipe_names)
|
||
# ['tagger', 'parser', 'ner']
|
||
```
|
||
|
||
### Built-in pipeline components {#built-in}
|
||
|
||
spaCy ships with several built-in pipeline components that are also available in
|
||
the `Language.factories`. This means that you can initialize them by calling
|
||
[`nlp.create_pipe`](/api/language#create_pipe) with their string names and
|
||
require them in the pipeline settings in your model's `meta.json`.
|
||
|
||
> #### Usage
|
||
>
|
||
> ```python
|
||
> # Option 1: Import and initialize
|
||
> from spacy.pipeline import EntityRuler
|
||
> ruler = EntityRuler(nlp)
|
||
> nlp.add_pipe(ruler)
|
||
>
|
||
> # Option 2: Using nlp.create_pipe
|
||
> sentencizer = nlp.create_pipe("sentencizer")
|
||
> nlp.add_pipe(sentencizer)
|
||
> ```
|
||
|
||
| String name | Component | Description |
|
||
| ------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
|
||
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
|
||
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
|
||
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
|
||
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
|
||
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
|
||
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. |
|
||
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
|
||
| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser. |
|
||
| `merge_entities` | [`merge_entities`](/api/pipeline-functions#merge_entities) | Merge all entities into a single token. Should be added after the entity recognizer. |
|
||
| `merge_subtokens` | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens) | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. |
|
||
|
||
### Disabling and modifying pipeline components {#disabling}
|
||
|
||
If you don't need a particular component of the pipeline – for example, the
|
||
tagger or the parser, you can **disable loading** it. This can sometimes make a
|
||
big difference and improve loading speed. Disabled component names can be
|
||
provided to [`spacy.load`](/api/top-level#spacy.load),
|
||
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
|
||
list:
|
||
|
||
```python
|
||
### Disable loading
|
||
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
|
||
nlp = English().from_disk("/model", disable=["ner"])
|
||
```
|
||
|
||
In some cases, you do want to load all pipeline components and their weights,
|
||
because you need them at different points in your application. However, if you
|
||
only need a `Doc` object with named entities, there's no need to run all
|
||
pipeline components on it – that can potentially make processing much slower.
|
||
Instead, you can use the `disable` keyword argument on
|
||
[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
|
||
processing**:
|
||
|
||
```python
|
||
### Disable for processing
|
||
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
|
||
# Do something with the doc here
|
||
```
|
||
|
||
If you need to **execute more code** with components disabled – e.g. to reset
|
||
the weights or update only some components during training – you can use the
|
||
[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of
|
||
the `with` block, the disabled pipeline components will be restored
|
||
automatically. Alternatively, `select_pipes` returns an object that lets you
|
||
call its `restore()` method to restore the disabled components when needed. This
|
||
can be useful if you want to prevent unnecessary code indentation of large
|
||
blocks.
|
||
|
||
```python
|
||
### Disable for block
|
||
# 1. Use as a contextmanager
|
||
with nlp.select_pipes(disable=["tagger", "parser"]):
|
||
doc = nlp("I won't be tagged and parsed")
|
||
doc = nlp("I will be tagged and parsed")
|
||
|
||
# 2. Restore manually
|
||
disabled = nlp.select_pipes(disable="ner")
|
||
doc = nlp("I won't have named entities")
|
||
disabled.restore()
|
||
```
|
||
|
||
If you want to disable all pipes except for one or a few, you can use the
|
||
`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
|
||
names, or a string defining just one pipe.
|
||
|
||
```python
|
||
# Enable only the parser
|
||
with nlp.select_pipes(enable="parser"):
|
||
doc = nlp("I will only be parsed")
|
||
```
|
||
|
||
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
||
to remove pipeline components from an existing pipeline, the
|
||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||
custom component entirely (more details on this in the section on
|
||
[custom components](#custom-components).
|
||
|
||
```python
|
||
nlp.remove_pipe("parser")
|
||
nlp.rename_pipe("ner", "entityrecognizer")
|
||
nlp.replace_pipe("tagger", my_custom_tagger)
|
||
```
|
||
|
||
## Creating custom pipeline components {#custom-components}
|
||
|
||
A component receives a `Doc` object and can modify it – for example, by using
|
||
the current weights to make a prediction and set some annotation on the
|
||
document. By adding a component to the pipeline, you'll get access to the `Doc`
|
||
at any point **during processing** – instead of only being able to modify it
|
||
afterwards.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> def my_component(doc):
|
||
> # do something to the doc here
|
||
> return doc
|
||
> ```
|
||
|
||
| Argument | Type | Description |
|
||
| ----------- | ----- | ------------------------------------------------------ |
|
||
| `doc` | `Doc` | The `Doc` object processed by the previous component. |
|
||
| **RETURNS** | `Doc` | The `Doc` object processed by this pipeline component. |
|
||
|
||
Custom components can be added to the pipeline using the
|
||
[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
|
||
a component to add it **before or after**, tell spaCy to add it **first or
|
||
last** in the pipeline, or define a **custom name**. If no name is set and no
|
||
`name` attribute is present on your component, the function name is used.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp.add_pipe(my_component)
|
||
> nlp.add_pipe(my_component, first=True)
|
||
> nlp.add_pipe(my_component, before="parser")
|
||
> ```
|
||
|
||
| Argument | Type | Description |
|
||
| -------- | ---- | ------------------------------------------------------------------------ |
|
||
| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). |
|
||
| `first` | bool | If set to `True`, component is added **first** in the pipeline. |
|
||
| `before` | str | String name of component to add the new component **before**. |
|
||
| `after` | str | String name of component to add the new component **after**. |
|
||
|
||
### Example: A simple pipeline component {#custom-components-simple}
|
||
|
||
The following component receives the `Doc` in the pipeline and prints some
|
||
information about it: the number of tokens, the part-of-speech tags of the
|
||
tokens and a conditional message based on the document length.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Add the component first in the pipeline by setting `first=True`. You'll see
|
||
> that the part-of-speech tags are empty, because the component now runs
|
||
> before the tagger and the tags aren't available yet.
|
||
> 2. Change the component `name` or remove the `name` argument. You should see
|
||
> this change reflected in `nlp.pipe_names`.
|
||
> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
|
||
> name and the function that's called on the `Doc` object in the pipeline.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
def my_component(doc):
|
||
print(f"After tokenization, this doc has {len(doc)} tokens.")
|
||
print("The part-of-speech tags are:", [token.pos_ for token in doc])
|
||
if len(doc) < 10:
|
||
print("This is a pretty short document.")
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe(my_component, name="print_info", last=True)
|
||
print(nlp.pipe_names) # ['tagger', 'parser', 'ner', 'print_info']
|
||
doc = nlp("This is a sentence.")
|
||
|
||
```
|
||
|
||
Of course, you can also wrap your component as a class to allow initializing it
|
||
with custom settings and hold state within the component. This is useful for
|
||
**stateful components**, especially ones which **depend on shared data**. In the
|
||
following example, the custom component `EntityMatcher` can be initialized with
|
||
`nlp` object, a terminology list and an entity label. Using the
|
||
[`PhraseMatcher`](/api/phrasematcher), it then matches the terms in the `Doc`
|
||
and adds them to the existing entities.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
As of v2.1.0, spaCy ships with the [`EntityRuler`](/api/entityruler), a pipeline
|
||
component for easy, rule-based named entity recognition. Its implementation is
|
||
similar to the `EntityMatcher` code shown below, but it includes some additional
|
||
features like support for phrase patterns and token patterns, handling overlaps
|
||
with existing entities and pattern export as JSONL.
|
||
|
||
We'll still keep the pipeline component example below, as it works well to
|
||
illustrate complex components. But if you're planning on using this type of
|
||
component in your application, you might find the `EntityRuler` more convenient.
|
||
[See here](/usage/rule-based-matching#entityruler) for more details and
|
||
examples.
|
||
|
||
</Infobox>
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.matcher import PhraseMatcher
|
||
from spacy.tokens import Span
|
||
|
||
class EntityMatcher(object):
|
||
name = "entity_matcher"
|
||
|
||
def __init__(self, nlp, terms, label):
|
||
patterns = [nlp.make_doc(text) for text in terms]
|
||
self.matcher = PhraseMatcher(nlp.vocab)
|
||
self.matcher.add(label, patterns)
|
||
|
||
def __call__(self, doc):
|
||
matches = self.matcher(doc)
|
||
for match_id, start, end in matches:
|
||
span = Span(doc, start, end, label=match_id)
|
||
doc.ents = list(doc.ents) + [span]
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
terms = ("cat", "dog", "tree kangaroo", "giant sea spider")
|
||
entity_matcher = EntityMatcher(nlp, terms, "ANIMAL")
|
||
|
||
nlp.add_pipe(entity_matcher, after="ner")
|
||
|
||
print(nlp.pipe_names) # The components in the pipeline
|
||
|
||
doc = nlp("This is a text about Barack Obama and a tree kangaroo")
|
||
print([(ent.text, ent.label_) for ent in doc.ents])
|
||
```
|
||
|
||
### Example: Custom sentence segmentation logic {#component-example1}
|
||
|
||
Let's say you want to implement custom logic to improve spaCy's sentence
|
||
boundary detection. Currently, sentence segmentation is based on the dependency
|
||
parse, which doesn't always produce ideal results. The custom logic should
|
||
therefore be applied **after** tokenization, but _before_ the dependency parsing
|
||
– this way, the parser can also take advantage of the sentence boundaries.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
|
||
> component. You'll see that the predicted dependency parse changes to match
|
||
> the sentence boundaries.
|
||
> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
|
||
> to `None` (missing value), the parser will assign sentence boundaries in
|
||
> between.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
def custom_sentencizer(doc):
|
||
for i, token in enumerate(doc[:-2]):
|
||
# Define sentence start if pipe + titlecase token
|
||
if token.text == "|" and doc[i+1].is_title:
|
||
doc[i+1].is_sent_start = True
|
||
else:
|
||
# Explicitly set sentence start to False otherwise, to tell
|
||
# the parser to leave those tokens alone
|
||
doc[i+1].is_sent_start = False
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser
|
||
doc = nlp("This is. A sentence. | This is. Another sentence.")
|
||
for sent in doc.sents:
|
||
print(sent.text)
|
||
```
|
||
|
||
### Example: Pipeline component for entity matching and tagging with custom attributes {#component-example2}
|
||
|
||
This example shows how to create a spaCy extension that takes a terminology list
|
||
(in this case, single- and multi-word company names), matches the occurrences in
|
||
a document, labels them as `ORG` entities, merges the tokens and sets custom
|
||
`is_tech_org` and `has_tech_org` attributes. For efficient matching, the example
|
||
uses the [`PhraseMatcher`](/api/phrasematcher) which accepts `Doc` objects as
|
||
match patterns and works well for large terminology lists. It also ensures your
|
||
patterns will always match, even when you customize spaCy's tokenization rules.
|
||
When you call `nlp` on a text, the custom pipeline component is applied to the
|
||
`Doc`.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py
|
||
```
|
||
|
||
Wrapping this functionality in a pipeline component allows you to reuse the
|
||
module with different settings, and have all pre-processing taken care of when
|
||
you call `nlp` on your text and receive a `Doc` object.
|
||
|
||
### Adding factories {#custom-components-factories}
|
||
|
||
When spaCy loads a model via its `meta.json`, it will iterate over the
|
||
`"pipeline"` setting, look up every component name in the internal factories and
|
||
call [`nlp.create_pipe`](/api/language#create_pipe) to initialize the individual
|
||
components, like the tagger, parser or entity recognizer. If your model uses
|
||
custom components, this won't work – so you'll have to tell spaCy **where to
|
||
find your component**. You can do this by writing to the `Language.factories`:
|
||
|
||
```python
|
||
from spacy.language import Language
|
||
Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
|
||
```
|
||
|
||
You can also ship the above code and your custom component in your packaged
|
||
model's `__init__.py`, so it's executed when you load your model. The `**cfg`
|
||
config parameters are passed all the way down from
|
||
[`spacy.load`](/api/top-level#spacy.load), so you can load the model and its
|
||
components with custom settings:
|
||
|
||
```python
|
||
nlp = spacy.load("your_custom_model", terms=["tree kangaroo"], label="ANIMAL")
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When you load a model via its package name, like `en_core_web_sm`, spaCy will
|
||
import the package and then call its `load()` method. This means that custom
|
||
code in the model's `__init__.py` will be executed, too. This is **not the
|
||
case** if you're loading a model from a path containing the model data. Here,
|
||
spaCy will only read in the `meta.json`. If you want to use custom factories
|
||
with a model loaded from a path, you need to add them to `Language.factories`
|
||
_before_ you load the model.
|
||
|
||
</Infobox>
|
||
|
||
## Extension attributes {#custom-components-attributes new="2"}
|
||
|
||
As of v2.0, spaCy allows you to set any custom attributes and methods on the
|
||
`Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and
|
||
`Token._` – for example, `Token._.my_attr`. This lets you store additional
|
||
information relevant to your application, add new features and functionality to
|
||
spaCy, and implement your own models trained with other machine learning
|
||
libraries. It also lets you take advantage of spaCy's data structures and the
|
||
`Doc` object as the "single source of truth".
|
||
|
||
<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
|
||
|
||
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
||
separation and makes it easier to ensure backwards compatibility. For example,
|
||
if you've implemented your own `.coref` property and spaCy claims it one day,
|
||
it'll break your code. Similarly, just by looking at the code, you'll
|
||
immediately know what's built-in and what's custom – for example,
|
||
`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
|
||
|
||
Extension definitions – the defaults, methods, getters and setters you pass in
|
||
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
||
If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
|
||
stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
|
||
underscore data separate from your other dictionary entries, the string `"._."`
|
||
is placed before the name, in a tuple.
|
||
|
||
</Accordion>
|
||
|
||
---
|
||
|
||
There are three main types of extensions, which can be defined using the
|
||
[`Doc.set_extension`](/api/doc#set_extension),
|
||
[`Span.set_extension`](/api/span#set_extension) and
|
||
[`Token.set_extension`](/api/token#set_extension) methods.
|
||
|
||
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
||
overwritten manually at any time. Attribute extensions work like "normal"
|
||
variables and are the quickest way to store arbitrary information on a `Doc`,
|
||
`Span` or `Token`.
|
||
|
||
```python
|
||
Doc.set_extension("hello", default=True)
|
||
assert doc._.hello
|
||
doc._.hello = False
|
||
```
|
||
|
||
2. **Property extensions.** Define a getter and an optional setter function. If
|
||
no setter is provided, the extension is immutable. Since the getter and
|
||
setter functions are only called when you _retrieve_ the attribute, you can
|
||
also access values of previously added attribute extensions. For example, a
|
||
`Doc` getter can average over `Token` attributes. For `Span` extensions,
|
||
you'll almost always want to use a property – otherwise, you'd have to write
|
||
to _every possible_ `Span` in the `Doc` to set up the values correctly.
|
||
|
||
```python
|
||
Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
|
||
assert doc._.hello
|
||
doc._.hello = "Hi!"
|
||
```
|
||
|
||
3. **Method extensions.** Assign a function that becomes available as an object
|
||
method. Method extensions are always immutable. For more details and
|
||
implementation ideas, see
|
||
[these examples](/usage/examples#custom-components-attr-methods).
|
||
|
||
```python
|
||
Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
|
||
assert doc._.hello("Bob") == "Hi Bob!"
|
||
```
|
||
|
||
Before you can access a custom extension, you need to register it using the
|
||
`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
|
||
in mind that extensions are always **added globally** and not just on a
|
||
particular instance. If an attribute of the same name already exists, or if
|
||
you're trying to access an attribute that hasn't been registered, spaCy will
|
||
raise an `AttributeError`.
|
||
|
||
```python
|
||
### Example
|
||
from spacy.tokens import Doc, Span, Token
|
||
|
||
fruits = ["apple", "pear", "banana", "orange", "strawberry"]
|
||
is_fruit_getter = lambda token: token.text in fruits
|
||
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
|
||
|
||
Token.set_extension("is_fruit", getter=is_fruit_getter)
|
||
Doc.set_extension("has_fruit", getter=has_fruit_getter)
|
||
Span.set_extension("has_fruit", getter=has_fruit_getter)
|
||
```
|
||
|
||
> #### Usage example
|
||
>
|
||
> ```python
|
||
> doc = nlp("I have an apple and a melon")
|
||
> assert doc[3]._.is_fruit # get Token attributes
|
||
> assert not doc[0]._.is_fruit
|
||
> assert doc._.has_fruit # get Doc attributes
|
||
> assert doc[1:4]._.has_fruit # get Span attributes
|
||
> ```
|
||
|
||
Once you've registered your custom attribute, you can also use the built-in
|
||
`set`, `get` and `has` methods to modify and retrieve the attributes. This is
|
||
especially useful it you want to pass in a string instead of calling
|
||
`doc._.my_attr`.
|
||
|
||
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
||
|
||
This example shows the implementation of a pipeline component that fetches
|
||
country meta data via the [REST Countries API](https://restcountries.eu), sets
|
||
entity annotations for countries, merges entities into one token and sets custom
|
||
attributes on the `Doc`, `Span` and `Token` – for example, the capital,
|
||
latitude/longitude coordinates and even the country flag.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py
|
||
```
|
||
|
||
In this case, all data can be fetched on initialization in one request. However,
|
||
if you're working with text that contains incomplete country names, spelling
|
||
mistakes or foreign-language versions, you could also implement a
|
||
`like_country`-style getter function that makes a request to the search API
|
||
endpoint and returns the best-matching result.
|
||
|
||
### User hooks {#custom-components-user-hooks}
|
||
|
||
While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
|
||
proxies to add your own custom attributes, spaCy offers a few exceptions to
|
||
allow **customizing the built-in methods** like
|
||
[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
|
||
your own hooks, which can rely on statistical models you train yourself. For
|
||
instance, you can provide your own on-the-fly sentence segmentation algorithm or
|
||
document similarity method.
|
||
|
||
Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
|
||
objects by adding a component to the pipeline. For instance, to customize the
|
||
[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
|
||
sets a custom function to `doc.user_hooks['similarity']`. The built-in
|
||
`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
|
||
function if you've set one. Similar results can be achieved by setting functions
|
||
to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
||
|
||
> #### Implementation note
|
||
>
|
||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||
> This turns out to be convenient here — we only have to worry about installing
|
||
> hooks in one place.
|
||
|
||
| Name | Customizes |
|
||
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `user_hooks` | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) |
|
||
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
|
||
| `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) |
|
||
|
||
```python
|
||
### Add custom similarity hooks
|
||
class SimilarityModel(object):
|
||
def __init__(self, model):
|
||
self._model = model
|
||
|
||
def __call__(self, doc):
|
||
doc.user_hooks["similarity"] = self.similarity
|
||
doc.user_span_hooks["similarity"] = self.similarity
|
||
doc.user_token_hooks["similarity"] = self.similarity
|
||
|
||
def similarity(self, obj1, obj2):
|
||
y = self._model([obj1.vector, obj2.vector])
|
||
return float(y[0])
|
||
```
|
||
|
||
## Developing plugins and wrappers {#plugins}
|
||
|
||
We're very excited about all the new possibilities for community extensions and
|
||
plugins in spaCy, and we can't wait to see what you build with it! To get you
|
||
started, here are a few tips, tricks and best
|
||
practices. [See here](/universe/?category=pipeline) for examples of other spaCy
|
||
extensions.
|
||
|
||
### Usage ideas {#custom-components-usage-ideas}
|
||
|
||
- **Adding new features and hooking in models.** For example, a sentiment
|
||
analysis model, or your preferred solution for lemmatization or sentiment
|
||
analysis. spaCy's built-in tagger, parser and entity recognizer respect
|
||
annotations that were already set on the `Doc` in a previous step of the
|
||
pipeline.
|
||
- **Integrating other libraries and APIs.** For example, your pipeline component
|
||
can write additional information and data directly to the `Doc` or `Token` as
|
||
custom attributes, while making sure no information is lost in the process.
|
||
This can be output generated by other libraries and models, or an external
|
||
service with a REST API.
|
||
- **Debugging and logging.** For example, a component which stores and/or
|
||
exports relevant information about the current state of the processed
|
||
document, and insert it at any point of your pipeline.
|
||
|
||
### Best practices {#custom-components-best-practices}
|
||
|
||
Extensions can claim their own `._` namespace and exist as standalone packages.
|
||
If you're developing a tool or library and want to make it easy for others to
|
||
use it with spaCy and add it to their pipeline, all you have to do is expose a
|
||
function that takes a `Doc`, modifies it and returns it.
|
||
|
||
- Make sure to choose a **descriptive and specific name** for your pipeline
|
||
component class, and set it as its `name` attribute. Avoid names that are too
|
||
common or likely to clash with built-in or a user's other custom components.
|
||
While it's fine to call your package `"spacy_my_extension"`, avoid component
|
||
names including `"spacy"`, since this can easily lead to confusion.
|
||
|
||
```diff
|
||
+ name = "myapp_lemmatizer"
|
||
- name = "lemmatizer"
|
||
```
|
||
|
||
- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
|
||
wherever possible, and avoid setting values explicitly. Tokens and spans don't
|
||
own any data themselves, and they're implemented as C extension classes – so
|
||
you can't usually add new attributes to them like you could with most pure
|
||
Python objects.
|
||
|
||
```diff
|
||
+ is_fruit = lambda token: token.text in ("apple", "orange")
|
||
+ Token.set_extension("is_fruit", getter=is_fruit)
|
||
|
||
- token._.set_extension("is_fruit", default=False)
|
||
- if token.text in ('"apple", "orange"):
|
||
- token._.set("is_fruit", True)
|
||
```
|
||
|
||
- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
|
||
objects, not a particular instance of them. Add the attributes **as early as
|
||
possible**, e.g. in your extension's `__init__` method or in the global scope
|
||
of your module. This means that in the case of namespace collisions, the user
|
||
will see an error immediately, not just when they run their pipeline.
|
||
|
||
```diff
|
||
+ from spacy.tokens import Doc
|
||
+ def __init__(attr="my_attr"):
|
||
+ Doc.set_extension(attr, getter=self.get_doc_attr)
|
||
|
||
- def __call__(doc):
|
||
- doc.set_extension("my_attr", getter=self.get_doc_attr)
|
||
```
|
||
|
||
- If your extension is setting properties on the `Doc`, `Token` or `Span`,
|
||
include an option to **let the user to change those attribute names**. This
|
||
makes it easier to avoid namespace collisions and accommodate users with
|
||
different naming preferences. We recommend adding an `attrs` argument to the
|
||
`__init__` method of your class so you can write the names to class attributes
|
||
and reuse them across your component.
|
||
|
||
```diff
|
||
+ Doc.set_extension(self.doc_attr, default="some value")
|
||
- Doc.set_extension("my_doc_attr", default="some value")
|
||
```
|
||
|
||
- Ideally, extensions should be **standalone packages** with spaCy and
|
||
optionally, other packages specified as a dependency. They can freely assign
|
||
to their own `._` namespace, but should stick to that. If your extension's
|
||
only job is to provide a better `.similarity` implementation, and your docs
|
||
state this explicitly, there's no problem with writing to the
|
||
[`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
|
||
method. However, a third-party extension should **never silently overwrite
|
||
built-ins**, or attributes set by other extensions.
|
||
|
||
- If you're looking to publish a model that depends on a custom pipeline
|
||
component, you can either **require it** in the model package's dependencies,
|
||
or – if the component is specific and lightweight – choose to **ship it with
|
||
your model package** and add it to the `Language` instance returned by the
|
||
model's `load()` method. For examples of this, check out the implementations
|
||
of spaCy's
|
||
[`load_model_from_init_py`](/api/top-level#util.load_model_from_init_py)
|
||
[`load_model_from_path`](/api/top-level#util.load_model_from_path) utility
|
||
functions.
|
||
|
||
```diff
|
||
+ nlp.add_pipe(my_custom_component)
|
||
+ return nlp.from_disk(model_path)
|
||
```
|
||
|
||
- Once you're ready to share your extension with others, make sure to **add docs
|
||
and installation instructions** (you can always link to this page for more
|
||
info). Make it easy for others to install and use your extension, for example
|
||
by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
|
||
code on GitHub, don't forget to tag it with
|
||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||
[`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||
to help people find it. If you post it on Twitter, feel free to tag
|
||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||
|
||
### Wrapping other models and libraries {#wrapping-models-libraries}
|
||
|
||
Let's say you have a custom entity recognizer that takes a list of strings and
|
||
returns their [BILUO tags](/api/annotation#biluo). Given an input like
|
||
`["A", "text", "about", "Facebook"]`, it will predict and return
|
||
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
|
||
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
|
||
component function and pass it the token texts from the `Doc` object received by
|
||
the component.
|
||
|
||
The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very
|
||
helpful here, because it takes a `Doc` object and token-based BILUO tags and
|
||
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
|
||
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
|
||
|
||
> #### How the doc.ents work
|
||
>
|
||
> When you add spans to the `doc.ents`, spaCy will automatically resolve them
|
||
> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
|
||
> attributes. By definition, each token can only be part of one entity, so
|
||
> overlapping entity spans are not allowed.
|
||
|
||
```python
|
||
### {highlight="1,6-7"}
|
||
import your_custom_entity_recognizer
|
||
from spacy.gold import offsets_from_biluo_tags
|
||
|
||
def custom_ner_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
custom_entities = your_custom_entity_recognizer(words)
|
||
doc.ents = spans_from_biluo_tags(doc, custom_entities)
|
||
return doc
|
||
```
|
||
|
||
The `custom_ner_wrapper` can then be added to the pipeline of a blank model
|
||
using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the
|
||
existing entity recognizer of a pretrained model with
|
||
[`nlp.replace_pipe`](/api/language#replace_pipe).
|
||
|
||
Here's another example of a custom model, `your_custom_model`, that takes a list
|
||
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
|
||
part-of-speech tags, dependency labels and head token indices. Here, we can use
|
||
the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
|
||
those values. To create a numpy array we need integers, so we can look up the
|
||
string labels in the [`StringStore`](/api/stringstore). The
|
||
[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
|
||
because it returns the integer ID of the string _and_ makes sure it's added to
|
||
the vocab. This is especially important if the custom model uses a different
|
||
label scheme than spaCy's default models.
|
||
|
||
> #### Example: spacy-stanfordnlp
|
||
>
|
||
> For an example of an end-to-end wrapper for statistical tokenization, tagging
|
||
> and parsing, check out
|
||
> [`spacy-stanfordnlp`](https://github.com/explosion/spacy-stanfordnlp). It uses
|
||
> a very similar approach to the example in this section – the only difference
|
||
> is that it fully replaces the `nlp` object instead of providing a pipeline
|
||
> component, since it also needs to handle tokenization.
|
||
|
||
```python
|
||
### {highlight="1,9,15-17"}
|
||
import your_custom_model
|
||
from spacy.symbols import POS, TAG, DEP, HEAD
|
||
from spacy.tokens import Doc
|
||
import numpy
|
||
|
||
def custom_model_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
spaces = [token.whitespace for token in doc]
|
||
pos, tags, deps, heads = your_custom_model(words)
|
||
# Convert the strings to integers and add them to the string store
|
||
pos = [doc.vocab.strings.add(label) for label in pos]
|
||
tags = [doc.vocab.strings.add(label) for label in tags]
|
||
deps = [doc.vocab.strings.add(label) for label in deps]
|
||
# Create a new Doc from a numpy array
|
||
attrs = [POS, TAG, DEP, HEAD]
|
||
arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
|
||
new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
|
||
return new_doc
|
||
```
|
||
|
||
<Infobox title="Sentence boundaries and heads" variant="warning">
|
||
|
||
If you create a `Doc` object with dependencies and heads, spaCy is able to
|
||
resolve the sentence boundaries automatically. However, note that the `HEAD`
|
||
value used to construct a `Doc` is the token index **relative** to the current
|
||
token – e.g. `-1` for the previous token. The CoNLL format typically annotates
|
||
heads as `1`-indexed absolute indices with `0` indicating the root. If that's
|
||
the case in your annotations, you need to convert them first:
|
||
|
||
```python
|
||
heads = [2, 0, 4, 2, 2]
|
||
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
<Infobox title="📖 Advanced usage, serialization and entry points">
|
||
|
||
For more details on how to write and package custom components, make them
|
||
available to spaCy via entry points and implement your own serialization
|
||
methods, check out the usage guide on
|
||
[saving and loading](/usage/saving-loading).
|
||
|
||
</Infobox>
|