mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 04:31:17 +03:00
* Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
1837 lines
81 KiB
Markdown
1837 lines
81 KiB
Markdown
---
|
||
title: Language Processing Pipelines
|
||
next: /usage/embeddings-transformers
|
||
menu:
|
||
- ['Processing Text', 'processing']
|
||
- ['Pipelines & Components', 'pipelines']
|
||
- ['Custom Components', 'custom-components']
|
||
- ['Component Data', 'component-data']
|
||
- ['Type Hints & Validation', 'type-hints']
|
||
- ['Trainable Components', 'trainable-components']
|
||
- ['Extension Attributes', 'custom-components-attributes']
|
||
- ['Plugins & Wrappers', 'plugins']
|
||
---
|
||
|
||
import Pipelines101 from 'usage/101/\_pipelines.md'
|
||
|
||
<Pipelines101 />
|
||
|
||
## Processing text {#processing}
|
||
|
||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||
component** on the `Doc`, in order. It then returns the processed `Doc` that you
|
||
can work with.
|
||
|
||
```python
|
||
doc = nlp("This is a text")
|
||
```
|
||
|
||
When processing large volumes of text, the statistical models are usually more
|
||
efficient if you let them work on batches of texts. spaCy's
|
||
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
|
||
processed `Doc` objects. The batching is done internally.
|
||
|
||
```diff
|
||
texts = ["This is a text", "These are lots of texts", "..."]
|
||
- docs = [nlp(text) for text in texts]
|
||
+ docs = list(nlp.pipe(texts))
|
||
```
|
||
|
||
<Infobox title="Tips for efficient processing" emoji="💡">
|
||
|
||
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
|
||
buffer them in batches, instead of one-by-one. This is usually much more
|
||
efficient.
|
||
- Only apply the **pipeline components you need**. Getting predictions from the
|
||
model that you don't actually need adds up and becomes very inefficient at
|
||
scale. To prevent this, use the `disable` keyword argument to disable
|
||
components you don't need – either when loading a pipeline, or during
|
||
processing with `nlp.pipe`. See the section on
|
||
[disabling pipeline components](#disabling) for more details and examples.
|
||
|
||
</Infobox>
|
||
|
||
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
|
||
(potentially very large) iterable of texts as a stream. Because we're only
|
||
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
|
||
disable all other components during processing. `nlp.pipe` yields `Doc` objects,
|
||
so we can iterate over them and access the named entity predictions:
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
|
||
> empty, because the entity recognizer didn't run.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
texts = [
|
||
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||
]
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
|
||
# Do something with the doc here
|
||
print([(ent.text, ent.label_) for ent in doc.ents])
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
|
||
[generator](https://realpython.com/introduction-to-python-generators/) that
|
||
yields `Doc` objects – not a list. So if you want to use it like a list, you'll
|
||
have to call `list()` on it first:
|
||
|
||
```diff
|
||
- docs = nlp.pipe(texts)[0] # will raise an error
|
||
+ docs = list(nlp.pipe(texts))[0] # works as expected
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
You can use the `as_tuples` option to pass additional context along with each
|
||
doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
|
||
the input should be a sequence of `(text, context)` tuples and the output will
|
||
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
|
||
the context and save it in a [custom attribute](#custom-components-attributes):
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.tokens import Doc
|
||
|
||
if not Doc.has_extension("text_id"):
|
||
Doc.set_extension("text_id", default=None)
|
||
|
||
text_tuples = [
|
||
("This is the first text.", {"text_id": "text1"}),
|
||
("This is the second text.", {"text_id": "text2"})
|
||
]
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)
|
||
|
||
docs = []
|
||
for doc, context in doc_tuples:
|
||
doc._.text_id = context["text_id"]
|
||
docs.append(doc)
|
||
|
||
for doc in docs:
|
||
print(f"{doc._.text_id}: {doc.text}")
|
||
```
|
||
|
||
### Multiprocessing {#multiprocessing}
|
||
|
||
spaCy includes built-in support for multiprocessing with
|
||
[`nlp.pipe`](/api/language#pipe) using the `n_process` option:
|
||
|
||
```python
|
||
# Multiprocessing with 4 processes
|
||
docs = nlp.pipe(texts, n_process=4)
|
||
|
||
# With as many processes as CPUs (use with caution!)
|
||
docs = nlp.pipe(texts, n_process=-1)
|
||
```
|
||
|
||
Depending on your platform, starting many processes with multiprocessing can add
|
||
a lot of overhead. In particular, the default start method `spawn` used in
|
||
macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models
|
||
because the model data is copied in memory for each new process. See the
|
||
[Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
|
||
for further details.
|
||
|
||
For shorter tasks and in particular with `spawn`, it can be faster to use a
|
||
smaller number of processes with a larger batch size. The optimal `batch_size`
|
||
setting will depend on the pipeline components, the length of your documents,
|
||
the number of processes and how much memory is available.
|
||
|
||
```python
|
||
# Default batch size is `nlp.batch_size` (typically 1000)
|
||
docs = nlp.pipe(texts, n_process=2, batch_size=2000)
|
||
```
|
||
|
||
<Infobox title="Multiprocessing on GPU" variant="warning">
|
||
|
||
Multiprocessing is not generally recommended on GPU because RAM is too limited.
|
||
If you want to try it out, be aware that it is only possible using `spawn` due
|
||
to limitations in CUDA.
|
||
|
||
</Infobox>
|
||
|
||
<Infobox title="Multiprocessing with transformer models" variant="warning">
|
||
|
||
In Linux, transformer models may hang or deadlock with multiprocessing due to an
|
||
[issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One
|
||
suggested workaround is to use `spawn` instead of `fork` and another is to limit
|
||
the number of threads before loading any models using
|
||
`torch.set_num_threads(1)`.
|
||
|
||
</Infobox>
|
||
|
||
## Pipelines and built-in components {#pipelines}
|
||
|
||
spaCy makes it very easy to create your own pipelines consisting of reusable
|
||
components – this includes spaCy's default tagger, parser and entity recognizer,
|
||
but also your own custom processing functions. A pipeline component can be added
|
||
to an already existing `nlp` object, specified when initializing a
|
||
[`Language`](/api/language) class, or defined within a
|
||
[pipeline package](/usage/saving-loading#models).
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```ini
|
||
> [nlp]
|
||
> lang = "en"
|
||
> pipeline = ["tok2vec", "parser"]
|
||
>
|
||
> [components]
|
||
>
|
||
> [components.tok2vec]
|
||
> factory = "tok2vec"
|
||
> # Settings for the tok2vec component
|
||
>
|
||
> [components.parser]
|
||
> factory = "parser"
|
||
> # Settings for the parser component
|
||
> ```
|
||
|
||
When you load a pipeline, spaCy first consults the
|
||
[`meta.json`](/usage/saving-loading#models) and
|
||
[`config.cfg`](/usage/training#config). The config tells spaCy what language
|
||
class to use, which components are in the pipeline, and how those components
|
||
should be created. spaCy will then do the following:
|
||
|
||
1. Load the **language class and data** for the given ID via
|
||
[`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
|
||
`Language` class contains the shared vocabulary, tokenization rules and the
|
||
language-specific settings.
|
||
2. Iterate over the **pipeline names** and look up each component name in the
|
||
`[components]` block. The `factory` tells spaCy which
|
||
[component factory](#custom-components-factories) to use for adding the
|
||
component with [`add_pipe`](/api/language#add_pipe). The settings are passed
|
||
into the factory.
|
||
3. Make the **model data** available to the `Language` class by calling
|
||
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
||
|
||
So when you call this...
|
||
|
||
```python
|
||
nlp = spacy.load("en_core_web_sm")
|
||
```
|
||
|
||
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
|
||
pipeline
|
||
`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy
|
||
will then initialize `spacy.lang.en.English`, and create each pipeline component
|
||
and add it to the processing pipeline. It'll then load in the model data from
|
||
the data directory and return the modified `Language` class for you to use as
|
||
the `nlp` object.
|
||
|
||
<Infobox title="Changed in v3.0" variant="warning">
|
||
|
||
spaCy v3.0 introduces a `config.cfg`, which includes more detailed settings for
|
||
the pipeline, its components and the [training process](/usage/training#config).
|
||
You can export the config of your current `nlp` object by calling
|
||
[`nlp.config.to_disk`](/api/language#config).
|
||
|
||
</Infobox>
|
||
|
||
Fundamentally, a [spaCy pipeline package](/models) consists of three components:
|
||
**the weights**, i.e. binary data loaded in from a directory, a **pipeline** of
|
||
functions called in order, and **language data** like the tokenization rules and
|
||
language-specific settings. For example, a Spanish NER pipeline requires
|
||
different weights, language data and components than an English parsing and
|
||
tagging pipeline. This is also why the pipeline state is always held by the
|
||
`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
|
||
together and returns an instance of `Language` with a pipeline set and access to
|
||
the binary data:
|
||
|
||
```python
|
||
### spacy.load under the hood
|
||
lang = "en"
|
||
pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
|
||
data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"
|
||
|
||
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
|
||
nlp = cls() # 2. Initialize it
|
||
for name in pipeline:
|
||
nlp.add_pipe(name) # 3. Add the component to the pipeline
|
||
nlp.from_disk(data_path) # 4. Load in the binary data
|
||
```
|
||
|
||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||
component** on the `Doc`, in order. Since the model data is loaded, the
|
||
components can access it to assign annotations to the `Doc` object, and
|
||
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
|
||
don't own any data themselves. All components return the modified document,
|
||
which is then processed by the next component in the pipeline.
|
||
|
||
```python
|
||
### The pipeline under the hood
|
||
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
|
||
for name, proc in nlp.pipeline: # Iterate over components in order
|
||
doc = proc(doc) # Apply each component
|
||
```
|
||
|
||
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
||
list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
|
||
list of human-readable component names.
|
||
|
||
```python
|
||
print(nlp.pipeline)
|
||
# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>), ('attribute_ruler', <spacy.pipeline.AttributeRuler>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer>)]
|
||
print(nlp.pipe_names)
|
||
# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
|
||
```
|
||
|
||
### Built-in pipeline components {#built-in}
|
||
|
||
spaCy ships with several built-in pipeline components that are registered with
|
||
string names. This means that you can initialize them by calling
|
||
[`nlp.add_pipe`](/api/language#add_pipe) with their names and spaCy will know
|
||
how to create them. See the [API documentation](/api) for a full list of
|
||
available pipeline components and component functions.
|
||
|
||
> #### Usage
|
||
>
|
||
> ```python
|
||
> nlp = spacy.blank("en")
|
||
> nlp.add_pipe("sentencizer")
|
||
> # add_pipe returns the added component
|
||
> ruler = nlp.add_pipe("entity_ruler")
|
||
> ```
|
||
|
||
| String name | Component | Description |
|
||
| ---------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
||
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
|
||
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
|
||
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
|
||
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
|
||
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
|
||
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories: exactly one category is predicted per document. |
|
||
| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document. |
|
||
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words using rules and lookups. |
|
||
| `trainable_lemmatizer` | [`EditTreeLemmatizer`](/api/edittreelemmatizer) | Assign base forms to words. |
|
||
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
|
||
| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
|
||
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
|
||
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
|
||
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
|
||
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
|
||
|
||
### Disabling, excluding and modifying components {#disabling}
|
||
|
||
If you don't need a particular component of the pipeline – for example, the
|
||
tagger or the parser, you can **disable or exclude** it. This can sometimes make
|
||
a big difference and improve loading and inference speed. There are two
|
||
different mechanisms you can use:
|
||
|
||
1. **Disable:** The component and its data will be loaded with the pipeline, but
|
||
it will be disabled by default and not run as part of the processing
|
||
pipeline. To run it, you can explicitly enable it by calling
|
||
[`nlp.enable_pipe`](/api/language#enable_pipe). When you save out the `nlp`
|
||
object, the disabled component will be included but disabled by default.
|
||
2. **Exclude:** Don't load the component and its data with the pipeline. Once
|
||
the pipeline is loaded, there will be no reference to the excluded component.
|
||
|
||
Disabled and excluded component names can be provided to
|
||
[`spacy.load`](/api/top-level#spacy.load) as a list.
|
||
|
||
> #### 💡 Optional pipeline components
|
||
>
|
||
> The `disable` mechanism makes it easy to distribute pipeline packages with
|
||
> optional components that you can enable or disable at runtime. For instance,
|
||
> your pipeline may include a statistical _and_ a rule-based component for
|
||
> sentence segmentation, and you can choose which one to run depending on your
|
||
> use case.
|
||
>
|
||
> For example, spaCy's [trained pipelines](/models) like
|
||
> [`en_core_web_sm`](/models/en#en_core_web_sm) contain both a `parser` and
|
||
> `senter` that perform sentence segmentation, but the `senter` is disabled by
|
||
> default.
|
||
|
||
```python
|
||
# Load the pipeline without the entity recognizer
|
||
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
|
||
|
||
# Load the tagger and parser but don't enable them
|
||
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
|
||
# Explicitly enable the tagger later on
|
||
nlp.enable_pipe("tagger")
|
||
```
|
||
|
||
<Infobox variant="warning" title="Changed in v3.0">
|
||
|
||
As of v3.0, the `disable` keyword argument specifies components to load but
|
||
disable, instead of components to not load at all. Those components can now be
|
||
specified separately using the new `exclude` keyword argument.
|
||
|
||
</Infobox>
|
||
|
||
As a shortcut, you can use the [`nlp.select_pipes`](/api/language#select_pipes)
|
||
context manager to temporarily disable certain components for a given block. At
|
||
the end of the `with` block, the disabled pipeline components will be restored
|
||
automatically. Alternatively, `select_pipes` returns an object that lets you
|
||
call its `restore()` method to restore the disabled components when needed. This
|
||
can be useful if you want to prevent unnecessary code indentation of large
|
||
blocks.
|
||
|
||
```python
|
||
### Disable for block
|
||
# 1. Use as a context manager
|
||
with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
|
||
doc = nlp("I won't be tagged and parsed")
|
||
doc = nlp("I will be tagged and parsed")
|
||
|
||
# 2. Restore manually
|
||
disabled = nlp.select_pipes(disable="ner")
|
||
doc = nlp("I won't have named entities")
|
||
disabled.restore()
|
||
```
|
||
|
||
If you want to disable all pipes except for one or a few, you can use the
|
||
`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
|
||
names, or a string defining just one pipe.
|
||
|
||
```python
|
||
# Enable only the parser
|
||
with nlp.select_pipes(enable="parser"):
|
||
doc = nlp("I will only be parsed")
|
||
```
|
||
|
||
The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword
|
||
argument if you only want to disable components during processing:
|
||
|
||
```python
|
||
for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):
|
||
# Do something with the doc here
|
||
```
|
||
|
||
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
||
to remove pipeline components from an existing pipeline, the
|
||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||
custom component entirely (more details on this in the section on
|
||
[custom components](#custom-components)).
|
||
|
||
```python
|
||
nlp.remove_pipe("parser")
|
||
nlp.rename_pipe("ner", "entityrecognizer")
|
||
nlp.replace_pipe("tagger", "my_custom_tagger")
|
||
```
|
||
|
||
The `Language` object exposes different [attributes](/api/language#attributes)
|
||
that let you inspect all available components and the components that currently
|
||
run as part of the pipeline.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp = spacy.blank("en")
|
||
> nlp.add_pipe("ner")
|
||
> nlp.add_pipe("textcat")
|
||
> assert nlp.pipe_names == ["ner", "textcat"]
|
||
> nlp.disable_pipe("ner")
|
||
> assert nlp.pipe_names == ["textcat"]
|
||
> assert nlp.component_names == ["ner", "textcat"]
|
||
> assert nlp.disabled == ["ner"]
|
||
> ```
|
||
|
||
| Name | Description |
|
||
| --------------------- | ---------------------------------------------------------------- |
|
||
| `nlp.pipeline` | `(name, component)` tuples of the processing pipeline, in order. |
|
||
| `nlp.pipe_names` | Pipeline component names, in order. |
|
||
| `nlp.components` | All `(name, component)` tuples, including disabled components. |
|
||
| `nlp.component_names` | All component names, including disabled components. |
|
||
| `nlp.disabled` | Names of components that are currently disabled. |
|
||
|
||
### Sourcing components from existing pipelines {#sourced-components new="3"}
|
||
|
||
Pipeline components that are independent can also be reused across pipelines.
|
||
Instead of adding a new blank component, you can also copy an existing component
|
||
from a trained pipeline by setting the `source` argument on
|
||
[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
|
||
interpreted as the name of the component in the source pipeline – for instance,
|
||
`"ner"`. This is especially useful for
|
||
[training a pipeline](/usage/training#config-components) because it lets you mix
|
||
and match components and create fully custom pipeline packages with updated
|
||
trained components and new components trained on your data.
|
||
|
||
<Infobox variant="warning" title="Important note for trained components">
|
||
|
||
When reusing components across pipelines, keep in mind that the **vocabulary**,
|
||
**vectors** and model settings **must match**. If a trained pipeline includes
|
||
[word vectors](/usage/linguistic-features#vectors-similarity) and the component
|
||
uses them as features, the pipeline you copy it to needs to have the _same_
|
||
vectors available – otherwise, it won't be able to make the same predictions.
|
||
|
||
</Infobox>
|
||
|
||
> #### In training config
|
||
>
|
||
> Instead of providing a `factory`, component blocks in the training
|
||
> [config](/usage/training#config) can also define a `source`. The string needs
|
||
> to be a loadable spaCy pipeline package or path.
|
||
>
|
||
> ```ini
|
||
> [components.ner]
|
||
> source = "en_core_web_sm"
|
||
> component = "ner"
|
||
> ```
|
||
>
|
||
> By default, sourced components will be updated with your data during training.
|
||
> If you want to preserve the component as-is, you can "freeze" it if the
|
||
> pipeline is not using a shared `Tok2Vec` layer:
|
||
>
|
||
> ```ini
|
||
> [training]
|
||
> frozen_components = ["ner"]
|
||
> ```
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
# The source pipeline with different components
|
||
source_nlp = spacy.load("en_core_web_sm")
|
||
print(source_nlp.pipe_names)
|
||
|
||
# Add only the entity recognizer to the new blank pipeline
|
||
nlp = spacy.blank("en")
|
||
nlp.add_pipe("ner", source=source_nlp)
|
||
print(nlp.pipe_names)
|
||
```
|
||
|
||
### Analyzing pipeline components {#analysis new="3"}
|
||
|
||
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
||
components in the current pipeline and outputs information about them like the
|
||
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
||
they retokenize the `Doc` and which scores they produce during training. It will
|
||
also show warnings if components require values that aren't set by previous
|
||
component – for instance, if the entity linker is used but no component that
|
||
runs before it sets named entities. Setting `pretty=True` will pretty-print a
|
||
table instead of only returning the structured data.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
|
||
> `"entity_linker"`. The analysis should now show no problems, because
|
||
> requirements are met.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.blank("en")
|
||
nlp.add_pipe("tagger")
|
||
# This is a problem because it needs entities and sentence boundaries
|
||
nlp.add_pipe("entity_linker")
|
||
analysis = nlp.analyze_pipes(pretty=True)
|
||
```
|
||
|
||
<Accordion title="Example output">
|
||
|
||
```json
|
||
### Structured
|
||
{
|
||
"summary": {
|
||
"tagger": {
|
||
"assigns": ["token.tag"],
|
||
"requires": [],
|
||
"scores": ["tag_acc", "pos_acc", "lemma_acc"],
|
||
"retokenizes": false
|
||
},
|
||
"entity_linker": {
|
||
"assigns": ["token.ent_kb_id"],
|
||
"requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
|
||
"scores": [],
|
||
"retokenizes": false
|
||
}
|
||
},
|
||
"problems": {
|
||
"tagger": [],
|
||
"entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"]
|
||
},
|
||
"attrs": {
|
||
"token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
|
||
"doc.ents": { "assigns": [], "requires": ["entity_linker"] },
|
||
"token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
|
||
"doc.sents": { "assigns": [], "requires": ["entity_linker"] },
|
||
"token.tag": { "assigns": ["tagger"], "requires": [] },
|
||
"token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
|
||
}
|
||
}
|
||
```
|
||
|
||
```
|
||
### Pretty
|
||
============================= Pipeline Overview =============================
|
||
|
||
# Component Assigns Requires Scores Retokenizes
|
||
- ------------- --------------- -------------- ----------- -----------
|
||
0 tagger token.tag tag_acc False
|
||
|
||
1 entity_linker token.ent_kb_id doc.ents nel_micro_f False
|
||
doc.sents nel_micro_r
|
||
token.ent_iob nel_micro_p
|
||
token.ent_type
|
||
|
||
|
||
================================ Problems (4) ================================
|
||
⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
|
||
token.ent_iob, token.ent_type
|
||
```
|
||
|
||
</Accordion>
|
||
|
||
<Infobox variant="warning" title="Important note">
|
||
|
||
The pipeline analysis is static and does **not actually run the components**.
|
||
This means that it relies on the information provided by the components
|
||
themselves. If a custom component declares that it assigns an attribute but it
|
||
doesn't, the pipeline analysis won't catch that.
|
||
|
||
</Infobox>
|
||
|
||
## Creating custom pipeline components {#custom-components}
|
||
|
||
A pipeline component is a function that receives a `Doc` object, modifies it and
|
||
returns it – for example, by using the current weights to make a prediction and
|
||
set some annotation on the document. By adding a component to the pipeline,
|
||
you'll get access to the `Doc` at any point **during processing** – instead of
|
||
only being able to modify it afterwards.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.language import Language
|
||
>
|
||
> @Language.component("my_component")
|
||
> def my_component(doc):
|
||
> # Do something to the doc here
|
||
> return doc
|
||
> ```
|
||
|
||
| Argument | Type | Description |
|
||
| ----------- | ----------------- | ------------------------------------------------------ |
|
||
| `doc` | [`Doc`](/api/doc) | The `Doc` object processed by the previous component. |
|
||
| **RETURNS** | [`Doc`](/api/doc) | The `Doc` object processed by this pipeline component. |
|
||
|
||
The [`@Language.component`](/api/language#component) decorator lets you turn a
|
||
simple function into a pipeline component. It takes at least one argument, the
|
||
**name** of the component factory. You can use this name to add an instance of
|
||
your component to the pipeline. It can also be listed in your pipeline config,
|
||
so you can save, load and train pipelines using your component.
|
||
|
||
Custom components can be added to the pipeline using the
|
||
[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
|
||
a component to add it **before or after**, tell spaCy to add it **first or
|
||
last** in the pipeline, or define a **custom name**. If no name is set and no
|
||
`name` attribute is present on your component, the function name is used.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp.add_pipe("my_component")
|
||
> nlp.add_pipe("my_component", first=True)
|
||
> nlp.add_pipe("my_component", before="parser")
|
||
> ```
|
||
|
||
| Argument | Description |
|
||
| -------- | --------------------------------------------------------------------------------- |
|
||
| `last` | If set to `True`, component is added **last** in the pipeline (default). ~~bool~~ |
|
||
| `first` | If set to `True`, component is added **first** in the pipeline. ~~bool~~ |
|
||
| `before` | String name or index to add the new component **before**. ~~Union[str, int]~~ |
|
||
| `after` | String name or index to add the new component **after**. ~~Union[str, int]~~ |
|
||
|
||
<Infobox title="Changed in v3.0" variant="warning">
|
||
|
||
As of v3.0, components need to be registered using the
|
||
[`@Language.component`](/api/language#component) or
|
||
[`@Language.factory`](/api/language#factory) decorator so spaCy knows that a
|
||
function is a component. [`nlp.add_pipe`](/api/language#add_pipe) now takes the
|
||
**string name** of the component factory instead of the component function. This
|
||
doesn't only save you lines of code, it also allows spaCy to validate and track
|
||
your custom components, and make sure they can be saved and loaded.
|
||
|
||
```diff
|
||
- ruler = nlp.create_pipe("entity_ruler")
|
||
- nlp.add_pipe(ruler)
|
||
+ ruler = nlp.add_pipe("entity_ruler")
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Examples: Simple stateless pipeline components {#custom-components-simple}
|
||
|
||
The following component receives the `Doc` in the pipeline and prints some
|
||
information about it: the number of tokens, the part-of-speech tags of the
|
||
tokens and a conditional message based on the document length. The
|
||
[`@Language.component`](/api/language#component) decorator lets you register the
|
||
component under the name `"info_component"`.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Add the component first in the pipeline by setting `first=True`. You'll see
|
||
> that the part-of-speech tags are empty, because the component now runs
|
||
> before the tagger and the tags aren't available yet.
|
||
> 2. Change the component `name` or remove the `name` argument. You should see
|
||
> this change reflected in `nlp.pipe_names`.
|
||
> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
|
||
> name and the function that's called on the `Doc` object in the pipeline.
|
||
> 4. Change the first argument to `@Language.component`, the name, to something
|
||
> else. spaCy should now complain that it doesn't know a component of the
|
||
> name `"info_component"`.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.language import Language
|
||
|
||
@Language.component("info_component")
|
||
def my_component(doc):
|
||
print(f"After tokenization, this doc has {len(doc)} tokens.")
|
||
print("The part-of-speech tags are:", [token.pos_ for token in doc])
|
||
if len(doc) < 10:
|
||
print("This is a pretty short document.")
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe("info_component", name="print_info", last=True)
|
||
print(nlp.pipe_names) # ['tagger', 'parser', 'ner', 'print_info']
|
||
doc = nlp("This is a sentence.")
|
||
```
|
||
|
||
Here's another example of a pipeline component that implements custom logic to
|
||
improve the sentence boundaries set by the dependency parser. The custom logic
|
||
should therefore be applied **after** tokenization, but _before_ the dependency
|
||
parsing – this way, the parser can also take advantage of the sentence
|
||
boundaries.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
|
||
> component. You'll see that the predicted dependency parse changes to match
|
||
> the sentence boundaries.
|
||
> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
|
||
> to `None` (missing value), the parser will assign sentence boundaries in
|
||
> between.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.language import Language
|
||
|
||
@Language.component("custom_sentencizer")
|
||
def custom_sentencizer(doc):
|
||
for i, token in enumerate(doc[:-2]):
|
||
# Define sentence start if pipe + titlecase token
|
||
if token.text == "|" and doc[i + 1].is_title:
|
||
doc[i + 1].is_sent_start = True
|
||
else:
|
||
# Explicitly set sentence start to False otherwise, to tell
|
||
# the parser to leave those tokens alone
|
||
doc[i + 1].is_sent_start = False
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe("custom_sentencizer", before="parser") # Insert before the parser
|
||
doc = nlp("This is. A sentence. | This is. Another sentence.")
|
||
for sent in doc.sents:
|
||
print(sent.text)
|
||
```
|
||
|
||
### Component factories and stateful components {#custom-components-factories}
|
||
|
||
Component factories are callables that take settings and return a **pipeline
|
||
component function**. This is useful if your component is stateful and if you
|
||
need to customize their creation, or if you need access to the current `nlp`
|
||
object or the shared vocab. Component factories can be registered using the
|
||
[`@Language.factory`](/api/language#factory) decorator and they need at least
|
||
**two named arguments** that are filled in automatically when the component is
|
||
added to the pipeline:
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.language import Language
|
||
>
|
||
> @Language.factory("my_component")
|
||
> def my_component(nlp, name):
|
||
> return MyComponent()
|
||
> ```
|
||
|
||
| Argument | Description |
|
||
| -------- | --------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `nlp` | The current `nlp` object. Can be used to access the shared vocab. ~~Language~~ |
|
||
| `name` | The **instance name** of the component in the pipeline. This lets you identify different instances of the same component. ~~str~~ |
|
||
|
||
All other settings can be passed in by the user via the `config` argument on
|
||
[`nlp.add_pipe`](/api/language). The
|
||
[`@Language.factory`](/api/language#factory) decorator also lets you define a
|
||
`default_config` that's used as a fallback.
|
||
|
||
```python
|
||
### With config {highlight="4,9"}
|
||
import spacy
|
||
from spacy.language import Language
|
||
|
||
@Language.factory("my_component", default_config={"some_setting": True})
|
||
def my_component(nlp, name, some_setting: bool):
|
||
return MyComponent(some_setting=some_setting)
|
||
|
||
nlp = spacy.blank("en")
|
||
nlp.add_pipe("my_component", config={"some_setting": False})
|
||
```
|
||
|
||
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
||
|
||
The [`@Language.component`](/api/language#component) decorator is essentially a
|
||
**shortcut** for stateless pipeline components that don't need any settings.
|
||
This means you don't have to always write a function that returns your function
|
||
if there's no state to be passed through – spaCy can just take care of this for
|
||
you. The following two code examples are equivalent:
|
||
|
||
```python
|
||
# Stateless component with @Language.factory
|
||
@Language.factory("my_component")
|
||
def create_my_component():
|
||
def my_component(doc):
|
||
# Do something to the doc
|
||
return doc
|
||
|
||
return my_component
|
||
|
||
# Stateless component with @Language.component
|
||
@Language.component("my_component")
|
||
def my_component(doc):
|
||
# Do something to the doc
|
||
return doc
|
||
```
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="Can I add the @Language.factory decorator to a class?" id="factories-class-decorator" spaced>
|
||
|
||
Yes, the [`@Language.factory`](/api/language#factory) decorator can be added to
|
||
a function or a class. If it's added to a class, it expects the `__init__`
|
||
method to take the arguments `nlp` and `name`, and will populate all other
|
||
arguments from the config. That said, it's often cleaner and more intuitive to
|
||
make your factory a separate function. That's also how spaCy does it internally.
|
||
|
||
</Accordion>
|
||
|
||
### Language-specific factories {#factories-language new="3"}
|
||
|
||
There are many use cases where you might want your pipeline components to be
|
||
language-specific. Sometimes this requires entirely different implementation per
|
||
language, sometimes the only difference is in the settings or data. spaCy allows
|
||
you to register factories of the **same name** on both the `Language` base
|
||
class, as well as its **subclasses** like `English` or `German`. Factories are
|
||
resolved starting with the specific subclass. If the subclass doesn't define a
|
||
component of that name, spaCy will check the `Language` base class.
|
||
|
||
Here's an example of a pipeline component that overwrites the normalized form of
|
||
a token, the `Token.norm_` with an entry from a language-specific lookup table.
|
||
It's registered twice under the name `"token_normalizer"` – once using
|
||
`@English.factory` and once using `@German.factory`:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
from spacy.lang.en import English
|
||
from spacy.lang.de import German
|
||
|
||
class TokenNormalizer:
|
||
def __init__(self, norm_table):
|
||
self.norm_table = norm_table
|
||
|
||
def __call__(self, doc):
|
||
for token in doc:
|
||
# Overwrite the token.norm_ if there's an entry in the data
|
||
token.norm_ = self.norm_table.get(token.text, token.norm_)
|
||
return doc
|
||
|
||
@English.factory("token_normalizer")
|
||
def create_en_normalizer(nlp, name):
|
||
return TokenNormalizer({"realise": "realize", "colour": "color"})
|
||
|
||
@German.factory("token_normalizer")
|
||
def create_de_normalizer(nlp, name):
|
||
return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
|
||
|
||
nlp_en = English()
|
||
nlp_en.add_pipe("token_normalizer") # uses the English factory
|
||
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
|
||
|
||
nlp_de = German()
|
||
nlp_de.add_pipe("token_normalizer") # uses the German factory
|
||
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
|
||
```
|
||
|
||
<Infobox title="Implementation details">
|
||
|
||
Under the hood, language-specific factories are added to the
|
||
[`factories` registry](/api/top-level#registry) prefixed with the language code,
|
||
e.g. `"en.token_normalizer"`. When resolving the factory in
|
||
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
|
||
language-specific version of the factory using `nlp.lang` and if none is
|
||
available, falls back to looking up the regular factory name.
|
||
|
||
</Infobox>
|
||
|
||
### Example: Stateful component with settings {#example-stateful-components}
|
||
|
||
This example shows a **stateful** pipeline component for handling acronyms:
|
||
based on a dictionary, it will detect acronyms and their expanded forms in both
|
||
directions and add them to a list as the custom `doc._.acronyms`
|
||
[extension attribute](#custom-components-attributes). Under the hood, it uses
|
||
the [`PhraseMatcher`](/api/phrasematcher) to find instances of the phrases.
|
||
|
||
The factory function takes three arguments: the shared `nlp` object and
|
||
component instance `name`, which are passed in automatically by spaCy, and a
|
||
`case_sensitive` config setting that makes the matching and acronym detection
|
||
case-sensitive.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Change the `config` passed to `nlp.add_pipe` and set `"case_sensitive"` to
|
||
> `True`. You should see that the expanded acronym for "LOL" isn't detected
|
||
> anymore.
|
||
> 2. Add some more terms to the `DICTIONARY` and update the processed text so
|
||
> they're detected.
|
||
> 3. Add a `name` argument to `nlp.add_pipe` to change the component name. Print
|
||
> `nlp.pipe_names` to see the change reflected in the pipeline.
|
||
> 4. Print the config of the current `nlp` object with
|
||
> `print(nlp.config.to_str())` and inspect the `[components]` block. You
|
||
> should see an entry for the acronyms component, referencing the factory
|
||
> `acronyms` and the config settings.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
from spacy.language import Language
|
||
from spacy.tokens import Doc
|
||
from spacy.matcher import PhraseMatcher
|
||
import spacy
|
||
|
||
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
|
||
DICTIONARY.update({value: key for key, value in DICTIONARY.items()})
|
||
|
||
@Language.factory("acronyms", default_config={"case_sensitive": False})
|
||
def create_acronym_component(nlp: Language, name: str, case_sensitive: bool):
|
||
return AcronymComponent(nlp, case_sensitive)
|
||
|
||
class AcronymComponent:
|
||
def __init__(self, nlp: Language, case_sensitive: bool):
|
||
# Create the matcher and match on Token.lower if case-insensitive
|
||
matcher_attr = "TEXT" if case_sensitive else "LOWER"
|
||
self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr)
|
||
self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY])
|
||
self.case_sensitive = case_sensitive
|
||
# Register custom extension on the Doc
|
||
if not Doc.has_extension("acronyms"):
|
||
Doc.set_extension("acronyms", default=[])
|
||
|
||
def __call__(self, doc: Doc) -> Doc:
|
||
# Add the matched spans when doc is processed
|
||
for _, start, end in self.matcher(doc):
|
||
span = doc[start:end]
|
||
acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower())
|
||
doc._.acronyms.append((span, acronym))
|
||
return doc
|
||
|
||
# Add the component to the pipeline and configure it
|
||
nlp = spacy.blank("en")
|
||
nlp.add_pipe("acronyms", config={"case_sensitive": False})
|
||
|
||
# Process a doc and see the results
|
||
doc = nlp("LOL, be right back")
|
||
print(doc._.acronyms)
|
||
```
|
||
|
||
## Initializing and serializing component data {#component-data}
|
||
|
||
Many stateful components depend on **data resources** like dictionaries and
|
||
lookup tables that should ideally be **configurable**. For example, it makes
|
||
sense to make the `DICTIONARY` in the above example an argument of the
|
||
registered function, so the `AcronymComponent` can be re-used with different
|
||
data. One logical solution would be to make it an argument of the component
|
||
factory, and allow it to be initialized with different dictionaries.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [components.acronyms.data]
|
||
> # 🚨 Problem: you don't want the data in the config
|
||
> lol = "laugh out loud"
|
||
> brb = "be right back"
|
||
> ```
|
||
|
||
```python
|
||
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
|
||
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
|
||
# 🚨 Problem: data ends up in the config file
|
||
return AcronymComponent(nlp, data, case_sensitive)
|
||
```
|
||
|
||
However, passing in the dictionary directly is problematic, because it means
|
||
that if a component saves out its config and settings, the
|
||
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
|
||
since that's the config the component was created with. It will also fail if the
|
||
data is not JSON-serializable.
|
||
|
||
### Option 1: Using a registered function {#component-data-function}
|
||
|
||
<Infobox>
|
||
|
||
- ✅ **Pros:** can load anything in Python, easy to add to and configure via
|
||
config
|
||
- ❌ **Cons:** requires the function and its dependencies to be available at
|
||
runtime
|
||
|
||
</Infobox>
|
||
|
||
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
|
||
[model](#trainable-components) – saving out the component config becomes
|
||
impossible because there's no way for spaCy to know _how_ that object was
|
||
created, and what to do to create it again. This makes it much harder to save,
|
||
load and train custom pipelines with custom components. A simple solution is to
|
||
**register a function** that returns your resources. The
|
||
[registry](/api/top-level#registry) lets you **map string names to functions**
|
||
that create objects, so given a name and optional arguments, spaCy will know how
|
||
to recreate the object. To register a function that returns your custom
|
||
dictionary, you can use the `@spacy.registry.misc` decorator with a single
|
||
argument, the name:
|
||
|
||
> #### What's the misc registry?
|
||
>
|
||
> The [`registry`](/api/top-level#registry) provides different categories for
|
||
> different types of functions – for example, model architectures, tokenizers or
|
||
> batchers. `misc` is intended for miscellaneous functions that don't fit
|
||
> anywhere else.
|
||
|
||
```python
|
||
### Registered function for assets {highlight="1"}
|
||
@spacy.registry.misc("acronyms.slang_dict.v1")
|
||
def create_acronyms_slang_dict():
|
||
dictionary = {"lol": "laughing out loud", "brb": "be right back"}
|
||
dictionary.update({value: key for key, value in dictionary.items()})
|
||
return dictionary
|
||
```
|
||
|
||
In your `default_config` (and later in your
|
||
[training config](/usage/training#config)), you can now refer to the function
|
||
registered under the name `"acronyms.slang_dict.v1"` using the `@misc` key. This
|
||
tells spaCy how to create the value, and when your component is created, the
|
||
result of the registered function is passed in as the key `"dictionary"`.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [components.acronyms]
|
||
> factory = "acronyms"
|
||
>
|
||
> [components.acronyms.data]
|
||
> @misc = "acronyms.slang_dict.v1"
|
||
> ```
|
||
|
||
```diff
|
||
- default_config = {"dictionary:" DICTIONARY}
|
||
+ default_config = {"dictionary": {"@misc": "acronyms.slang_dict.v1"}}
|
||
```
|
||
|
||
Using a registered function also means that you can easily include your custom
|
||
components in pipelines that you [train](/usage/training). To make sure spaCy
|
||
knows where to find your custom `@misc` function, you can pass in a Python file
|
||
via the argument `--code`. If someone else is using your component, all they
|
||
have to do to customize the data is to register their own function and swap out
|
||
the name. Registered functions can also take **arguments**, by the way, that can
|
||
be defined in the config as well – you can read more about this in the docs on
|
||
[training with custom code](/usage/training#custom-code).
|
||
|
||
### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}
|
||
|
||
<Infobox>
|
||
|
||
- ✅ **Pros:** lets components save and load their own data and reflect user
|
||
changes, load in data assets before training without depending on them at
|
||
runtime
|
||
- ❌ **Cons:** requires more component methods, more complex config and data
|
||
flow
|
||
|
||
</Infobox>
|
||
|
||
Just like models save out their binary weights when you call
|
||
[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
|
||
other data assets – for instance, an acronym dictionary. If a pipeline component
|
||
implements its own `to_disk` and `from_disk` methods, those will be called
|
||
automatically by `nlp.to_disk` and will receive the path to the directory to
|
||
save to or load from. The component can then perform any custom saving or
|
||
loading. If a user makes changes to the component data, they will be reflected
|
||
when the `nlp` object is saved. For more examples of this, see the usage guide
|
||
on [serialization methods](/usage/saving-loading/#serialization-methods).
|
||
|
||
> #### About the data path
|
||
>
|
||
> The `path` argument spaCy passes to the serialization methods consists of the
|
||
> path provided by the user, plus a directory of the component name. This means
|
||
> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
|
||
> receive the directory path `/path/acronyms` and can then create files in this
|
||
> directory.
|
||
|
||
```python
|
||
### Custom serialization methods {highlight="7-11,13-15"}
|
||
import srsly
|
||
from spacy.util import ensure_path
|
||
|
||
class AcronymComponent:
|
||
# other methods here...
|
||
|
||
def to_disk(self, path, exclude=tuple()):
|
||
path = ensure_path(path)
|
||
if not path.exists():
|
||
path.mkdir()
|
||
srsly.write_json(path / "data.json", self.data)
|
||
|
||
def from_disk(self, path, exclude=tuple()):
|
||
self.data = srsly.read_json(path / "data.json")
|
||
return self
|
||
```
|
||
|
||
Now the component can save to and load from a directory. The only remaining
|
||
question: How do you **load in the initial data**? In Python, you could just
|
||
call the pipe's `from_disk` method yourself. But if you're adding the component
|
||
to your [training config](/usage/training#config), spaCy will need to know how
|
||
to set it up, from start to finish, including the data to initialize it with.
|
||
|
||
While you could use a registered function or a file loader like
|
||
[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
|
||
component factory, this approach is problematic: the component factory runs
|
||
**every time the component is created**. This means it will run when creating
|
||
the `nlp` object before training, but also every time a user loads your
|
||
pipeline. So your runtime pipeline would either depend on a local path on your
|
||
file system, or it's loaded twice: once when the component is created, and then
|
||
again when the data is by `from_disk`.
|
||
|
||
> ```ini
|
||
> ### config.cfg
|
||
> [components.acronyms.data]
|
||
> # 🚨 Problem: Runtime pipeline depends on local path
|
||
> @readers = "srsly.read_json.v1"
|
||
> path = "/path/to/slang_dict.json"
|
||
> ```
|
||
>
|
||
> ```ini
|
||
> ### config.cfg
|
||
> [components.acronyms.data]
|
||
> # 🚨 Problem: this always runs
|
||
> @misc = "acronyms.slang_dict.v1"
|
||
> ```
|
||
|
||
```python
|
||
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
|
||
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
|
||
# 🚨 Problem: data will be loaded every time component is created
|
||
return AcronymComponent(nlp, data, case_sensitive)
|
||
```
|
||
|
||
To solve this, your component can implement a separate method, `initialize`,
|
||
which will be called by [`nlp.initialize`](/api/language#initialize) if
|
||
available. This typically happens before training, but not at runtime when the
|
||
pipeline is loaded. For more background on this, see the usage guides on the
|
||
[config lifecycle](/usage/training#config-lifecycle) and
|
||
[custom initialization](/usage/training#initialization).
|
||
|
||

|
||
|
||
A component's `initialize` method needs to take at least **two named
|
||
arguments**: a `get_examples` callback that gives it access to the training
|
||
examples, and the current `nlp` object. This is mostly used by trainable
|
||
components so they can initialize their models and label schemes from the data,
|
||
so we can ignore those arguments here. All **other arguments** on the method can
|
||
be defined via the config – in this case a dictionary `data`.
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [initialize.components.my_component]
|
||
>
|
||
> [initialize.components.my_component.data]
|
||
> # ✅ This only runs on initialization
|
||
> @readers = "srsly.read_json.v1"
|
||
> path = "/path/to/slang_dict.json"
|
||
> ```
|
||
|
||
```python
|
||
### Custom initialize method {highlight="5-6"}
|
||
class AcronymComponent:
|
||
def __init__(self):
|
||
self.data = {}
|
||
|
||
def initialize(self, get_examples=None, nlp=None, data={}):
|
||
self.data = data
|
||
```
|
||
|
||
When [`nlp.initialize`](/api/language#initialize) runs before training (or when
|
||
you call it in your own code), the
|
||
[`[initialize]`](/api/data-formats#config-initialize) block of the config is
|
||
loaded and used to construct the `nlp` object. The custom acronym component will
|
||
then be passed the data loaded from the JSON file. After training, the `nlp`
|
||
object is saved to disk, which will run the component's `to_disk` method. When
|
||
the pipeline is loaded back into spaCy later to use it, the `from_disk` method
|
||
will load the data back in.
|
||
|
||
## Python type hints and validation {#type-hints new="3"}
|
||
|
||
spaCy's configs are powered by our machine learning library Thinc's
|
||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||
[type hints](https://docs.python.org/3/library/typing.html) and even
|
||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your component
|
||
factory provides type hints, the values that are passed in will be **checked
|
||
against the expected types**. If the value can't be cast to an integer, spaCy
|
||
will raise an error. `pydantic` also provides strict types like `StrictFloat`,
|
||
which will force the value to be an integer and raise an error if it's not – for
|
||
instance, if your config defines a float.
|
||
|
||
<Infobox variant="warning">
|
||
|
||
If you're not using
|
||
[strict types](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
|
||
values that can be **cast to** the given type will still be accepted. For
|
||
example, `1` can be cast to a `float` or a `bool` type, but not to a
|
||
`List[str]`. However, if the type is
|
||
[`StrictFloat`](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
|
||
only a float will be accepted.
|
||
|
||
</Infobox>
|
||
|
||
The following example shows a custom pipeline component for debugging. It can be
|
||
added anywhere in the pipeline and logs information about the `nlp` object and
|
||
the `Doc` that passes through. The `log_level` config setting lets the user
|
||
customize what log statements are shown – for instance, `"INFO"` will show info
|
||
logs and more critical logging statements, whereas `"DEBUG"` will show
|
||
everything. The value is annotated as a `StrictStr`, so it will only accept a
|
||
string value.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Change the `config` passed to `nlp.add_pipe` to use the log level `"INFO"`.
|
||
> You should see that only the statement logged with `logger.info` is shown.
|
||
> 2. Change the `config` passed to `nlp.add_pipe` so that it contains unexpected
|
||
> values – for example, a boolean instead of a string: `"log_level": False`.
|
||
> You should see a validation error.
|
||
> 3. Check out the docs on `pydantic`'s
|
||
> [constrained types](https://pydantic-docs.helpmanual.io/usage/types/#constrained-types)
|
||
> and write a type hint for `log_level` that only accepts the exact string
|
||
> values `"DEBUG"`, `"INFO"` or `"CRITICAL"`.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.language import Language
|
||
from spacy.tokens import Doc
|
||
from pydantic import StrictStr
|
||
import logging
|
||
|
||
@Language.factory("debug", default_config={"log_level": "DEBUG"})
|
||
class DebugComponent:
|
||
def __init__(self, nlp: Language, name: str, log_level: StrictStr):
|
||
self.logger = logging.getLogger(f"spacy.{name}")
|
||
self.logger.setLevel(log_level)
|
||
self.logger.info(f"Pipeline: {nlp.pipe_names}")
|
||
|
||
def __call__(self, doc: Doc) -> Doc:
|
||
is_tagged = doc.has_annotation("TAG")
|
||
self.logger.debug(f"Doc: {len(doc)} tokens, is tagged: {is_tagged}")
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe("debug", config={"log_level": "DEBUG"})
|
||
doc = nlp("This is a text...")
|
||
```
|
||
|
||
## Trainable components {#trainable-components new="3"}
|
||
|
||
spaCy's [`TrainablePipe`](/api/pipe) class helps you implement your own
|
||
trainable components that have their own model instance, make predictions over
|
||
`Doc` objects and can be updated using [`spacy train`](/api/cli#train). This
|
||
lets you plug fully custom machine learning components into your pipeline.
|
||
|
||

|
||
|
||
You'll need the following:
|
||
|
||
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
|
||
can be a model implemented in [Thinc](/usage/layers-architectures#thinc), or
|
||
a [wrapped model](/usage/layers-architectures#frameworks) implemented in
|
||
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
|
||
list of [`Doc`](/api/doc) objects as input and can have any type of output.
|
||
2. **TrainablePipe subclass:** A subclass of [`TrainablePipe`](/api/pipe) that
|
||
implements at least two methods: [`TrainablePipe.predict`](/api/pipe#predict)
|
||
and [`TrainablePipe.set_annotations`](/api/pipe#set_annotations).
|
||
3. **Component factory:** A component factory registered with
|
||
[`@Language.factory`](/api/language#factory) that takes the `nlp` object and
|
||
component `name` and optional settings provided by the config and returns an
|
||
instance of your trainable component.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.pipeline import TrainablePipe
|
||
> from spacy.language import Language
|
||
>
|
||
> class TrainableComponent(TrainablePipe):
|
||
> def predict(self, docs):
|
||
> ...
|
||
>
|
||
> def set_annotations(self, docs, scores):
|
||
> ...
|
||
>
|
||
> @Language.factory("my_trainable_component")
|
||
> def make_component(nlp, name, model):
|
||
> return TrainableComponent(nlp.vocab, model, name=name)
|
||
> ```
|
||
|
||
| Name | Description |
|
||
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
|
||
| [`predict`](/api/pipe#predict) | Apply the component's model to a batch of [`Doc`](/api/doc) objects (without modifying them) and return the scores. |
|
||
| [`set_annotations`](/api/pipe#set_annotations) | Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores generated by `predict`. |
|
||
|
||
By default, [`TrainablePipe.__init__`](/api/pipe#init) takes the shared vocab,
|
||
the [`Model`](https://thinc.ai/docs/api-model) and the name of the component
|
||
instance in the pipeline, which you can use as a key in the losses. All other
|
||
keyword arguments will become available as [`TrainablePipe.cfg`](/api/pipe#cfg)
|
||
and will also be serialized with the component.
|
||
|
||
<Accordion title="Why components should be passed a Model instance, not create it" spaced>
|
||
|
||
spaCy's [config system](/usage/training#config) resolves the config describing
|
||
the pipeline components and models **bottom-up**. This means that it will
|
||
_first_ create a `Model` from a [registered architecture](/api/architectures),
|
||
validate its arguments and _then_ pass the object forward to the component. This
|
||
means that the config can express very complex, nested trees of objects – but
|
||
the objects don't have to pass the model settings all the way down to the
|
||
components. It also makes the components more **modular** and lets you
|
||
[swap](/usage/layers-architectures#swap-architectures) different architectures
|
||
in your config, and re-use model definitions.
|
||
|
||
```ini
|
||
### config.cfg (excerpt)
|
||
[components]
|
||
|
||
[components.textcat]
|
||
factory = "textcat"
|
||
labels = []
|
||
|
||
# This function is created and then passed to the "textcat" component as
|
||
# the argument "model"
|
||
[components.textcat.model]
|
||
@architectures = "spacy.TextCatBOW.v2"
|
||
exclusive_classes = true
|
||
ngram_size = 1
|
||
no_output_layer = false
|
||
|
||
[components.other_textcat]
|
||
factory = "textcat"
|
||
# This references the [components.textcat.model] block above
|
||
model = ${components.textcat.model}
|
||
labels = []
|
||
```
|
||
|
||
Your trainable pipeline component factories should therefore always take a
|
||
`model` argument instead of instantiating the
|
||
[`Model`](https://thinc.ai/docs/api-model) inside the component. To register
|
||
custom architectures, you can use the
|
||
[`@spacy.registry.architectures`](/api/top-level#registry) decorator. Also see
|
||
the [training guide](/usage/training#config) for details.
|
||
|
||
</Accordion>
|
||
|
||
For some use cases, it makes sense to also overwrite additional methods to
|
||
customize how the model is updated from examples, how it's initialized, how the
|
||
loss is calculated and to add evaluation scores to the training output.
|
||
|
||
| Name | Description |
|
||
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| [`update`](/api/pipe#update) | Learn from a batch of [`Example`](/api/example) objects containing the predictions and gold-standard annotations, and update the component's model. |
|
||
| [`initialize`](/api/pipe#initialize) | Initialize the model. Typically calls into [`Model.initialize`](https://thinc.ai/docs/api-model#initialize) and can be passed custom arguments via the [`[initialize]`](/api/data-formats#config-initialize) config block that are only loaded during training or when you call [`nlp.initialize`](/api/language#initialize), not at runtime. |
|
||
| [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. |
|
||
| [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_score_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. |
|
||
|
||
<Infobox title="Custom trainable components and models" emoji="📖">
|
||
|
||
For more details on how to implement your own trainable components and model
|
||
architectures, and plug existing models implemented in PyTorch or TensorFlow
|
||
into your spaCy pipeline, see the usage guide on
|
||
[layers and model architectures](/usage/layers-architectures#components).
|
||
|
||
</Infobox>
|
||
|
||
## Extension attributes {#custom-components-attributes new="2"}
|
||
|
||
spaCy allows you to set any custom attributes and methods on the `Doc`, `Span`
|
||
and `Token`, which become available as `Doc._`, `Span._` and `Token._` – for
|
||
example, `Token._.my_attr`. This lets you store additional information relevant
|
||
to your application, add new features and functionality to spaCy, and implement
|
||
your own models trained with other machine learning libraries. It also lets you
|
||
take advantage of spaCy's data structures and the `Doc` object as the "single
|
||
source of truth".
|
||
|
||
<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
|
||
|
||
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
||
separation and makes it easier to ensure backwards compatibility. For example,
|
||
if you've implemented your own `.coref` property and spaCy claims it one day,
|
||
it'll break your code. Similarly, just by looking at the code, you'll
|
||
immediately know what's built-in and what's custom – for example,
|
||
`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
|
||
|
||
Extension definitions – the defaults, methods, getters and setters you pass in
|
||
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
||
If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
|
||
stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
|
||
underscore data separate from your other dictionary entries, the string `"._."`
|
||
is placed before the name, in a tuple.
|
||
|
||
</Accordion>
|
||
|
||
---
|
||
|
||
There are three main types of extensions, which can be defined using the
|
||
[`Doc.set_extension`](/api/doc#set_extension),
|
||
[`Span.set_extension`](/api/span#set_extension) and
|
||
[`Token.set_extension`](/api/token#set_extension) methods.
|
||
|
||
## Description
|
||
|
||
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
||
overwritten manually at any time. Attribute extensions work like "normal"
|
||
variables and are the quickest way to store arbitrary information on a `Doc`,
|
||
`Span` or `Token`.
|
||
|
||
```python
|
||
Doc.set_extension("hello", default=True)
|
||
assert doc._.hello
|
||
doc._.hello = False
|
||
```
|
||
|
||
2. **Property extensions.** Define a getter and an optional setter function. If
|
||
no setter is provided, the extension is immutable. Since the getter and
|
||
setter functions are only called when you _retrieve_ the attribute, you can
|
||
also access values of previously added attribute extensions. For example, a
|
||
`Doc` getter can average over `Token` attributes. For `Span` extensions,
|
||
you'll almost always want to use a property – otherwise, you'd have to write
|
||
to _every possible_ `Span` in the `Doc` to set up the values correctly.
|
||
|
||
```python
|
||
Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
|
||
assert doc._.hello
|
||
doc._.hello = "Hi!"
|
||
```
|
||
|
||
3. **Method extensions.** Assign a function that becomes available as an object
|
||
method. Method extensions are always immutable. For more details and
|
||
implementation ideas, see
|
||
[these examples](/usage/examples#custom-components-attr-methods).
|
||
|
||
```python
|
||
Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
|
||
assert doc._.hello("Bob") == "Hi Bob!"
|
||
```
|
||
|
||
Before you can access a custom extension, you need to register it using the
|
||
`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
|
||
in mind that extensions are always **added globally** and not just on a
|
||
particular instance. If an attribute of the same name already exists, or if
|
||
you're trying to access an attribute that hasn't been registered, spaCy will
|
||
raise an `AttributeError`.
|
||
|
||
```python
|
||
### Example
|
||
from spacy.tokens import Doc, Span, Token
|
||
|
||
fruits = ["apple", "pear", "banana", "orange", "strawberry"]
|
||
is_fruit_getter = lambda token: token.text in fruits
|
||
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
|
||
|
||
Token.set_extension("is_fruit", getter=is_fruit_getter)
|
||
Doc.set_extension("has_fruit", getter=has_fruit_getter)
|
||
Span.set_extension("has_fruit", getter=has_fruit_getter)
|
||
```
|
||
|
||
> #### Usage example
|
||
>
|
||
> ```python
|
||
> doc = nlp("I have an apple and a melon")
|
||
> assert doc[3]._.is_fruit # get Token attributes
|
||
> assert not doc[0]._.is_fruit
|
||
> assert doc._.has_fruit # get Doc attributes
|
||
> assert doc[1:4]._.has_fruit # get Span attributes
|
||
> ```
|
||
|
||
Once you've registered your custom attribute, you can also use the built-in
|
||
`set`, `get` and `has` methods to modify and retrieve the attributes. This is
|
||
especially useful it you want to pass in a string instead of calling
|
||
`doc._.my_attr`.
|
||
|
||
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
||
|
||
This example shows the implementation of a pipeline component that fetches
|
||
country meta data via the [REST Countries API](https://restcountries.com), sets
|
||
entity annotations for countries and sets custom attributes on the `Doc` and
|
||
`Span` – for example, the capital, latitude/longitude coordinates and even the
|
||
country flag.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import requests
|
||
from spacy.lang.en import English
|
||
from spacy.language import Language
|
||
from spacy.matcher import PhraseMatcher
|
||
from spacy.tokens import Doc, Span, Token
|
||
|
||
@Language.factory("rest_countries")
|
||
class RESTCountriesComponent:
|
||
def __init__(self, nlp, name, label="GPE"):
|
||
r = requests.get("https://restcountries.com/v2/all")
|
||
r.raise_for_status() # make sure requests raises an error if it fails
|
||
countries = r.json()
|
||
# Convert API response to dict keyed by country name for easy lookup
|
||
self.countries = {c["name"]: c for c in countries}
|
||
self.label = label
|
||
# Set up the PhraseMatcher with Doc patterns for each country name
|
||
self.matcher = PhraseMatcher(nlp.vocab)
|
||
self.matcher.add("COUNTRIES", [nlp.make_doc(c) for c in self.countries.keys()])
|
||
# Register attributes on the Span. We'll be overwriting this based on
|
||
# the matches, so we're only setting a default value, not a getter.
|
||
Span.set_extension("is_country", default=None)
|
||
Span.set_extension("country_capital", default=None)
|
||
Span.set_extension("country_latlng", default=None)
|
||
Span.set_extension("country_flag", default=None)
|
||
# Register attribute on Doc via a getter that checks if the Doc
|
||
# contains a country entity
|
||
Doc.set_extension("has_country", getter=self.has_country)
|
||
|
||
def __call__(self, doc):
|
||
spans = [] # keep the spans for later so we can merge them afterwards
|
||
for _, start, end in self.matcher(doc):
|
||
# Generate Span representing the entity & set label
|
||
entity = Span(doc, start, end, label=self.label)
|
||
# Set custom attributes on entity. Can be extended with other data
|
||
# returned by the API, like currencies, country code, calling code etc.
|
||
entity._.set("is_country", True)
|
||
entity._.set("country_capital", self.countries[entity.text]["capital"])
|
||
entity._.set("country_latlng", self.countries[entity.text]["latlng"])
|
||
entity._.set("country_flag", self.countries[entity.text]["flag"])
|
||
spans.append(entity)
|
||
# Overwrite doc.ents and add entity – be careful not to replace!
|
||
doc.ents = list(doc.ents) + spans
|
||
return doc # don't forget to return the Doc!
|
||
|
||
def has_country(self, doc):
|
||
"""Getter for Doc attributes. Since the getter is only called
|
||
when we access the attribute, we can refer to the Span's 'is_country'
|
||
attribute here, which is already set in the processing step."""
|
||
return any([entity._.get("is_country") for entity in doc.ents])
|
||
|
||
nlp = English()
|
||
nlp.add_pipe("rest_countries", config={"label": "GPE"})
|
||
doc = nlp("Some text about Colombia and the Czech Republic")
|
||
print("Pipeline", nlp.pipe_names) # pipeline contains component name
|
||
print("Doc has countries", doc._.has_country) # Doc contains countries
|
||
for ent in doc.ents:
|
||
if ent._.is_country:
|
||
print(ent.text, ent.label_, ent._.country_capital, ent._.country_latlng, ent._.country_flag)
|
||
```
|
||
|
||
In this case, all data can be fetched on initialization in one request. However,
|
||
if you're working with text that contains incomplete country names, spelling
|
||
mistakes or foreign-language versions, you could also implement a
|
||
`like_country`-style getter function that makes a request to the search API
|
||
endpoint and returns the best-matching result.
|
||
|
||
### User hooks {#custom-components-user-hooks}
|
||
|
||
While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
|
||
proxies to add your own custom attributes, spaCy offers a few exceptions to
|
||
allow **customizing the built-in methods** like
|
||
[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
|
||
your own hooks, which can rely on components you train yourself. For instance,
|
||
you can provide your own on-the-fly sentence segmentation algorithm or document
|
||
similarity method.
|
||
|
||
Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
|
||
objects by adding a component to the pipeline. For instance, to customize the
|
||
[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
|
||
sets a custom function to `doc.user_hooks["similarity"]`. The built-in
|
||
`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
|
||
function if you've set one. Similar results can be achieved by setting functions
|
||
to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
||
|
||
> #### Implementation note
|
||
>
|
||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||
> This turns out to be convenient here – we only have to worry about installing
|
||
> hooks in one place.
|
||
|
||
| Name | Customizes |
|
||
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `user_hooks` | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) |
|
||
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
|
||
| `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) |
|
||
|
||
```python
|
||
### Add custom similarity hooks
|
||
from spacy.language import Language
|
||
|
||
|
||
class SimilarityModel:
|
||
def __init__(self, name: str, index: int):
|
||
self.name = name
|
||
self.index = index
|
||
|
||
def __call__(self, doc):
|
||
doc.user_hooks["similarity"] = self.similarity
|
||
doc.user_span_hooks["similarity"] = self.similarity
|
||
doc.user_token_hooks["similarity"] = self.similarity
|
||
return doc
|
||
|
||
def similarity(self, obj1, obj2):
|
||
return obj1.vector[self.index] + obj2.vector[self.index]
|
||
|
||
|
||
@Language.factory("similarity_component", default_config={"index": 0})
|
||
def create_similarity_component(nlp, name, index: int):
|
||
return SimilarityModel(name, index)
|
||
```
|
||
|
||
## Developing plugins and wrappers {#plugins}
|
||
|
||
We're very excited about all the new possibilities for community extensions and
|
||
plugins in spaCy, and we can't wait to see what you build with it! To get you
|
||
started, here are a few tips, tricks and best
|
||
practices. [See here](/universe/?category=pipeline) for examples of other spaCy
|
||
extensions.
|
||
|
||
### Usage ideas {#custom-components-usage-ideas}
|
||
|
||
- **Adding new features and hooking in models.** For example, a sentiment
|
||
analysis model, or your preferred solution for lemmatization or sentiment
|
||
analysis. spaCy's built-in tagger, parser and entity recognizer respect
|
||
annotations that were already set on the `Doc` in a previous step of the
|
||
pipeline.
|
||
- **Integrating other libraries and APIs.** For example, your pipeline component
|
||
can write additional information and data directly to the `Doc` or `Token` as
|
||
custom attributes, while making sure no information is lost in the process.
|
||
This can be output generated by other libraries and models, or an external
|
||
service with a REST API.
|
||
- **Debugging and logging.** For example, a component which stores and/or
|
||
exports relevant information about the current state of the processed
|
||
document, and insert it at any point of your pipeline.
|
||
|
||
### Best practices {#custom-components-best-practices}
|
||
|
||
Extensions can claim their own `._` namespace and exist as standalone packages.
|
||
If you're developing a tool or library and want to make it easy for others to
|
||
use it with spaCy and add it to their pipeline, all you have to do is expose a
|
||
function that takes a `Doc`, modifies it and returns it.
|
||
|
||
- Make sure to choose a **descriptive and specific name** for your pipeline
|
||
component class, and set it as its `name` attribute. Avoid names that are too
|
||
common or likely to clash with built-in or a user's other custom components.
|
||
While it's fine to call your package `"spacy_my_extension"`, avoid component
|
||
names including `"spacy"`, since this can easily lead to confusion.
|
||
|
||
```diff
|
||
+ name = "myapp_lemmatizer"
|
||
- name = "lemmatizer"
|
||
```
|
||
|
||
- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
|
||
wherever possible, and avoid setting values explicitly. Tokens and spans don't
|
||
own any data themselves, and they're implemented as C extension classes – so
|
||
you can't usually add new attributes to them like you could with most pure
|
||
Python objects.
|
||
|
||
```diff
|
||
+ is_fruit = lambda token: token.text in ("apple", "orange")
|
||
+ Token.set_extension("is_fruit", getter=is_fruit)
|
||
|
||
- token._.set_extension("is_fruit", default=False)
|
||
- if token.text in ('"apple", "orange"):
|
||
- token._.set("is_fruit", True)
|
||
```
|
||
|
||
- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
|
||
objects, not a particular instance of them. Add the attributes **as early as
|
||
possible**, e.g. in your extension's `__init__` method or in the global scope
|
||
of your module. This means that in the case of namespace collisions, the user
|
||
will see an error immediately, not just when they run their pipeline.
|
||
|
||
```diff
|
||
+ from spacy.tokens import Doc
|
||
+ def __init__(attr="my_attr"):
|
||
+ Doc.set_extension(attr, getter=self.get_doc_attr)
|
||
|
||
- def __call__(doc):
|
||
- doc.set_extension("my_attr", getter=self.get_doc_attr)
|
||
```
|
||
|
||
- If your extension is setting properties on the `Doc`, `Token` or `Span`,
|
||
include an option to **let the user to change those attribute names**. This
|
||
makes it easier to avoid namespace collisions and accommodate users with
|
||
different naming preferences. We recommend adding an `attrs` argument to the
|
||
`__init__` method of your class so you can write the names to class attributes
|
||
and reuse them across your component.
|
||
|
||
```diff
|
||
+ Doc.set_extension(self.doc_attr, default="some value")
|
||
- Doc.set_extension("my_doc_attr", default="some value")
|
||
```
|
||
|
||
- Ideally, extensions should be **standalone packages** with spaCy and
|
||
optionally, other packages specified as a dependency. They can freely assign
|
||
to their own `._` namespace, but should stick to that. If your extension's
|
||
only job is to provide a better `.similarity` implementation, and your docs
|
||
state this explicitly, there's no problem with writing to the
|
||
[`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
|
||
method. However, a third-party extension should **never silently overwrite
|
||
built-ins**, or attributes set by other extensions.
|
||
|
||
- If you're looking to publish a pipeline package that depends on a custom
|
||
pipeline component, you can either **require it** in the package's
|
||
dependencies, or – if the component is specific and lightweight – choose to
|
||
**ship it with your pipeline package**. Just make sure the
|
||
[`@Language.component`](/api/language#component) or
|
||
[`@Language.factory`](/api/language#factory) decorator that registers the
|
||
custom component runs in your package's `__init__.py` or is exposed via an
|
||
[entry point](/usage/saving-loading#entry-points).
|
||
|
||
- Once you're ready to share your extension with others, make sure to **add docs
|
||
and installation instructions** (you can always link to this page for more
|
||
info). Make it easy for others to install and use your extension, for example
|
||
by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
|
||
code on GitHub, don't forget to tag it with
|
||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||
[`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||
to help people find it. If you post it on Twitter, feel free to tag
|
||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||
|
||
### Wrapping other models and libraries {#wrapping-models-libraries}
|
||
|
||
Let's say you have a custom entity recognizer that takes a list of strings and
|
||
returns their [BILUO tags](/usage/linguistic-features#accessing-ner). Given an
|
||
input like `["A", "text", "about", "Facebook"]`, it will predict and return
|
||
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
|
||
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
|
||
component function and pass it the token texts from the `Doc` object received by
|
||
the component.
|
||
|
||
The [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans) is very
|
||
helpful here, because it takes a `Doc` object and token-based BILUO tags and
|
||
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
|
||
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
|
||
|
||
> #### How the doc.ents work
|
||
>
|
||
> When you add spans to the `doc.ents`, spaCy will automatically resolve them
|
||
> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
|
||
> attributes. By definition, each token can only be part of one entity, so
|
||
> overlapping entity spans are not allowed.
|
||
|
||
```python
|
||
### {highlight="1,8-9"}
|
||
import your_custom_entity_recognizer
|
||
from spacy.training import biluo_tags_to_spans
|
||
from spacy.language import Language
|
||
|
||
@Language.component("custom_ner_wrapper")
|
||
def custom_ner_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
custom_entities = your_custom_entity_recognizer(words)
|
||
doc.ents = biluo_tags_to_spans(doc, custom_entities)
|
||
return doc
|
||
```
|
||
|
||
The `custom_ner_wrapper` can then be added to a blank pipeline using
|
||
[`nlp.add_pipe`](/api/language#add_pipe). You can also replace the existing
|
||
entity recognizer of a trained pipeline with
|
||
[`nlp.replace_pipe`](/api/language#replace_pipe).
|
||
|
||
Here's another example of a custom model, `your_custom_model`, that takes a list
|
||
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
|
||
part-of-speech tags, dependency labels and head token indices. Here, we can use
|
||
the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
|
||
those values. To create a numpy array we need integers, so we can look up the
|
||
string labels in the [`StringStore`](/api/stringstore). The
|
||
[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
|
||
because it returns the integer ID of the string _and_ makes sure it's added to
|
||
the vocab. This is especially important if the custom model uses a different
|
||
label scheme than spaCy's default models.
|
||
|
||
> #### Example: spacy-stanza
|
||
>
|
||
> For an example of an end-to-end wrapper for statistical tokenization, tagging
|
||
> and parsing, check out
|
||
> [`spacy-stanza`](https://github.com/explosion/spacy-stanza). It uses a very
|
||
> similar approach to the example in this section – the only difference is that
|
||
> it fully replaces the `nlp` object instead of providing a pipeline component,
|
||
> since it also needs to handle tokenization.
|
||
|
||
```python
|
||
### {highlight="1,11,17-19"}
|
||
import your_custom_model
|
||
from spacy.language import Language
|
||
from spacy.symbols import POS, TAG, DEP, HEAD
|
||
from spacy.tokens import Doc
|
||
import numpy
|
||
|
||
@Language.component("custom_model_wrapper")
|
||
def custom_model_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
spaces = [token.whitespace for token in doc]
|
||
pos, tags, deps, heads = your_custom_model(words)
|
||
# Convert the strings to integers and add them to the string store
|
||
pos = [doc.vocab.strings.add(label) for label in pos]
|
||
tags = [doc.vocab.strings.add(label) for label in tags]
|
||
deps = [doc.vocab.strings.add(label) for label in deps]
|
||
# Create a new Doc from a numpy array
|
||
attrs = [POS, TAG, DEP, HEAD]
|
||
arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
|
||
new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
|
||
return new_doc
|
||
```
|
||
|
||
<Infobox title="Sentence boundaries and heads" variant="warning">
|
||
|
||
If you create a `Doc` object with dependencies and heads, spaCy is able to
|
||
resolve the sentence boundaries automatically. However, note that the `HEAD`
|
||
value used to construct a `Doc` is the token index **relative** to the current
|
||
token – e.g. `-1` for the previous token. The CoNLL format typically annotates
|
||
heads as `1`-indexed absolute indices with `0` indicating the root. If that's
|
||
the case in your annotations, you need to convert them first:
|
||
|
||
```python
|
||
heads = [2, 0, 4, 2, 2]
|
||
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
<Infobox title="Advanced usage, serialization and entry points" emoji="📖">
|
||
|
||
For more details on how to write and package custom components, make them
|
||
available to spaCy via entry points and implement your own serialization
|
||
methods, check out the usage guide on
|
||
[saving and loading](/usage/saving-loading).
|
||
|
||
</Infobox>
|