mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-13 13:17:06 +03:00
e597110d31
<!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
837 lines
38 KiB
Markdown
837 lines
38 KiB
Markdown
---
|
||
title: Language Processing Pipelines
|
||
next: vectors-similarity
|
||
menu:
|
||
- ['How Pipelines Work', 'pipelines']
|
||
- ['Custom Components', 'custom-components']
|
||
- ['Extension Attributes', 'custom-components-attributes']
|
||
- ['Plugins & Wrappers', 'plugins']
|
||
---
|
||
|
||
import Pipelines101 from 'usage/101/\_pipelines.md'
|
||
|
||
<Pipelines101 />
|
||
|
||
## How pipelines work {#pipelines}
|
||
|
||
spaCy makes it very easy to create your own pipelines consisting of reusable
|
||
components – this includes spaCy's default tagger, parser and entity recognizer,
|
||
but also your own custom processing functions. A pipeline component can be added
|
||
to an already existing `nlp` object, specified when initializing a `Language`
|
||
class, or defined within a [model package](/usage/training#saving-loading).
|
||
|
||
When you load a model, spaCy first consults the model's
|
||
[`meta.json`](/usage/training#saving-loading). The meta typically includes the
|
||
model details, the ID of a language class, and an optional list of pipeline
|
||
components. spaCy then does the following:
|
||
|
||
> #### meta.json (excerpt)
|
||
>
|
||
> ```json
|
||
> {
|
||
> "name": "example_model",
|
||
> "lang": "en",
|
||
> "description": "Example model for spaCy",
|
||
> "pipeline": ["tagger", "parser"]
|
||
> }
|
||
> ```
|
||
|
||
1. Load the **language class and data** for the given ID via
|
||
[`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
|
||
`Language` class contains the shared vocabulary, tokenization rules and the
|
||
language-specific annotation scheme.
|
||
2. Iterate over the **pipeline names** and create each component using
|
||
[`create_pipe`](/api/anguage#create_pipe), which looks them up in
|
||
`Language.factories`.
|
||
3. Add each pipeline component to the pipeline in order, using
|
||
[`add_pipe`](/api/language#add_pipe).
|
||
4. Make the **model data** available to the `Language` class by calling
|
||
[`from_disk`](language#from_disk) with the path to the model data directory.
|
||
|
||
So when you call this...
|
||
|
||
```python
|
||
nlp = spacy.load("en")
|
||
```
|
||
|
||
... the model tells spaCy to use the language `"en"` and the pipeline
|
||
`["tagger", "parser", "ner"]`. spaCy will then initialize
|
||
`spacy.lang.en.English`, and create each pipeline component and add it to the
|
||
processing pipeline. It'll then load in the model's data from its data directory
|
||
and return the modified `Language` class for you to use as the `nlp` object.
|
||
|
||
Fundamentally, a [spaCy model](/models) consists of three components: **the
|
||
weights**, i.e. binary data loaded in from a directory, a **pipeline** of
|
||
functions called in order, and **language data** like the tokenization rules and
|
||
annotation scheme. All of this is specific to each model, and defined in the
|
||
model's `meta.json` – for example, a Spanish NER model requires different
|
||
weights, language data and pipeline components than an English parsing and
|
||
tagging model. This is also why the pipeline state is always held by the
|
||
`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
|
||
together and returns an instance of `Language` with a pipeline set and access to
|
||
the binary data:
|
||
|
||
```python
|
||
### spacy.load under the hood
|
||
lang = "en"
|
||
pipeline = ["tagger", "parser", "ner"]
|
||
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
|
||
|
||
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
|
||
nlp = cls() # 2. Initialize it
|
||
for name in pipeline:
|
||
component = nlp.create_pipe(name) # 3. Create the pipeline components
|
||
nlp.add_pipe(component) # 4. Add the component to the pipeline
|
||
nlp.from_disk(model_data_path) # 5. Load in the binary data
|
||
```
|
||
|
||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||
component** on the `Doc`, in order. Since the model data is loaded, the
|
||
components can access it to assign annotations to the `Doc` object, and
|
||
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
|
||
don't own any data themselves. All components return the modified document,
|
||
which is then processed by the component next in the pipeline.
|
||
|
||
```python
|
||
### The pipeline under the hood
|
||
doc = nlp.make_doc(u"This is a sentence") # create a Doc from raw text
|
||
for name, proc in nlp.pipeline: # iterate over components in order
|
||
doc = proc(doc) # apply each component
|
||
```
|
||
|
||
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
||
list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
|
||
list of human-readable component names.
|
||
|
||
```python
|
||
print(nlp.pipeline)
|
||
# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
|
||
print(nlp.pipe_names)
|
||
# ['tagger', 'parser', 'ner']
|
||
```
|
||
|
||
### Built-in pipeline components
|
||
|
||
spaCy ships with several built-in pipeline components that are also available in
|
||
the `Language.factories`. This means that you can initialize them by calling
|
||
[`nlp.create_pipe`](/api/language#create_pipe) with their string names and
|
||
require them in the pipeline settings in your model's `meta.json`.
|
||
|
||
> #### Usage
|
||
>
|
||
> ```python
|
||
> # Option 1: Import and initialize
|
||
> from spacy.pipeline import EntityRuler
|
||
> ruler = EntityRuler(nlp)
|
||
> nlp.add_pipe(ruler)
|
||
>
|
||
> # Option 2: Using nlp.create_pipe
|
||
> sentencizer = nlp.create_pipe("sentencizer")
|
||
> nlp.add_pipe(sentencizer)
|
||
> ```
|
||
|
||
| String name | Component | Description |
|
||
| ------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
|
||
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
|
||
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
|
||
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
|
||
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
|
||
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. |
|
||
| `sentencizer` | [`SentenceSegmenter`](/api/sentencesegmenter) | Add rule-based sentence segmentation without the dependency parse. |
|
||
| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser. |
|
||
| `merge_entities` | [`merge_entities`](/api/pipeline-functions#merge_entities) | Merge all entities into a single token. Should be added after the entity recognizer. |
|
||
| `merge_subtokens` | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens) | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. |
|
||
|
||
### Disabling and modifying pipeline components {#disabling}
|
||
|
||
If you don't need a particular component of the pipeline – for example, the
|
||
tagger or the parser, you can disable loading it. This can sometimes make a big
|
||
difference and improve loading speed. Disabled component names can be provided
|
||
to [`spacy.load`](/api/top-level#spacy.load),
|
||
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
|
||
list:
|
||
|
||
```python
|
||
nlp = spacy.load("en", disable=["parser", "tagger"])
|
||
nlp = English().from_disk("/model", disable=["ner"])
|
||
```
|
||
|
||
You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove
|
||
pipeline components from an existing pipeline, the
|
||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||
custom component entirely (more details on this in the section on
|
||
[custom components](#custom-components).
|
||
|
||
```python
|
||
nlp.remove_pipe("parser")
|
||
nlp.rename_pipe("ner", "entityrecognizer")
|
||
nlp.replace_pipe("tagger", my_custom_tagger)
|
||
```
|
||
|
||
<Infobox title="Important note: disabling pipeline components" variant="warning">
|
||
|
||
Since spaCy v2.0 comes with better support for customizing the processing
|
||
pipeline components, the `parser`, `tagger` and `entity` keyword arguments have
|
||
been replaced with `disable`, which takes a list of pipeline component names.
|
||
This lets you disable pre-defined components when loading a model, or
|
||
initializing a Language class via [`from_disk`](/api/language#from_disk).
|
||
|
||
```diff
|
||
- nlp = spacy.load('en', tagger=False, entity=False)
|
||
- doc = nlp(u"I don't want parsed", parse=False)
|
||
|
||
+ nlp = spacy.load('en', disable=['ner'])
|
||
+ nlp.remove_pipe('parser')
|
||
+ doc = nlp(u"I don't want parsed")
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
## Creating custom pipeline components {#custom-components}
|
||
|
||
A component receives a `Doc` object and can modify it – for example, by using
|
||
the current weights to make a prediction and set some annotation on the
|
||
document. By adding a component to the pipeline, you'll get access to the `Doc`
|
||
at any point **during processing** – instead of only being able to modify it
|
||
afterwards.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> def my_component(doc):
|
||
> # do something to the doc here
|
||
> return doc
|
||
> ```
|
||
|
||
| Argument | Type | Description |
|
||
| ----------- | ----- | ------------------------------------------------------ |
|
||
| `doc` | `Doc` | The `Doc` object processed by the previous component. |
|
||
| **RETURNS** | `Doc` | The `Doc` object processed by this pipeline component. |
|
||
|
||
Custom components can be added to the pipeline using the
|
||
[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
|
||
a component to add it **before or after**, tell spaCy to add it **first or
|
||
last** in the pipeline, or define a **custom name**. If no name is set and no
|
||
`name` attribute is present on your component, the function name is used.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp.add_pipe(my_component)
|
||
> nlp.add_pipe(my_component, first=True)
|
||
> nlp.add_pipe(my_component, before="parser")
|
||
> ```
|
||
|
||
| Argument | Type | Description |
|
||
| -------- | ------- | ------------------------------------------------------------------------ |
|
||
| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). |
|
||
| `first` | bool | If set to `True`, component is added **first** in the pipeline. |
|
||
| `before` | unicode | String name of component to add the new component **before**. |
|
||
| `after` | unicode | String name of component to add the new component **after**. |
|
||
|
||
### Example: A simple pipeline component {#custom-components-simple}
|
||
|
||
The following component receives the `Doc` in the pipeline and prints some
|
||
information about it: the number of tokens, the part-of-speech tags of the
|
||
tokens and a conditional message based on the document length.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Add the component first in the pipeline by setting `first=True`. You'll see
|
||
> that the part-of-speech tags are empty, because the component now runs
|
||
> before the tagger and the tags aren't available yet.
|
||
> 2. Change the component `name` or remove the `name` argument. You should see
|
||
> this change reflected in `nlp.pipe_names`.
|
||
> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
|
||
> name and the function that's called on the `Doc` object in the pipeline.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
def my_component(doc):
|
||
print("After tokenization, this doc has {} tokens.".format(len(doc)))
|
||
print("The part-of-speech tags are:", [token.pos_ for token in doc])
|
||
if len(doc) < 10:
|
||
print("This is a pretty short document.")
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe(my_component, name="print_info", last=True)
|
||
print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
|
||
doc = nlp(u"This is a sentence.")
|
||
|
||
```
|
||
|
||
Of course, you can also wrap your component as a class to allow initializing it
|
||
with custom settings and hold state within the component. This is useful for
|
||
**stateful components**, especially ones which **depend on shared data**. In the
|
||
following example, the custom component `EntityMatcher` can be initialized with
|
||
`nlp` object, a terminology list and an entity label. Using the
|
||
[`PhraseMatcher`](/api/phrasematcher), it then matches the terms in the `Doc`
|
||
and adds them to the existing entities.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
As of v2.1.0, spaCy ships with the [`EntityRuler`](/api/entityruler), a pipeline
|
||
component for easy, rule-based named entity recognition. Its implementation is
|
||
similar to the `EntityMatcher` code shown below, but it includes some additional
|
||
features like support for phrase patterns and token patterns, handling overlaps
|
||
with existing entities and pattern export as JSONL.
|
||
|
||
We'll still keep the pipeline component example below, as it works well to
|
||
illustrate complex components. But if you're planning on using this type of
|
||
component in your application, you might find the `EntityRuler` more convenient.
|
||
[See here](/usage/rule-based-matching#entityruler) for more details and
|
||
examples.
|
||
|
||
</Infobox>
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
from spacy.matcher import PhraseMatcher
|
||
from spacy.tokens import Span
|
||
|
||
class EntityMatcher(object):
|
||
name = "entity_matcher"
|
||
|
||
def __init__(self, nlp, terms, label):
|
||
patterns = [nlp.make_doc(text) for text in terms]
|
||
self.matcher = PhraseMatcher(nlp.vocab)
|
||
self.matcher.add(label, None, *patterns)
|
||
|
||
def __call__(self, doc):
|
||
matches = self.matcher(doc)
|
||
for match_id, start, end in matches:
|
||
span = Span(doc, start, end, label=match_id)
|
||
doc.ents = list(doc.ents) + [span]
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
terms = (u"cat", u"dog", u"tree kangaroo", u"giant sea spider")
|
||
entity_matcher = EntityMatcher(nlp, terms, "ANIMAL")
|
||
|
||
nlp.add_pipe(entity_matcher, after="ner")
|
||
|
||
print(nlp.pipe_names) # The components in the pipeline
|
||
|
||
doc = nlp(u"This is a text about Barack Obama and a tree kangaroo")
|
||
print([(ent.text, ent.label_) for ent in doc.ents])
|
||
```
|
||
|
||
### Example: Custom sentence segmentation logic {#component-example1}
|
||
|
||
Let's say you want to implement custom logic to improve spaCy's sentence
|
||
boundary detection. Currently, sentence segmentation is based on the dependency
|
||
parse, which doesn't always produce ideal results. The custom logic should
|
||
therefore be applied **after** tokenization, but _before_ the dependency parsing
|
||
– this way, the parser can also take advantage of the sentence boundaries.
|
||
|
||
> #### ✏️ Things to try
|
||
>
|
||
> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
|
||
> component. You'll see that the predicted dependency parse changes to match
|
||
> the sentence boundaries.
|
||
> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
|
||
> to `None` (missing value), the parser will assign sentence boundaries in
|
||
> between.
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
def custom_sentencizer(doc):
|
||
for i, token in enumerate(doc[:-2]):
|
||
# Define sentence start if pipe + titlecase token
|
||
if token.text == "|" and doc[i+1].is_title:
|
||
doc[i+1].is_sent_start = True
|
||
else:
|
||
# Explicitly set sentence start to False otherwise, to tell
|
||
# the parser to leave those tokens alone
|
||
doc[i+1].is_sent_start = False
|
||
return doc
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser
|
||
doc = nlp(u"This is. A sentence. | This is. Another sentence.")
|
||
for sent in doc.sents:
|
||
print(sent.text)
|
||
```
|
||
|
||
### Example: Pipeline component for entity matching and tagging with custom attributes {#component-example2}
|
||
|
||
This example shows how to create a spaCy extension that takes a terminology list
|
||
(in this case, single- and multi-word company names), matches the occurrences in
|
||
a document, labels them as `ORG` entities, merges the tokens and sets custom
|
||
`is_tech_org` and `has_tech_org` attributes. For efficient matching, the example
|
||
uses the [`PhraseMatcher`](/api/phrasematcher) which accepts `Doc` objects as
|
||
match patterns and works well for large terminology lists. It also ensures your
|
||
patterns will always match, even when you customize spaCy's tokenization rules.
|
||
When you call `nlp` on a text, the custom pipeline component is applied to the
|
||
`Doc`.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py
|
||
```
|
||
|
||
Wrapping this functionality in a pipeline component allows you to reuse the
|
||
module with different settings, and have all pre-processing taken care of when
|
||
you call `nlp` on your text and receive a `Doc` object.
|
||
|
||
### Adding factories {#custom-components-factories}
|
||
|
||
When spaCy loads a model via its `meta.json`, it will iterate over the
|
||
`"pipeline"` setting, look up every component name in the internal factories and
|
||
call [`nlp.create_pipe`](/api/language#create_pipe) to initialize the individual
|
||
components, like the tagger, parser or entity recognizer. If your model uses
|
||
custom components, this won't work – so you'll have to tell spaCy **where to
|
||
find your component**. You can do this by writing to the `Language.factories`:
|
||
|
||
```python
|
||
from spacy.language import Language
|
||
Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
|
||
```
|
||
|
||
You can also ship the above code and your custom component in your packaged
|
||
model's `__init__.py`, so it's executed when you load your model. The `**cfg`
|
||
config parameters are passed all the way down from
|
||
[`spacy.load`](/api/top-level#spacy.load), so you can load the model and its
|
||
components with custom settings:
|
||
|
||
```python
|
||
nlp = spacy.load("your_custom_model", terms=(u"tree kangaroo"), label="ANIMAL")
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
When you load a model via its shortcut or package name, like `en_core_web_sm`,
|
||
spaCy will import the package and then call its `load()` method. This means that
|
||
custom code in the model's `__init__.py` will be executed, too. This is **not
|
||
the case** if you're loading a model from a path containing the model data.
|
||
Here, spaCy will only read in the `meta.json`. If you want to use custom
|
||
factories with a model loaded from a path, you need to add them to
|
||
`Language.factories` _before_ you load the model.
|
||
|
||
</Infobox>
|
||
|
||
## Extension attributes {#custom-components-attributes new="2"}
|
||
|
||
As of v2.0, spaCy allows you to set any custom attributes and methods on the
|
||
`Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and
|
||
`Token._` – for example, `Token._.my_attr`. This lets you store additional
|
||
information relevant to your application, add new features and functionality to
|
||
spaCy, and implement your own models trained with other machine learning
|
||
libraries. It also lets you take advantage of spaCy's data structures and the
|
||
`Doc` object as the "single source of truth".
|
||
|
||
<Accordion title="Why ._ and not just a top-level attribute?">
|
||
|
||
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
||
separation and makes it easier to ensure backwards compatibility. For example,
|
||
if you've implemented your own `.coref` property and spaCy claims it one day,
|
||
it'll break your code. Similarly, just by looking at the code, you'll
|
||
immediately know what's built-in and what's custom – for example,
|
||
`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="How is the ._ implemented?">
|
||
|
||
Extension definitions – the defaults, methods, getters and setters you pass in
|
||
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
||
If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
|
||
stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
|
||
underscore data separate from your other dictionary entries, the string `"._."`
|
||
is placed before the name, in a tuple.
|
||
|
||
</Accordion>
|
||
|
||
---
|
||
|
||
There are three main types of extensions, which can be defined using the
|
||
[`Doc.set_extension`](/api/doc#set_extension),
|
||
[`Span.set_extension`](/api/span#set_extension) and
|
||
[`Token.set_extension`](/api/token#set_extension) methods.
|
||
|
||
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
||
overwritten manually at any time. Attribute extensions work like "normal"
|
||
variables and are the quickest way to store arbitrary information on a `Doc`,
|
||
`Span` or `Token`. Attribute defaults behaves just like argument defaults
|
||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments),
|
||
and should not be used for mutable values like dictionaries or lists.
|
||
|
||
```python
|
||
Doc.set_extension("hello", default=True)
|
||
assert doc._.hello
|
||
doc._.hello = False
|
||
```
|
||
|
||
2. **Property extensions.** Define a getter and an optional setter function. If
|
||
no setter is provided, the extension is immutable. Since the getter and
|
||
setter functions are only called when you _retrieve_ the attribute, you can
|
||
also access values of previously added attribute extensions. For example, a
|
||
`Doc` getter can average over `Token` attributes. For `Span` extensions,
|
||
you'll almost always want to use a property – otherwise, you'd have to write
|
||
to _every possible_ `Span` in the `Doc` to set up the values correctly.
|
||
|
||
```python
|
||
Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
|
||
assert doc._.hello
|
||
doc._.hello = "Hi!"
|
||
```
|
||
|
||
3. **Method extensions.** Assign a function that becomes available as an object
|
||
method. Method extensions are always immutable. For more details and
|
||
implementation ideas, see
|
||
[these examples](/usage/examples#custom-components-attr-methods).
|
||
|
||
```python
|
||
Doc.set_extension("hello", method=lambda doc, name: "Hi {}!".format(name))
|
||
assert doc._.hello("Bob") == "Hi Bob!"
|
||
```
|
||
|
||
Before you can access a custom extension, you need to register it using the
|
||
`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
|
||
in mind that extensions are always **added globally** and not just on a
|
||
particular instance. If an attribute of the same name already exists, or if
|
||
you're trying to access an attribute that hasn't been registered, spaCy will
|
||
raise an `AttributeError`.
|
||
|
||
```python
|
||
### Example
|
||
from spacy.tokens import Doc, Span, Token
|
||
|
||
fruits = [u"apple", u"pear", u"banana", u"orange", u"strawberry"]
|
||
is_fruit_getter = lambda token: token.text in fruits
|
||
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
|
||
|
||
Token.set_extension("is_fruit", getter=is_fruit_getter)
|
||
Doc.set_extension("has_fruit", getter=has_fruit_getter)
|
||
Span.set_extension("has_fruit", getter=has_fruit_getter)
|
||
```
|
||
|
||
> #### Usage example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I have an apple and a melon")
|
||
> assert doc[3]._.is_fruit # get Token attributes
|
||
> assert not doc[0]._.is_fruit
|
||
> assert doc._.has_fruit # get Doc attributes
|
||
> assert doc[1:4]._.has_fruit # get Span attributes
|
||
> ```
|
||
|
||
Once you've registered your custom attribute, you can also use the built-in
|
||
`set`, `get` and `has` methods to modify and retrieve the attributes. This is
|
||
especially useful it you want to pass in a string instead of calling
|
||
`doc._.my_attr`.
|
||
|
||
<Infobox title="Using mutable default values" variant="danger">
|
||
|
||
When using **mutable values** like dictionaries or lists as the `default`
|
||
argument, keep in mind that they behave just like mutable default arguments
|
||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments).
|
||
This can easily cause unintended results, like the same value being set on _all_
|
||
objects instead of only one particular instance. In most cases, it's better to
|
||
use **getters and setters**, and only set the `default` for boolean or string
|
||
values.
|
||
|
||
```diff
|
||
+ Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
|
||
|
||
- Doc.set_extension('fruits', default={})
|
||
- doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
||
|
||
This example shows the implementation of a pipeline component that fetches
|
||
country meta data via the [REST Countries API](https://restcountries.eu), sets
|
||
entity annotations for countries, merges entities into one token and sets custom
|
||
attributes on the `Doc`, `Span` and `Token` – for example, the capital,
|
||
latitude/longitude coordinates and even the country flag.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py
|
||
```
|
||
|
||
In this case, all data can be fetched on initialization in one request. However,
|
||
if you're working with text that contains incomplete country names, spelling
|
||
mistakes or foreign-language versions, you could also implement a
|
||
`like_country`-style getter function that makes a request to the search API
|
||
endpoint and returns the best-matching result.
|
||
|
||
### User hooks {#custom-components-user-hooks}
|
||
|
||
While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
|
||
proxies to add your own custom attributes, spaCy offers a few exceptions to
|
||
allow **customizing the built-in methods** like
|
||
[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
|
||
your own hooks, which can rely on statistical models you train yourself. For
|
||
instance, you can provide your own on-the-fly sentence segmentation algorithm or
|
||
document similarity method.
|
||
|
||
Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
|
||
objects by adding a component to the pipeline. For instance, to customize the
|
||
[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
|
||
sets a custom function to `doc.user_hooks['similarity']`. The built-in
|
||
`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
|
||
function if you've set one. Similar results can be achieved by setting functions
|
||
to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
||
|
||
> #### Implementation note
|
||
>
|
||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||
> This turns out to be convenient here — we only have to worry about installing
|
||
> hooks in one place.
|
||
|
||
| Name | Customizes |
|
||
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `user_hooks` | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) |
|
||
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
|
||
| `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) |
|
||
|
||
```python
|
||
### Add custom similarity hooks
|
||
class SimilarityModel(object):
|
||
def __init__(self, model):
|
||
self._model = model
|
||
|
||
def __call__(self, doc):
|
||
doc.user_hooks["similarity"] = self.similarity
|
||
doc.user_span_hooks["similarity"] = self.similarity
|
||
doc.user_token_hooks["similarity"] = self.similarity
|
||
|
||
def similarity(self, obj1, obj2):
|
||
y = self._model([obj1.vector, obj2.vector])
|
||
return float(y[0])
|
||
```
|
||
|
||
## Developing plugins and wrappers {#plugins}
|
||
|
||
We're very excited about all the new possibilities for community extensions and
|
||
plugins in spaCy v2.0, and we can't wait to see what you build with it! To get
|
||
you started, here are a few tips, tricks and best
|
||
practices. [See here](/universe/?category=pipeline) for examples of other spaCy
|
||
extensions.
|
||
|
||
### Usage ideas {#custom-components-usage-ideas}
|
||
|
||
- **Adding new features and hooking in models.** For example, a sentiment
|
||
analysis model, or your preferred solution for lemmatization or sentiment
|
||
analysis. spaCy's built-in tagger, parser and entity recognizer respect
|
||
annotations that were already set on the `Doc` in a previous step of the
|
||
pipeline.
|
||
- **Integrating other libraries and APIs.** For example, your pipeline component
|
||
can write additional information and data directly to the `Doc` or `Token` as
|
||
custom attributes, while making sure no information is lost in the process.
|
||
This can be output generated by other libraries and models, or an external
|
||
service with a REST API.
|
||
- **Debugging and logging.** For example, a component which stores and/or
|
||
exports relevant information about the current state of the processed
|
||
document, and insert it at any point of your pipeline.
|
||
|
||
### Best practices {#custom-components-best-practices}
|
||
|
||
Extensions can claim their own `._` namespace and exist as standalone packages.
|
||
If you're developing a tool or library and want to make it easy for others to
|
||
use it with spaCy and add it to their pipeline, all you have to do is expose a
|
||
function that takes a `Doc`, modifies it and returns it.
|
||
|
||
- Make sure to choose a **descriptive and specific name** for your pipeline
|
||
component class, and set it as its `name` attribute. Avoid names that are too
|
||
common or likely to clash with built-in or a user's other custom components.
|
||
While it's fine to call your package `"spacy_my_extension"`, avoid component
|
||
names including `"spacy"`, since this can easily lead to confusion.
|
||
|
||
```diff
|
||
+ name = "myapp_lemmatizer"
|
||
- name = "lemmatizer"
|
||
```
|
||
|
||
- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
|
||
wherever possible, and avoid setting values explicitly. Tokens and spans don't
|
||
own any data themselves, and they're implemented as C extension classes – so
|
||
you can't usually add new attributes to them like you could with most pure
|
||
Python objects.
|
||
|
||
```diff
|
||
+ is_fruit = lambda token: token.text in ("apple", "orange")
|
||
+ Token.set_extension("is_fruit", getter=is_fruit)
|
||
|
||
- token._.set_extension("is_fruit", default=False)
|
||
- if token.text in ('"apple", "orange"):
|
||
- token._.set("is_fruit", True)
|
||
```
|
||
|
||
- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
|
||
objects, not a particular instance of them. Add the attributes **as early as
|
||
possible**, e.g. in your extension's `__init__` method or in the global scope
|
||
of your module. This means that in the case of namespace collisions, the user
|
||
will see an error immediately, not just when they run their pipeline.
|
||
|
||
```diff
|
||
+ from spacy.tokens import Doc
|
||
+ def __init__(attr="my_attr"):
|
||
+ Doc.set_extension(attr, getter=self.get_doc_attr)
|
||
|
||
- def __call__(doc):
|
||
- doc.set_extension("my_attr", getter=self.get_doc_attr)
|
||
```
|
||
|
||
- If your extension is setting properties on the `Doc`, `Token` or `Span`,
|
||
include an option to **let the user to change those attribute names**. This
|
||
makes it easier to avoid namespace collisions and accommodate users with
|
||
different naming preferences. We recommend adding an `attrs` argument to the
|
||
`__init__` method of your class so you can write the names to class attributes
|
||
and reuse them across your component.
|
||
|
||
```diff
|
||
+ Doc.set_extension(self.doc_attr, default="some value")
|
||
- Doc.set_extension("my_doc_attr", default="some value")
|
||
```
|
||
|
||
- Ideally, extensions should be **standalone packages** with spaCy and
|
||
optionally, other packages specified as a dependency. They can freely assign
|
||
to their own `._` namespace, but should stick to that. If your extension's
|
||
only job is to provide a better `.similarity` implementation, and your docs
|
||
state this explicitly, there's no problem with writing to the
|
||
[`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
|
||
method. However, a third-party extension should **never silently overwrite
|
||
built-ins**, or attributes set by other extensions.
|
||
|
||
- If you're looking to publish a model that depends on a custom pipeline
|
||
component, you can either **require it** in the model package's dependencies,
|
||
or – if the component is specific and lightweight – choose to **ship it with
|
||
your model package** and add it to the `Language` instance returned by the
|
||
model's `load()` method. For examples of this, check out the implementations
|
||
of spaCy's
|
||
[`load_model_from_init_py`](/api/top-level#util.load_model_from_init_py)
|
||
[`load_model_from_path`](/api/top-level#util.load_model_from_path) utility
|
||
functions.
|
||
|
||
```diff
|
||
+ nlp.add_pipe(my_custom_component)
|
||
+ return nlp.from_disk(model_path)
|
||
```
|
||
|
||
- Once you're ready to share your extension with others, make sure to **add docs
|
||
and installation instructions** (you can always link to this page for more
|
||
info). Make it easy for others to install and use your extension, for example
|
||
by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
|
||
code on GitHub, don't forget to tag it with
|
||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||
[`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||
to help people find it. If you post it on Twitter, feel free to tag
|
||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||
|
||
### Wrapping other models and libraries {#wrapping-models-libraries}
|
||
|
||
Let's say you have a custom entity recognizer that takes a list of strings and
|
||
returns their [BILUO tags](/api/annotation#biluo). Given an input like
|
||
`["A", "text", "about", "Facebook"]`, it will predict and return
|
||
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
|
||
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
|
||
component function and pass it the token texts from the `Doc` object received by
|
||
the component.
|
||
|
||
The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very
|
||
helpful here, because it takes a `Doc` object and token-based BILUO tags and
|
||
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
|
||
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
|
||
|
||
> #### How the doc.ents work
|
||
>
|
||
> When you add spans to the `doc.ents`, spaCy will automatically resolve them
|
||
> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
|
||
> attributes. By definition, each token can only be part of one entity, so
|
||
> overlapping entity spans are not allowed.
|
||
|
||
```python
|
||
### {highlight="1,6-7"}
|
||
import your_custom_entity_recognizer
|
||
from spacy.gold import offsets_from_biluo_tags
|
||
|
||
def custom_ner_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
custom_entities = your_custom_entity_recognizer(words)
|
||
doc.ents = spans_from_biluo_tags(doc, custom_entities)
|
||
return doc
|
||
```
|
||
|
||
The `custom_ner_wrapper` can then be added to the pipeline of a blank model
|
||
using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the
|
||
existing entity recognizer of a pre-trained model with
|
||
[`nlp.replace_pipe`](/api/language#replace_pipe).
|
||
|
||
Here's another example of a custom model, `your_custom_model`, that takes a list
|
||
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
|
||
part-of-speech tags, dependency labels and head token indices. Here, we can use
|
||
the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
|
||
those values. To create a numpy array we need integers, so we can look up the
|
||
string labels in the [`StringStore`](/api/stringstore). The
|
||
[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
|
||
because it returns the integer ID of the string _and_ makes sure it's added to
|
||
the vocab. This is especially important if the custom model uses a different
|
||
label scheme than spaCy's default models.
|
||
|
||
> #### Example: spacy-stanfordnlp
|
||
>
|
||
> For an example of an end-to-end wrapper for statistical tokenization, tagging
|
||
> and parsing, check out
|
||
> [`spacy-stanfordnlp`](https://github.com/explosion/spacy-stanfordnlp). It uses
|
||
> a very similar approach to the example in this section – the only difference
|
||
> is that it fully replaces the `nlp` object instead of providing a pipeline
|
||
> component, since it also needs to handle tokenization.
|
||
|
||
```python
|
||
### {highlight="1,9,15-17"}
|
||
import your_custom_model
|
||
from spacy.symbols import POS, TAG, DEP, HEAD
|
||
from spacy.tokens import Doc
|
||
import numpy
|
||
|
||
def custom_model_wrapper(doc):
|
||
words = [token.text for token in doc]
|
||
spaces = [token.whitespace for token in doc]
|
||
pos, tags, deps, heads = your_custom_model(words)
|
||
# Convert the strings to integers and add them to the string store
|
||
pos = [doc.vocab.strings.add(label) for label in pos]
|
||
tags = [doc.vocab.strings.add(label) for label in tags]
|
||
deps = [doc.vocab.strings.add(label) for label in deps]
|
||
# Create a new Doc from a numpy array
|
||
attrs = [POS, TAG, DEP, HEAD]
|
||
arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
|
||
new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
|
||
return new_doc
|
||
```
|
||
|
||
<Infobox title="Sentence boundaries and heads" variant="warning">
|
||
|
||
If you create a `Doc` object with dependencies and heads, spaCy is able to
|
||
resolve the sentence boundaries automatically. However, note that the `HEAD`
|
||
value used to construct a `Doc` is the token index **relative** to the current
|
||
token – e.g. `-1` for the previous token. The CoNLL format typically annotates
|
||
heads as `1`-indexed absolute indices with `0` indicating the root. If that's
|
||
the case in your annotations, you need to convert them first:
|
||
|
||
```python
|
||
heads = [2, 0, 4, 2, 2]
|
||
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
<Infobox title="📖 Advanced usage, serialization and entry points">
|
||
|
||
For more details on how to write and package custom components, make them
|
||
available to spaCy via entry points and implement your own serialization
|
||
methods, check out the usage guide on
|
||
[saving and loading](/usage/saving-loading).
|
||
|
||
</Infobox>
|