Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-10-03 14:47:02 +02:00
parent 5413358ba1
commit 5fb776556a
3 changed files with 250 additions and 94 deletions

View File

@ -3,8 +3,11 @@ title: Language Processing Pipelines
next: /usage/embeddings-transformers
menu:
- ['Processing Text', 'processing']
- ['How Pipelines Work', 'pipelines']
- ['Pipelines & Components', 'pipelines']
- ['Custom Components', 'custom-components']
- ['Component Data', 'component-data']
- ['Type Hints & Validation', 'type-hints']
- ['Trainable Components', 'trainable-components']
- ['Extension Attributes', 'custom-components-attributes']
- ['Plugins & Wrappers', 'plugins']
---
@ -89,26 +92,27 @@ have to call `list()` on it first:
</Infobox>
## How pipelines work {#pipelines}
## Pipelines and built-in components {#pipelines}
spaCy makes it very easy to create your own pipelines consisting of reusable
components this includes spaCy's default tagger, parser and entity recognizer,
but also your own custom processing functions. A pipeline component can be added
to an already existing `nlp` object, specified when initializing a `Language`
class, or defined within a [pipeline package](/usage/saving-loading#models).
to an already existing `nlp` object, specified when initializing a
[`Language`](/api/language) class, or defined within a
[pipeline package](/usage/saving-loading#models).
> #### config.cfg (excerpt)
>
> ```ini
> [nlp]
> lang = "en"
> pipeline = ["tagger", "parser"]
> pipeline = ["tok2vec", "parser"]
>
> [components]
>
> [components.tagger]
> factory = "tagger"
> # Settings for the tagger component
> [components.tok2vec]
> factory = "tok2vec"
> # Settings for the tok2vec component
>
> [components.parser]
> factory = "parser"
@ -140,7 +144,7 @@ nlp = spacy.load("en_core_web_sm")
```
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize
`spacy.lang.en.English`, and create each pipeline component and add it to the
processing pipeline. It'll then load in the model data from the data directory
and return the modified `Language` class for you to use as the `nlp` object.
@ -739,6 +743,64 @@ make your factory a separate function. That's also how spaCy does it internally.
</Accordion>
### Language-specific factories {#factories-language new="3"}
There are many use case where you might want your pipeline components to be
language-specific. Sometimes this requires entirely different implementation per
language, sometimes the only difference is in the settings or data. spaCy allows
you to register factories of the **same name** on both the `Language` base
class, as well as its **subclasses** like `English` or `German`. Factories are
resolved starting with the specific subclass. If the subclass doesn't define a
component of that name, spaCy will check the `Language` base class.
Here's an example of a pipeline component that overwrites the normalized form of
a token, the `Token.norm_` with an entry from a language-specific lookup table.
It's registered twice under the name `"token_normalizer"` once using
`@English.factory` and once using `@German.factory`:
```python
### {executable="true"}
from spacy.lang.en import English
from spacy.lang.de import German
class TokenNormalizer:
def __init__(self, norm_table):
self.norm_table = norm_table
def __call__(self, doc):
for token in doc:
# Overwrite the token.norm_ if there's an entry in the data
token.norm_ = self.norm_table.get(token.text, token.norm_)
return doc
@English.factory("token_normalizer")
def create_en_normalizer(nlp, name):
return TokenNormalizer({"realise": "realize", "colour": "color"})
@German.factory("token_normalizer")
def create_de_normalizer(nlp, name):
return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
nlp_en = English()
nlp_en.add_pipe("token_normalizer") # uses the English factory
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
nlp_de = German()
nlp_de.add_pipe("token_normalizer") # uses the German factory
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
```
<Infobox title="Implementation details">
Under the hood, language-specific factories are added to the
[`factories` registry](/api/top-level#registry) prefixed with the language code,
e.g. `"en.token_normalizer"`. When resolving the factory in
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
language-specific version of the factory using `nlp.lang` and if none is
available, falls back to looking up the regular factory name.
</Infobox>
### Example: Stateful component with settings {#example-stateful-components}
This example shows a **stateful** pipeline component for handling acronyms:
@ -808,34 +870,47 @@ doc = nlp("LOL, be right back")
print(doc._.acronyms)
```
## Initializing and serializing component data {#component-data}
Many stateful components depend on **data resources** like dictionaries and
lookup tables that should ideally be **configurable**. For example, it makes
sense to make the `DICTIONARY` and argument of the registered function, so the
`AcronymComponent` can be re-used with different data. One logical solution
would be to make it an argument of the component factory, and allow it to be
initialized with different dictionaries.
sense to make the `DICTIONARY` in the above example an argument of the
registered function, so the `AcronymComponent` can be re-used with different
data. One logical solution would be to make it an argument of the component
factory, and allow it to be initialized with different dictionaries.
> #### Example
>
> Making the data an argument of the registered function would result in output
> like this in your `config.cfg`, which is typically not what you want (and only
> works for JSON-serializable data).
> #### config.cfg
>
> ```ini
> [components.acronyms.dictionary]
> [components.acronyms.data]
> # 🚨 Problem: you don't want the data in the config
> lol = "laugh out loud"
> brb = "be right back"
> ```
```python
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
# 🚨 Problem: data ends up in the config file
return AcronymComponent(nlp, data, case_sensitive)
```
However, passing in the dictionary directly is problematic, because it means
that if a component saves out its config and settings, the
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
since that's the config the component was created with.
since that's the config the component was created with. It will also fail if the
data is not JSON-serializable.
```diff
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
- default_config = {"dictionary:" DICTIONARY}
```
### Option 1: Using a registered function {#component-data-function}
<Infobox>
- ✅ **Pros:** can load anything in Python, easy to add to and configure via
config
- ❌ **Cons:** requires the function and its dependencies to be available at
runtime
</Infobox>
If what you're passing in isn't JSON-serializable e.g. a custom object like a
[model](#trainable-components) saving out the component config becomes
@ -877,7 +952,7 @@ result of the registered function is passed in as the key `"dictionary"`.
> [components.acronyms]
> factory = "acronyms"
>
> [components.acronyms.dictionary]
> [components.acronyms.data]
> @misc = "acronyms.slang_dict.v1"
> ```
@ -895,11 +970,135 @@ the name. Registered functions can also take **arguments** by the way that can
be defined in the config as well you can read more about this in the docs on
[training with custom code](/usage/training#custom-code).
### Initializing components with data {#initialization}
### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}
<!-- TODO: -->
<Infobox>
### Python type hints and pydantic validation {#type-hints new="3"}
- ✅ **Pros:** lets components save and load their own data and reflect user
changes, load in data assets before training without depending on them at
runtime
- ❌ **Cons:** requires more component methods, more complex config and data
flow
</Infobox>
Just like models save out their binary weights when you call
[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
other data assets for instance, an acronym dictionary. If a pipeline component
implements its own `to_disk` and `from_disk` methods, those will be called
automatically by `nlp.to_disk` and will receive the path to the directory to
save to or load from. The component can then perform any custom saving or
loading. If a user makes changes to the component data, they will be reflected
when the `nlp` object is saved. For more examples of this, see the usage guide
on [serialization methods](/usage/saving-loading/#serialization-methods).
> #### About the data path
>
> The `path` argument spaCy passes to the serialization methods consists of the
> path provided by the user, plus a directory of the component name. This means
> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
> receive the directory path `/path/acronyms` and can then create files in this
> directory.
```python
### Custom serialization methods {highlight="6-7,9-11"}
import srsly
class AcronymComponent:
# other methods here...
def to_disk(self, path, exclude=tuple()):
srsly.write_json(path / "data.json", self.data)
def from_disk(self, path, exclude=tuple()):
self.data = srsly.read_json(path / "data.json")
return self
```
Now the component can save to and load from a directory. The only remaining
question: How do you **load in the initial data**? In Python, you could just
call the pipe's `from_disk` method yourself. But if you're adding the component
to your [training config](/usage/training#config), spaCy will need to know how
to set it up, from start to finish, including the data to initialize it with.
While you could use a registered function or a file loader like
[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
component factory, this approach is problematic: the component factory runs
**every time the component is created**. This means it will run when creating
the `nlp` object before training, but also every a user loads your pipeline. So
your runtime pipeline would either depend on a local path on your file system,
or it's loaded twice: once when the component is created, and then again when
the data is by `from_disk`.
> ```ini
> ### config.cfg
> [components.acronyms.data]
> # 🚨 Problem: Runtime pipeline depends on local path
> @readers = "srsly.read_json.v1"
> path = "/path/to/slang_dict.json"
> ```
>
> ```ini
> ### config.cfg
> [components.acronyms.data]
> # 🚨 Problem: this always runs
> @misc = "acronyms.slang_dict.v1"
> ```
```python
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
# 🚨 Problem: data will be loaded every time component is created
return AcronymComponent(nlp, data, case_sensitive)
```
To solve this, your component can implement a separate method, `initialize`,
which will be called by [`nlp.initialize`](/api/language#initialize) if
available. This typically happens before training, but not at runtime when the
pipeline is loaded. For more background on this, see the usage guides on the
[config lifecycle](/usage/training#config-lifecycle) and
[custom initialization](/usage/training#initialization).
![Illustration of pipeline lifecycle](../images/lifecycle.svg)
A component's `initialize` method needs to take at least **two named
arguments**: a `get_examples` callback that gives it access to the training
examples, and the current `nlp` object. This is mostly used by trainable
components so they can initialize their models and label schemes from the data,
so we can ignore those arguments here. All **other arguments** on the method can
be defined via the config in this case a dictionary `data`.
> #### config.cfg
>
> ```ini
> [initialize.components.my_component]
>
> [initialize.components.my_component.data]
> # ✅ This only runs on initialization
> @readers = "srsly.read_json.v1"
> path = "/path/to/slang_dict.json"
> ```
```python
### Custom initialize method {highlight="5-6"}
class AcronymComponent:
def __init__(self):
self.data = {}
def initialize(self, get_examples=None, nlp=None, data={}):
self.data = data
```
When [`nlp.initialize`](/api/language#initialize) runs before training (or when
you call it in your own code), the
[`[initialize]`](/api/data-formats#config-initialize) block of the config is
loaded and used to construct the `nlp` object. The custom acronym component will
then be passed the data loaded from the JSON file. After training, the `nlp`
object is saved to disk, which will run the component's `to_disk` method. When
the pipeline is loaded back into spaCy later to use it, the `from_disk` method
will load the data back in.
## Python type hints and validation {#type-hints new="3"}
spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
@ -968,65 +1167,7 @@ nlp.add_pipe("debug", config={"log_level": "DEBUG"})
doc = nlp("This is a text...")
```
### Language-specific factories {#factories-language new="3"}
There are many use case where you might want your pipeline components to be
language-specific. Sometimes this requires entirely different implementation per
language, sometimes the only difference is in the settings or data. spaCy allows
you to register factories of the **same name** on both the `Language` base
class, as well as its **subclasses** like `English` or `German`. Factories are
resolved starting with the specific subclass. If the subclass doesn't define a
component of that name, spaCy will check the `Language` base class.
Here's an example of a pipeline component that overwrites the normalized form of
a token, the `Token.norm_` with an entry from a language-specific lookup table.
It's registered twice under the name `"token_normalizer"` once using
`@English.factory` and once using `@German.factory`:
```python
### {executable="true"}
from spacy.lang.en import English
from spacy.lang.de import German
class TokenNormalizer:
def __init__(self, norm_table):
self.norm_table = norm_table
def __call__(self, doc):
for token in doc:
# Overwrite the token.norm_ if there's an entry in the data
token.norm_ = self.norm_table.get(token.text, token.norm_)
return doc
@English.factory("token_normalizer")
def create_en_normalizer(nlp, name):
return TokenNormalizer({"realise": "realize", "colour": "color"})
@German.factory("token_normalizer")
def create_de_normalizer(nlp, name):
return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
nlp_en = English()
nlp_en.add_pipe("token_normalizer") # uses the English factory
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
nlp_de = German()
nlp_de.add_pipe("token_normalizer") # uses the German factory
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
```
<Infobox title="Implementation details">
Under the hood, language-specific factories are added to the
[`factories` registry](/api/top-level#registry) prefixed with the language code,
e.g. `"en.token_normalizer"`. When resolving the factory in
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
language-specific version of the factory using `nlp.lang` and if none is
available, falls back to looking up the regular factory name.
</Infobox>
### Trainable components {#trainable-components new="3"}
## Trainable components {#trainable-components new="3"}
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
components that have their own model instance, make predictions over `Doc`

View File

@ -2,6 +2,7 @@
title: Saving and Loading
menu:
- ['Basics', 'basics']
- ['Serializing Docs', 'docs']
- ['Serialization Methods', 'serialization-methods']
- ['Entry Points', 'entry-points']
- ['Trained Pipelines', 'models']
@ -52,7 +53,7 @@ defined [factories](/usage/processing-pipeline#custom-components-factories) and
_then_ loads in the binary data. You can read more about this process
[here](/usage/processing-pipelines#pipelines).
### Serializing Doc objects efficiently {#docs new="2.2"}
## Serializing Doc objects efficiently {#docs new="2.2"}
If you're working with lots of data, you'll probably need to pass analyses
between machines, either to use something like [Dask](https://dask.org) or
@ -179,9 +180,20 @@ example, model weights or terminology lists you can take advantage of spaCy'
built-in component serialization by making your custom component expose its own
`to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp`
object with the component in its pipeline is saved or loaded, the component will
then be able to serialize and deserialize itself. The following example shows a
custom component that keeps arbitrary JSON-serializable data, allows the user to
add to that data and saves and loads the data to and from a JSON file.
then be able to serialize and deserialize itself.
<Infobox title="Custom components and data" emoji="📖">
For more details on how to work with pipeline components that depend on data
resources and manage data loading and initialization at training and runtime,
see the usage guide on initializing and serializing
[component data](/usage/processing-pipelines#component-data).
</Infobox>
The following example shows a custom component that keeps arbitrary
JSON-serializable data, allows the user to add to that data and saves and loads
the data to and from a JSON file.
> #### Real-world example
>
@ -208,13 +220,13 @@ class CustomComponent:
# Add something to the component's data
self.data.append(data)
def to_disk(self, path, **kwargs):
def to_disk(self, path, exclude=tuple()):
# This will receive the directory path + /my_component
data_path = path / "data.json"
with data_path.open("w", encoding="utf8") as f:
f.write(json.dumps(self.data))
def from_disk(self, path, **cfg):
def from_disk(self, path, exclude=tuple()):
# This will receive the directory path + /my_component
data_path = path / "data.json"
with data_path.open("r", encoding="utf8") as f:
@ -276,6 +288,8 @@ custom components to spaCy automatically.
</Infobox>
<!-- ## Initializing components with data {#initialization new="3"} -->
## Using entry points {#entry-points new="2.1"}
Entry points let you expose parts of a Python package you write to other Python

View File

@ -819,7 +819,8 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
### Customizing the initialization {#initialization}
<!-- TODO: -->
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
</Infobox>
## Data utilities {#data}