diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md
index c98bd08bc..3d0c7b7e9 100644
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@@ -3,8 +3,11 @@ title: Language Processing Pipelines
next: /usage/embeddings-transformers
menu:
- ['Processing Text', 'processing']
- - ['How Pipelines Work', 'pipelines']
+ - ['Pipelines & Components', 'pipelines']
- ['Custom Components', 'custom-components']
+ - ['Component Data', 'component-data']
+ - ['Type Hints & Validation', 'type-hints']
+ - ['Trainable Components', 'trainable-components']
- ['Extension Attributes', 'custom-components-attributes']
- ['Plugins & Wrappers', 'plugins']
---
@@ -89,26 +92,27 @@ have to call `list()` on it first:
-## How pipelines work {#pipelines}
+## Pipelines and built-in components {#pipelines}
spaCy makes it very easy to create your own pipelines consisting of reusable
components – this includes spaCy's default tagger, parser and entity recognizer,
but also your own custom processing functions. A pipeline component can be added
-to an already existing `nlp` object, specified when initializing a `Language`
-class, or defined within a [pipeline package](/usage/saving-loading#models).
+to an already existing `nlp` object, specified when initializing a
+[`Language`](/api/language) class, or defined within a
+[pipeline package](/usage/saving-loading#models).
> #### config.cfg (excerpt)
>
> ```ini
> [nlp]
> lang = "en"
-> pipeline = ["tagger", "parser"]
+> pipeline = ["tok2vec", "parser"]
>
> [components]
>
-> [components.tagger]
-> factory = "tagger"
-> # Settings for the tagger component
+> [components.tok2vec]
+> factory = "tok2vec"
+> # Settings for the tok2vec component
>
> [components.parser]
> factory = "parser"
@@ -140,7 +144,7 @@ nlp = spacy.load("en_core_web_sm")
```
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
-pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
+pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize
`spacy.lang.en.English`, and create each pipeline component and add it to the
processing pipeline. It'll then load in the model data from the data directory
and return the modified `Language` class for you to use as the `nlp` object.
@@ -739,6 +743,64 @@ make your factory a separate function. That's also how spaCy does it internally.
+### Language-specific factories {#factories-language new="3"}
+
+There are many use case where you might want your pipeline components to be
+language-specific. Sometimes this requires entirely different implementation per
+language, sometimes the only difference is in the settings or data. spaCy allows
+you to register factories of the **same name** on both the `Language` base
+class, as well as its **subclasses** like `English` or `German`. Factories are
+resolved starting with the specific subclass. If the subclass doesn't define a
+component of that name, spaCy will check the `Language` base class.
+
+Here's an example of a pipeline component that overwrites the normalized form of
+a token, the `Token.norm_` with an entry from a language-specific lookup table.
+It's registered twice under the name `"token_normalizer"` – once using
+`@English.factory` and once using `@German.factory`:
+
+```python
+### {executable="true"}
+from spacy.lang.en import English
+from spacy.lang.de import German
+
+class TokenNormalizer:
+ def __init__(self, norm_table):
+ self.norm_table = norm_table
+
+ def __call__(self, doc):
+ for token in doc:
+ # Overwrite the token.norm_ if there's an entry in the data
+ token.norm_ = self.norm_table.get(token.text, token.norm_)
+ return doc
+
+@English.factory("token_normalizer")
+def create_en_normalizer(nlp, name):
+ return TokenNormalizer({"realise": "realize", "colour": "color"})
+
+@German.factory("token_normalizer")
+def create_de_normalizer(nlp, name):
+ return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
+
+nlp_en = English()
+nlp_en.add_pipe("token_normalizer") # uses the English factory
+print([token.norm_ for token in nlp_en("realise colour daß wußte")])
+
+nlp_de = German()
+nlp_de.add_pipe("token_normalizer") # uses the German factory
+print([token.norm_ for token in nlp_de("realise colour daß wußte")])
+```
+
+
+
+Under the hood, language-specific factories are added to the
+[`factories` registry](/api/top-level#registry) prefixed with the language code,
+e.g. `"en.token_normalizer"`. When resolving the factory in
+[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
+language-specific version of the factory using `nlp.lang` and if none is
+available, falls back to looking up the regular factory name.
+
+
+
### Example: Stateful component with settings {#example-stateful-components}
This example shows a **stateful** pipeline component for handling acronyms:
@@ -808,34 +870,47 @@ doc = nlp("LOL, be right back")
print(doc._.acronyms)
```
+## Initializing and serializing component data {#component-data}
+
Many stateful components depend on **data resources** like dictionaries and
lookup tables that should ideally be **configurable**. For example, it makes
-sense to make the `DICTIONARY` and argument of the registered function, so the
-`AcronymComponent` can be re-used with different data. One logical solution
-would be to make it an argument of the component factory, and allow it to be
-initialized with different dictionaries.
+sense to make the `DICTIONARY` in the above example an argument of the
+registered function, so the `AcronymComponent` can be re-used with different
+data. One logical solution would be to make it an argument of the component
+factory, and allow it to be initialized with different dictionaries.
-> #### Example
->
-> Making the data an argument of the registered function would result in output
-> like this in your `config.cfg`, which is typically not what you want (and only
-> works for JSON-serializable data).
+> #### config.cfg
>
> ```ini
-> [components.acronyms.dictionary]
+> [components.acronyms.data]
+> # 🚨 Problem: you don't want the data in the config
> lol = "laugh out loud"
> brb = "be right back"
> ```
+```python
+@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
+def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
+ # 🚨 Problem: data ends up in the config file
+ return AcronymComponent(nlp, data, case_sensitive)
+```
+
However, passing in the dictionary directly is problematic, because it means
that if a component saves out its config and settings, the
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
-since that's the config the component was created with.
+since that's the config the component was created with. It will also fail if the
+data is not JSON-serializable.
-```diff
-DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
-- default_config = {"dictionary:" DICTIONARY}
-```
+### Option 1: Using a registered function {#component-data-function}
+
+
+
+- ✅ **Pros:** can load anything in Python, easy to add to and configure via
+ config
+- ❌ **Cons:** requires the function and its dependencies to be available at
+ runtime
+
+
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
[model](#trainable-components) – saving out the component config becomes
@@ -877,7 +952,7 @@ result of the registered function is passed in as the key `"dictionary"`.
> [components.acronyms]
> factory = "acronyms"
>
-> [components.acronyms.dictionary]
+> [components.acronyms.data]
> @misc = "acronyms.slang_dict.v1"
> ```
@@ -895,11 +970,135 @@ the name. Registered functions can also take **arguments** by the way that can
be defined in the config as well – you can read more about this in the docs on
[training with custom code](/usage/training#custom-code).
-### Initializing components with data {#initialization}
+### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}
-
+
-### Python type hints and pydantic validation {#type-hints new="3"}
+- ✅ **Pros:** lets components save and load their own data and reflect user
+ changes, load in data assets before training without depending on them at
+ runtime
+- ❌ **Cons:** requires more component methods, more complex config and data
+ flow
+
+
+
+Just like models save out their binary weights when you call
+[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
+other data assets – for instance, an acronym dictionary. If a pipeline component
+implements its own `to_disk` and `from_disk` methods, those will be called
+automatically by `nlp.to_disk` and will receive the path to the directory to
+save to or load from. The component can then perform any custom saving or
+loading. If a user makes changes to the component data, they will be reflected
+when the `nlp` object is saved. For more examples of this, see the usage guide
+on [serialization methods](/usage/saving-loading/#serialization-methods).
+
+> #### About the data path
+>
+> The `path` argument spaCy passes to the serialization methods consists of the
+> path provided by the user, plus a directory of the component name. This means
+> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
+> receive the directory path `/path/acronyms` and can then create files in this
+> directory.
+
+```python
+### Custom serialization methods {highlight="6-7,9-11"}
+import srsly
+
+class AcronymComponent:
+ # other methods here...
+
+ def to_disk(self, path, exclude=tuple()):
+ srsly.write_json(path / "data.json", self.data)
+
+ def from_disk(self, path, exclude=tuple()):
+ self.data = srsly.read_json(path / "data.json")
+ return self
+```
+
+Now the component can save to and load from a directory. The only remaining
+question: How do you **load in the initial data**? In Python, you could just
+call the pipe's `from_disk` method yourself. But if you're adding the component
+to your [training config](/usage/training#config), spaCy will need to know how
+to set it up, from start to finish, including the data to initialize it with.
+
+While you could use a registered function or a file loader like
+[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
+component factory, this approach is problematic: the component factory runs
+**every time the component is created**. This means it will run when creating
+the `nlp` object before training, but also every a user loads your pipeline. So
+your runtime pipeline would either depend on a local path on your file system,
+or it's loaded twice: once when the component is created, and then again when
+the data is by `from_disk`.
+
+> ```ini
+> ### config.cfg
+> [components.acronyms.data]
+> # 🚨 Problem: Runtime pipeline depends on local path
+> @readers = "srsly.read_json.v1"
+> path = "/path/to/slang_dict.json"
+> ```
+>
+> ```ini
+> ### config.cfg
+> [components.acronyms.data]
+> # 🚨 Problem: this always runs
+> @misc = "acronyms.slang_dict.v1"
+> ```
+
+```python
+@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
+def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
+ # 🚨 Problem: data will be loaded every time component is created
+ return AcronymComponent(nlp, data, case_sensitive)
+```
+
+To solve this, your component can implement a separate method, `initialize`,
+which will be called by [`nlp.initialize`](/api/language#initialize) if
+available. This typically happens before training, but not at runtime when the
+pipeline is loaded. For more background on this, see the usage guides on the
+[config lifecycle](/usage/training#config-lifecycle) and
+[custom initialization](/usage/training#initialization).
+
+
+
+A component's `initialize` method needs to take at least **two named
+arguments**: a `get_examples` callback that gives it access to the training
+examples, and the current `nlp` object. This is mostly used by trainable
+components so they can initialize their models and label schemes from the data,
+so we can ignore those arguments here. All **other arguments** on the method can
+be defined via the config – in this case a dictionary `data`.
+
+> #### config.cfg
+>
+> ```ini
+> [initialize.components.my_component]
+>
+> [initialize.components.my_component.data]
+> # ✅ This only runs on initialization
+> @readers = "srsly.read_json.v1"
+> path = "/path/to/slang_dict.json"
+> ```
+
+```python
+### Custom initialize method {highlight="5-6"}
+class AcronymComponent:
+ def __init__(self):
+ self.data = {}
+
+ def initialize(self, get_examples=None, nlp=None, data={}):
+ self.data = data
+```
+
+When [`nlp.initialize`](/api/language#initialize) runs before training (or when
+you call it in your own code), the
+[`[initialize]`](/api/data-formats#config-initialize) block of the config is
+loaded and used to construct the `nlp` object. The custom acronym component will
+then be passed the data loaded from the JSON file. After training, the `nlp`
+object is saved to disk, which will run the component's `to_disk` method. When
+the pipeline is loaded back into spaCy later to use it, the `from_disk` method
+will load the data back in.
+
+## Python type hints and validation {#type-hints new="3"}
spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
@@ -968,65 +1167,7 @@ nlp.add_pipe("debug", config={"log_level": "DEBUG"})
doc = nlp("This is a text...")
```
-### Language-specific factories {#factories-language new="3"}
-
-There are many use case where you might want your pipeline components to be
-language-specific. Sometimes this requires entirely different implementation per
-language, sometimes the only difference is in the settings or data. spaCy allows
-you to register factories of the **same name** on both the `Language` base
-class, as well as its **subclasses** like `English` or `German`. Factories are
-resolved starting with the specific subclass. If the subclass doesn't define a
-component of that name, spaCy will check the `Language` base class.
-
-Here's an example of a pipeline component that overwrites the normalized form of
-a token, the `Token.norm_` with an entry from a language-specific lookup table.
-It's registered twice under the name `"token_normalizer"` – once using
-`@English.factory` and once using `@German.factory`:
-
-```python
-### {executable="true"}
-from spacy.lang.en import English
-from spacy.lang.de import German
-
-class TokenNormalizer:
- def __init__(self, norm_table):
- self.norm_table = norm_table
-
- def __call__(self, doc):
- for token in doc:
- # Overwrite the token.norm_ if there's an entry in the data
- token.norm_ = self.norm_table.get(token.text, token.norm_)
- return doc
-
-@English.factory("token_normalizer")
-def create_en_normalizer(nlp, name):
- return TokenNormalizer({"realise": "realize", "colour": "color"})
-
-@German.factory("token_normalizer")
-def create_de_normalizer(nlp, name):
- return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
-
-nlp_en = English()
-nlp_en.add_pipe("token_normalizer") # uses the English factory
-print([token.norm_ for token in nlp_en("realise colour daß wußte")])
-
-nlp_de = German()
-nlp_de.add_pipe("token_normalizer") # uses the German factory
-print([token.norm_ for token in nlp_de("realise colour daß wußte")])
-```
-
-
-
-Under the hood, language-specific factories are added to the
-[`factories` registry](/api/top-level#registry) prefixed with the language code,
-e.g. `"en.token_normalizer"`. When resolving the factory in
-[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
-language-specific version of the factory using `nlp.lang` and if none is
-available, falls back to looking up the regular factory name.
-
-
-
-### Trainable components {#trainable-components new="3"}
+## Trainable components {#trainable-components new="3"}
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
components that have their own model instance, make predictions over `Doc`
diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md
index f8a5eea2a..c19ff39eb 100644
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@@ -2,6 +2,7 @@
title: Saving and Loading
menu:
- ['Basics', 'basics']
+ - ['Serializing Docs', 'docs']
- ['Serialization Methods', 'serialization-methods']
- ['Entry Points', 'entry-points']
- ['Trained Pipelines', 'models']
@@ -52,7 +53,7 @@ defined [factories](/usage/processing-pipeline#custom-components-factories) and
_then_ loads in the binary data. You can read more about this process
[here](/usage/processing-pipelines#pipelines).
-### Serializing Doc objects efficiently {#docs new="2.2"}
+## Serializing Doc objects efficiently {#docs new="2.2"}
If you're working with lots of data, you'll probably need to pass analyses
between machines, either to use something like [Dask](https://dask.org) or
@@ -179,9 +180,20 @@ example, model weights or terminology lists – you can take advantage of spaCy'
built-in component serialization by making your custom component expose its own
`to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp`
object with the component in its pipeline is saved or loaded, the component will
-then be able to serialize and deserialize itself. The following example shows a
-custom component that keeps arbitrary JSON-serializable data, allows the user to
-add to that data and saves and loads the data to and from a JSON file.
+then be able to serialize and deserialize itself.
+
+
+
+For more details on how to work with pipeline components that depend on data
+resources and manage data loading and initialization at training and runtime,
+see the usage guide on initializing and serializing
+[component data](/usage/processing-pipelines#component-data).
+
+
+
+The following example shows a custom component that keeps arbitrary
+JSON-serializable data, allows the user to add to that data and saves and loads
+the data to and from a JSON file.
> #### Real-world example
>
@@ -208,13 +220,13 @@ class CustomComponent:
# Add something to the component's data
self.data.append(data)
- def to_disk(self, path, **kwargs):
+ def to_disk(self, path, exclude=tuple()):
# This will receive the directory path + /my_component
data_path = path / "data.json"
with data_path.open("w", encoding="utf8") as f:
f.write(json.dumps(self.data))
- def from_disk(self, path, **cfg):
+ def from_disk(self, path, exclude=tuple()):
# This will receive the directory path + /my_component
data_path = path / "data.json"
with data_path.open("r", encoding="utf8") as f:
@@ -276,6 +288,8 @@ custom components to spaCy automatically.
+
+
## Using entry points {#entry-points new="2.1"}
Entry points let you expose parts of a Python package you write to other Python
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 1dd57fd4a..74d2f6de5 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -819,7 +819,8 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
### Customizing the initialization {#initialization}
-
+
+
## Data utilities {#data}