mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-04 21:50:35 +03:00
Update docs [ci skip]
This commit is contained in:
parent
5413358ba1
commit
5fb776556a
|
@ -3,8 +3,11 @@ title: Language Processing Pipelines
|
||||||
next: /usage/embeddings-transformers
|
next: /usage/embeddings-transformers
|
||||||
menu:
|
menu:
|
||||||
- ['Processing Text', 'processing']
|
- ['Processing Text', 'processing']
|
||||||
- ['How Pipelines Work', 'pipelines']
|
- ['Pipelines & Components', 'pipelines']
|
||||||
- ['Custom Components', 'custom-components']
|
- ['Custom Components', 'custom-components']
|
||||||
|
- ['Component Data', 'component-data']
|
||||||
|
- ['Type Hints & Validation', 'type-hints']
|
||||||
|
- ['Trainable Components', 'trainable-components']
|
||||||
- ['Extension Attributes', 'custom-components-attributes']
|
- ['Extension Attributes', 'custom-components-attributes']
|
||||||
- ['Plugins & Wrappers', 'plugins']
|
- ['Plugins & Wrappers', 'plugins']
|
||||||
---
|
---
|
||||||
|
@ -89,26 +92,27 @@ have to call `list()` on it first:
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
## How pipelines work {#pipelines}
|
## Pipelines and built-in components {#pipelines}
|
||||||
|
|
||||||
spaCy makes it very easy to create your own pipelines consisting of reusable
|
spaCy makes it very easy to create your own pipelines consisting of reusable
|
||||||
components – this includes spaCy's default tagger, parser and entity recognizer,
|
components – this includes spaCy's default tagger, parser and entity recognizer,
|
||||||
but also your own custom processing functions. A pipeline component can be added
|
but also your own custom processing functions. A pipeline component can be added
|
||||||
to an already existing `nlp` object, specified when initializing a `Language`
|
to an already existing `nlp` object, specified when initializing a
|
||||||
class, or defined within a [pipeline package](/usage/saving-loading#models).
|
[`Language`](/api/language) class, or defined within a
|
||||||
|
[pipeline package](/usage/saving-loading#models).
|
||||||
|
|
||||||
> #### config.cfg (excerpt)
|
> #### config.cfg (excerpt)
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [nlp]
|
> [nlp]
|
||||||
> lang = "en"
|
> lang = "en"
|
||||||
> pipeline = ["tagger", "parser"]
|
> pipeline = ["tok2vec", "parser"]
|
||||||
>
|
>
|
||||||
> [components]
|
> [components]
|
||||||
>
|
>
|
||||||
> [components.tagger]
|
> [components.tok2vec]
|
||||||
> factory = "tagger"
|
> factory = "tok2vec"
|
||||||
> # Settings for the tagger component
|
> # Settings for the tok2vec component
|
||||||
>
|
>
|
||||||
> [components.parser]
|
> [components.parser]
|
||||||
> factory = "parser"
|
> factory = "parser"
|
||||||
|
@ -140,7 +144,7 @@ nlp = spacy.load("en_core_web_sm")
|
||||||
```
|
```
|
||||||
|
|
||||||
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
|
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
|
||||||
pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
|
pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize
|
||||||
`spacy.lang.en.English`, and create each pipeline component and add it to the
|
`spacy.lang.en.English`, and create each pipeline component and add it to the
|
||||||
processing pipeline. It'll then load in the model data from the data directory
|
processing pipeline. It'll then load in the model data from the data directory
|
||||||
and return the modified `Language` class for you to use as the `nlp` object.
|
and return the modified `Language` class for you to use as the `nlp` object.
|
||||||
|
@ -739,6 +743,64 @@ make your factory a separate function. That's also how spaCy does it internally.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
### Language-specific factories {#factories-language new="3"}
|
||||||
|
|
||||||
|
There are many use case where you might want your pipeline components to be
|
||||||
|
language-specific. Sometimes this requires entirely different implementation per
|
||||||
|
language, sometimes the only difference is in the settings or data. spaCy allows
|
||||||
|
you to register factories of the **same name** on both the `Language` base
|
||||||
|
class, as well as its **subclasses** like `English` or `German`. Factories are
|
||||||
|
resolved starting with the specific subclass. If the subclass doesn't define a
|
||||||
|
component of that name, spaCy will check the `Language` base class.
|
||||||
|
|
||||||
|
Here's an example of a pipeline component that overwrites the normalized form of
|
||||||
|
a token, the `Token.norm_` with an entry from a language-specific lookup table.
|
||||||
|
It's registered twice under the name `"token_normalizer"` – once using
|
||||||
|
`@English.factory` and once using `@German.factory`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.lang.de import German
|
||||||
|
|
||||||
|
class TokenNormalizer:
|
||||||
|
def __init__(self, norm_table):
|
||||||
|
self.norm_table = norm_table
|
||||||
|
|
||||||
|
def __call__(self, doc):
|
||||||
|
for token in doc:
|
||||||
|
# Overwrite the token.norm_ if there's an entry in the data
|
||||||
|
token.norm_ = self.norm_table.get(token.text, token.norm_)
|
||||||
|
return doc
|
||||||
|
|
||||||
|
@English.factory("token_normalizer")
|
||||||
|
def create_en_normalizer(nlp, name):
|
||||||
|
return TokenNormalizer({"realise": "realize", "colour": "color"})
|
||||||
|
|
||||||
|
@German.factory("token_normalizer")
|
||||||
|
def create_de_normalizer(nlp, name):
|
||||||
|
return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
|
||||||
|
|
||||||
|
nlp_en = English()
|
||||||
|
nlp_en.add_pipe("token_normalizer") # uses the English factory
|
||||||
|
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
|
||||||
|
|
||||||
|
nlp_de = German()
|
||||||
|
nlp_de.add_pipe("token_normalizer") # uses the German factory
|
||||||
|
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox title="Implementation details">
|
||||||
|
|
||||||
|
Under the hood, language-specific factories are added to the
|
||||||
|
[`factories` registry](/api/top-level#registry) prefixed with the language code,
|
||||||
|
e.g. `"en.token_normalizer"`. When resolving the factory in
|
||||||
|
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
|
||||||
|
language-specific version of the factory using `nlp.lang` and if none is
|
||||||
|
available, falls back to looking up the regular factory name.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Example: Stateful component with settings {#example-stateful-components}
|
### Example: Stateful component with settings {#example-stateful-components}
|
||||||
|
|
||||||
This example shows a **stateful** pipeline component for handling acronyms:
|
This example shows a **stateful** pipeline component for handling acronyms:
|
||||||
|
@ -808,34 +870,47 @@ doc = nlp("LOL, be right back")
|
||||||
print(doc._.acronyms)
|
print(doc._.acronyms)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Initializing and serializing component data {#component-data}
|
||||||
|
|
||||||
Many stateful components depend on **data resources** like dictionaries and
|
Many stateful components depend on **data resources** like dictionaries and
|
||||||
lookup tables that should ideally be **configurable**. For example, it makes
|
lookup tables that should ideally be **configurable**. For example, it makes
|
||||||
sense to make the `DICTIONARY` and argument of the registered function, so the
|
sense to make the `DICTIONARY` in the above example an argument of the
|
||||||
`AcronymComponent` can be re-used with different data. One logical solution
|
registered function, so the `AcronymComponent` can be re-used with different
|
||||||
would be to make it an argument of the component factory, and allow it to be
|
data. One logical solution would be to make it an argument of the component
|
||||||
initialized with different dictionaries.
|
factory, and allow it to be initialized with different dictionaries.
|
||||||
|
|
||||||
> #### Example
|
> #### config.cfg
|
||||||
>
|
|
||||||
> Making the data an argument of the registered function would result in output
|
|
||||||
> like this in your `config.cfg`, which is typically not what you want (and only
|
|
||||||
> works for JSON-serializable data).
|
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [components.acronyms.dictionary]
|
> [components.acronyms.data]
|
||||||
|
> # 🚨 Problem: you don't want the data in the config
|
||||||
> lol = "laugh out loud"
|
> lol = "laugh out loud"
|
||||||
> brb = "be right back"
|
> brb = "be right back"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
|
||||||
|
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
|
||||||
|
# 🚨 Problem: data ends up in the config file
|
||||||
|
return AcronymComponent(nlp, data, case_sensitive)
|
||||||
|
```
|
||||||
|
|
||||||
However, passing in the dictionary directly is problematic, because it means
|
However, passing in the dictionary directly is problematic, because it means
|
||||||
that if a component saves out its config and settings, the
|
that if a component saves out its config and settings, the
|
||||||
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
|
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
|
||||||
since that's the config the component was created with.
|
since that's the config the component was created with. It will also fail if the
|
||||||
|
data is not JSON-serializable.
|
||||||
|
|
||||||
```diff
|
### Option 1: Using a registered function {#component-data-function}
|
||||||
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
|
|
||||||
- default_config = {"dictionary:" DICTIONARY}
|
<Infobox>
|
||||||
```
|
|
||||||
|
- ✅ **Pros:** can load anything in Python, easy to add to and configure via
|
||||||
|
config
|
||||||
|
- ❌ **Cons:** requires the function and its dependencies to be available at
|
||||||
|
runtime
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
|
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
|
||||||
[model](#trainable-components) – saving out the component config becomes
|
[model](#trainable-components) – saving out the component config becomes
|
||||||
|
@ -877,7 +952,7 @@ result of the registered function is passed in as the key `"dictionary"`.
|
||||||
> [components.acronyms]
|
> [components.acronyms]
|
||||||
> factory = "acronyms"
|
> factory = "acronyms"
|
||||||
>
|
>
|
||||||
> [components.acronyms.dictionary]
|
> [components.acronyms.data]
|
||||||
> @misc = "acronyms.slang_dict.v1"
|
> @misc = "acronyms.slang_dict.v1"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -895,11 +970,135 @@ the name. Registered functions can also take **arguments** by the way that can
|
||||||
be defined in the config as well – you can read more about this in the docs on
|
be defined in the config as well – you can read more about this in the docs on
|
||||||
[training with custom code](/usage/training#custom-code).
|
[training with custom code](/usage/training#custom-code).
|
||||||
|
|
||||||
### Initializing components with data {#initialization}
|
### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}
|
||||||
|
|
||||||
<!-- TODO: -->
|
<Infobox>
|
||||||
|
|
||||||
### Python type hints and pydantic validation {#type-hints new="3"}
|
- ✅ **Pros:** lets components save and load their own data and reflect user
|
||||||
|
changes, load in data assets before training without depending on them at
|
||||||
|
runtime
|
||||||
|
- ❌ **Cons:** requires more component methods, more complex config and data
|
||||||
|
flow
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
Just like models save out their binary weights when you call
|
||||||
|
[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
|
||||||
|
other data assets – for instance, an acronym dictionary. If a pipeline component
|
||||||
|
implements its own `to_disk` and `from_disk` methods, those will be called
|
||||||
|
automatically by `nlp.to_disk` and will receive the path to the directory to
|
||||||
|
save to or load from. The component can then perform any custom saving or
|
||||||
|
loading. If a user makes changes to the component data, they will be reflected
|
||||||
|
when the `nlp` object is saved. For more examples of this, see the usage guide
|
||||||
|
on [serialization methods](/usage/saving-loading/#serialization-methods).
|
||||||
|
|
||||||
|
> #### About the data path
|
||||||
|
>
|
||||||
|
> The `path` argument spaCy passes to the serialization methods consists of the
|
||||||
|
> path provided by the user, plus a directory of the component name. This means
|
||||||
|
> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
|
||||||
|
> receive the directory path `/path/acronyms` and can then create files in this
|
||||||
|
> directory.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Custom serialization methods {highlight="6-7,9-11"}
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
class AcronymComponent:
|
||||||
|
# other methods here...
|
||||||
|
|
||||||
|
def to_disk(self, path, exclude=tuple()):
|
||||||
|
srsly.write_json(path / "data.json", self.data)
|
||||||
|
|
||||||
|
def from_disk(self, path, exclude=tuple()):
|
||||||
|
self.data = srsly.read_json(path / "data.json")
|
||||||
|
return self
|
||||||
|
```
|
||||||
|
|
||||||
|
Now the component can save to and load from a directory. The only remaining
|
||||||
|
question: How do you **load in the initial data**? In Python, you could just
|
||||||
|
call the pipe's `from_disk` method yourself. But if you're adding the component
|
||||||
|
to your [training config](/usage/training#config), spaCy will need to know how
|
||||||
|
to set it up, from start to finish, including the data to initialize it with.
|
||||||
|
|
||||||
|
While you could use a registered function or a file loader like
|
||||||
|
[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
|
||||||
|
component factory, this approach is problematic: the component factory runs
|
||||||
|
**every time the component is created**. This means it will run when creating
|
||||||
|
the `nlp` object before training, but also every a user loads your pipeline. So
|
||||||
|
your runtime pipeline would either depend on a local path on your file system,
|
||||||
|
or it's loaded twice: once when the component is created, and then again when
|
||||||
|
the data is by `from_disk`.
|
||||||
|
|
||||||
|
> ```ini
|
||||||
|
> ### config.cfg
|
||||||
|
> [components.acronyms.data]
|
||||||
|
> # 🚨 Problem: Runtime pipeline depends on local path
|
||||||
|
> @readers = "srsly.read_json.v1"
|
||||||
|
> path = "/path/to/slang_dict.json"
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> ### config.cfg
|
||||||
|
> [components.acronyms.data]
|
||||||
|
> # 🚨 Problem: this always runs
|
||||||
|
> @misc = "acronyms.slang_dict.v1"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
|
||||||
|
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
|
||||||
|
# 🚨 Problem: data will be loaded every time component is created
|
||||||
|
return AcronymComponent(nlp, data, case_sensitive)
|
||||||
|
```
|
||||||
|
|
||||||
|
To solve this, your component can implement a separate method, `initialize`,
|
||||||
|
which will be called by [`nlp.initialize`](/api/language#initialize) if
|
||||||
|
available. This typically happens before training, but not at runtime when the
|
||||||
|
pipeline is loaded. For more background on this, see the usage guides on the
|
||||||
|
[config lifecycle](/usage/training#config-lifecycle) and
|
||||||
|
[custom initialization](/usage/training#initialization).
|
||||||
|
|
||||||
|
![Illustration of pipeline lifecycle](../images/lifecycle.svg)
|
||||||
|
|
||||||
|
A component's `initialize` method needs to take at least **two named
|
||||||
|
arguments**: a `get_examples` callback that gives it access to the training
|
||||||
|
examples, and the current `nlp` object. This is mostly used by trainable
|
||||||
|
components so they can initialize their models and label schemes from the data,
|
||||||
|
so we can ignore those arguments here. All **other arguments** on the method can
|
||||||
|
be defined via the config – in this case a dictionary `data`.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.my_component]
|
||||||
|
>
|
||||||
|
> [initialize.components.my_component.data]
|
||||||
|
> # ✅ This only runs on initialization
|
||||||
|
> @readers = "srsly.read_json.v1"
|
||||||
|
> path = "/path/to/slang_dict.json"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Custom initialize method {highlight="5-6"}
|
||||||
|
class AcronymComponent:
|
||||||
|
def __init__(self):
|
||||||
|
self.data = {}
|
||||||
|
|
||||||
|
def initialize(self, get_examples=None, nlp=None, data={}):
|
||||||
|
self.data = data
|
||||||
|
```
|
||||||
|
|
||||||
|
When [`nlp.initialize`](/api/language#initialize) runs before training (or when
|
||||||
|
you call it in your own code), the
|
||||||
|
[`[initialize]`](/api/data-formats#config-initialize) block of the config is
|
||||||
|
loaded and used to construct the `nlp` object. The custom acronym component will
|
||||||
|
then be passed the data loaded from the JSON file. After training, the `nlp`
|
||||||
|
object is saved to disk, which will run the component's `to_disk` method. When
|
||||||
|
the pipeline is loaded back into spaCy later to use it, the `from_disk` method
|
||||||
|
will load the data back in.
|
||||||
|
|
||||||
|
## Python type hints and validation {#type-hints new="3"}
|
||||||
|
|
||||||
spaCy's configs are powered by our machine learning library Thinc's
|
spaCy's configs are powered by our machine learning library Thinc's
|
||||||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||||||
|
@ -968,65 +1167,7 @@ nlp.add_pipe("debug", config={"log_level": "DEBUG"})
|
||||||
doc = nlp("This is a text...")
|
doc = nlp("This is a text...")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Language-specific factories {#factories-language new="3"}
|
## Trainable components {#trainable-components new="3"}
|
||||||
|
|
||||||
There are many use case where you might want your pipeline components to be
|
|
||||||
language-specific. Sometimes this requires entirely different implementation per
|
|
||||||
language, sometimes the only difference is in the settings or data. spaCy allows
|
|
||||||
you to register factories of the **same name** on both the `Language` base
|
|
||||||
class, as well as its **subclasses** like `English` or `German`. Factories are
|
|
||||||
resolved starting with the specific subclass. If the subclass doesn't define a
|
|
||||||
component of that name, spaCy will check the `Language` base class.
|
|
||||||
|
|
||||||
Here's an example of a pipeline component that overwrites the normalized form of
|
|
||||||
a token, the `Token.norm_` with an entry from a language-specific lookup table.
|
|
||||||
It's registered twice under the name `"token_normalizer"` – once using
|
|
||||||
`@English.factory` and once using `@German.factory`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
### {executable="true"}
|
|
||||||
from spacy.lang.en import English
|
|
||||||
from spacy.lang.de import German
|
|
||||||
|
|
||||||
class TokenNormalizer:
|
|
||||||
def __init__(self, norm_table):
|
|
||||||
self.norm_table = norm_table
|
|
||||||
|
|
||||||
def __call__(self, doc):
|
|
||||||
for token in doc:
|
|
||||||
# Overwrite the token.norm_ if there's an entry in the data
|
|
||||||
token.norm_ = self.norm_table.get(token.text, token.norm_)
|
|
||||||
return doc
|
|
||||||
|
|
||||||
@English.factory("token_normalizer")
|
|
||||||
def create_en_normalizer(nlp, name):
|
|
||||||
return TokenNormalizer({"realise": "realize", "colour": "color"})
|
|
||||||
|
|
||||||
@German.factory("token_normalizer")
|
|
||||||
def create_de_normalizer(nlp, name):
|
|
||||||
return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
|
|
||||||
|
|
||||||
nlp_en = English()
|
|
||||||
nlp_en.add_pipe("token_normalizer") # uses the English factory
|
|
||||||
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
|
|
||||||
|
|
||||||
nlp_de = German()
|
|
||||||
nlp_de.add_pipe("token_normalizer") # uses the German factory
|
|
||||||
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
|
|
||||||
```
|
|
||||||
|
|
||||||
<Infobox title="Implementation details">
|
|
||||||
|
|
||||||
Under the hood, language-specific factories are added to the
|
|
||||||
[`factories` registry](/api/top-level#registry) prefixed with the language code,
|
|
||||||
e.g. `"en.token_normalizer"`. When resolving the factory in
|
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
|
|
||||||
language-specific version of the factory using `nlp.lang` and if none is
|
|
||||||
available, falls back to looking up the regular factory name.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Trainable components {#trainable-components new="3"}
|
|
||||||
|
|
||||||
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
|
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
|
||||||
components that have their own model instance, make predictions over `Doc`
|
components that have their own model instance, make predictions over `Doc`
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
title: Saving and Loading
|
title: Saving and Loading
|
||||||
menu:
|
menu:
|
||||||
- ['Basics', 'basics']
|
- ['Basics', 'basics']
|
||||||
|
- ['Serializing Docs', 'docs']
|
||||||
- ['Serialization Methods', 'serialization-methods']
|
- ['Serialization Methods', 'serialization-methods']
|
||||||
- ['Entry Points', 'entry-points']
|
- ['Entry Points', 'entry-points']
|
||||||
- ['Trained Pipelines', 'models']
|
- ['Trained Pipelines', 'models']
|
||||||
|
@ -52,7 +53,7 @@ defined [factories](/usage/processing-pipeline#custom-components-factories) and
|
||||||
_then_ loads in the binary data. You can read more about this process
|
_then_ loads in the binary data. You can read more about this process
|
||||||
[here](/usage/processing-pipelines#pipelines).
|
[here](/usage/processing-pipelines#pipelines).
|
||||||
|
|
||||||
### Serializing Doc objects efficiently {#docs new="2.2"}
|
## Serializing Doc objects efficiently {#docs new="2.2"}
|
||||||
|
|
||||||
If you're working with lots of data, you'll probably need to pass analyses
|
If you're working with lots of data, you'll probably need to pass analyses
|
||||||
between machines, either to use something like [Dask](https://dask.org) or
|
between machines, either to use something like [Dask](https://dask.org) or
|
||||||
|
@ -179,9 +180,20 @@ example, model weights or terminology lists – you can take advantage of spaCy'
|
||||||
built-in component serialization by making your custom component expose its own
|
built-in component serialization by making your custom component expose its own
|
||||||
`to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp`
|
`to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp`
|
||||||
object with the component in its pipeline is saved or loaded, the component will
|
object with the component in its pipeline is saved or loaded, the component will
|
||||||
then be able to serialize and deserialize itself. The following example shows a
|
then be able to serialize and deserialize itself.
|
||||||
custom component that keeps arbitrary JSON-serializable data, allows the user to
|
|
||||||
add to that data and saves and loads the data to and from a JSON file.
|
<Infobox title="Custom components and data" emoji="📖">
|
||||||
|
|
||||||
|
For more details on how to work with pipeline components that depend on data
|
||||||
|
resources and manage data loading and initialization at training and runtime,
|
||||||
|
see the usage guide on initializing and serializing
|
||||||
|
[component data](/usage/processing-pipelines#component-data).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
The following example shows a custom component that keeps arbitrary
|
||||||
|
JSON-serializable data, allows the user to add to that data and saves and loads
|
||||||
|
the data to and from a JSON file.
|
||||||
|
|
||||||
> #### Real-world example
|
> #### Real-world example
|
||||||
>
|
>
|
||||||
|
@ -208,13 +220,13 @@ class CustomComponent:
|
||||||
# Add something to the component's data
|
# Add something to the component's data
|
||||||
self.data.append(data)
|
self.data.append(data)
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
# This will receive the directory path + /my_component
|
# This will receive the directory path + /my_component
|
||||||
data_path = path / "data.json"
|
data_path = path / "data.json"
|
||||||
with data_path.open("w", encoding="utf8") as f:
|
with data_path.open("w", encoding="utf8") as f:
|
||||||
f.write(json.dumps(self.data))
|
f.write(json.dumps(self.data))
|
||||||
|
|
||||||
def from_disk(self, path, **cfg):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
# This will receive the directory path + /my_component
|
# This will receive the directory path + /my_component
|
||||||
data_path = path / "data.json"
|
data_path = path / "data.json"
|
||||||
with data_path.open("r", encoding="utf8") as f:
|
with data_path.open("r", encoding="utf8") as f:
|
||||||
|
@ -276,6 +288,8 @@ custom components to spaCy automatically.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
<!-- ## Initializing components with data {#initialization new="3"} -->
|
||||||
|
|
||||||
## Using entry points {#entry-points new="2.1"}
|
## Using entry points {#entry-points new="2.1"}
|
||||||
|
|
||||||
Entry points let you expose parts of a Python package you write to other Python
|
Entry points let you expose parts of a Python package you write to other Python
|
||||||
|
|
|
@ -819,7 +819,8 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
||||||
|
|
||||||
### Customizing the initialization {#initialization}
|
### Customizing the initialization {#initialization}
|
||||||
|
|
||||||
<!-- TODO: -->
|
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
## Data utilities {#data}
|
## Data utilities {#data}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user