mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-12 04:38:28 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
750 lines
30 KiB
Plaintext
750 lines
30 KiB
Plaintext
---
|
||
title: Saving and Loading
|
||
menu:
|
||
- ['Basics', 'basics']
|
||
- ['Serializing Docs', 'docs']
|
||
- ['Serialization Methods', 'serialization-methods']
|
||
- ['Entry Points', 'entry-points']
|
||
- ['Trained Pipelines', 'models']
|
||
---
|
||
|
||
## Basics {id="basics",hidden="true"}
|
||
|
||
<Serialization101 />
|
||
|
||
### Serializing the pipeline {id="pipeline"}
|
||
|
||
When serializing the pipeline, keep in mind that this will only save out the
|
||
**binary data for the individual components** to allow spaCy to restore them –
|
||
not the entire objects. This is a good thing, because it makes serialization
|
||
safe. But it also means that you have to take care of storing the config, which
|
||
contains the pipeline configuration and all the relevant settings.
|
||
|
||
> #### Saving the meta and config
|
||
>
|
||
> The [`nlp.meta`](/api/language#meta) attribute is a JSON-serializable
|
||
> dictionary and contains all pipeline meta information like the author and
|
||
> license information. The [`nlp.config`](/api/language#config) attribute is a
|
||
> dictionary containing the training configuration, pipeline component factories
|
||
> and other settings. It is saved out with a pipeline as the `config.cfg`.
|
||
|
||
```python {title="Serialize"}
|
||
config = nlp.config
|
||
bytes_data = nlp.to_bytes()
|
||
```
|
||
|
||
```python {title="Deserialize"}
|
||
lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
|
||
nlp = lang_cls.from_config(config)
|
||
nlp.from_bytes(bytes_data)
|
||
```
|
||
|
||
This is also how spaCy does it under the hood when loading a pipeline: it loads
|
||
the `config.cfg` containing the language and pipeline information, initializes
|
||
the language class, creates and adds the pipeline components based on the config
|
||
and _then_ loads in the binary data. You can read more about this process
|
||
[here](/usage/processing-pipelines#pipelines).
|
||
|
||
## Serializing Doc objects efficiently {id="docs",version="2.2"}
|
||
|
||
If you're working with lots of data, you'll probably need to pass analyses
|
||
between machines, either to use something like [Dask](https://dask.org) or
|
||
[Spark](https://spark.apache.org), or even just to save out work to disk. Often
|
||
it's sufficient to use the [`Doc.to_array`](/api/doc#to_array) functionality for
|
||
this, and just serialize the numpy arrays – but other times you want a more
|
||
general way to save and restore `Doc` objects.
|
||
|
||
The [`DocBin`](/api/docbin) class makes it easy to serialize and deserialize a
|
||
collection of `Doc` objects together, and is much more efficient than calling
|
||
[`Doc.to_bytes`](/api/doc#to_bytes) on each individual `Doc` object. You can
|
||
also control what data gets saved, and you can merge pallets together for easy
|
||
map/reduce-style processing.
|
||
|
||
```python {highlight="4,8,9,13,14"}
|
||
import spacy
|
||
from spacy.tokens import DocBin
|
||
|
||
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
|
||
texts = ["Some text", "Lots of texts...", "..."]
|
||
nlp = spacy.load("en_core_web_sm")
|
||
for doc in nlp.pipe(texts):
|
||
doc_bin.add(doc)
|
||
bytes_data = doc_bin.to_bytes()
|
||
|
||
# Deserialize later, e.g. in a new process
|
||
nlp = spacy.blank("en")
|
||
doc_bin = DocBin().from_bytes(bytes_data)
|
||
docs = list(doc_bin.get_docs(nlp.vocab))
|
||
```
|
||
|
||
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
||
well, which includes the values of
|
||
[extension attributes](/usage/processing-pipelines#custom-components-attributes)
|
||
(if they're serializable with msgpack).
|
||
|
||
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
||
|
||
Including the `Doc.user_data` and extension attributes will only serialize the
|
||
**values** of the attributes. To restore the values and access them via the
|
||
`doc._.` property, you need to register the global attribute on the `Doc` again.
|
||
|
||
```python
|
||
docs = list(doc_bin.get_docs(nlp.vocab))
|
||
Doc.set_extension("my_custom_attr", default=None)
|
||
print([doc._.my_custom_attr for doc in docs])
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Using Pickle {id="pickle"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp("This is a text.")
|
||
> data = pickle.dumps(doc)
|
||
> ```
|
||
|
||
When pickling spaCy's objects like the [`Doc`](/api/doc) or the
|
||
[`EntityRecognizer`](/api/entityrecognizer), keep in mind that they all require
|
||
the shared [`Vocab`](/api/vocab) (which includes the string to hash mappings,
|
||
label schemes and optional vectors). This means that their pickled
|
||
representations can become very large, especially if you have word vectors
|
||
loaded, because it won't only include the object itself, but also the entire
|
||
shared vocab it depends on.
|
||
|
||
If you need to pickle multiple objects, try to pickle them **together** instead
|
||
of separately. For instance, instead of pickling all pipeline components, pickle
|
||
the entire pipeline once. And instead of pickling several `Doc` objects
|
||
separately, pickle a list of `Doc` objects. Since they all share a reference to
|
||
the _same_ `Vocab` object, it will only be included once.
|
||
|
||
```python {title="Pickling objects with shared data",highlight="8-9"}
|
||
doc1 = nlp("Hello world")
|
||
doc2 = nlp("This is a test")
|
||
|
||
doc1_data = pickle.dumps(doc1)
|
||
doc2_data = pickle.dumps(doc2)
|
||
print(len(doc1_data) + len(doc2_data)) # 6636116 😞
|
||
|
||
doc_data = pickle.dumps([doc1, doc2])
|
||
print(len(doc_data)) # 3319761 😃
|
||
```
|
||
|
||
<Infobox title="Pickling spans and tokens" variant="warning">
|
||
|
||
Pickling `Token` and `Span` objects isn't supported. They're only views of the
|
||
`Doc` and can't exist on their own. Pickling them would always mean pulling in
|
||
the parent document and its vocabulary, which has practically no advantage over
|
||
pickling the parent `Doc`.
|
||
|
||
```diff
|
||
- data = pickle.dumps(doc[10:20])
|
||
+ data = pickle.dumps(doc)
|
||
```
|
||
|
||
If you really only need a span – for example, a particular sentence – you can
|
||
use [`Span.as_doc`](/api/span#as_doc) to make a copy of it and convert it to a
|
||
`Doc` object. However, note that this will not let you recover contextual
|
||
information from _outside_ the span.
|
||
|
||
```diff
|
||
+ span_doc = doc[10:20].as_doc()
|
||
data = pickle.dumps(span_doc)
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
## Implementing serialization methods {id="serialization-methods"}
|
||
|
||
When you call [`nlp.to_disk`](/api/language#to_disk),
|
||
[`nlp.from_disk`](/api/language#from_disk) or load a pipeline package, spaCy
|
||
will iterate over the components in the pipeline, check if they expose a
|
||
`to_disk` or `from_disk` method and if so, call it with the path to the pipeline
|
||
directory plus the string name of the component. For example, if you're calling
|
||
`nlp.to_disk("/path")`, the data for the named entity recognizer will be saved
|
||
in `/path/ner`.
|
||
|
||
If you're using custom pipeline components that depend on external data – for
|
||
example, model weights or terminology lists – you can take advantage of spaCy's
|
||
built-in component serialization by making your custom component expose its own
|
||
`to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp`
|
||
object with the component in its pipeline is saved or loaded, the component will
|
||
then be able to serialize and deserialize itself.
|
||
|
||
<Infobox title="Custom components and data" emoji="📖">
|
||
|
||
For more details on how to work with pipeline components that depend on data
|
||
resources and manage data loading and initialization at training and runtime,
|
||
see the usage guide on initializing and serializing
|
||
[component data](/usage/processing-pipelines#component-data).
|
||
|
||
</Infobox>
|
||
|
||
The following example shows a custom component that keeps arbitrary
|
||
JSON-serializable data, allows the user to add to that data and saves and loads
|
||
the data to and from a JSON file.
|
||
|
||
> #### Real-world example
|
||
>
|
||
> To see custom serialization methods in action, check out the new
|
||
> [`EntityRuler`](/api/entityruler) component and its
|
||
> [source](%%GITHUB_SPACY/spacy/pipeline/entityruler.py). Patterns added to the
|
||
> component will be saved to a `.jsonl` file if the pipeline is serialized to
|
||
> disk, and to a bytestring if the pipeline is serialized to bytes. This allows
|
||
> saving out a pipeline with a rule-based entity recognizer and including all
|
||
> rules _with_ the component data.
|
||
|
||
```python {highlight="16-23,25-30"}
|
||
import json
|
||
from spacy import Language
|
||
from spacy.util import ensure_path
|
||
|
||
@Language.factory("my_component")
|
||
class CustomComponent:
|
||
def __init__(self, nlp: Language, name: str = "my_component"):
|
||
self.name = name
|
||
self.data = []
|
||
|
||
def __call__(self, doc):
|
||
# Do something to the doc here
|
||
return doc
|
||
|
||
def add(self, data):
|
||
# Add something to the component's data
|
||
self.data.append(data)
|
||
|
||
def to_disk(self, path, exclude=tuple()):
|
||
# This will receive the directory path + /my_component
|
||
path = ensure_path(path)
|
||
if not path.exists():
|
||
path.mkdir()
|
||
data_path = path / "data.json"
|
||
with data_path.open("w", encoding="utf8") as f:
|
||
f.write(json.dumps(self.data))
|
||
|
||
def from_disk(self, path, exclude=tuple()):
|
||
# This will receive the directory path + /my_component
|
||
data_path = path / "data.json"
|
||
with data_path.open("r", encoding="utf8") as f:
|
||
self.data = json.load(f)
|
||
return self
|
||
```
|
||
|
||
After adding the component to the pipeline and adding some data to it, we can
|
||
serialize the `nlp` object to a directory, which will call the custom
|
||
component's `to_disk` method.
|
||
|
||
```python {highlight="2-4"}
|
||
nlp = spacy.load("en_core_web_sm")
|
||
my_component = nlp.add_pipe("my_component")
|
||
my_component.add({"hello": "world"})
|
||
nlp.to_disk("/path/to/pipeline")
|
||
```
|
||
|
||
The contents of the directory would then look like this.
|
||
`CustomComponent.to_disk` converted the data to a JSON string and saved it to a
|
||
file `data.json` in its subdirectory:
|
||
|
||
```yaml {title="Directory structure",highlight="2-3"}
|
||
└── /path/to/pipeline
|
||
├── my_component # data serialized by "my_component"
|
||
│ └── data.json
|
||
├── ner # data for "ner" component
|
||
├── parser # data for "parser" component
|
||
├── tagger # data for "tagger" component
|
||
├── vocab # pipeline vocabulary
|
||
├── meta.json # pipeline meta.json
|
||
├── config.cfg # pipeline config
|
||
└── tokenizer # tokenization rules
|
||
```
|
||
|
||
When you load the data back in, spaCy will call the custom component's
|
||
`from_disk` method with the given file path, and the component can then load the
|
||
contents of `data.json`, convert them to a Python object and restore the
|
||
component state. The same works for other types of data, of course – for
|
||
instance, you could add a
|
||
[wrapper for a model](/usage/layers-architectures#frameworks) trained with a
|
||
different library like TensorFlow or PyTorch and make spaCy load its weights
|
||
automatically when you load the pipeline package.
|
||
|
||
<Infobox title="Important note on loading custom components" variant="warning">
|
||
|
||
When you load back a pipeline with custom components, make sure that the
|
||
components are **available** and that the
|
||
[`@Language.component`](/api/language#component) or
|
||
[`@Language.factory`](/api/language#factory) decorators are executed _before_
|
||
your pipeline is loaded back. Otherwise, spaCy won't know how to resolve the
|
||
string name of a component factory like `"my_component"` back to a function. For
|
||
more details, see the documentation on
|
||
[adding factories](/usage/processing-pipelines#custom-components-factories) or
|
||
use [entry points](#entry-points) to make your extension package expose your
|
||
custom components to spaCy automatically.
|
||
|
||
</Infobox>
|
||
|
||
{/* ## Initializing components with data {id="initialization",version="3"} */}
|
||
|
||
## Using entry points {id="entry-points",version="2.1"}
|
||
|
||
Entry points let you expose parts of a Python package you write to other Python
|
||
packages. This lets one application easily customize the behavior of another, by
|
||
exposing an entry point in its `setup.py`. For a quick and fun intro to entry
|
||
points in Python, check out
|
||
[this excellent blog post](https://amir.rachum.com/blog/2017/07/28/python-entry-points/).
|
||
spaCy can load custom functions from several different entry points to add
|
||
pipeline component factories, language classes and other settings. To make spaCy
|
||
use your entry points, your package needs to expose them and it needs to be
|
||
installed in the same environment – that's it.
|
||
|
||
| Entry point | Description |
|
||
| ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
|
||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut. |
|
||
| `spacy_lookups` | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
||
| [`spacy_displacy_colors`](#entry-points-displacy) | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
||
|
||
### Custom components via entry points {id="entry-points-components"}
|
||
|
||
When you load a pipeline, spaCy will generally use its `config.cfg` to set up
|
||
the language class and construct the pipeline. The pipeline is specified as a
|
||
list of strings, e.g. `pipeline = ["tagger", "parser", "ner"]`. For each of
|
||
those strings, spaCy will call `nlp.add_pipe` and look up the name in all
|
||
factories defined by the decorators
|
||
[`@Language.component`](/api/language#component) and
|
||
[`@Language.factory`](/api/language#factory). This means that you have to import
|
||
your custom components _before_ loading the pipeline.
|
||
|
||
Using entry points, pipeline packages and extension packages can define their
|
||
own `"spacy_factories"`, which will be loaded automatically in the background
|
||
when the `Language` class is initialized. So if a user has your package
|
||
installed, they'll be able to use your components – even if they **don't import
|
||
them**!
|
||
|
||
To stick with the theme of
|
||
[this entry points blog post](https://amir.rachum.com/blog/2017/07/28/python-entry-points/),
|
||
consider the following custom spaCy
|
||
[pipeline component](/usage/processing-pipelines#custom-coponents) that prints a
|
||
snake when it's called:
|
||
|
||
> #### Package directory structure
|
||
>
|
||
> ```yaml
|
||
> ├── snek.py # the extension code
|
||
> └── setup.py # setup file for pip installation
|
||
> ```
|
||
|
||
```python {title="snek.py"}
|
||
from spacy.language import Language
|
||
|
||
snek = """
|
||
--..,_ _,.--.
|
||
`'.'. .'`__ o `;__. {text}
|
||
'.'. .'.'` '---'` `
|
||
'.`'--....--'`.'
|
||
`'--....--'`
|
||
"""
|
||
|
||
@Language.component("snek")
|
||
def snek_component(doc):
|
||
print(snek.format(text=doc.text))
|
||
return doc
|
||
```
|
||
|
||
Since it's a very complex and sophisticated module, you want to split it off
|
||
into its own package so you can version it and upload it to PyPi. You also want
|
||
your custom package to be able to define `pipeline = ["snek"]` in its
|
||
`config.cfg`. For that, you need to be able to tell spaCy where to find the
|
||
component `"snek"`. If you don't do this, spaCy will raise an error when you try
|
||
to load the pipeline because there's no built-in `"snek"` component. To add an
|
||
entry to the factories, you can now expose it in your `setup.py` via the
|
||
`entry_points` dictionary:
|
||
|
||
> #### Entry point syntax
|
||
>
|
||
> Python entry points for a group are formatted as a **list of strings**, with
|
||
> each string following the syntax of `name = module:object`. In this example,
|
||
> the created entry point is named `snek` and points to the function
|
||
> `snek_component` in the module `snek`, i.e. `snek.py`.
|
||
|
||
```python {title="setup.py",highlight="5-7"}
|
||
from setuptools import setup
|
||
|
||
setup(
|
||
name="snek",
|
||
entry_points={
|
||
"spacy_factories": ["snek = snek:snek_component"]
|
||
}
|
||
)
|
||
```
|
||
|
||
The same package can expose multiple entry points, by the way. To make them
|
||
available to spaCy, all you need to do is install the package in your
|
||
environment:
|
||
|
||
```bash
|
||
$ python setup.py develop
|
||
```
|
||
|
||
spaCy is now able to create the pipeline component `"snek"` – even though you
|
||
never imported `snek_component`. When you save the
|
||
[`nlp.config`](/api/language#config) to disk, it includes an entry for your
|
||
`"snek"` component and any pipeline you train with this config will include the
|
||
component and know how to load it – if your `snek` package is installed.
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```diff
|
||
> [nlp]
|
||
> lang = "en"
|
||
> + pipeline = ["snek"]
|
||
>
|
||
> [components]
|
||
>
|
||
> + [components.snek]
|
||
> + factory = "snek"
|
||
> ```
|
||
|
||
```
|
||
>>> from spacy.lang.en import English
|
||
>>> nlp = English()
|
||
>>> nlp.add_pipe("snek") # this now works! 🐍🎉
|
||
>>> doc = nlp("I am snek")
|
||
--..,_ _,.--.
|
||
`'.'. .'`__ o `;__. I am snek
|
||
'.'. .'.'` '---'` `
|
||
'.`'--....--'`.'
|
||
`'--....--'`
|
||
```
|
||
|
||
Instead of making your snek component a simple
|
||
[stateless component](/usage/processing-pipelines#custom-components-simple), you
|
||
could also make it a
|
||
[factory](/usage/processing-pipelines#custom-components-factories) that takes
|
||
settings. Your users can then pass in an optional `config` when they add your
|
||
component to the pipeline and customize its appearance – for example, the
|
||
`snek_style`.
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```diff
|
||
> [components.snek]
|
||
> factory = "snek"
|
||
> + snek_style = "basic"
|
||
> ```
|
||
|
||
```python
|
||
SNEKS = {"basic": snek, "cute": cute_snek} # collection of sneks
|
||
|
||
@Language.factory("snek", default_config={"snek_style": "basic"})
|
||
class SnekFactory:
|
||
def __init__(self, nlp: Language, name: str, snek_style: str):
|
||
self.nlp = nlp
|
||
self.snek_style = snek_style
|
||
self.snek = SNEKS[self.snek_style]
|
||
|
||
def __call__(self, doc):
|
||
print(self.snek)
|
||
return doc
|
||
```
|
||
|
||
```diff {title="setup.py"}
|
||
entry_points={
|
||
- "spacy_factories": ["snek = snek:snek_component"]
|
||
+ "spacy_factories": ["snek = snek:SnekFactory"]
|
||
}
|
||
```
|
||
|
||
The factory can also implement other pipeline component methods like `to_disk`
|
||
and `from_disk` for serialization, or even `update` to make the component
|
||
trainable. If a component exposes a `from_disk` method and is included in a
|
||
pipeline, spaCy will call it on load. This lets you ship custom data with your
|
||
pipeline package. When you save out a pipeline using `nlp.to_disk` and the
|
||
component exposes a `to_disk` method, it will be called with the disk path.
|
||
|
||
```python
|
||
from spacy.util import ensure_path
|
||
|
||
def to_disk(self, path, exclude=tuple()):
|
||
path = ensure_path(path)
|
||
if not path.exists():
|
||
path.mkdir()
|
||
snek_path = path / "snek.txt"
|
||
with snek_path.open("w", encoding="utf8") as snek_file:
|
||
snek_file.write(self.snek)
|
||
|
||
def from_disk(self, path, exclude=tuple()):
|
||
snek_path = path / "snek.txt"
|
||
with snek_path.open("r", encoding="utf8") as snek_file:
|
||
self.snek = snek_file.read()
|
||
return self
|
||
```
|
||
|
||
The above example will serialize the current snake in a `snek.txt` in the data
|
||
directory. When a pipeline using the `snek` component is loaded, it will open
|
||
the `snek.txt` and make it available to the component.
|
||
|
||
### Custom language classes via entry points {id="entry-points-languages"}
|
||
|
||
To stay with the theme of the previous example and
|
||
[this blog post on entry points](https://amir.rachum.com/blog/2017/07/28/python-entry-points/),
|
||
let's imagine you wanted to implement your own `SnekLanguage` class for your
|
||
custom pipeline – but you don't necessarily want to modify spaCy's code to add a
|
||
language. In your package, you could then implement the following
|
||
[custom language subclass](/usage/linguistic-features#language-subclass):
|
||
|
||
```python {title="snek.py"}
|
||
from spacy.language import Language
|
||
|
||
class SnekDefaults(Language.Defaults):
|
||
stop_words = set(["sss", "hiss"])
|
||
|
||
class SnekLanguage(Language):
|
||
lang = "snk"
|
||
Defaults = SnekDefaults
|
||
```
|
||
|
||
Alongside the `spacy_factories`, there's also an entry point option for
|
||
`spacy_languages`, which maps language codes to language-specific `Language`
|
||
subclasses:
|
||
|
||
```diff {title="setup.py"}
|
||
from setuptools import setup
|
||
|
||
setup(
|
||
name="snek",
|
||
entry_points={
|
||
"spacy_factories": ["snek = snek:SnekFactory"],
|
||
+ "spacy_languages": ["snk = snek:SnekLanguage"]
|
||
}
|
||
)
|
||
```
|
||
|
||
In spaCy, you can then load the custom `snk` language and it will be resolved to
|
||
`SnekLanguage` via the custom entry point. This is especially relevant for
|
||
pipeline packages you [train](/usage/training), which could then specify
|
||
`lang = snk` in their `config.cfg` without spaCy raising an error because the
|
||
language is not available in the core library.
|
||
|
||
### Custom displaCy colors via entry points {id="entry-points-displacy",version="2.2"}
|
||
|
||
If you're training a named entity recognition model for a custom domain, you may
|
||
end up training different labels that don't have pre-defined colors in the
|
||
[`displacy` visualizer](/usage/visualizers#ent). The `spacy_displacy_colors`
|
||
entry point lets you define a dictionary of entity labels mapped to their color
|
||
values. It's added to the pre-defined colors and can also overwrite existing
|
||
values.
|
||
|
||
> #### Domain-specific NER labels
|
||
>
|
||
> Good examples of pipelines with domain-specific label schemes are
|
||
> [scispaCy](/universe/project/scispacy) and
|
||
> [Blackstone](/universe/project/blackstone).
|
||
|
||
```python {title="snek.py"}
|
||
displacy_colors = {"SNEK": "#3dff74", "HUMAN": "#cfc5ff"}
|
||
```
|
||
|
||
Given the above colors, the entry point can be defined as follows. Entry points
|
||
need to have a name, so we use the key `colors`. However, the name doesn't
|
||
matter and whatever is defined in the entry point group will be used.
|
||
|
||
```diff {title="setup.py"}
|
||
from setuptools import setup
|
||
|
||
setup(
|
||
name="snek",
|
||
entry_points={
|
||
+ "spacy_displacy_colors": ["colors = snek:displacy_colors"]
|
||
}
|
||
)
|
||
```
|
||
|
||
After installing the package, the custom colors will be used when visualizing
|
||
text with `displacy`. Whenever the label `SNEK` is assigned, it will be
|
||
displayed in `#3dff74`.
|
||
|
||
<Iframe
|
||
title="displaCy visualization of entities"
|
||
src="/images/displacy-ent-snek.html"
|
||
height={100}
|
||
/>
|
||
|
||
## Saving, loading and distributing trained pipelines {id="models"}
|
||
|
||
After training your pipeline, you'll usually want to save its state, and load it
|
||
back later. You can do this with the [`Language.to_disk`](/api/language#to_disk)
|
||
method:
|
||
|
||
```python
|
||
nlp.to_disk("./en_example_pipeline")
|
||
```
|
||
|
||
The directory will be created if it doesn't exist, and the whole pipeline data,
|
||
meta and configuration will be written out. To make the pipeline more convenient
|
||
to deploy, we recommend wrapping it as a [Python package](/api/cli#package).
|
||
|
||
<Accordion title="What’s the difference between the config.cfg and meta.json?" spaced id="models-meta-vs-config" spaced>
|
||
|
||
When you save a pipeline in spaCy v3.0+, two files will be exported: a
|
||
[`config.cfg`](/api/data-formats#config) based on
|
||
[`nlp.config`](/api/language#config) and a [`meta.json`](/api/data-formats#meta)
|
||
based on [`nlp.meta`](/api/language#meta).
|
||
|
||
- **config**: Configuration used to create the current `nlp` object, its
|
||
pipeline components and models, as well as training settings and
|
||
hyperparameters. Can include references to registered functions like
|
||
[pipeline components](/usage/processing-pipelines#custom-components) or
|
||
[model architectures](/api/architectures). Given a config, spaCy is able
|
||
reconstruct the whole tree of objects and the `nlp` object. An exported config
|
||
can also be used to [train a pipeline](/usage/training#config) with the same
|
||
settings.
|
||
- **meta**: Meta information about the pipeline and the Python package, such as
|
||
the author information, license, version, data sources and label scheme. This
|
||
is mostly used for documentation purposes and for packaging pipelines. It has
|
||
no impact on the functionality of the `nlp` object.
|
||
|
||
</Accordion>
|
||
|
||
<Project id="pipelines/tagger_parser_ud">
|
||
|
||
The easiest way to get started with an end-to-end workflow is to clone a
|
||
[project template](/usage/projects) and run it – for example, this template that
|
||
lets you train a **part-of-speech tagger** and **dependency parser** on a
|
||
Universal Dependencies treebank and generates an installable Python package.
|
||
|
||
</Project>
|
||
|
||
### Generating a pipeline package {id="models-generating"}
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
Pipeline packages are typically **not suitable** for the public
|
||
[pypi.python.org](https://pypi.python.org) directory, which is not designed for
|
||
binary data and files over 50 MB. However, if your company is running an
|
||
**internal installation** of PyPi, publishing your pipeline packages on there
|
||
can be a convenient way to share them with your team.
|
||
|
||
</Infobox>
|
||
|
||
spaCy comes with a handy CLI command that will create all required files, and
|
||
walk you through generating the meta data. You can also create the
|
||
[`meta.json`](/api/data-formats#meta) manually and place it in the data
|
||
directory, or supply a path to it using the `--meta` flag. For more info on
|
||
this, see the [`package`](/api/cli#package) docs.
|
||
|
||
> #### meta.json (example)
|
||
>
|
||
> ```json
|
||
> {
|
||
> "name": "example_pipeline",
|
||
> "lang": "en",
|
||
> "version": "1.0.0",
|
||
> "spacy_version": ">=2.0.0,<3.0.0",
|
||
> "description": "Example pipeline for spaCy",
|
||
> "author": "You",
|
||
> "email": "you@example.com",
|
||
> "license": "CC BY-SA 3.0"
|
||
> }
|
||
> ```
|
||
|
||
```bash
|
||
$ python -m spacy package ./en_example_pipeline ./packages
|
||
```
|
||
|
||
This command will create a pipeline package directory and will run
|
||
`python setup.py sdist` in that directory to create a binary `.whl` file or
|
||
`.tar.gz` archive of your package that can be installed using `pip install`.
|
||
Installing the binary wheel is usually more efficient.
|
||
|
||
```yaml {title="Directory structure"}
|
||
└── /
|
||
├── MANIFEST.in # to include meta.json
|
||
├── meta.json # pipeline meta data
|
||
├── setup.py # setup file for pip installation
|
||
├── en_example_pipeline # pipeline directory
|
||
│ ├── __init__.py # init for pip installation
|
||
│ └── en_example_pipeline-1.0.0 # pipeline data
|
||
│ ├── config.cfg # pipeline config
|
||
│ ├── meta.json # pipeline meta
|
||
│ └── ... # directories with component data
|
||
└── dist
|
||
└── en_example_pipeline-1.0.0.tar.gz # installable package
|
||
```
|
||
|
||
You can also find templates for all files in the
|
||
[`cli/package.py` source](https://github.com/explosion/spacy/tree/master/spacy/cli/package.py).
|
||
If you're creating the package manually, keep in mind that the directories need
|
||
to be named according to the naming conventions of `lang_name` and
|
||
`lang_name-version`.
|
||
|
||
### Including custom functions and components {id="models-custom"}
|
||
|
||
If your pipeline includes
|
||
[custom components](/usage/processing-pipelines#custom-components), model
|
||
architectures or other [code](/usage/training#custom-code), those functions need
|
||
to be registered **before** your pipeline is loaded. Otherwise, spaCy won't know
|
||
how to create the objects referenced in the config. The
|
||
[`spacy package`](/api/cli#package) command lets you provide one or more paths
|
||
to Python files containing custom registered functions using the `--code`
|
||
argument.
|
||
|
||
> #### \_\_init\_\_.py (excerpt)
|
||
>
|
||
> ```python
|
||
> from . import functions
|
||
>
|
||
> def load(**overrides):
|
||
> ...
|
||
> ```
|
||
|
||
```bash
|
||
$ python -m spacy package ./en_example_pipeline ./packages --code functions.py
|
||
```
|
||
|
||
The Python files will be copied over into the root of the package, and the
|
||
package's `__init__.py` will import them as modules. This ensures that functions
|
||
are registered when the pipeline is imported, e.g. when you call `spacy.load`. A
|
||
simple import is all that's needed to make registered functions available.
|
||
|
||
Make sure to include **all Python files** that are referenced in your custom
|
||
code, including modules imported by others. If your custom code depends on
|
||
**external packages**, make sure they're listed in the list of `"requirements"`
|
||
in your [`meta.json`](/api/data-formats#meta). For the majority of use cases,
|
||
registered functions should provide you with all customizations you need, from
|
||
custom components to custom model architectures and lifecycle hooks. However, if
|
||
you do want to customize the setup in more detail, you can edit the package's
|
||
`__init__.py` and the package's `load` function that's called by
|
||
[`spacy.load`](/api/top-level#spacy.load).
|
||
|
||
<Infobox variant="warning" title="Important note on making manual edits">
|
||
|
||
While it's no problem to edit the package code or meta information, avoid making
|
||
edits to the `config.cfg` **after** training, as this can easily lead to data
|
||
incompatibility. For instance, changing an architecture or hyperparameter can
|
||
mean that the trained weights are now incompatible. If you want to make
|
||
adjustments, you can do so before training. Otherwise, you should always trust
|
||
spaCy to export the current state of its `nlp` objects via
|
||
[`nlp.config`](/api/language#config).
|
||
|
||
</Infobox>
|
||
|
||
### Loading a custom pipeline package {id="loading"}
|
||
|
||
To load a pipeline from a data directory, you can use
|
||
[`spacy.load()`](/api/top-level#spacy.load) with the local path. This will look
|
||
for a `config.cfg` in the directory and use the `lang` and `pipeline` settings
|
||
to initialize a `Language` class with a processing pipeline and load in the
|
||
model data.
|
||
|
||
```python
|
||
nlp = spacy.load("/path/to/pipeline")
|
||
```
|
||
|
||
If you want to **load only the binary data**, you'll have to create a `Language`
|
||
class and call [`from_disk`](/api/language#from_disk) instead.
|
||
|
||
```python
|
||
nlp = spacy.blank("en").from_disk("/path/to/data")
|
||
```
|