--- title: Saving and Loading menu: - ['Basics', 'basics'] - ['Serialization Methods', 'serialization-methods'] - ['Entry Points', 'entry-points'] - ['Models', 'models'] --- ## Basics {#basics hidden="true"} import Serialization101 from 'usage/101/\_serialization.md' ### Serializing the pipeline {#pipeline} When serializing the pipeline, keep in mind that this will only save out the **binary data for the individual components** to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization safe. But it also means that you have to take care of storing the language name and pipeline component names as well, and restoring them separately before you can load in the data. > #### Saving the model meta > > The `nlp.meta` attribute is a JSON-serializable dictionary and contains all > model meta information, like the language and pipeline, but also author and > license information. ```python ### Serialize bytes_data = nlp.to_bytes() lang = nlp.meta["lang"] # "en" pipeline = nlp.meta["pipeline"] # ["tagger", "parser", "ner"] ``` ```python ### Deserialize nlp = spacy.blank(lang) for pipe_name in pipeline: pipe = nlp.create_pipe(pipe_name) nlp.add_pipe(pipe) nlp.from_bytes(bytes_data) ``` This is also how spaCy does it under the hood when loading a model: it loads the model's `meta.json` containing the language and pipeline information, initializes the language class, creates and adds the pipeline components and _then_ loads in the binary data. You can read more about this process [here](/usage/processing-pipelines#pipelines). ### Serializing Doc objects efficiently {#docs new="2.2"} If you're working with lots of data, you'll probably need to pass analyses between machines, either to use something like [Dask](https://dask.org) or [Spark](https://spark.apache.org), or even just to save out work to disk. Often it's sufficient to use the [`Doc.to_array`](/api/doc#to_array) functionality for this, and just serialize the numpy arrays – but other times you want a more general way to save and restore `Doc` objects. The [`DocBin`](/api/docbin) class makes it easy to serialize and deserialize a collection of `Doc` objects together, and is much more efficient than calling [`Doc.to_bytes`](/api/doc#to_bytes) on each individual `Doc` object. You can also control what data gets saved, and you can merge pallets together for easy map/reduce-style processing. ```python ### {highlight="4,8,9,13,14"} import spacy from spacy.tokens import DocBin doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True) texts = ["Some text", "Lots of texts...", "..."] nlp = spacy.load("en_core_web_sm") for doc in nlp.pipe(texts): doc_bin.add(doc) bytes_data = doc_bin.to_bytes() # Deserialize later, e.g. in a new process nlp = spacy.blank("en") doc_bin = DocBin().from_bytes(bytes_data) docs = list(doc_bin.get_docs(nlp.vocab)) ``` If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as well, which includes the values of [extension attributes](/usage/processing-pipelines#custom-components-attributes) (if they're serializable with msgpack). Including the `Doc.user_data` and extension attributes will only serialize the **values** of the attributes. To restore the values and access them via the `doc._.` property, you need to register the global attribute on the `Doc` again. ```python docs = list(doc_bin.get_docs(nlp.vocab)) Doc.set_extension("my_custom_attr", default=None) print([doc._.my_custom_attr for doc in docs]) ``` ### Using Pickle {#pickle} > #### Example > > ```python > doc = nlp("This is a text.") > data = pickle.dumps(doc) > ``` When pickling spaCy's objects like the [`Doc`](/api/doc) or the [`EntityRecognizer`](/api/entityrecognizer), keep in mind that they all require the shared [`Vocab`](/api/vocab) (which includes the string to hash mappings, label schemes and optional vectors). This means that their pickled representations can become very large, especially if you have word vectors loaded, because it won't only include the object itself, but also the entire shared vocab it depends on. If you need to pickle multiple objects, try to pickle them **together** instead of separately. For instance, instead of pickling all pipeline components, pickle the entire pipeline once. And instead of pickling several `Doc` objects separately, pickle a list of `Doc` objects. Since they all share a reference to the _same_ `Vocab` object, it will only be included once. ```python ### Pickling objects with shared data {highlight="8-9"} doc1 = nlp("Hello world") doc2 = nlp("This is a test") doc1_data = pickle.dumps(doc1) doc2_data = pickle.dumps(doc2) print(len(doc1_data) + len(doc2_data)) # 6636116 😞 doc_data = pickle.dumps([doc1, doc2]) print(len(doc_data)) # 3319761 😃 ``` Pickling `Token` and `Span` objects isn't supported. They're only views of the `Doc` and can't exist on their own. Pickling them would always mean pulling in the parent document and its vocabulary, which has practically no advantage over pickling the parent `Doc`. ```diff - data = pickle.dumps(doc[10:20]) + data = pickle.dumps(doc) ``` If you really only need a span – for example, a particular sentence – you can use [`Span.as_doc`](/api/span#as_doc) to make a copy of it and convert it to a `Doc` object. However, note that this will not let you recover contextual information from _outside_ the span. ```diff + span_doc = doc[10:20].as_doc() data = pickle.dumps(span_doc) ``` ## Implementing serialization methods {#serialization-methods} When you call [`nlp.to_disk`](/api/language#to_disk), [`nlp.from_disk`](/api/language#from_disk) or load a model package, spaCy will iterate over the components in the pipeline, check if they expose a `to_disk` or `from_disk` method and if so, call it with the path to the model directory plus the string name of the component. For example, if you're calling `nlp.to_disk("/path")`, the data for the named entity recognizer will be saved in `/path/ner`. If you're using custom pipeline components that depend on external data – for example, model weights or terminology lists – you can take advantage of spaCy's built-in component serialization by making your custom component expose its own `to_disk` and `from_disk` or `to_bytes` and `from_bytes` methods. When an `nlp` object with the component in its pipeline is saved or loaded, the component will then be able to serialize and deserialize itself. The following example shows a custom component that keeps arbitrary JSON-serializable data, allows the user to add to that data and saves and loads the data to and from a JSON file. > #### Real-world example > > To see custom serialization methods in action, check out the new > [`EntityRuler`](/api/entityruler) component and its > [source](https://github.com/explosion/spaCy/tree/master/spacy/pipeline/entityruler.py). > Patterns added to the component will be saved to a `.jsonl` file if the > pipeline is serialized to disk, and to a bytestring if the pipeline is > serialized to bytes. This allows saving out a model with a rule-based entity > recognizer and including all rules _with_ the model data. ```python ### {highlight="15-19,21-26"} class CustomComponent(object): name = "my_component" def __init__(self): self.data = [] def __call__(self, doc): # Do something to the doc here return doc def add(self, data): # Add something to the component's data self.data.append(data) def to_disk(self, path, **kwargs): # This will receive the directory path + /my_component data_path = path / "data.json" with data_path.open("w", encoding="utf8") as f: f.write(json.dumps(self.data)) def from_disk(self, path, **cfg): # This will receive the directory path + /my_component data_path = path / "data.json" with data_path.open("r", encoding="utf8") as f: self.data = json.loads(f) return self ``` After adding the component to the pipeline and adding some data to it, we can serialize the `nlp` object to a directory, which will call the custom component's `to_disk` method. ```python ### {highlight="2-4"} nlp = spacy.load("en_core_web_sm") my_component = CustomComponent() my_component.add({"hello": "world"}) nlp.add_pipe(my_component) nlp.to_disk("/path/to/model") ``` The contents of the directory would then look like this. `CustomComponent.to_disk` converted the data to a JSON string and saved it to a file `data.json` in its subdirectory: ```yaml ### Directory structure {highlight="2-3"} └── /path/to/model ├── my_component # data serialized by "my_component" | └── data.json ├── ner # data for "ner" component ├── parser # data for "parser" component ├── tagger # data for "tagger" component ├── vocab # model vocabulary ├── meta.json # model meta.json with name, language and pipeline └── tokenizer # tokenization rules ``` When you load the data back in, spaCy will call the custom component's `from_disk` method with the given file path, and the component can then load the contents of `data.json`, convert them to a Python object and restore the component state. The same works for other types of data, of course – for instance, you could add a [wrapper for a model](/usage/processing-pipelines#wrapping-models-libraries) trained with a different library like TensorFlow or PyTorch and make spaCy load its weights automatically when you load the model package. When you load a model from disk, spaCy will check the `"pipeline"` in the model's `meta.json` and look up the component name in the internal factories. To make sure spaCy knows how to initialize `"my_component"`, you'll need to add it to the factories: ```python from spacy.language import Language Language.factories["my_component"] = lambda nlp, **cfg: CustomComponent() ``` For more details, see the documentation on [adding factories](/usage/processing-pipelines#custom-components-factories) or use [entry points](#entry-points) to make your extension package expose your custom components to spaCy automatically. ## Using entry points {#entry-points new="2.1"} Entry points let you expose parts of a Python package you write to other Python packages. This lets one application easily customize the behavior of another, by exposing an entry point in its `setup.py`. For a quick and fun intro to entry points in Python, check out [this excellent blog post](https://amir.rachum.com/blog/2017/07/28/python-entry-points/). spaCy can load custom function from several different entry points to add pipeline component factories, language classes and other settings. To make spaCy use your entry points, your package needs to expose them and it needs to be installed in the same environment – that's it. | Entry point | Description | | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. | | [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. | | `spacy_lookups` 2.2 | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. | | [`spacy_displacy_colors`](#entry-points-displacy) 2.2 | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. | ### Custom components via entry points {#entry-points-components} When you load a model, spaCy will generally use the model's `meta.json` to set up the language class and construct the pipeline. The pipeline is specified as a list of strings, e.g. `"pipeline": ["tagger", "paser", "ner"]`. For each of those strings, spaCy will call `nlp.create_pipe` and look up the name in the [built-in factories](/usage/processing-pipelines#custom-components-factories). If your model wanted to specify its own custom components, you usually have to write to `Language.factories` _before_ loading the model. ```python pipe = nlp.create_pipe("custom_component") # fails 👎 Language.factories["custom_component"] = CustomComponentFactory pipe = nlp.create_pipe("custom_component") # works 👍 ``` This is inconvenient and usually required shipping a bunch of component initialization code with the model. Using entry points, model packages and extension packages can now define their own `"spacy_factories"`, which will be added to the built-in factories when the `Language` class is initialized. If a package in the same environment exposes spaCy entry points, all of this happens automatically and no further user action is required. To stick with the theme of [this entry points blog post](https://amir.rachum.com/blog/2017/07/28/python-entry-points/), consider the following custom spaCy extension which is initialized with the shared `nlp` object and will print a snake when it's called as a pipeline component. > #### Package directory structure > > ```yaml > ├── snek.py # the extension code > └── setup.py # setup file for pip installation > ``` ```python ### snek.py snek = """ --..,_ _,.--. `'.'. .'`__ o `;__. '.'. .'.'` '---'` ` '.`'--....--'`.' `'--....--'` """ class SnekFactory(object): def __init__(self, nlp, **cfg): self.nlp = nlp def __call__(self, doc): print(snek) return doc ``` Since it's a very complex and sophisticated module, you want to split it off into its own package so you can version it and upload it to PyPi. You also want your custom model to be able to define `"pipeline": ["snek"]` in its `meta.json`. For that, you need to be able to tell spaCy where to find the factory for `"snek"`. If you don't do this, spaCy will raise an error when you try to load the model because there's no built-in `"snek"` factory. To add an entry to the factories, you can now expose it in your `setup.py` via the `entry_points` dictionary: ```python ### setup.py {highlight="5-7"} from setuptools import setup setup( name="snek", entry_points={ "spacy_factories": ["snek = snek:SnekFactory"] } ) ``` The entry point definition tells spaCy that the name `snek` can be found in the module `snek` (i.e. `snek.py`) as `SnekFactory`. The same package can expose multiple entry points. To make them available to spaCy, all you need to do is install the package: ```bash $ python setup.py develop ``` spaCy is now able to create the pipeline component `'snek'`: ``` >>> from spacy.lang.en import English >>> nlp = English() >>> snek = nlp.create_pipe("snek") # this now works! 🐍🎉 >>> nlp.add_pipe(snek) >>> doc = nlp("I am snek") --..,_ _,.--. `'.'. .'`__ o `;__. '.'. .'.'` '---'` ` '.`'--....--'`.' `'--....--'` ``` Arguably, this gets even more exciting when you train your `en_core_snek_sm` model. To make sure `snek` is installed with the model, you can add it to the model's `setup.py`. You can then tell spaCy to construct the model pipeline with the `snek` component by setting `"pipeline": ["snek"]` in the `meta.json`. > #### meta.json > > ```diff > { > "lang": "en", > "name": "core_snek_sm", > "version": "1.0.0", > + "pipeline": ["snek"] > } > ``` In theory, the entry point mechanism also lets you overwrite built-in factories – including the tokenizer. By default, spaCy will output a warning in these cases, to prevent accidental overwrites and unintended results. #### Advanced components with settings {#advanced-cfg} The `**cfg` keyword arguments that the factory receives are passed down all the way from `spacy.load`. This means that the factory can respond to custom settings defined when loading the model – for example, the style of the snake to load: ```python nlp = spacy.load("en_core_snek_sm", snek_style="cute") ``` ```python SNEKS = {"basic": snek, "cute": cute_snek} # collection of sneks class SnekFactory(object): def __init__(self, nlp, **cfg): self.nlp = nlp self.snek_style = cfg.get("snek_style", "basic") self.snek = SNEKS[self.snek_style] def __call__(self, doc): print(self.snek) return doc ``` The factory can also implement other pipeline component like `to_disk` and `from_disk` for serialization, or even `update` to make the component trainable. If a component exposes a `from_disk` method and is included in a model's pipeline, spaCy will call it on load. This lets you ship custom data with your model. When you save out a model using `nlp.to_disk` and the component exposes a `to_disk` method, it will be called with the disk path. ```python def to_disk(self, path, **kwargs): snek_path = path / "snek.txt" with snek_path.open("w", encoding="utf8") as snek_file: snek_file.write(self.snek) def from_disk(self, path, **cfg): snek_path = path / "snek.txt" with snek_path.open("r", encoding="utf8") as snek_file: self.snek = snek_file.read() return self ``` The above example will serialize the current snake in a `snek.txt` in the model data directory. When a model using the `snek` component is loaded, it will open the `snek.txt` and make it available to the component. ### Custom language classes via entry points {#entry-points-languages} To stay with the theme of the previous example and [this blog post on entry points](https://amir.rachum.com/blog/2017/07/28/python-entry-points/), let's imagine you wanted to implement your own `SnekLanguage` class for your custom model – but you don't necessarily want to modify spaCy's code to [add a language](/usage/adding-languages). In your package, you could then implement the following: ```python ### snek.py from spacy.language import Language from spacy.attrs import LANG class SnekDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: "snk" class SnekLanguage(Language): lang = "snk" Defaults = SnekDefaults # Some custom snek language stuff here ``` Alongside the `spacy_factories`, there's also an entry point option for `spacy_languages`, which maps language codes to language-specific `Language` subclasses: ```diff ### setup.py from setuptools import setup setup( name="snek", entry_points={ "spacy_factories": ["snek = snek:SnekFactory"], + "spacy_languages": ["snk = snek:SnekLanguage"] } ) ``` In spaCy, you can then load the custom `sk` language and it will be resolved to `SnekLanguage` via the custom entry point. This is especially relevant for model packages, which could then specify `"lang": "snk"` in their `meta.json` without spaCy raising an error because the language is not available in the core library. > #### meta.json > > ```diff > { > - "lang": "en", > + "lang": "snk", > "name": "core_snek_sm", > "version": "1.0.0", > "pipeline": ["snek"] > } > ``` ```python from spacy.util import get_lang_class SnekLanguage = get_lang_class("snk") nlp = SnekLanguage() ``` ### Custom displaCy colors via entry points {#entry-points-displacy new="2.2"} If you're training a named entity recognition model for a custom domain, you may end up training different labels that don't have pre-defined colors in the [`displacy` visualizer](/usage/visualizers#ent). The `spacy_displacy_colors` entry point lets you define a dictionary of entity labels mapped to their color values. It's added to the pre-defined colors and can also overwrite existing values. > #### Domain-specific NER labels > > Good examples of models with domain-specific label schemes are > [scispaCy](/universe/project/scispacy) and > [Blackstone](/universe/project/blackstone). ```python ### snek.py displacy_colors = {"SNEK": "#3dff74", "HUMAN": "#cfc5ff"} ``` Given the above colors, the entry point can be defined as follows. Entry points need to have a name, so we use the key `colors`. However, the name doesn't matter and whatever is defined in the entry point group will be used. ```diff ### setup.py from setuptools import setup setup( name="snek", entry_points={ + "spacy_displacy_colors": ["colors = snek:displacy_colors"] } ) ``` After installing the package, the the custom colors will be used when visualizing text with `displacy`. Whenever the label `SNEK` is assigned, it will be displayed in `#3dff74`. import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html'