spaCy/website/usage/_training/_saving-loading.jade

208 lines
8.8 KiB
Plaintext
Raw Normal View History

2017-10-03 15:26:20 +03:00
//- 💫 DOCS > USAGE > TRAINING > SAVING & LOADING
2017-05-24 12:59:08 +03:00
2017-04-16 21:36:09 +03:00
p
| After training your model, you'll usually want to save its state, and load
2017-04-17 02:41:46 +03:00
| it back later. You can do this with the
2017-05-24 00:15:31 +03:00
| #[+api("language#to_disk") #[code Language.to_disk()]]
2017-04-16 21:36:09 +03:00
| method:
+code.
2017-05-24 00:15:31 +03:00
nlp.to_disk('/home/me/data/en_example_model')
2017-04-16 21:36:09 +03:00
p
| The directory will be created if it doesn't exist, and the whole pipeline
| will be written out. To make the model more convenient to deploy, we
| recommend wrapping it as a Python package.
2017-10-03 15:26:20 +03:00
2017-05-24 12:59:08 +03:00
+h(3, "models-generating") Generating a model package
2017-04-16 21:36:09 +03:00
+infobox("Important note")
| The model packages are #[strong not suitable] for the public
| #[+a("https://pypi.python.org") pypi.python.org] directory, which is not
| designed for binary data and files over 50 MB. However, if your company
2017-05-24 12:59:08 +03:00
| is running an #[strong internal installation] of PyPi, publishing your
| models on there can be a convenient way to share them with your team.
2017-04-16 21:36:09 +03:00
p
2017-04-17 02:45:00 +03:00
| spaCy comes with a handy CLI command that will create all required files,
2017-04-16 21:36:09 +03:00
| and walk you through generating the meta data. You can also create the
| meta.json manually and place it in the model data directory, or supply a
2017-05-24 12:59:08 +03:00
| path to it using the #[code --meta] flag. For more info on this, see
| the #[+api("cli#package") #[code package]] docs.
2017-04-16 21:36:09 +03:00
+aside-code("meta.json", "json").
{
"name": "example_model",
"lang": "en",
2017-04-16 21:36:09 +03:00
"version": "1.0.0",
2017-05-24 12:59:08 +03:00
"spacy_version": ">=2.0.0,<3.0.0",
2017-04-16 21:36:09 +03:00
"description": "Example model for spaCy",
"author": "You",
"email": "you@example.com",
2017-05-24 20:25:49 +03:00
"license": "CC BY-SA 3.0",
"pipeline": ["token_vectors", "tagger"]
2017-04-16 21:36:09 +03:00
}
+code(false, "bash").
spacy package /home/me/data/en_example_model /home/me/my_models
2017-04-16 21:36:09 +03:00
p This command will create a model package directory that should look like this:
+code("Directory structure", "yaml").
└── /
├── MANIFEST.in # to include meta.json
├── meta.json # model meta data
├── setup.py # setup file for pip installation
└── en_example_model # model directory
├── __init__.py # init for pip installation
└── en_example_model-1.0.0 # model data
p
| You can also find templates for all files in our
2017-05-24 20:25:49 +03:00
| #[+src(gh("spacy-dev-resources", "templates/model")) spaCy dev resources].
2017-04-16 21:36:09 +03:00
| If you're creating the package manually, keep in mind that the directories
| need to be named according to the naming conventions of
2017-05-24 20:25:49 +03:00
| #[code lang_name] and #[code lang_name-version].
2017-10-03 15:26:20 +03:00
+h(3, "models-custom") Customising the model setup
2017-04-16 21:36:09 +03:00
p
| The meta.json includes the model details, like name, requirements and
| license, and lets you customise how the model should be initialised and
| loaded. You can define the language data to be loaded and the
2017-10-03 15:26:20 +03:00
| #[+a("/usage/processing-pipelines") processing pipeline] to
| execute.
+table(["Setting", "Type", "Description"])
+row
+cell #[code lang]
+cell unicode
+cell ID of the language class to initialise.
+row
+cell #[code pipeline]
+cell list
+cell
| A list of strings mapping to the IDs of pipeline factories to
| apply in that order. If not set, spaCy's
2017-10-03 15:26:20 +03:00
| #[+a("/usage/processing-pipelines") default pipeline]
| will be used.
2017-05-24 12:59:08 +03:00
p
| The #[code load()] method that comes with our model package
| templates will take care of putting all this together and returning a
| #[code Language] object with the loaded pipeline and data. If your model
| requires custom pipeline components, you should
| #[strong ship then with your model] and register their
2017-10-03 15:26:20 +03:00
| #[+a("/usage/processing-pipelines#creating-factory") factories]
| via #[+api("spacy#set_factory") #[code set_factory()]].
+aside-code("Factory example").
def my_factory(vocab):
# load some state
def my_component(doc):
# process the doc
return doc
return my_component
2017-04-16 21:36:09 +03:00
+code.
spacy.set_factory('custom_component', custom_component_factory)
2017-04-16 21:36:09 +03:00
+infobox("Custom models with pipeline components")
| For more details and an example of how to package a sentiment model
| with a custom pipeline component, see the usage guide on
2017-10-03 15:26:20 +03:00
| #[+a("/usage/processing-pipelines#example2") language processing pipelines].
+h(3, "models-building") Building the model package
2017-04-16 21:36:09 +03:00
2017-05-24 12:59:08 +03:00
p
| To build the package, run the following command from within the
| directory. For more information on building Python packages, see the
| docs on Python's
| #[+a("https://setuptools.readthedocs.io/en/latest/") Setuptools].
2017-04-16 21:36:09 +03:00
+code(false, "bash").
python setup.py sdist
2017-04-16 21:36:09 +03:00
p
| This will create a #[code .tar.gz] archive in a directory #[code /dist].
| The model can be installed by pointing pip to the path of the archive:
2017-04-16 21:36:09 +03:00
+code(false, "bash").
pip install /path/to/en_example_model-1.0.0.tar.gz
p
| You can then load the model via its name, #[code en_example_model], or
| import it directly as a module and then call its #[code load()] method.
2017-10-03 15:26:20 +03:00
+h(3, "loading") Loading a custom model package
2017-05-24 12:59:08 +03:00
p
| To load a model from a data directory, you can use
| #[+api("spacy#load") #[code spacy.load()]] with the local path. This will
| look for a meta.json in the directory and use the #[code lang] and
| #[code pipeline] settings to initialise a #[code Language] class with a
| processing pipeline and load in the model data.
2017-04-16 21:36:09 +03:00
+code.
nlp = spacy.load('/path/to/model')
2017-04-16 21:36:09 +03:00
p
| If you want to #[strong load only the binary data], you'll have to create
| a #[code Language] class and call
| #[+api("language#from_disk") #[code from_disk]] instead.
+code.
from spacy.lang.en import English
nlp = English().from_disk('/path/to/data')
+infobox("Important note: Loading data in v2.x")
.o-block
| In spaCy 1.x, the distinction between #[code spacy.load()] and the
| #[code Language] class constructor was quite unclear. You could call
| #[code spacy.load()] when no model was present, and it would silently
| return an empty object. Likewise, you could pass a path to
| #[code English], even if the mode required a different language.
| spaCy v2.0 solves this with a clear distinction between setting up
| the instance and loading the data.
+code-new nlp = English().from_disk('/path/to/data')
+code-old nlp = spacy.load('en', path='/path/to/data')
2017-10-03 15:26:20 +03:00
+h(3, "example-training-spacy") Example: How we're training and packaging models for spaCy
p
| Publishing a new version of spaCy often means re-training all available
| models currently, that's #{MODEL_COUNT} models for #{MODEL_LANG_COUNT}
| languages. To make this run smoothly, we're using an automated build
| process and a #[+api("cli#train") #[code spacy train]] template that
| looks like this:
+code(false, "bash", "$", false, false, true).
spacy train {lang} {models_dir}/{name} {train_data} {dev_data} -m meta/{name}.json -V {version} -g {gpu_id} -n {n_epoch} -ns {n_sents}
+aside-code("meta.json template", "json").
{
"lang": "en",
"name": "core_web_sm",
"license":"CC BY-SA 3.0",
"author":"Explosion AI",
"url":"https://explosion.ai",
"email":"contact@explosion.ai",
"sources": ["OntoNotes 5", "Common Crawl"],
"description":"English multi-task CNN trained on OntoNotes, with GloVe vectors trained on common crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities."
}
p In a directory #[code meta], we keep #[code meta.json] templates for the individual models, containing all relevant information that doesn't change across versions, like the name, description, author info and training data sources. When we train the model, we pass in the file to the meta template as the #[code --meta] argument, and specify the current model version as the #[code --version] argument.
p On each epoch, the model is saved out with a #[code meta.json] using our template and added properties, like the #[code pipeline], #[code accuracy] scores and the #[code spacy_version] used to train the model. After training completion, the best model is selected automatically and packaged using the #[+api("cli#package") #[code package]] command. Since a full meta file is already present on the trained model, no further setup is required to build a valid model package.
+code(false, "bash").
spacy package -f {best_model} dist/
cd dist/{model_name}
python setup.py sdist
p This process allows us to quickly trigger the model training and build process for all available models and languages, and generate the correct meta data automatically.