17 KiB
title | next | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Models & Languages | usage/facts-figures |
|
spaCy's models can be installed as Python packages. This means that they're
a component of your application, just like any other module. They're versioned
and can be defined as a dependency in your requirements.txt
. Models can be
installed from a download URL or a local directory, manually or via
pip. Their data can be located anywhere on
your file system.
Important note
If you're upgrading to spaCy v3.x, you need to download the new models. If you've trained statistical models that use spaCy's annotations, you should retrain your models after updating spaCy. If you don't retrain, you may suffer train/test skew, which might decrease your accuracy.
Quickstart
import QuickstartModels from 'widgets/quickstart-models.js'
Language support
spaCy currently provides support for the following languages. You can help by improving the existing language data and extending the tokenization patterns. See here for details on how to contribute to model development.
Usage note
If a model is available for a language, you can download it using the
spacy download
command. In order to use languages that don't yet come with a model, you have to import them directly, or usespacy.blank
:from spacy.lang.fi import Finnish nlp = Finnish() # use directly nlp = spacy.blank("fi") # blank instance
If lemmatization rules are available for your language, make sure to install spaCy with the
lookups
option, or installspacy-lookups-data
separately in the same environment:$ pip install spacy[lookups]
import Languages from 'widgets/languages.js'
Multi-language support
# Standard import from spacy.lang.xx import MultiLanguage nlp = MultiLanguage() # With lazy-loading from spacy.util import get_lang_class nlp = get_lang_class('xx')
spaCy also supports models trained on more than one language. This is especially
useful for named entity recognition. The language ID used for multi-language or
language-neutral models is xx
. The language class, a generic subclass
containing only the base language data, can be found in
lang/xx
.
To load your model with the neutral, multi-language class, simply set
"language": "xx"
in your model package's
meta.json
. You can also import the class directly, or call
util.get_lang_class()
for lazy-loading.
Chinese language support
The Chinese language class supports three word segmentation options:
from spacy.lang.zh import Chinese # Disable jieba to use character segmentation Chinese.Defaults.use_jieba = False nlp = Chinese() # Disable jieba through tokenizer config options cfg = {"use_jieba": False} nlp = Chinese(meta={"tokenizer": {"config": cfg}}) # Load with "default" model provided by pkuseg cfg = {"pkuseg_model": "default", "require_pkuseg": True} nlp = Chinese(meta={"tokenizer": {"config": cfg}})
- Jieba:
Chinese
uses Jieba for word segmentation by default. It's enabled when you create a newChinese
language class or callspacy.blank("zh")
. - Character segmentation: Character segmentation is supported by disabling
jieba
and settingChinese.Defaults.use_jieba = False
before initializing the language class. As of spaCy v2.3.0, themeta
tokenizer config options can be used to configureuse_jieba
. - PKUSeg: In spaCy v2.3.0, support for PKUSeg has been added to support better segmentation for Chinese OntoNotes and the new Chinese models.
Note that pkuseg
doesn't yet ship
with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
install it from our fork and compile it locally:
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
The meta
argument of the Chinese
language class supports the following
following tokenizer config settings:
Name | Type | Description |
---|---|---|
pkuseg_model |
str | Required: Name of a model provided by pkuseg or the path to a local model directory. |
pkuseg_user_dict |
str | Optional path to a file with one word per line which overrides the default pkuseg user dictionary. |
require_pkuseg |
bool | Overrides all jieba settings (optional but strongly recommended). |
### Examples
# Load "default" model
cfg = {"pkuseg_model": "default", "require_pkuseg": True}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})
# Load local model
cfg = {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})
# Override the user directory
cfg = {"pkuseg_model": "default", "require_pkuseg": True, "pkuseg_user_dict": "/path"}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})
You can also modify the user dictionary on-the-fly:
# Append words to user dict
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
# Remove all words from user dict and replace with new words
nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)
# Remove all words from user dict
nlp.tokenizer.pkuseg_update_user_dict([], reset=True)
The Chinese models provided by spaCy include a custom pkuseg
model trained only on
Chinese OntoNotes 5.0, since the
models provided by pkuseg
include data restricted to research use. For
research use, pkuseg
provides models for several different domains
("default"
, "news"
"web"
, "medicine"
, "tourism"
) and for other uses,
pkuseg
provides a simple
training API:
import pkuseg
from spacy.lang.zh import Chinese
# Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
# Load pkuseg model in spaCy Chinese tokenizer
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})
Japanese language support
from spacy.lang.ja import Japanese # Load SudachiPy with split mode A (default) nlp = Japanese() # Load SudachiPy with split mode B cfg = {"split_mode": "B"} nlp = Japanese(meta={"tokenizer": {"config": cfg}})
The Japanese language class uses
SudachiPy for word
segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese models use SudachiPy split mode A
. The meta
argument
of the Japanese
language class can be used to configure the split mode to A
,
B
or C
.
If you run into errors related to sudachipy
, which is currently under active
development, we suggest downgrading to sudachipy==0.4.5
, which is the version
used for training the current Japanese models.
Installing and using models
The easiest way to download a model is via spaCy's
download
command. It takes care of finding the
best-matching model compatible with your spaCy installation.
Important note for v3.0
Note that as of spaCy v3.0, model shortcut links that create (potentially brittle) symlinks in your spaCy installation are deprecated. To download and load an installed model, use its full name:
- python -m spacy download en + python -m spacy dowmload en_core_web_sm
- nlp = spacy.load("en") + nlp = spacy.load("en_core_web_sm")
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
# Download exact model version
python -m spacy download en_core_web_sm-2.2.0 --direct
The download command will install the model via
pip and place the package in your site-packages
directory.
pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
Installation via pip
To download a model directly using pip,
point pip install
to the URL or local path of the archive file. To find the
direct link to a model, head over to the
model releases, right
click on the archive link and copy it to your clipboard.
# With external URL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
# With local file
pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
By default, this will install the model into your site-packages
directory. You
can then use spacy.load()
to load it via its package name or
import it explicitly as a module. If you need to download
models as part of an automated process, we recommend using pip with a direct
link, instead of relying on spaCy's download
command.
You can also add the direct download link to your application's
requirements.txt
. For more details, see the section on
working with models in production.
Manual download and installation
In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the model via your browser from the latest releases, or configure your own download script using the URL of the archive file. The archive consists of a model directory that contains another directory with the model data.
### Directory structure {highlight="7"}
└── en_core_web_md-3.0.0.tar.gz # downloaded archive
├── meta.json # model meta data
├── setup.py # setup file for pip installation
└── en_core_web_md # 📦 model package
├── __init__.py # init for pip installation
├── meta.json # model meta data
└── en_core_web_md-3.0.0 # model data
You can place the model package directory anywhere on your local file system.
Using models with spaCy
To load a model, use spacy.load
with the model's
package name or a path to the data directory:
Important note for v3.0
Note that as of spaCy v3.0, model shortcut links that create (potentially brittle) symlinks in your spaCy installation are deprecated. To load an installed model, use its full name:
- nlp = spacy.load("en") + nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm") # load model package "en_core_web_sm"
nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory
doc = nlp("This is a sentence.")
You can use the info
command or
spacy.info()
method to print a model's meta data
before loading it. Each Language
object with a loaded model also exposes the
model's meta data as the attribute meta
. For example, nlp.meta['version']
will return the model's version.
Importing models as modules
If you've installed a model via spaCy's downloader, or directly via pip, you can
also import
it and then call its load()
method with no arguments:
### {executable="true"}
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")
How you choose to load your models ultimately depends on personal preference.
However, for larger code bases, we usually recommend native imports, as this
will make it easier to integrate models with your existing build process,
continuous integration workflow and testing framework. It'll also prevent you
from ever trying to load a model that is not installed, as your code will raise
an ImportError
immediately, instead of failing somewhere down the line when
calling spacy.load()
.
For more details, see the section on working with models in production.
Using your own models
If you've trained your own model, for example for
additional languages or
custom named entities, you can save its state using the
Language.to_disk()
method. To make the model more
convenient to deploy, we recommend wrapping it as a Python package.
For more information and a detailed guide on how to package your model, see the documentation on saving and loading models.
Using models in production
If your application depends on one or more models, you'll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading, linking and loading models, the underlying functionality is entirely based on native Python packages. This allows your application to handle a model like any other package dependency.
Downloading and requiring model dependencies
spaCy's built-in download
command is mostly intended as a
convenient, interactive wrapper. It performs compatibility checks and prints
detailed error messages and warnings. However, if you're downloading models as
part of an automated build process, this only adds an unnecessary layer of
complexity. If you know which models your application needs, you should be
specifying them directly.
Because all models are valid Python packages, you can add them to your
application's requirements.txt
. If you're running your own internal PyPi
installation, you can upload the models there. pip's
requirements file format
supports both package names to download via a PyPi server, as well as direct
URLs.
### requirements.txt
spacy>=2.2.0,<3.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm
Specifying #egg=
with the package name tells pip which package to expect from
the download URL. This way, the package won't be re-downloaded and overwritten
if it's already installed - just like when you're downloading a package from
PyPi.
All models are versioned and specify their spaCy dependency. This ensures
cross-compatibility and lets you specify exact version requirements for each
model. If you've trained your own model, you can use the
package
command to generate the required meta data and
turn it into a loadable package.
Loading and testing models
Models are regular Python packages, so you can also import them as a package
using Python's native import
syntax, and then call the load
method to load
the model data and return an nlp
object:
import en_core_web_sm
nlp = en_core_web_sm.load()
In general, this approach is recommended for larger code bases, as it's more
"native", and doesn't depend on symlinks or rely on spaCy's loader to resolve
string names to model packages. If a model can't be imported, Python will raise
an ImportError
immediately. And if a model is imported but not used, any
linter will catch that.
Similarly, it'll give you more flexibility when writing tests that require
loading models. For example, instead of writing your own try
and except
logic around spaCy's loader, you can use
pytest's
importorskip()
method to only run a test if a specific model or model version is installed.
Each model package exposes a __version__
attribute which you can also use to
perform your own version compatibility checks before loading a model.