* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044
.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
23 KiB
title | next | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Models & Languages | usage/facts-figures |
|
spaCy's trained pipelines can be installed as Python packages. This means
that they're a component of your application, just like any other module.
They're versioned and can be defined as a dependency in your requirements.txt
.
Trained pipelines can be installed from a download URL or a local directory,
manually or via pip. Their data can be
located anywhere on your file system.
Important note
If you're upgrading to spaCy v3.x, you need to download the new pipeline packages. If you've trained your own pipelines, you need to retrain them after updating spaCy.
Quickstart
import QuickstartModels from 'widgets/quickstart-models.js'
Usage note
If lemmatization rules are available for your language, make sure to install spaCy with the
lookups
option, or installspacy-lookups-data
separately in the same environment:$ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
If a trained pipeline is available for a language, you can download it using the
spacy download
command as shown above. In order to use
languages that don't yet come with a trained pipeline, you have to import them
directly, or use spacy.blank
:
from spacy.lang.yo import Yoruba
nlp = Yoruba() # use directly
nlp = spacy.blank("yo") # blank instance
A blank pipeline is typically just a tokenizer. You might want to create a blank
pipeline when you only need a tokenizer, when you want to add more components
from scratch, or for testing purposes. Initializing the language object directly
yields the same result as generating it using spacy.blank()
. In both cases the
default configuration for the chosen language is loaded, and no pretrained
components will be available.
Language support
spaCy currently provides support for the following languages. You can help by improving the existing language data and extending the tokenization patterns. See here for details on how to contribute to development. Also see the training documentation for how to train your own pipelines on your data.
import Languages from 'widgets/languages.js'
Multi-language support
# Standard import from spacy.lang.xx import MultiLanguage nlp = MultiLanguage() # With lazy-loading nlp = spacy.blank("xx")
spaCy also supports pipelines trained on more than one language. This is
especially useful for named entity recognition. The language ID used for
multi-language or language-neutral pipelines is xx
. The language class, a
generic subclass containing only the base language data, can be found in
lang/xx
.
To train a pipeline using the neutral multi-language class, you can set
lang = "xx"
in your training config. You can also
import the MultiLanguage
class directly, or call
spacy.blank("xx")
for lazy-loading.
Chinese language support
The Chinese language class supports three word segmentation options, char
,
jieba
and pkuseg
.
Manual setup
from spacy.lang.zh import Chinese # Character segmentation (default) nlp = Chinese() # Jieba cfg = {"segmenter": "jieba"} nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) # PKUSeg with "mixed" model provided by pkuseg cfg = {"segmenter": "pkuseg"} nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) nlp.tokenizer.initialize(pkuseg_model="mixed")
### config.cfg
[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char"
Segmenter | Description |
---|---|
char |
Character segmentation: Character segmentation is the default segmentation option. It's enabled when you create a new Chinese language class or call spacy.blank("zh") . |
jieba |
Jieba: to use Jieba for word segmentation, you can set the option segmenter to "jieba" . |
pkuseg |
PKUSeg: As of spaCy v2.3.0, support for PKUSeg has been added to support better segmentation for Chinese OntoNotes and the provided Chinese pipelines. Enable PKUSeg by setting tokenizer option segmenter to "pkuseg" . |
In v3.0, the default word segmenter has switched from Jieba to character
segmentation. Because the pkuseg
segmenter depends on a model that can be
loaded from a file, the model is loaded on
initialization (typically before training).
This ensures that your packaged Chinese model doesn't depend on a local path at
runtime.
The initialize
method for the Chinese tokenizer class supports the following
config settings for loading pkuseg
models:
Name | Description |
---|---|
pkuseg_model |
Name of a model provided by spacy-pkuseg or the path to a local model directory. |
pkuseg_user_dict |
Optional path to a file with one word per line which overrides the default pkuseg user dictionary. Defaults to "default" , the default provided dictionary. |
The initialization settings are typically provided in the training config and the data is loaded in before training and serialized with the model. This allows you to load the data from a local path and save out your pipeline and config, without requiring the same local path at runtime. See the usage guide on the config lifecycle for more background on this.
### config.cfg
[initialize]
[initialize.tokenizer]
pkuseg_model = "/path/to/model"
pkuseg_user_dict = "default"
You can also initialize the tokenizer for a blank language class by calling its
initialize
method:
### Examples
# Initialize the pkuseg tokenizer
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
# Load spaCy's OntoNotes model
nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes")
# Load pkuseg's "news" model
nlp.tokenizer.initialize(pkuseg_model="news")
# Load local model
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
# Override the user directory
nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes", pkuseg_user_dict="/path/to/user_dict")
You can also modify the user dictionary on-the-fly:
# Append words to user dict
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
# Remove all words from user dict and replace with new words
nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)
# Remove all words from user dict
nlp.tokenizer.pkuseg_update_user_dict([], reset=True)
The Chinese pipelines provided by spaCy include a custom pkuseg
model trained only on
Chinese OntoNotes 5.0, since the
models provided by pkuseg
include data restricted to research use. For
research use, pkuseg
provides models for several different domains ("mixed"
(equivalent to "default"
from pkuseg
packages), "news"
"web"
,
"medicine"
, "tourism"
) and for other uses, pkuseg
provides a simple
training API:
import spacy_pkuseg as pkuseg
from spacy.lang.zh import Chinese
# Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
# Load pkuseg model in spaCy Chinese tokenizer
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
Japanese language support
Manual setup
from spacy.lang.ja import Japanese # Load SudachiPy with split mode A (default) nlp = Japanese() # Load SudachiPy with split mode B cfg = {"split_mode": "B"} nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
The Japanese language class uses
SudachiPy for word
segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode A
. The tokenizer
config can be used to configure the split mode to A
, B
or C
.
### config.cfg
[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = "A"
Extra information, such as reading, inflection form, and the SudachiPy
normalized form, is available in Token.morph
. For B
or C
split modes,
subtokens are stored in Doc.user_data["sub_tokens"]
.
If you run into errors related to sudachipy
, which is currently under active
development, we suggest downgrading to sudachipy==0.4.9
, which is the version
used for training the current Japanese pipelines.
Korean language support
There are currently three built-in options for Korean tokenization, two based on mecab-ko and one using the rule-based tokenizer.
Default mecab-ko tokenizer
# uses mecab-ko-dic nlp = spacy.blank("ko") # with custom mecab args mecab_args = "-d /path/to/dicdir -u /path/to/userdic" config = {"nlp": {"tokenizer": {"mecab_args": mecab_args}}} nlp = spacy.blank("ko", config=config)
The default MeCab-based Korean tokenizer requires the python package
mecab-ko
and no further system
requirements.
The natto-py
MeCab-based tokenizer (the previous default for spaCy v3.4 and
earlier) is available as spacy.KoreanNattoTokenizer.v1
. It requires:
To use this tokenizer, edit [nlp.tokenizer]
in your config:
natto-py MeCab-ko tokenizer
config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.KoreanNattoTokenizer.v1"}}} nlp = spacy.blank("ko", config=config)
### config.cfg
[nlp]
lang = "ko"
tokenizer = {"@tokenizers" = "spacy.KoreanNattoTokenizer.v1"}
For some Korean datasets and tasks, the rule-based tokenizer is better-suited than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
Rule-based tokenizer
config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}} nlp = spacy.blank("ko", config=config)
### config.cfg
[nlp]
lang = "ko"
tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
The Korean trained pipelines use the rule-based tokenizer, so no additional dependencies are required.
Installing and using trained pipelines
The easiest way to download a trained pipeline is via spaCy's
download
command. It takes care of finding the
best-matching package compatible with your spaCy installation.
Important note for v3.0
Note that as of spaCy v3.0, shortcut links like
en
that create (potentially brittle) symlinks in your spaCy installation are deprecated. To download and load an installed pipeline package, use its full name:- python -m spacy download en + python -m spacy download en_core_web_sm
- nlp = spacy.load("en") + nlp = spacy.load("en_core_web_sm")
# Download best-matching version of a package for your spaCy installation
$ python -m spacy download en_core_web_sm
# Download exact package version
$ python -m spacy download en_core_web_sm-3.0.0 --direct
The download command will install the package via
pip and place the package in your site-packages
directory.
$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS
$ python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
If you're in a Jupyter notebook or similar environment, you can use the !
prefix to
execute commands.
Make sure to restart your kernel or runtime after installation (just like
you would when installing other Python packages) to make sure that the installed
pipeline package can be found.
!python -m spacy download en_core_web_sm
Installation via pip
To download a trained pipeline directly using
pip, point pip install
to the URL or local
path of the wheel file or archive. Installing the wheel is usually more
efficient. To find the direct link to a package, head over to the
releases, right click on
the archive link and copy it to your clipboard.
# With external URL
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
# With local file
$ pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
$ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
By default, this will install the pipeline package into your site-packages
directory. You can then use spacy.load
to load it via its package name or
import it explicitly as a module. If you need to download
pipeline packages as part of an automated process, we recommend using pip with a
direct link, instead of relying on spaCy's download
command.
You can also add the direct download link to your application's
requirements.txt
. For more details, see the section on
working with pipeline packages in production.
Manual download and installation
In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the package via your browser from the latest releases, or configure your own download script using the URL of the archive file. The archive consists of a package directory that contains another directory with the pipeline data.
### Directory structure {highlight="6"}
└── en_core_web_md-3.0.0.tar.gz # downloaded archive
├── setup.py # setup file for pip installation
├── meta.json # copy of pipeline meta
└── en_core_web_md # 📦 pipeline package
├── __init__.py # init for pip installation
└── en_core_web_md-3.0.0 # pipeline data
├── config.cfg # pipeline config
├── meta.json # pipeline meta
└── ... # directories with component data
You can place the pipeline package directory anywhere on your local file system.
Installation from Python
Since the spacy download
command installs the pipeline as
a Python package, we always recommend running it from the command line, just
like you install other Python packages with pip install
. However, if you need
to, or if you want to integrate the download process into another CLI command,
you can also import and call the download
function used by the CLI via Python.
Keep in mind that the download
command installs a Python package into your
environment. In order for it to be found after installation, you will need to
restart or reload your Python process so that new packages are recognized.
import spacy
spacy.cli.download("en_core_web_sm")
Using trained pipelines with spaCy
To load a pipeline package, use spacy.load
with
the package name or a path to the data directory:
Important note for v3.0
Note that as of spaCy v3.0, shortcut links like
en
that create (potentially brittle) symlinks in your spaCy installation are deprecated. To download and load an installed pipeline package, use its full name:- python -m spacy download en + python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm") # load package "en_core_web_sm"
nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory
doc = nlp("This is a sentence.")
You can use the info
command or
spacy.info()
method to print a pipeline package's
meta data before loading it. Each Language
object with a loaded pipeline also
exposes the pipeline's meta data as the attribute meta
. For example,
nlp.meta['version']
will return the package version.
Importing pipeline packages as modules
If you've installed a trained pipeline via spacy download
or directly via pip, you can also import
it and then call its load()
method
with no arguments:
### {executable="true"}
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")
How you choose to load your trained pipelines ultimately depends on personal
preference. However, for larger code bases, we usually recommend native
imports, as this will make it easier to integrate pipeline packages with your
existing build process, continuous integration workflow and testing framework.
It'll also prevent you from ever trying to load a package that is not installed,
as your code will raise an ImportError
immediately, instead of failing
somewhere down the line when calling spacy.load()
. For more details, see the
section on working with pipeline packages in production.
Using trained pipelines in production
If your application depends on one or more trained pipeline packages, you'll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading and loading pipeline packages, the underlying functionality is entirely based on native Python packaging. This allows your application to handle a spaCy pipeline like any other package dependency.
Downloading and requiring package dependencies
spaCy's built-in download
command is mostly intended as a
convenient, interactive wrapper. It performs compatibility checks and prints
detailed error messages and warnings. However, if you're downloading pipeline
packages as part of an automated build process, this only adds an unnecessary
layer of complexity. If you know which packages your application needs, you
should be specifying them directly.
Because pipeline packages are valid Python packages, you can add them to your
application's requirements.txt
. If you're running your own internal PyPi
installation, you can upload the pipeline packages there. pip's
requirements file format
supports both package names to download via a PyPi server, as well as direct
URLs.
### requirements.txt
spacy>=3.0.0,<4.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz#egg=en_core_web_sm
Specifying #egg=
with the package name tells pip which package to expect from
the download URL. This way, the package won't be re-downloaded and overwritten
if it's already installed - just like when you're downloading a package from
PyPi.
All pipeline packages are versioned and specify their spaCy dependency. This
ensures cross-compatibility and lets you specify exact version requirements for
each pipeline. If you've trained your own pipeline, you can
use the spacy package
command to generate the required
meta data and turn it into a loadable package.
Loading and testing pipeline packages
Pipeline packages are regular Python packages, so you can also import them as a
package using Python's native import
syntax, and then call the load
method
to load the data and return an nlp
object:
import en_core_web_sm
nlp = en_core_web_sm.load()
In general, this approach is recommended for larger code bases, as it's more
"native", and doesn't rely on spaCy's loader to resolve string names to
packages. If a package can't be imported, Python will raise an ImportError
immediately. And if a package is imported but not used, any linter will catch
that.
Similarly, it'll give you more flexibility when writing tests that require
loading pipelines. For example, instead of writing your own try
and except
logic around spaCy's loader, you can use
pytest's
importorskip()
method to only run a test if a specific pipeline package or version is
installed. Each pipeline package exposes a __version__
attribute which you can
also use to perform your own version compatibility checks before loading it.