17 KiB
title | teaser | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
What's New in v3.0 | New features, backwards incompatibilities and migration guide |
|
Summary
New Features
New training workflow and config system
Transformer-based pipelines
Custom models using any framework
Manage end-to-end workflows with projects
New built-in pipeline components
Name | Description |
---|---|
SentenceRecognizer |
Trainable component for sentence segmentation. |
Morphologizer |
Trainable component to predict morphological features. |
Lemmatizer |
Standalone component for rule-based and lookup lemmatization. |
AttributeRuler |
Component for setting token attributes using match patterns. |
Transformer |
Component for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via spacy-transformers . |
New and improved pipeline component APIs
Language.factory
,Language.component
Language.analyze_pipes
- Adding components from other models
Type hints and type-based data validation
spaCy v3.0 officially drops support for Python 2 and now requires Python
3.6+. This also means that the code base can take full advantage of
type hints. spaCy's user-facing
API that's implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy's machine learning library
Thinc also features extensive
type support, including custom
types for models and arrays, and a custom mypy
plugin that can be used to
type-check model definitions.
For data validation, spacy v3.0 adopts
pydantic
. It also powers the data
validation of Thinc's config system, which
lets you to register custom functions with typed arguments, reference them
in your config and see validation errors if the argument values don't match.
CLI
Name | Description |
---|---|
init config |
Initialize a training config file for a blank language or auto-fill a partial config. |
debug config |
Debug a training config file and show validation errors. |
project |
Subcommand for cloning and running spaCy projects. |
Backwards Incompatibilities
As always, we've tried to keep the breaking changes to a minimum and focus on changes that were necessary to support the new features, fix problems or improve usability. The following section lists the relevant changes to the user-facing API. For specific examples of how to rewrite your code, check out the migration guide.
Compatibility
- spaCy now requires Python 3.6+.
API changes
Language.add_pipe
now takes the string name of the component factory instead of the component function.- Custom pipeline components now needs to be decorated with the
@Language.component
or@Language.factory
decorator. Language.update
now takes a batch ofExample
objects instead of raw texts and annotations, orDoc
andGoldParse
objects.- The
Language.disable_pipes
contextmanager has been replaced byLanguage.select_pipes
, which can explicitly disable or enable components.
Removed or renamed API
Removed | Replacement |
---|---|
Language.disable_pipes |
Language.select_pipes |
GoldParse |
Example |
GoldCorpus |
Corpus |
spacy debug-data |
spacy debug data |
spacy link , util.set_data_path , util.get_data_path |
not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
Removed | Replacement |
---|---|
Doc.tokens_from_list |
Doc.__init__ |
Doc.merge , Span.merge |
Doc.retokenize |
Token.string , Span.string , Span.upper , Span.lower |
Span.text , Token.text |
Language.tagger , Language.parser , Language.entity |
Language.get_pipe |
keyword-arguments like vocab=False on to_disk , from_disk , to_bytes , from_bytes |
exclude=["vocab"] |
n_threads argument on Tokenizer , Matcher , PhraseMatcher |
n_process |
SentenceSegmenter hook, SimilarityHook |
user hooks, Sentencizer , SentenceRecognizer |
Migrating from v2.x
Downloading and loading models
Model symlinks and shortcuts like en
are now officially deprecated. There are
many different models with different capabilities and not just one
"English model". In order to download and load a model, you should always use
its full name – for instance, en_core_web_sm
.
- python -m spacy download en
+ python -m spacy download en_core_web_sm
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
Custom pipeline components and factories
Custom pipeline components now have to be registered explicitly using the
@Language.component
or
@Language.factory
decorator. For simple functions
that take a Doc
and return it, all you have to do is add the
@Language.component
decorator to it and assign it a name:
### Stateless function components
+ from spacy.language import Language
+ @Language.component("my_component")
def my_component(doc):
return doc
For class components that are initialized with settings and/or the shared nlp
object, you can use the @Language.factory
decorator. Also make sure that that
the method used to initialize the factory has two named arguments: nlp
(the current nlp
object) and name
(the string name of the component
instance).
### Stateful class components
+ from spacy.language import Language
+ @Language.factory("my_component")
class MyComponent:
- def __init__(self, nlp):
+ def __init__(self, nlp, name):
self.nlp = nlp
def __call__(self, doc):
return doc
Instead of decorating your class, you could also add a factory function that
takes the arguments nlp
and name
and returns an instance of your component:
### Stateful class components with factory function
+ from spacy.language import Language
+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+ return MyComponent(nlp)
class MyComponent:
def __init__(self, nlp):
self.nlp = nlp
def __call__(self, doc):
return doc
The @Language.component
and @Language.factory
decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
Language.factories
manually anymore.
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
Adding components to the pipeline
The nlp.add_pipe
method now takes the string
name of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.
+ @Language.component("my_component")
def my_component(doc):
return doc
- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
nlp.add_pipe
now also returns the pipeline component
itself, so you can access its attributes. The
nlp.create_pipe
method is now mostly internals
and you typically shouldn't have to use it in your code.
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
Training models
To train your models, you should now pretty much always use the
spacy train
CLI. You shouldn't have to put together your own
training scripts anymore, unless you really want to. The training commands now
use a flexible config file that describes all training
settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The --code
argument lets you pass in code containing
custom registered functions that you can
reference in your config.
Binary .spacy training data format
spaCy now uses a new
binary training data format, which is much
smaller and consists of Doc
objects, serialized via the
DocBin
. You can convert your existing JSON-formatted data using
the spacy convert
command, which outputs .spacy
files:
$ python -m spacy convert ./training.json ./output
Training config
### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output
The easiest way to get started with an end-to-end training process is to clone a project template. Projects let you manage multi-step workflows, from data preprocessing to training and packaging your model.
Migrating training scripts to CLI command and config
Training via the Python API
Packaging models
The spacy package
command now automatically builds the
installable .tar.gz
sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the --no-sdist
flag.
python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist
Migration notes for plugin maintainers
Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We've tried to make it as easy as possible for you to upgrade your packages for spaCy v3. The most common use case for plugins is providing pipeline components and extension attributes. When migrating your plugin, double-check the following:
- Use the
@Language.factory
decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to theLanguage.factories
. - Make sure your component factories take at least two named arguments:
nlp
(the currentnlp
object) andname
(the instance name of the added component so you can identify multiple instances of the same component). - Update all references to
nlp.add_pipe
in your docs to use string names instead of the component functions.
### {highlight="1-5"}
from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
return MyCoolComponent(some_setting=some_setting)
class MyCoolComponent:
def __init__(self, some_setting):
self.some_setting = some_setting
def __call__(self, doc):
# Do something to the doc
return doc
Result in config.cfg
[components.my_component] factory = "my_component" some_setting = true
import spacy
from your_plugin import MyCoolComponent
nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
The @Language.factory
decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its string name. However, this
requires the decorator to be executed – so users will still have to import
your plugin. Alternatively, your plugin could expose an
entry point, which spaCy can read from.
This means that spaCy knows how to initialize my_component
, even if your
package isn't imported.