12 KiB
title | teaser | menu | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
What's New in v3.0 | New features, backwards incompatibilities and migration guide |
|
Summary
New Features
Backwards Incompatibilities
Removed or renamed objects, methods, attributes and arguments
Removed | Replacement |
---|---|
GoldParse |
Example |
GoldCorpus |
Corpus |
spacy debug-data |
spacy debug data |
spacy link , util.set_data_path , util.get_data_path |
not needed, model symlinks are deprecated |
Removed deprecated methods, attributes and arguments
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
Removed | Replacement |
---|---|
Doc.tokens_from_list |
Doc.__init__ |
Doc.merge , Span.merge |
Doc.retokenize |
Token.string , Span.string , Span.upper , Span.lower |
Span.text , Token.text |
Language.tagger , Language.parser , Language.entity |
Language.get_pipe |
keyword-arguments like vocab=False on to_disk , from_disk , to_bytes , from_bytes |
exclude=["vocab"] |
n_threads argument on Tokenizer , Matcher , PhraseMatcher |
n_process |
SentenceSegmenter hook, SimilarityHook |
user hooks, Sentencizer , SentenceRecognizer |
Migrating from v2.x
Downloading and loading models
Model symlinks and shortcuts like en
are now officially deprecated. There are
many different models with different capabilities and not just one
"English model". In order to download and load a model, you should always use
its full name – for instance, en_core_web_sm
.
- python -m spacy download en
+ python -m spacy download en_core_web_sm
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
Custom pipeline components and factories
Custom pipeline components now have to be registered explicitly using the
@Language.component
or
@Language.factory
decorator. For simple functions
that take a Doc
and return it, all you have to do is add the
@Language.component
decorator to it and assign it a name:
### Stateless function components
+ from spacy.language import Language
+ @Language.component("my_component")
def my_component(doc):
return doc
For class components that are initialized with settings and/or the shared nlp
object, you can use the @Language.factory
decorator. Also make sure that that
the method used to initialize the factory has two named arguments: nlp
(the current nlp
object) and name
(the string name of the component
instance).
### Stateful class components
+ from spacy.language import Language
+ @Language.factory("my_component")
class MyComponent:
- def __init__(self, nlp):
+ def __init__(self, nlp, name):
self.nlp = nlp
def __call__(self, doc):
return doc
Instead of decorating your class, you could also add a factory function that
takes the arguments nlp
and name
and returns an instance of your component:
### Stateful class components with factory function
+ from spacy.language import Language
+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+ return MyComponent(nlp)
class MyComponent:
def __init__(self, nlp):
self.nlp = nlp
def __call__(self, doc):
return doc
The @Language.component
and @Language.factory
decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
Language.factories
manually anymore.
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
Adding components to the pipeline
The nlp.add_pipe
method now takes the string
name of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.
+ @Language.component("my_component")
def my_component(doc):
return doc
- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
nlp.add_pipe
now also returns the pipeline component
itself, so you can access its attributes. The
nlp.create_pipe
method is now mostly internals
and you typically shouldn't have to use it in your code.
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
Training models
To train your models, you should now pretty much always use the
spacy train
CLI. You shouldn't have to put together your own
training scripts anymore, unless you really want to. The training commands now
use a flexible config file that describes all training
settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The --code
argument lets you pass in code containing
custom registered functions that you can
reference in your config.
Binary .spacy training data format
spaCy now uses a new
binary training data format, which is much
smaller and consists of Doc
objects, serialized via the
DocBin
. You can convert your existing JSON-formatted data using
the spacy convert
command, which outputs .spacy
files:
$ python -m spacy convert ./training.json ./output
Training config
### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output
The easiest way to get started with an end-to-end training process is to clone a project template. Projects let you manage multi-step workflows, from data preprocessing to training and packaging your model.
Migrating training scripts to CLI command and config
Training via the Python API
Packaging models
The spacy package
command now automatically builds the
installable .tar.gz
sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the --no-sdist
flag.
python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist
Migration notes for plugin maintainers
Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We've tried to keep breaking changes to a minimum and make it as easy as possible for you to upgrade your packages for spaCy v3.
Custom pipeline components
The most common use case for plugins is providing pipeline components and extension attributes.
- Use the
@Language.factory
decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to theLanguage.factories
. - Make sure your component factories take at least two named arguments:
nlp
(the currentnlp
object) andname
(the instance name of the added component so you can identify multiple instances of the same component). - Update all references to
nlp.add_pipe
in your docs to use string names instead of the component functions.
### {highlight="1-5"}
from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
return MyCoolComponent(some_setting=some_setting)
class MyCoolComponent:
def __init__(self, some_setting):
self.some_setting = some_setting
def __call__(self, doc):
# Do something to the doc
return doc
Result in config.cfg
[components.my_component] factory = "my_component" some_setting = true
import spacy
from your_plugin import MyCoolComponent
nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
The @Language.factory
decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its string name. However, this
requires the decorator to be executed – so users will still have to import
your plugin. Alternatively, your plugin could expose an
entry point, which spaCy can read from.
This means that spaCy knows how to initialize my_component
, even if your
package isn't imported.