spaCy/website/docs/usage/v3.md
2020-08-06 13:10:15 +02:00

12 KiB
Raw Blame History

title teaser menu
What's New in v3.0 New features, backwards incompatibilities and migration guide
Summary
summary
New Features
features
Backwards Incompatibilities
incompat
Migrating from v2.x
migrating
Migrating plugins
plugins

Summary

New Features

Backwards Incompatibilities

Removed or renamed objects, methods, attributes and arguments

Removed Replacement
GoldParse Example
GoldCorpus Corpus
spacy debug-data spacy debug data
spacy link, util.set_data_path, util.get_data_path not needed, model symlinks are deprecated

Removed deprecated methods, attributes and arguments

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

Removed Replacement
Doc.tokens_from_list Doc.__init__
Doc.merge, Span.merge Doc.retokenize
Token.string, Span.string, Span.upper, Span.lower Span.text, Token.text
Language.tagger, Language.parser, Language.entity Language.get_pipe
keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes exclude=["vocab"]
n_threads argument on Tokenizer, Matcher, PhraseMatcher n_process
SentenceSegmenter hook, SimilarityHook user hooks, Sentencizer, SentenceRecognizer

Migrating from v2.x

Downloading and loading models

Model symlinks and shortcuts like en are now officially deprecated. There are many different models with different capabilities and not just one "English model". In order to download and load a model, you should always use its full name for instance, en_core_web_sm.

- python -m spacy download en
+ python -m spacy download en_core_web_sm
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")

Custom pipeline components and factories

Custom pipeline components now have to be registered explicitly using the @Language.component or @Language.factory decorator. For simple functions that take a Doc and return it, all you have to do is add the @Language.component decorator to it and assign it a name:

### Stateless function components
+ from spacy.language import Language

+ @Language.component("my_component")
def my_component(doc):
    return doc

For class components that are initialized with settings and/or the shared nlp object, you can use the @Language.factory decorator. Also make sure that that the method used to initialize the factory has two named arguments: nlp (the current nlp object) and name (the string name of the component instance).

### Stateful class components
+ from spacy.language import Language

+ @Language.factory("my_component")
class MyComponent:
-   def __init__(self, nlp):
+   def __init__(self, nlp, name):
        self.nlp = nlp

    def __call__(self, doc):
        return doc

Instead of decorating your class, you could also add a factory function that takes the arguments nlp and name and returns an instance of your component:

### Stateful class components with factory function
+ from spacy.language import Language

+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+     return MyComponent(nlp)

class MyComponent:
    def __init__(self, nlp):
        self.nlp = nlp

    def __call__(self, doc):
        return doc

The @Language.component and @Language.factory decorators now take care of adding an entry to the component factories, so spaCy knows how to load a component back in from its string name. You won't have to write to Language.factories manually anymore.

- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)

Adding components to the pipeline

The nlp.add_pipe method now takes the string name of the component factory instead of a callable component. This allows spaCy to track and serialize components that have been added and their settings.

+ @Language.component("my_component")
def my_component(doc):
    return doc

- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")

nlp.add_pipe now also returns the pipeline component itself, so you can access its attributes. The nlp.create_pipe method is now mostly internals and you typically shouldn't have to use it in your code.

- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")

Training models

To train your models, you should now pretty much always use the spacy train CLI. You shouldn't have to put together your own training scripts anymore, unless you really want to. The training commands now use a flexible config file that describes all training settings and hyperparameters, as well as your pipeline, model components and architectures to use. The --code argument lets you pass in code containing custom registered functions that you can reference in your config.

Binary .spacy training data format

spaCy now uses a new binary training data format, which is much smaller and consists of Doc objects, serialized via the DocBin. You can convert your existing JSON-formatted data using the spacy convert command, which outputs .spacy files:

$ python -m spacy convert ./training.json ./output

Training config

### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --output ./output

The easiest way to get started with an end-to-end training process is to clone a project template. Projects let you manage multi-step workflows, from data preprocessing to training and packaging your model.

Migrating training scripts to CLI command and config

Training via the Python API

Packaging models

The spacy package command now automatically builds the installable .tar.gz sdist of the Python package, so you don't have to run this step manually anymore. You can disable the behavior by setting the --no-sdist flag.

python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist

Migration notes for plugin maintainers

Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We've tried to keep breaking changes to a minimum and make it as easy as possible for you to upgrade your packages for spaCy v3.

Custom pipeline components

The most common use case for plugins is providing pipeline components and extension attributes.

  • Use the @Language.factory decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to the Language.factories.
  • Make sure your component factories take at least two named arguments: nlp (the current nlp object) and name (the instance name of the added component so you can identify multiple instances of the same component).
  • Update all references to nlp.add_pipe in your docs to use string names instead of the component functions.
### {highlight="1-5"}
from spacy.language import Language

@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
    return MyCoolComponent(some_setting=some_setting)


class MyCoolComponent:
    def __init__(self, some_setting):
        self.some_setting = some_setting

    def __call__(self, doc):
        # Do something to the doc
        return doc

Result in config.cfg

[components.my_component]
factory = "my_component"
some_setting = true
import spacy
from your_plugin import MyCoolComponent

nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})

The @Language.factory decorator takes care of letting spaCy know that a component of that name is available. This means that your users can add it to the pipeline using its string name. However, this requires the decorator to be executed so users will still have to import your plugin. Alternatively, your plugin could expose an entry point, which spaCy can read from. This means that spaCy knows how to initialize my_component, even if your package isn't imported.