2020-07-01 14:02:17 +03:00
|
|
|
|
---
|
|
|
|
|
title: What's New in v3.0
|
|
|
|
|
teaser: New features, backwards incompatibilities and migration guide
|
|
|
|
|
menu:
|
|
|
|
|
- ['Summary', 'summary']
|
|
|
|
|
- ['New Features', 'features']
|
|
|
|
|
- ['Backwards Incompatibilities', 'incompat']
|
|
|
|
|
- ['Migrating from v2.x', 'migrating']
|
2020-07-27 01:29:45 +03:00
|
|
|
|
- ['Migrating plugins', 'plugins']
|
2020-07-01 14:02:17 +03:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Summary {#summary}
|
|
|
|
|
|
|
|
|
|
## New Features {#features}
|
|
|
|
|
|
|
|
|
|
## Backwards Incompatibilities {#incompat}
|
|
|
|
|
|
2020-07-27 19:11:45 +03:00
|
|
|
|
### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}
|
2020-07-25 19:51:12 +03:00
|
|
|
|
|
2020-07-27 19:11:45 +03:00
|
|
|
|
| Removed | Replacement |
|
|
|
|
|
| -------------------------------------------------------- | ----------------------------------------- |
|
|
|
|
|
| `GoldParse` | [`Example`](/api/example) |
|
|
|
|
|
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
|
|
|
|
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
|
|
|
|
|
|
|
|
|
### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}
|
2020-07-25 19:51:12 +03:00
|
|
|
|
|
2020-07-27 19:11:45 +03:00
|
|
|
|
The following deprecated methods, attributes and arguments were removed in v3.0.
|
|
|
|
|
Most of them have been **deprecated for a while** and many would previously
|
|
|
|
|
raise errors. Many of them were also mostly internals. If you've been working
|
|
|
|
|
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
|
|
|
|
|
on them.
|
2020-07-25 19:51:12 +03:00
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
| Removed | Replacement |
|
|
|
|
|
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
|
|
|
| `Doc.tokens_from_list` | [`Doc.__init__`](/api/doc#init) |
|
|
|
|
|
| `Doc.merge`, `Span.merge` | [`Doc.retokenize`](/api/doc#retokenize) |
|
|
|
|
|
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower` | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes) |
|
|
|
|
|
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
|
|
|
|
|
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
|
|
|
|
|
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
|
|
|
|
|
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
|
2020-07-25 19:51:12 +03:00
|
|
|
|
|
2020-07-01 14:02:17 +03:00
|
|
|
|
## Migrating from v2.x {#migrating}
|
2020-07-27 01:29:45 +03:00
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
### Downloading and loading models {#migrating-downloading-models}
|
|
|
|
|
|
|
|
|
|
Model symlinks and shortcuts like `en` are now officially deprecated. There are
|
|
|
|
|
[many different models](/models) with different capabilities and not just one
|
|
|
|
|
"English model". In order to download and load a model, you should always use
|
|
|
|
|
its full name – for instance, `en_core_web_sm`.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- python -m spacy download en
|
|
|
|
|
+ python -m spacy download en_core_web_sm
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- nlp = spacy.load("en")
|
|
|
|
|
+ nlp = spacy.load("en_core_web_sm")
|
|
|
|
|
```
|
|
|
|
|
|
2020-07-27 19:11:45 +03:00
|
|
|
|
### Custom pipeline components and factories {#migrating-pipeline-components}
|
|
|
|
|
|
|
|
|
|
Custom pipeline components now have to be registered explicitly using the
|
|
|
|
|
[`@Language.component`](/api/language#component) or
|
|
|
|
|
[`@Language.factory`](/api/language#factory) decorator. For simple functions
|
|
|
|
|
that take a `Doc` and return it, all you have to do is add the
|
|
|
|
|
`@Language.component` decorator to it and assign it a name:
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
### Stateless function components
|
|
|
|
|
+ from spacy.language import Language
|
|
|
|
|
|
|
|
|
|
+ @Language.component("my_component")
|
|
|
|
|
def my_component(doc):
|
|
|
|
|
return doc
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
For class components that are initialized with settings and/or the shared `nlp`
|
|
|
|
|
object, you can use the `@Language.factory` decorator. Also make sure that that
|
|
|
|
|
the method used to initialize the factory has **two named arguments**: `nlp`
|
|
|
|
|
(the current `nlp` object) and `name` (the string name of the component
|
|
|
|
|
instance).
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
### Stateful class components
|
|
|
|
|
+ from spacy.language import Language
|
|
|
|
|
|
|
|
|
|
+ @Language.factory("my_component")
|
|
|
|
|
class MyComponent:
|
|
|
|
|
- def __init__(self, nlp):
|
|
|
|
|
+ def __init__(self, nlp, name):
|
|
|
|
|
self.nlp = nlp
|
|
|
|
|
|
|
|
|
|
def __call__(self, doc):
|
|
|
|
|
return doc
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Instead of decorating your class, you could also add a factory function that
|
|
|
|
|
takes the arguments `nlp` and `name` and returns an instance of your component:
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
### Stateful class components with factory function
|
|
|
|
|
+ from spacy.language import Language
|
|
|
|
|
|
|
|
|
|
+ @Language.factory("my_component")
|
|
|
|
|
+ def create_my_component(nlp, name):
|
|
|
|
|
+ return MyComponent(nlp)
|
|
|
|
|
|
|
|
|
|
class MyComponent:
|
|
|
|
|
def __init__(self, nlp):
|
|
|
|
|
self.nlp = nlp
|
|
|
|
|
|
|
|
|
|
def __call__(self, doc):
|
|
|
|
|
return doc
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The `@Language.component` and `@Language.factory` decorators now take care of
|
|
|
|
|
adding an entry to the component factories, so spaCy knows how to load a
|
|
|
|
|
component back in from its string name. You won't have to write to
|
|
|
|
|
`Language.factories` manually anymore.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### Adding components to the pipeline {#migrating-add-pipe}
|
|
|
|
|
|
|
|
|
|
The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
|
|
|
|
|
name** of the component factory instead of a callable component. This allows
|
|
|
|
|
spaCy to track and serialize components that have been added and their settings.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
+ @Language.component("my_component")
|
|
|
|
|
def my_component(doc):
|
|
|
|
|
return doc
|
|
|
|
|
|
|
|
|
|
- nlp.add_pipe(my_component)
|
|
|
|
|
+ nlp.add_pipe("my_component")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
|
|
|
|
|
itself, so you can access its attributes. The
|
|
|
|
|
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
|
|
|
|
|
and you typically shouldn't have to use it in your code.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- parser = nlp.create_pipe("parser")
|
|
|
|
|
- nlp.add_pipe(parser)
|
|
|
|
|
+ parser = nlp.add_pipe("parser")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Training models {#migrating-training}
|
|
|
|
|
|
|
|
|
|
To train your models, you should now pretty much always use the
|
|
|
|
|
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
|
|
|
|
|
training scripts anymore, unless you _really_ want to. The training commands now
|
|
|
|
|
use a [flexible config file](/usage/training#config) that describes all training
|
|
|
|
|
settings and hyperparameters, as well as your pipeline, model components and
|
|
|
|
|
architectures to use. The `--code` argument lets you pass in code containing
|
|
|
|
|
[custom registered functions](/usage/training#custom-code) that you can
|
|
|
|
|
reference in your config.
|
|
|
|
|
|
|
|
|
|
#### Binary .spacy training data format {#migrating-training-format}
|
|
|
|
|
|
|
|
|
|
spaCy now uses a new
|
|
|
|
|
[binary training data format](/api/data-formats#binary-training), which is much
|
|
|
|
|
smaller and consists of `Doc` objects, serialized via the
|
|
|
|
|
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
|
|
|
|
|
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
$ python -m spacy convert ./training.json ./output
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### Training config {#migrating-training-config}
|
|
|
|
|
|
|
|
|
|
<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
### {wrap="true"}
|
|
|
|
|
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
|
|
|
|
|
+ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --output ./output
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<Project id="some_example_project">
|
|
|
|
|
|
|
|
|
|
The easiest way to get started with an end-to-end training process is to clone a
|
|
|
|
|
[project](/usage/projects) template. Projects let you manage multi-step
|
|
|
|
|
workflows, from data preprocessing to training and packaging your model.
|
|
|
|
|
|
|
|
|
|
</Project>
|
|
|
|
|
|
|
|
|
|
#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
|
|
|
|
|
|
|
|
|
|
<!-- TODO: write -->
|
|
|
|
|
|
2020-07-29 12:36:42 +03:00
|
|
|
|
#### Training via the Python API {#migrating-training-python}
|
|
|
|
|
|
|
|
|
|
<!-- TODO: this should explain the GoldParse -> Example stuff -->
|
|
|
|
|
|
2020-07-27 19:11:45 +03:00
|
|
|
|
#### Packaging models {#migrating-training-packaging}
|
|
|
|
|
|
|
|
|
|
The [`spacy package`](/api/cli#package) command now automatically builds the
|
|
|
|
|
installable `.tar.gz` sdist of the Python package, so you don't have to run this
|
|
|
|
|
step manually anymore. You can disable the behavior by setting the `--no-sdist`
|
|
|
|
|
flag.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
python -m spacy package ./model ./packages
|
|
|
|
|
- cd /output/en_model-0.0.0
|
|
|
|
|
- python setup.py sdist
|
|
|
|
|
```
|
|
|
|
|
|
2020-07-27 01:29:45 +03:00
|
|
|
|
## Migration notes for plugin maintainers {#plugins}
|
|
|
|
|
|
|
|
|
|
Thanks to everyone who's been contributing to the spaCy ecosystem by developing
|
|
|
|
|
and maintaining one of the many awesome [plugins and extensions](/universe).
|
|
|
|
|
We've tried to keep breaking changes to a minimum and make it as easy as
|
|
|
|
|
possible for you to upgrade your packages for spaCy v3.
|
|
|
|
|
|
|
|
|
|
### Custom pipeline components
|
|
|
|
|
|
|
|
|
|
The most common use case for plugins is providing pipeline components and
|
|
|
|
|
extension attributes.
|
|
|
|
|
|
|
|
|
|
- Use the [`@Language.factory`](/api/language#factory) decorator to register
|
|
|
|
|
your component and assign it a name. This allows users to refer to your
|
|
|
|
|
components by name and serialize pipelines referencing them. Remove all manual
|
|
|
|
|
entries to the `Language.factories`.
|
|
|
|
|
- Make sure your component factories take at least two **named arguments**:
|
|
|
|
|
`nlp` (the current `nlp` object) and `name` (the instance name of the added
|
|
|
|
|
component so you can identify multiple instances of the same component).
|
|
|
|
|
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
|
|
|
|
|
to use **string names** instead of the component functions.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### {highlight="1-5"}
|
|
|
|
|
from spacy.language import Language
|
|
|
|
|
|
|
|
|
|
@Language.factory("my_component", default_config={"some_setting": False})
|
|
|
|
|
def create_component(nlp: Language, name: str, some_setting: bool):
|
|
|
|
|
return MyCoolComponent(some_setting=some_setting)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
class MyCoolComponent:
|
|
|
|
|
def __init__(self, some_setting):
|
|
|
|
|
self.some_setting = some_setting
|
|
|
|
|
|
|
|
|
|
def __call__(self, doc):
|
|
|
|
|
# Do something to the doc
|
|
|
|
|
return doc
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
> #### Result in config.cfg
|
|
|
|
|
>
|
|
|
|
|
> ```ini
|
|
|
|
|
> [components.my_component]
|
|
|
|
|
> factory = "my_component"
|
|
|
|
|
> some_setting = true
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
import spacy
|
|
|
|
|
from your_plugin import MyCoolComponent
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
|
|
|
|
- component = MyCoolComponent(some_setting=True)
|
|
|
|
|
- nlp.add_pipe(component)
|
|
|
|
|
+ nlp.add_pipe("my_component", config={"some_setting": True})
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<Infobox title="Important note on registering factories" variant="warning">
|
|
|
|
|
|
|
|
|
|
The [`@Language.factory`](/api/language#factory) decorator takes care of letting
|
|
|
|
|
spaCy know that a component of that name is available. This means that your
|
|
|
|
|
users can add it to the pipeline using its **string name**. However, this
|
|
|
|
|
requires the decorator to be executed – so users will still have to **import
|
|
|
|
|
your plugin**. Alternatively, your plugin could expose an
|
|
|
|
|
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
|
|
|
|
|
This means that spaCy knows how to initialize `my_component`, even if your
|
|
|
|
|
package isn't imported.
|
|
|
|
|
|
|
|
|
|
</Infobox>
|