spaCy/website/docs/usage/v3.md

364 lines
18 KiB
Markdown
Raw Normal View History

2020-07-01 14:02:17 +03:00
---
title: What's New in v3.0
teaser: New features, backwards incompatibilities and migration guide
menu:
- ['Summary', 'summary']
- ['New Features', 'features']
- ['Backwards Incompatibilities', 'incompat']
- ['Migrating from v2.x', 'migrating']
---
## Summary {#summary}
## New Features {#features}
2020-08-10 01:01:38 +03:00
### New training workflow and config system {#features-training}
### Transformer-based pipelines {#features-transformers}
### Custom models using any framework {#feautres-custom-models}
### Manage end-to-end workflows with projects {#features-projects}
### New built-in pipeline components {#features-pipeline-components}
| Name | Description |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component`
- `Language.analyze_pipes`
- Adding components from other models
### Type hints and type-based data validation {#features-types}
spaCy v3.0 officially drops support for Python 2 and now requires **Python
3.6+**. This also means that the code base can take full advantage of
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
API that's implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy's machine learning library
[Thinc](https://thinc.ai) also features extensive
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
types for models and arrays, and a custom `mypy` plugin that can be used to
type-check model definitions.
For data validation, spacy v3.0 adopts
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
lets you to register **custom functions with typed arguments**, reference them
in your config and see validation errors if the argument values don't match.
### CLI
| Name | Description |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. |
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). |
2020-07-01 14:02:17 +03:00
## Backwards Incompatibilities {#incompat}
2020-08-10 01:01:38 +03:00
As always, we've tried to keep the breaking changes to a minimum and focus on
changes that were necessary to support the new features, fix problems or improve
usability. The following section lists the relevant changes to the user-facing
API. For specific examples of how to rewrite your code, check out the
[migration guide](#migrating).
### Compatibility {#incompat-compat}
- spaCy now requires **Python 3.6+**.
2020-07-25 19:51:12 +03:00
2020-08-10 01:01:38 +03:00
### API changes {#incompat-api}
2020-08-10 01:01:38 +03:00
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
the component factory instead of the component function.
- **Custom pipeline components** now needs to be decorated with the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator.
- [`Language.update`](/api/language#update) now takes a batch of
[`Example`](/api/example) objects instead of raw texts and annotations, or
`Doc` and `GoldParse` objects.
- The `Language.disable_pipes` contextmanager has been replaced by
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
disable or enable components.
### Removed or renamed API {#incompat-removed}
| Removed | Replacement |
| -------------------------------------------------------- | ----------------------------------------------------- |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
2020-07-25 19:51:12 +03:00
The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously
raise errors. Many of them were also mostly internals. If you've been working
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
on them.
2020-07-25 19:51:12 +03:00
2020-07-29 12:36:42 +03:00
| Removed | Replacement |
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Doc.tokens_from_list` | [`Doc.__init__`](/api/doc#init) |
| `Doc.merge`, `Span.merge` | [`Doc.retokenize`](/api/doc#retokenize) |
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower` | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes) |
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
2020-07-25 19:51:12 +03:00
2020-07-01 14:02:17 +03:00
## Migrating from v2.x {#migrating}
2020-07-27 01:29:45 +03:00
2020-07-29 12:36:42 +03:00
### Downloading and loading models {#migrating-downloading-models}
Model symlinks and shortcuts like `en` are now officially deprecated. There are
[many different models](/models) with different capabilities and not just one
"English model". In order to download and load a model, you should always use
its full name for instance, `en_core_web_sm`.
```diff
- python -m spacy download en
+ python -m spacy download en_core_web_sm
```
```diff
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
```
### Custom pipeline components and factories {#migrating-pipeline-components}
Custom pipeline components now have to be registered explicitly using the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator. For simple functions
that take a `Doc` and return it, all you have to do is add the
`@Language.component` decorator to it and assign it a name:
```diff
### Stateless function components
+ from spacy.language import Language
+ @Language.component("my_component")
def my_component(doc):
return doc
```
For class components that are initialized with settings and/or the shared `nlp`
object, you can use the `@Language.factory` decorator. Also make sure that that
the method used to initialize the factory has **two named arguments**: `nlp`
(the current `nlp` object) and `name` (the string name of the component
instance).
```diff
### Stateful class components
+ from spacy.language import Language
+ @Language.factory("my_component")
class MyComponent:
- def __init__(self, nlp):
+ def __init__(self, nlp, name):
self.nlp = nlp
def __call__(self, doc):
return doc
```
Instead of decorating your class, you could also add a factory function that
takes the arguments `nlp` and `name` and returns an instance of your component:
```diff
### Stateful class components with factory function
+ from spacy.language import Language
+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+ return MyComponent(nlp)
class MyComponent:
def __init__(self, nlp):
self.nlp = nlp
def __call__(self, doc):
return doc
```
The `@Language.component` and `@Language.factory` decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
`Language.factories` manually anymore.
```diff
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
```
#### Adding components to the pipeline {#migrating-add-pipe}
The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
name** of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.
```diff
+ @Language.component("my_component")
def my_component(doc):
return doc
- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
```
[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
itself, so you can access its attributes. The
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
and you typically shouldn't have to use it in your code.
```diff
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
```
### Training models {#migrating-training}
To train your models, you should now pretty much always use the
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
training scripts anymore, unless you _really_ want to. The training commands now
use a [flexible config file](/usage/training#config) that describes all training
settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The `--code` argument lets you pass in code containing
[custom registered functions](/usage/training#custom-code) that you can
reference in your config.
#### Binary .spacy training data format {#migrating-training-format}
spaCy now uses a new
[binary training data format](/api/data-formats#binary-training), which is much
smaller and consists of `Doc` objects, serialized via the
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
```bash
$ python -m spacy convert ./training.json ./output
```
#### Training config {#migrating-training-config}
2020-08-10 02:20:10 +03:00
The easiest way to get started with a training config is to use the
[`init config`](/api/cli#init-config) command. You can start off with a blank
config for a new model, copy the config from an existing model, or auto-fill a
partial config like a starter config generated by our
[quickstart widget](/usage/training#quickstart).
```bash
python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
```
```diff
### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
2020-08-06 20:30:43 +03:00
+ python -m spacy train ./config.cfg --output ./output
```
<Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.
</Project>
#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
<!-- TODO: write -->
2020-07-29 12:36:42 +03:00
#### Training via the Python API {#migrating-training-python}
<!-- TODO: this should explain the GoldParse -> Example stuff -->
#### Packaging models {#migrating-training-packaging}
The [`spacy package`](/api/cli#package) command now automatically builds the
installable `.tar.gz` sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the `--no-sdist`
flag.
```diff
python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist
```
2020-08-10 01:01:38 +03:00
#### Migration notes for plugin maintainers {#migrating-plugins}
2020-07-27 01:29:45 +03:00
Thanks to everyone who's been contributing to the spaCy ecosystem by developing
and maintaining one of the many awesome [plugins and extensions](/universe).
2020-08-10 01:01:38 +03:00
We've tried to make it as easy as possible for you to upgrade your packages for
spaCy v3. The most common use case for plugins is providing pipeline components
and extension attributes. When migrating your plugin, double-check the
following:
2020-07-27 01:29:45 +03:00
- Use the [`@Language.factory`](/api/language#factory) decorator to register
your component and assign it a name. This allows users to refer to your
components by name and serialize pipelines referencing them. Remove all manual
entries to the `Language.factories`.
- Make sure your component factories take at least two **named arguments**:
`nlp` (the current `nlp` object) and `name` (the instance name of the added
component so you can identify multiple instances of the same component).
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
to use **string names** instead of the component functions.
```python
### {highlight="1-5"}
from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
return MyCoolComponent(some_setting=some_setting)
class MyCoolComponent:
def __init__(self, some_setting):
self.some_setting = some_setting
def __call__(self, doc):
# Do something to the doc
return doc
```
> #### Result in config.cfg
>
> ```ini
> [components.my_component]
> factory = "my_component"
> some_setting = true
> ```
```diff
import spacy
from your_plugin import MyCoolComponent
nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
```
<Infobox title="Important note on registering factories" variant="warning">
The [`@Language.factory`](/api/language#factory) decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its **string name**. However, this
requires the decorator to be executed so users will still have to **import
your plugin**. Alternatively, your plugin could expose an
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
This means that spaCy knows how to initialize `my_component`, even if your
package isn't imported.
</Infobox>