mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-16 06:37:04 +03:00
617 lines
28 KiB
Markdown
617 lines
28 KiB
Markdown
---
|
||
title: What's New in v2.0
|
||
teaser: New features, backwards incompatibilities and migration guide
|
||
menu:
|
||
- ['Summary', 'summary']
|
||
- ['New Features', 'features']
|
||
- ['Backwards Incompatibilities', 'incompat']
|
||
- ['Migrating from v1.x', 'migrating']
|
||
---
|
||
|
||
We're very excited to finally introduce spaCy v2.0! On this page, you'll find a
|
||
summary of the new features, information on the backwards incompatibilities,
|
||
including a handy overview of what's been renamed or deprecated. To help you
|
||
make the most of v2.0, we also **re-wrote almost all of the usage guides and API
|
||
docs**, and added more [real-world examples](/usage/examples). If you're new to
|
||
spaCy, or just want to brush up on some NLP basics and the details of the
|
||
library, check out the [spaCy 101 guide](/usage/spacy-101) that explains the
|
||
most important concepts with examples and illustrations.
|
||
|
||
## Summary {#summary}
|
||
|
||
<Grid cols={2}>
|
||
|
||
<div>
|
||
|
||
This release features entirely new **deep learning-powered models** for spaCy's
|
||
tagger, parser and entity recognizer. The new models are **10× smaller**, **20%
|
||
more accurate** and **even cheaper to run** than the previous generation.
|
||
|
||
We've also made several usability improvements that are particularly helpful for
|
||
**production deployments**. spaCy v2 now fully supports the Pickle protocol,
|
||
making it easy to use spaCy with [Apache Spark](https://spark.apache.org/). The
|
||
string-to-integer mapping is **no longer stateful**, making it easy to reconcile
|
||
annotations made in different processes. Models are smaller and use less memory,
|
||
and the APIs for serialization are now much more consistent. Custom pipeline
|
||
components let you modify the `Doc` at any stage in the pipeline. You can now
|
||
also add your own custom attributes, properties and methods to the `Doc`,
|
||
`Token` and `Span`.
|
||
|
||
</div>
|
||
|
||
<Infobox title="Table of Contents" id="toc">
|
||
|
||
- [Summary](#summary)
|
||
- [New features](#features)
|
||
- [Neural network models](#features-models)
|
||
- [Improved processing pipelines](#features-pipelines)
|
||
- [Text classification](#features-text-classification)
|
||
- [Hash values as IDs](#features-hash-ids)
|
||
- [Improved word vectors support](#features-vectors)
|
||
- [Saving, loading and serialization](#features-serializer)
|
||
- [displaCy visualizer](#features-displacy)
|
||
- [Language data and lazy loading](#features-language)
|
||
- [Revised matcher API and phrase matcher](#features-matcher)
|
||
- [Backwards incompatibilities](#incompat)
|
||
- [Migrating from spaCy v1.x](#migrating)
|
||
|
||
</Infobox>
|
||
|
||
</Grid>
|
||
|
||
The main usability improvements you'll notice in spaCy v2.0 are around
|
||
**defining, training and loading your own models** and components. The new
|
||
neural network models make it much easier to train a model from scratch, or
|
||
update an existing model with a few examples. In v1.x, the statistical models
|
||
depended on the state of the `Vocab`. If you taught the model a new word, you
|
||
would have to save and load a lot of data — otherwise the model wouldn't
|
||
correctly recall the features of your new example. That's no longer the case.
|
||
|
||
Due to some clever use of hashing, the statistical models **never change size**,
|
||
even as they learn new vocabulary items. The whole pipeline is also now fully
|
||
differentiable. Even if you don't have explicitly annotated data, you can update
|
||
spaCy using all the **latest deep learning tricks** like adversarial training,
|
||
noise contrastive estimation or reinforcement learning.
|
||
|
||
## New features {#features}
|
||
|
||
This section contains an overview of the most important **new features and
|
||
improvements**. The [API docs](/api) include additional deprecation notes. New
|
||
methods and functions that were introduced in this version are marked with the
|
||
tag <Tag variant="new">2</Tag>.
|
||
|
||
### Convolutional neural network models {#features-models}
|
||
|
||
> #### Example
|
||
>
|
||
> ```bash
|
||
> python -m spacy download en_core_web_sm
|
||
> python -m spacy download de_core_news_sm
|
||
> python -m spacy download xx_ent_wiki_sm
|
||
> ```
|
||
|
||
spaCy v2.0 features new neural models for tagging, parsing and entity
|
||
recognition. The models have been designed and implemented from scratch
|
||
specifically for spaCy, to give you an unmatched balance of speed, size and
|
||
accuracy. The new models are **10× smaller**, **20% more accurate**, and **even
|
||
cheaper to run** than the previous generation.
|
||
|
||
spaCy v2.0's new neural network models bring significant improvements in
|
||
accuracy, especially for English Named Entity Recognition. The new
|
||
[`en_core_web_lg`](/models/en#en_core_web_lg) model makes about **25% fewer
|
||
mistakes** than the corresponding v1.x model and is within **1% of the current
|
||
state-of-the-art**
|
||
([Strubell et al., 2017](https://arxiv.org/pdf/1702.02098.pdf)). The v2.0 models
|
||
are also cheaper to run at scale, as they require **under 1 GB of memory** per
|
||
process.
|
||
|
||
<Infobox>
|
||
|
||
**Usage:** [Models directory](/models)
|
||
|
||
</Infobox>
|
||
|
||
### Improved processing pipelines {#features-pipelines}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> # Set custom attributes
|
||
> Doc.set_extension("my_attr", default=False)
|
||
> Token.set_extension("my_attr", getter=my_token_getter)
|
||
> assert doc._.my_attr, token._.my_attr
|
||
>
|
||
> # Add components to the pipeline
|
||
> my_component = lambda doc: doc
|
||
> nlp.add_pipe(my_component)
|
||
> ```
|
||
|
||
It's now much easier to **customize the pipeline** with your own components:
|
||
functions that receive a `Doc` object, modify and return it. Extensions let you
|
||
write any **attributes, properties and methods** to the `Doc`, `Token` and
|
||
`Span`. You can add data, implement new features, integrate other libraries with
|
||
spaCy or plug in your own machine learning models.
|
||
|
||
![The processing pipeline](../images/pipeline.svg)
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`Language`](/api/language),
|
||
[`Doc.set_extension`](/api/doc#set_extension),
|
||
[`Span.set_extension`](/api/span#set_extension),
|
||
[`Token.set_extension`](/api/token#set_extension) **Usage:**
|
||
[Processing pipelines](/usage/processing-pipelines) **Code:**
|
||
[Pipeline examples](/usage/examples#section-pipeline)
|
||
|
||
</Infobox>
|
||
|
||
### Text classification {#features-text-classification}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> textcat = nlp.create_pipe("textcat")
|
||
> nlp.add_pipe(textcat, last=True)
|
||
> nlp.begin_training()
|
||
> for itn in range(100):
|
||
> for doc, gold in train_data:
|
||
> nlp.update([doc], [gold])
|
||
> doc = nlp("This is a text.")
|
||
> print(doc.cats)
|
||
> ```
|
||
|
||
spaCy v2.0 lets you add text categorization models to spaCy pipelines. The model
|
||
supports classification with multiple, non-mutually exclusive labels – so
|
||
multiple labels can apply at once. You can change the model architecture rather
|
||
easily, but by default, the `TextCategorizer` class uses a convolutional neural
|
||
network to assign position-sensitive vectors to each word in the document.
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`TextCategorizer`](/api/textcategorizer),
|
||
[`Doc.cats`](/api/doc#attributes), [`GoldParse.cats`](/api/goldparse#attributes)
|
||
**Usage:** [Training a text classification model](/usage/training#textcat)
|
||
|
||
</Infobox>
|
||
|
||
### Hash values instead of integer IDs {#features-hash-ids}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp("I love coffee")
|
||
> assert doc.vocab.strings["coffee"] == 3197928453018144401
|
||
> assert doc.vocab.strings[3197928453018144401] == "coffee"
|
||
>
|
||
> beer_hash = doc.vocab.strings.add("beer")
|
||
> assert doc.vocab.strings["beer"] == beer_hash
|
||
> assert doc.vocab.strings[beer_hash] == "beer"
|
||
> ```
|
||
|
||
The [`StringStore`](/api/stringstore) now resolves all strings to hash values
|
||
instead of integer IDs. This means that the string-to-int mapping **no longer
|
||
depends on the vocabulary state**, making a lot of workflows much simpler,
|
||
especially during training. Unlike integer IDs in spaCy v1.x, hash values will
|
||
**always match** – even across models. Strings can now be added explicitly using
|
||
the new [`Stringstore.add`](/api/stringstore#add) method. A token's hash is
|
||
available via `token.orth`.
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`StringStore`](/api/stringstore) **Usage:**
|
||
[Vocab, hashes and lexemes 101](/usage/spacy-101#vocab)
|
||
|
||
</Infobox>
|
||
|
||
### Improved word vectors support {#features-vectors}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> for word, vector in vector_data:
|
||
> nlp.vocab.set_vector(word, vector)
|
||
> nlp.vocab.vectors.from_glove("/path/to/vectors")
|
||
> # Keep 10000 unique vectors and remap the rest
|
||
> nlp.vocab.prune_vectors(10000)
|
||
> nlp.to_disk("/model")
|
||
> ```
|
||
|
||
The new [`Vectors`](/api/vectors) class helps the `Vocab` manage the vectors
|
||
assigned to strings, and lets you assign vectors individually, or
|
||
[load in GloVe vectors](/usage/vectors-similarity#custom-loading-glove) from a
|
||
directory. To help you strike a good balance between coverage and memory usage,
|
||
the `Vectors` class lets you map **multiple keys** to the **same row** of the
|
||
table. If you're using the [`spacy init-model`](/api/cli#init-model) command to
|
||
create a vocabulary, pruning the vectors will be taken care of automatically if
|
||
you set the `--prune-vectors` flag. Otherwise, you can use the new
|
||
[`Vocab.prune_vectors`](/api/vocab#prune_vectors).
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`Vectors`](/api/vectors), [`Vocab`](/api/vocab) **Usage:**
|
||
[Word vectors and semantic similarity](/usage/vectors-similarity)
|
||
|
||
</Infobox>
|
||
|
||
### Saving, loading and serialization {#features-serializer}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp = spacy.load("en") # shortcut link
|
||
> nlp = spacy.load("en_core_web_sm") # package
|
||
> nlp = spacy.load("/path/to/en") # unicode path
|
||
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
||
>
|
||
> nlp.to_disk("/path/to/nlp")
|
||
> nlp = English().from_disk("/path/to/nlp")
|
||
> ```
|
||
|
||
spaCy's serialization API has been made consistent across classes and objects.
|
||
All container classes, i.e. `Language`, `Doc`, `Vocab` and `StringStore` now
|
||
have a `to_bytes()`, `from_bytes()`, `to_disk()` and `from_disk()` method that
|
||
supports the Pickle protocol.
|
||
|
||
The improved `spacy.load` makes loading models easier and more transparent. You
|
||
can load a model by supplying its [shortcut link](/usage/models#usage), the name
|
||
of an installed [model package](/models) or a path. The `Language` class to
|
||
initialize will be determined based on the model's settings. For a blank
|
||
language, you can import the class directly, e.g.
|
||
`from spacy.lang.en import English` or use
|
||
[`spacy.blank()`](/api/top-level#spacy.blank).
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`spacy.load`](/api/top-level#spacy.load),
|
||
[`Language.to_disk`](/api/language#to_disk) **Usage:**
|
||
[Models](/usage/models#usage),
|
||
[Saving and loading](/usage/saving-loading#models)
|
||
|
||
</Infobox>
|
||
|
||
### displaCy visualizer with Jupyter support {#features-displacy}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy import displacy
|
||
> doc = nlp("This is a sentence about Facebook.")
|
||
> displacy.serve(doc, style="dep") # run the web server
|
||
> html = displacy.render(doc, style="ent") # generate HTML
|
||
> ```
|
||
|
||
Our popular dependency and named entity visualizers are now an official part of
|
||
the spaCy library. displaCy can run a simple web server, or generate raw HTML
|
||
markup or SVG files to be exported. You can pass in one or more docs, and
|
||
customize the style. displaCy also auto-detects whether you're running
|
||
[Jupyter](https://jupyter.org) and will render the visualizations in your
|
||
notebook.
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`displacy`](/api/top-level#displacy) **Usage:**
|
||
[Visualizing spaCy](/usage/visualizers)
|
||
|
||
</Infobox>
|
||
|
||
### Improved language data and lazy loading {#features-language}
|
||
|
||
Language-specific data now lives in its own submodule, `spacy.lang`. Languages
|
||
are lazy-loaded, i.e. only loaded when you import a `Language` class, or load a
|
||
model that initializes one. This allows languages to contain more custom data,
|
||
e.g. lemmatizer lookup tables, or complex regular expressions. The language data
|
||
has also been tidied up and simplified. spaCy now also supports simple
|
||
lookup-based lemmatization – and **many new languages**!
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`Language`](/api/language) **Code:**
|
||
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang)
|
||
**Usage:** [Adding languages](/usage/adding-languages)
|
||
|
||
</Infobox>
|
||
|
||
### Revised matcher API and phrase matcher {#features-matcher}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.matcher import Matcher, PhraseMatcher
|
||
>
|
||
> matcher = Matcher(nlp.vocab)
|
||
> matcher.add('HEARTS', None, [{"ORTH": "❤️", "OP": '+'}])
|
||
>
|
||
> phrasematcher = PhraseMatcher(nlp.vocab)
|
||
> phrasematcher.add("OBAMA", None, nlp("Barack Obama"))
|
||
> ```
|
||
|
||
Patterns can now be added to the matcher by calling
|
||
[`matcher.add()`](/api/matcher#add) with a match ID, an optional callback
|
||
function to be invoked on each match, and one or more patterns. This allows you
|
||
to write powerful, pattern-specific logic using only one matcher. For example,
|
||
you might only want to merge some entity types, and set custom flags for other
|
||
matched patterns. The new [`PhraseMatcher`](/api/phrasematcher) lets you
|
||
efficiently match very large terminology lists using `Doc` objects as match
|
||
patterns.
|
||
|
||
<Infobox>
|
||
|
||
**API:** [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher)
|
||
**Usage:** [Rule-based matching](/usage/rule-based-matching)
|
||
|
||
</Infobox>
|
||
|
||
## Backwards incompatibilities {#incompat}
|
||
|
||
The following modules, classes and methods have changed between v1.x and v2.0.
|
||
|
||
| Old | New |
|
||
| ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `spacy.download.en`, `spacy.download.de` | [`cli.download`](/api/cli#download) |
|
||
| `spacy.en` etc. | `spacy.lang.en` etc. |
|
||
| `spacy.en.word_sets` | `spacy.lang.en.stop_words` |
|
||
| `spacy.orth` | `spacy.lang.xx.lex_attrs` |
|
||
| `spacy.syntax.iterators` | `spacy.lang.xx.syntax_iterators` |
|
||
| `spacy.tagger.Tagger` | `spacy.pipeline.Tagger` |
|
||
| `spacy.cli.model` | [`spacy.cli.vocab`](/api/cli#vocab) |
|
||
| `Language.save_to_directory` | [`Language.to_disk`](/api/language#to_disk) |
|
||
| `Language.end_training` | [`Language.begin_training`](/api/language#begin_training) |
|
||
| `Language.create_make_doc` | [`Language.tokenizer`](/api/language#attributes) |
|
||
| `Vocab.resize_vectors` | [`Vectors.resize`](/api/vectors#resize) |
|
||
| `Vocab.load` `Vocab.load_lexemes` | [`Vocab.from_disk`](/api/vocab#from_disk) [`Vocab.from_bytes`](/api/vocab#from_bytes) |
|
||
| `Vocab.dump` | [`Vocab.to_disk`](/api/vocab#to_disk) [`Vocab.to_bytes`](/api/vocab#to_bytes) |
|
||
| `Vocab.load_vectors` `Vocab.load_vectors_from_bin_loc` | [`Vectors.from_disk`](/api/vectors#from_disk) [`Vectors.from_bytes`](/api/vectors#from_bytes) [`Vectors.from_glove`](/api/vectors#from_glove) |
|
||
| `Vocab.dump_vectors` | [`Vectors.to_disk`](/api/vectors#to_disk) [`Vectors.to_bytes`](/api/vectors#to_bytes) |
|
||
| `StringStore.load` | [`StringStore.from_disk`](/api/stringstore#from_disk) [`StringStore.from_bytes`](/api/stringstore#from_bytes) |
|
||
| `StringStore.dump` | [`StringStore.to_disk`](/api/stringstore#to_disk) [`StringStore.to_bytes`](/api/stringstore#to_bytes) |
|
||
| `Tokenizer.load` | [`Tokenizer.from_disk`](/api/tokenizer#from_disk) [`Tokenizer.from_bytes`](/api/tokenizer#from_bytes) |
|
||
| `Tagger.load` | [`Tagger.from_disk`](/api/tagger#from_disk) [`Tagger.from_bytes`](/api/tagger#from_bytes) |
|
||
| `Tagger.tag_names` | `Tagger.labels` |
|
||
| `DependencyParser.load` | [`DependencyParser.from_disk`](/api/dependencyparser#from_disk) [`DependencyParser.from_bytes`](/api/dependencyparser#from_bytes) |
|
||
| `EntityRecognizer.load` | [`EntityRecognizer.from_disk`](/api/entityrecognizer#from_disk) [`EntityRecognizer.from_bytes`](/api/entityrecognizer#from_bytes) |
|
||
| `Matcher.load` | - |
|
||
| `Matcher.add_pattern` `Matcher.add_entity` | [`Matcher.add`](/api/matcher#add) [`PhraseMatcher.add`](/api/phrasematcher#add) |
|
||
| `Matcher.get_entity` | [`Matcher.get`](/api/matcher#get) |
|
||
| `Matcher.has_entity` | [`Matcher.has_key`](/api/matcher#has_key) |
|
||
| `Doc.read_bytes` | [`Doc.to_bytes`](/api/doc#to_bytes) [`Doc.from_bytes`](/api/doc#from_bytes) [`Doc.to_disk`](/api/doc#to_disk) [`Doc.from_disk`](/api/doc#from_disk) |
|
||
| `Token.is_ancestor_of` | [`Token.is_ancestor`](/api/token#is_ancestor) |
|
||
|
||
### Deprecated {#deprecated}
|
||
|
||
The following methods are deprecated. They can still be used, but should be
|
||
replaced.
|
||
|
||
| Old | New |
|
||
| ---------------------------- | ----------------------------------------------- |
|
||
| `Tokenizer.tokens_from_list` | [`Doc`](/api/doc) |
|
||
| `Span.sent_start` | [`Span.is_sent_start`](/api/span#is_sent_start) |
|
||
|
||
## Migrating from spaCy 1.x {#migrating}
|
||
|
||
Because we'e made so many architectural changes to the library, we've tried to
|
||
**keep breaking changes to a minimum**. A lot of projects follow the philosophy
|
||
that if you're going to break anything, you may as well break everything. We
|
||
think migration is easier if there's a logic to what has changed. We've
|
||
therefore followed a policy of avoiding breaking changes to the `Doc`, `Span`
|
||
and `Token` objects. This way, you can focus on only migrating the code that
|
||
does training, loading and serialization — in other words, code that works with
|
||
the `nlp` object directly. Code that uses the annotations should continue to
|
||
work.
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
If you've trained your own models, keep in mind that your train and runtime
|
||
inputs must match. This means you'll have to **retrain your models** with spaCy
|
||
v2.0.
|
||
|
||
</Infobox>
|
||
|
||
### Document processing {#migrating-document-processing}
|
||
|
||
The [`Language.pipe`](/api/language#pipe) method allows spaCy to batch
|
||
documents, which brings a **significant performance advantage** in v2.0. The new
|
||
neural networks introduce some overhead per batch, so if you're processing a
|
||
number of documents in a row, you should use `nlp.pipe` and process the texts as
|
||
a stream.
|
||
|
||
```diff
|
||
- docs = (nlp(text) for text in texts)
|
||
|
||
+ docs = nlp.pipe(texts)
|
||
```
|
||
|
||
To make usage easier, there's now a boolean `as_tuples` keyword argument, that
|
||
lets you pass in an iterator of `(text, context)` pairs, so you can get back an
|
||
iterator of `(doc, context)` tuples.
|
||
|
||
### Saving, loading and serialization {#migrating-saving-loading}
|
||
|
||
Double-check all calls to `spacy.load()` and make sure they don't use the `path`
|
||
keyword argument. If you're only loading in binary data and not a model package
|
||
that can construct its own `Language` class and pipeline, you should now use the
|
||
[`Language.from_disk`](/api/language#from_disk) method.
|
||
|
||
```diff
|
||
- nlp = spacy.load("en", path="/model")
|
||
|
||
+ nlp = spacy.load("/model")
|
||
+ nlp = spacy.blank("en").from_disk("/model/data")
|
||
```
|
||
|
||
Review all other code that writes state to disk or bytes. All containers, now
|
||
share the same, consistent API for saving and loading. Replace saving with
|
||
`to_disk()` or `to_bytes()`, and loading with `from_disk()` and `from_bytes()`.
|
||
|
||
```diff
|
||
- nlp.save_to_directory("/model")
|
||
- nlp.vocab.dump("/vocab")
|
||
|
||
+ nlp.to_disk("/model")
|
||
+ nlp.vocab.to_disk("/vocab")
|
||
```
|
||
|
||
If you've trained models with input from v1.x, you'll need to **retrain them**
|
||
with spaCy v2.0. All previous models will not be compatible with the new
|
||
version.
|
||
|
||
### Processing pipelines and language data {#migrating-languages}
|
||
|
||
If you're importing language data or `Language` classes, make sure to change
|
||
your import statements to import from `spacy.lang`. If you've added your own
|
||
custom language, it needs to be moved to `spacy/lang/xx` and adjusted
|
||
accordingly.
|
||
|
||
```diff
|
||
- from spacy.en import English
|
||
|
||
+ from spacy.lang.en import English
|
||
```
|
||
|
||
If you've been using custom pipeline components, check out the new guide on
|
||
[processing pipelines](/usage/processing-pipelines). Pipeline components are now
|
||
`(name, func)` tuples. Appending them to the pipeline still works – but the
|
||
[`add_pipe`](/api/language#add_pipe) method now makes this much more convenient.
|
||
Methods for removing, renaming, replacing and retrieving components have been
|
||
added as well. Components can now be disabled by passing a list of their names
|
||
to the `disable` keyword argument on load, or by using
|
||
[`disable_pipes`](/api/language#disable_pipes) as a method or context manager:
|
||
|
||
```diff
|
||
- nlp = spacy.load("en_core_web_sm", tagger=False, entity=False)
|
||
- doc = nlp("I don't want parsed", parse=False)
|
||
|
||
+ nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"])
|
||
+ with nlp.disable_pipes("parser"):
|
||
+ doc = nlp("I don't want parsed")
|
||
```
|
||
|
||
To add spaCy's built-in pipeline components to your pipeline, you can still
|
||
import and instantiate them directly – but it's more convenient to use the new
|
||
[`create_pipe`](/api/language#create_pipe) method with the component name, i.e.
|
||
`'tagger'`, `'parser'`, `'ner'` or `'textcat'`.
|
||
|
||
```diff
|
||
- from spacy.pipeline import Tagger
|
||
- tagger = Tagger(nlp.vocab)
|
||
- nlp.pipeline.insert(0, tagger)
|
||
|
||
+ tagger = nlp.create_pipe("tagger")
|
||
+ nlp.add_pipe(tagger, first=True)
|
||
```
|
||
|
||
### Training {#migrating-training}
|
||
|
||
All built-in pipeline components are now subclasses of [`Pipe`](/api/pipe),
|
||
fully trainable and serializable, and follow the same API. Instead of updating
|
||
the model and telling spaCy when to _stop_, you can now explicitly call
|
||
[`begin_training`](/api/language#begin_training), which returns an optimizer you
|
||
can pass into the [`update`](/api/language#update) function. While `update`
|
||
still accepts sequences of `Doc` and `GoldParse` objects, you can now also pass
|
||
in a list of strings and dictionaries describing the annotations. We call this
|
||
the ["simple training style"](/usage/training#training-simple-style). This is
|
||
also the recommended usage, as it removes one layer of abstraction from the
|
||
training.
|
||
|
||
```diff
|
||
- for itn in range(1000):
|
||
- for text, entities in train_data:
|
||
- doc = Doc(text)
|
||
- gold = GoldParse(doc, entities=entities)
|
||
- nlp.update(doc, gold)
|
||
- nlp.end_training()
|
||
- nlp.save_to_directory("/model")
|
||
|
||
+ nlp.begin_training()
|
||
+ for itn in range(1000):
|
||
+ for texts, annotations in train_data:
|
||
+ nlp.update(texts, annotations)
|
||
+ nlp.to_disk("/model")
|
||
```
|
||
|
||
### Attaching custom data to the Doc {#migrating-doc}
|
||
|
||
Previously, you had to create a new container in order to attach custom data to
|
||
a `Doc` object. This often required converting the `Doc` objects to and from
|
||
arrays. In spaCy v2.0, you can set your own attributes, properties and methods
|
||
on the `Doc`, `Token` and `Span` via
|
||
[custom extensions](/usage/processing-pipelines#custom-components-attributes).
|
||
This means that your application can – and should – only pass around `Doc`
|
||
objects and refer to them as the single source of truth.
|
||
|
||
```diff
|
||
- doc = nlp("This is a regular doc")
|
||
- doc_array = doc.to_array(["ORTH", "POS"])
|
||
- doc_with_meta = {"doc_array": doc_array, "meta": get_doc_meta(doc_array)}
|
||
|
||
+ Doc.set_extension("meta", getter=get_doc_meta)
|
||
+ doc_with_meta = nlp(u'This is a doc with meta data')
|
||
+ meta = doc._.meta
|
||
```
|
||
|
||
If you wrap your extension attributes in a
|
||
[custom pipeline component](/usage/processing-pipelines#custom-components), they
|
||
will be assigned automatically when you call `nlp` on a text. If your
|
||
application assigns custom data to spaCy's container objects, or includes other
|
||
utilities that interact with the pipeline, consider moving this logic into its
|
||
own extension module.
|
||
|
||
```diff
|
||
- doc = nlp("Doc with a standard pipeline")
|
||
- meta = get_meta(doc)
|
||
|
||
+ nlp.add_pipe(meta_component)
|
||
+ doc = nlp("Doc with a custom pipeline that assigns meta")
|
||
+ meta = doc._.meta
|
||
```
|
||
|
||
### Strings and hash values {#migrating-strings}
|
||
|
||
The change from integer IDs to hash values may not actually affect your code
|
||
very much. However, if you're adding strings to the vocab manually, you now need
|
||
to call [`StringStore.add`](/api/stringstore#add) explicitly. You can also now
|
||
be sure that the string-to-hash mapping will always match across vocabularies.
|
||
|
||
```diff
|
||
- nlp.vocab.strings["coffee"] # 3672
|
||
- other_nlp.vocab.strings["coffee"] # 40259
|
||
|
||
+ nlp.vocab.strings.add("coffee")
|
||
+ nlp.vocab.strings["coffee"] # 3197928453018144401
|
||
+ other_nlp.vocab.strings["coffee"] # 3197928453018144401
|
||
```
|
||
|
||
### Adding patterns and callbacks to the matcher {#migrating-matcher}
|
||
|
||
If you're using the matcher, you can now add patterns in one step. This should
|
||
be easy to update – simply merge the ID, callback and patterns into one call to
|
||
[`Matcher.add()`](/api/matcher#add). The matcher now also supports string keys,
|
||
which saves you an extra import. If you've been using **acceptor functions**,
|
||
you'll need to move this logic into the
|
||
[`on_match` callbacks](/usage/linguistic-features#on_match). The callback
|
||
function is invoked on every match and will give you access to the doc, the
|
||
index of the current match and all total matches. This lets you both accept or
|
||
reject the match, and define the actions to be triggered.
|
||
|
||
```diff
|
||
- matcher.add_entity("GoogleNow", on_match=merge_phrases)
|
||
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
|
||
|
||
+ matcher.add("GoogleNow", merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
|
||
```
|
||
|
||
If you need to match large terminology lists, you can now also use the
|
||
[`PhraseMatcher`](/api/phrasematcher), which accepts `Doc` objects as match
|
||
patterns and is more efficient than the regular, rule-based matcher.
|
||
|
||
```diff
|
||
- matcher = Matcher(nlp.vocab)
|
||
- matcher.add_entity("PRODUCT")
|
||
- for text in large_terminology_list
|
||
- matcher.add_pattern("PRODUCT", [{ORTH: text}])
|
||
|
||
+ from spacy.matcher import PhraseMatcher
|
||
+ matcher = PhraseMatcher(nlp.vocab)
|
||
+ patterns = [nlp.make_doc(text) for text in large_terminology_list]
|
||
+ matcher.add("PRODUCT", None, *patterns)
|
||
```
|