This reverts commitc8bb08b545
, reversing changes made tob6a509a8d1
.
28 KiB
title | teaser | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
What's New in v2.0 | New features, backwards incompatibilities and migration guide |
|
We're very excited to finally introduce spaCy v2.0! On this page, you'll find a summary of the new features, information on the backwards incompatibilities, including a handy overview of what's been renamed or deprecated. To help you make the most of v2.0, we also re-wrote almost all of the usage guides and API docs, and added more real-world examples. If you're new to spaCy, or just want to brush up on some NLP basics and the details of the library, check out the spaCy 101 guide that explains the most important concepts with examples and illustrations.
Summary
This release features entirely new deep learning-powered models for spaCy's tagger, parser and entity recognizer. The new models are 10× smaller, 20% more accurate and even cheaper to run than the previous generation.
We've also made several usability improvements that are particularly helpful for
production deployments. spaCy v2 now fully supports the Pickle protocol,
making it easy to use spaCy with Apache Spark. The
string-to-integer mapping is no longer stateful, making it easy to reconcile
annotations made in different processes. Models are smaller and use less memory,
and the APIs for serialization are now much more consistent. Custom pipeline
components let you modify the Doc
at any stage in the pipeline. You can now
also add your own custom attributes, properties and methods to the Doc
,
Token
and Span
.
- Summary
- New features
- Neural network models
- Improved processing pipelines
- Text classification
- Hash values as IDs
- Improved word vectors support
- Saving, loading and serialization
- displaCy visualizer
- Language data and lazy loading
- Revised matcher API and phrase matcher
- Backwards incompatibilities
- Migrating from spaCy v1.x
The main usability improvements you'll notice in spaCy v2.0 are around
defining, training and loading your own models and components. The new
neural network models make it much easier to train a model from scratch, or
update an existing model with a few examples. In v1.x, the statistical models
depended on the state of the Vocab
. If you taught the model a new word, you
would have to save and load a lot of data — otherwise the model wouldn't
correctly recall the features of your new example. That's no longer the case.
Due to some clever use of hashing, the statistical models never change size, even as they learn new vocabulary items. The whole pipeline is also now fully differentiable. Even if you don't have explicitly annotated data, you can update spaCy using all the latest deep learning tricks like adversarial training, noise contrastive estimation or reinforcement learning.
New features
This section contains an overview of the most important new features and improvements. The API docs include additional deprecation notes. New methods and functions that were introduced in this version are marked with the tag 2.
Convolutional neural network models
Example
python -m spacy download en_core_web_sm python -m spacy download de_core_news_sm python -m spacy download xx_ent_wiki_sm
spaCy v2.0 features new neural models for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy. The new models are 10× smaller, 20% more accurate, and even cheaper to run than the previous generation.
spaCy v2.0's new neural network models bring significant improvements in
accuracy, especially for English Named Entity Recognition. The new
en_core_web_lg
model makes about 25% fewer
mistakes than the corresponding v1.x model and is within 1% of the current
state-of-the-art
(Strubell et al., 2017). The v2.0 models
are also cheaper to run at scale, as they require under 1 GB of memory per
process.
Usage: Models directory
Improved processing pipelines
Example
# Set custom attributes Doc.set_extension("my_attr", default=False) Token.set_extension("my_attr", getter=my_token_getter) assert doc._.my_attr, token._.my_attr # Add components to the pipeline my_component = lambda doc: doc nlp.add_pipe(my_component)
It's now much easier to customize the pipeline with your own components:
functions that receive a Doc
object, modify and return it. Extensions let you
write any attributes, properties and methods to the Doc
, Token
and
Span
. You can add data, implement new features, integrate other libraries with
spaCy or plug in your own machine learning models.
API: Language
,
Doc.set_extension
,
Span.set_extension
,
Token.set_extension
Usage:
Processing pipelines Code:
Pipeline examples
Text classification
Example
textcat = nlp.create_pipe("textcat") nlp.add_pipe(textcat, last=True) nlp.begin_training() for itn in range(100): for doc, gold in train_data: nlp.update([doc], [gold]) doc = nlp(u"This is a text.") print(doc.cats)
spaCy v2.0 lets you add text categorization models to spaCy pipelines. The model
supports classification with multiple, non-mutually exclusive labels – so
multiple labels can apply at once. You can change the model architecture rather
easily, but by default, the TextCategorizer
class uses a convolutional neural
network to assign position-sensitive vectors to each word in the document.
API: TextCategorizer
,
Doc.cats
, GoldParse.cats
Usage: Training a text classification model
Hash values instead of integer IDs
Example
doc = nlp(u"I love coffee") assert doc.vocab.strings[u"coffee"] == 3197928453018144401 assert doc.vocab.strings[3197928453018144401] == u"coffee" beer_hash = doc.vocab.strings.add(u"beer") assert doc.vocab.strings[u"beer"] == beer_hash assert doc.vocab.strings[beer_hash] == u"beer"
The StringStore
now resolves all strings to hash values
instead of integer IDs. This means that the string-to-int mapping no longer
depends on the vocabulary state, making a lot of workflows much simpler,
especially during training. Unlike integer IDs in spaCy v1.x, hash values will
always match – even across models. Strings can now be added explicitly using
the new Stringstore.add
method. A token's hash is
available via token.orth
.
API: StringStore
Usage:
Vocab, hashes and lexemes 101
Improved word vectors support
Example
for word, vector in vector_data: nlp.vocab.set_vector(word, vector) nlp.vocab.vectors.from_glove("/path/to/vectors") # Keep 10000 unique vectors and remap the rest nlp.vocab.prune_vectors(10000) nlp.to_disk("/model")
The new Vectors
class helps the Vocab
manage the vectors
assigned to strings, and lets you assign vectors individually, or
load in GloVe vectors from a
directory. To help you strike a good balance between coverage and memory usage,
the Vectors
class lets you map multiple keys to the same row of the
table. If you're using the spacy init-model
command to
create a vocabulary, pruning the vectors will be taken care of automatically if
you set the --prune-vectors
flag. Otherwise, you can use the new
Vocab.prune_vectors
.
API: Vectors
, Vocab
Usage:
Word vectors and semantic similarity
Saving, loading and serialization
Example
nlp = spacy.load("en") # shortcut link nlp = spacy.load("en_core_web_sm") # package nlp = spacy.load("/path/to/en") # unicode path nlp = spacy.load(Path("/path/to/en")) # pathlib Path nlp.to_disk("/path/to/nlp") nlp = English().from_disk("/path/to/nlp")
spaCy's serialization API has been made consistent across classes and objects.
All container classes, i.e. Language
, Doc
, Vocab
and StringStore
now
have a to_bytes()
, from_bytes()
, to_disk()
and from_disk()
method that
supports the Pickle protocol.
The improved spacy.load
makes loading models easier and more transparent. You
can load a model by supplying its shortcut link, the name
of an installed model package or a path. The Language
class to
initialize will be determined based on the model's settings. For a blank
language, you can import the class directly, e.g.
from spacy.lang.en import English
or use
spacy.blank()
.
API: spacy.load
,
Language.to_disk
Usage:
Models,
Saving and loading
displaCy visualizer with Jupyter support
Example
from spacy import displacy doc = nlp(u"This is a sentence about Facebook.") displacy.serve(doc, style="dep") # run the web server html = displacy.render(doc, style="ent") # generate HTML
Our popular dependency and named entity visualizers are now an official part of the spaCy library. displaCy can run a simple web server, or generate raw HTML markup or SVG files to be exported. You can pass in one or more docs, and customize the style. displaCy also auto-detects whether you're running Jupyter and will render the visualizations in your notebook.
API: displacy
Usage:
Visualizing spaCy
Improved language data and lazy loading
Language-specific data now lives in its own submodule, spacy.lang
. Languages
are lazy-loaded, i.e. only loaded when you import a Language
class, or load a
model that initializes one. This allows languages to contain more custom data,
e.g. lemmatizer lookup tables, or complex regular expressions. The language data
has also been tidied up and simplified. spaCy now also supports simple
lookup-based lemmatization – and many new languages!
API: Language
Code:
spacy/lang
Usage: Adding languages
Revised matcher API and phrase matcher
Example
from spacy.matcher import Matcher, PhraseMatcher matcher = Matcher(nlp.vocab) matcher.add('HEARTS', None, [{"ORTH": "❤️", "OP": '+'}]) phrasematcher = PhraseMatcher(nlp.vocab) phrasematcher.add("OBAMA", None, nlp(u"Barack Obama"))
Patterns can now be added to the matcher by calling
matcher.add()
with a match ID, an optional callback
function to be invoked on each match, and one or more patterns. This allows you
to write powerful, pattern-specific logic using only one matcher. For example,
you might only want to merge some entity types, and set custom flags for other
matched patterns. The new PhraseMatcher
lets you
efficiently match very large terminology lists using Doc
objects as match
patterns.
API: Matcher
, PhraseMatcher
Usage: Rule-based matching
Backwards incompatibilities
The following modules, classes and methods have changed between v1.x and v2.0.
Old | New |
---|---|
spacy.download.en , spacy.download.de |
cli.download |
spacy.en etc. |
spacy.lang.en etc. |
spacy.en.word_sets |
spacy.lang.en.stop_words |
spacy.orth |
spacy.lang.xx.lex_attrs |
spacy.syntax.iterators |
spacy.lang.xx.syntax_iterators |
spacy.tagger.Tagger |
spacy.pipeline.Tagger |
spacy.cli.model |
spacy.cli.vocab |
Language.save_to_directory |
Language.to_disk |
Language.end_training |
Language.begin_training |
Language.create_make_doc |
Language.tokenizer |
Vocab.resize_vectors |
Vectors.resize |
Vocab.load Vocab.load_lexemes |
Vocab.from_disk Vocab.from_bytes |
Vocab.dump |
Vocab.to_disk Vocab.to_bytes |
Vocab.load_vectors Vocab.load_vectors_from_bin_loc |
Vectors.from_disk Vectors.from_bytes Vectors.from_glove |
Vocab.dump_vectors |
Vectors.to_disk Vectors.to_bytes |
StringStore.load |
StringStore.from_disk StringStore.from_bytes |
StringStore.dump |
StringStore.to_disk StringStore.to_bytes |
Tokenizer.load |
Tokenizer.from_disk Tokenizer.from_bytes |
Tagger.load |
Tagger.from_disk Tagger.from_bytes |
Tagger.tag_names |
Tagger.labels |
DependencyParser.load |
DependencyParser.from_disk DependencyParser.from_bytes |
EntityRecognizer.load |
EntityRecognizer.from_disk EntityRecognizer.from_bytes |
Matcher.load |
- |
Matcher.add_pattern Matcher.add_entity |
Matcher.add PhraseMatcher.add |
Matcher.get_entity |
Matcher.get |
Matcher.has_entity |
Matcher.has_key |
Doc.read_bytes |
Doc.to_bytes Doc.from_bytes Doc.to_disk Doc.from_disk |
Token.is_ancestor_of |
Token.is_ancestor |
Deprecated
The following methods are deprecated. They can still be used, but should be replaced.
Old | New |
---|---|
Tokenizer.tokens_from_list |
Doc |
Span.sent_start |
Span.is_sent_start |
Migrating from spaCy 1.x
Because we'e made so many architectural changes to the library, we've tried to
keep breaking changes to a minimum. A lot of projects follow the philosophy
that if you're going to break anything, you may as well break everything. We
think migration is easier if there's a logic to what has changed. We've
therefore followed a policy of avoiding breaking changes to the Doc
, Span
and Token
objects. This way, you can focus on only migrating the code that
does training, loading and serialization — in other words, code that works with
the nlp
object directly. Code that uses the annotations should continue to
work.
If you've trained your own models, keep in mind that your train and runtime inputs must match. This means you'll have to retrain your models with spaCy v2.0.
Document processing
The Language.pipe
method allows spaCy to batch
documents, which brings a significant performance advantage in v2.0. The new
neural networks introduce some overhead per batch, so if you're processing a
number of documents in a row, you should use nlp.pipe
and process the texts as
a stream.
- docs = (nlp(text) for text in texts)
+ docs = nlp.pipe(texts)
To make usage easier, there's now a boolean as_tuples
keyword argument, that
lets you pass in an iterator of (text, context)
pairs, so you can get back an
iterator of (doc, context)
tuples.
Saving, loading and serialization
Double-check all calls to spacy.load()
and make sure they don't use the path
keyword argument. If you're only loading in binary data and not a model package
that can construct its own Language
class and pipeline, you should now use the
Language.from_disk
method.
- nlp = spacy.load("en", path="/model")
+ nlp = spacy.load("/model")
+ nlp = spacy.blank("en").from_disk("/model/data")
Review all other code that writes state to disk or bytes. All containers, now
share the same, consistent API for saving and loading. Replace saving with
to_disk()
or to_bytes()
, and loading with from_disk()
and from_bytes()
.
- nlp.save_to_directory("/model")
- nlp.vocab.dump("/vocab")
+ nlp.to_disk("/model")
+ nlp.vocab.to_disk("/vocab")
If you've trained models with input from v1.x, you'll need to retrain them with spaCy v2.0. All previous models will not be compatible with the new version.
Processing pipelines and language data
If you're importing language data or Language
classes, make sure to change
your import statements to import from spacy.lang
. If you've added your own
custom language, it needs to be moved to spacy/lang/xx
and adjusted
accordingly.
- from spacy.en import English
+ from spacy.lang.en import English
If you've been using custom pipeline components, check out the new guide on
processing pipelines. Pipeline components are now
(name, func)
tuples. Appending them to the pipeline still works – but the
add_pipe
method now makes this much more convenient.
Methods for removing, renaming, replacing and retrieving components have been
added as well. Components can now be disabled by passing a list of their names
to the disable
keyword argument on load, or by using
disable_pipes
as a method or context manager:
- nlp = spacy.load("en", tagger=False, entity=False)
- doc = nlp(u"I don't want parsed", parse=False)
+ nlp = spacy.load("en", disable=["tagger", "ner"])
+ with nlp.disable_pipes("parser"):
+ doc = nlp(u"I don't want parsed")
To add spaCy's built-in pipeline components to your pipeline, you can still
import and instantiate them directly – but it's more convenient to use the new
create_pipe
method with the component name, i.e.
'tagger'
, 'parser'
, 'ner'
or 'textcat'
.
- from spacy.pipeline import Tagger
- tagger = Tagger(nlp.vocab)
- nlp.pipeline.insert(0, tagger)
+ tagger = nlp.create_pipe("tagger")
+ nlp.add_pipe(tagger, first=True)
Training
All built-in pipeline components are now subclasses of Pipe
,
fully trainable and serializable, and follow the same API. Instead of updating
the model and telling spaCy when to stop, you can now explicitly call
begin_training
, which returns an optimizer you
can pass into the update
function. While update
still accepts sequences of Doc
and GoldParse
objects, you can now also pass
in a list of strings and dictionaries describing the annotations. We call this
the "simple training style". This is
also the recommended usage, as it removes one layer of abstraction from the
training.
- for itn in range(1000):
- for text, entities in train_data:
- doc = Doc(text)
- gold = GoldParse(doc, entities=entities)
- nlp.update(doc, gold)
- nlp.end_training()
- nlp.save_to_directory("/model")
+ nlp.begin_training()
+ for itn in range(1000):
+ for texts, annotations in train_data:
+ nlp.update(texts, annotations)
+ nlp.to_disk("/model")
Attaching custom data to the Doc
Previously, you had to create a new container in order to attach custom data to
a Doc
object. This often required converting the Doc
objects to and from
arrays. In spaCy v2.0, you can set your own attributes, properties and methods
on the Doc
, Token
and Span
via
custom extensions.
This means that your application can – and should – only pass around Doc
objects and refer to them as the single source of truth.
- doc = nlp(u"This is a regular doc")
- doc_array = doc.to_array(["ORTH", "POS"])
- doc_with_meta = {"doc_array": doc_array, "meta": get_doc_meta(doc_array)}
+ Doc.set_extension("meta", getter=get_doc_meta)
+ doc_with_meta = nlp(u'This is a doc with meta data')
+ meta = doc._.meta
If you wrap your extension attributes in a
custom pipeline component, they
will be assigned automatically when you call nlp
on a text. If your
application assigns custom data to spaCy's container objects, or includes other
utilities that interact with the pipeline, consider moving this logic into its
own extension module.
- doc = nlp(u"Doc with a standard pipeline")
- meta = get_meta(doc)
+ nlp.add_pipe(meta_component)
+ doc = nlp(u"Doc with a custom pipeline that assigns meta")
+ meta = doc._.meta
Strings and hash values
The change from integer IDs to hash values may not actually affect your code
very much. However, if you're adding strings to the vocab manually, you now need
to call StringStore.add
explicitly. You can also now
be sure that the string-to-hash mapping will always match across vocabularies.
- nlp.vocab.strings[u"coffee"] # 3672
- other_nlp.vocab.strings[u"coffee"] # 40259
+ nlp.vocab.strings.add(u"coffee")
+ nlp.vocab.strings[u"coffee"] # 3197928453018144401
+ other_nlp.vocab.strings[u"coffee"] # 3197928453018144401
Adding patterns and callbacks to the matcher
If you're using the matcher, you can now add patterns in one step. This should
be easy to update – simply merge the ID, callback and patterns into one call to
Matcher.add()
. The matcher now also supports string keys,
which saves you an extra import. If you've been using acceptor functions,
you'll need to move this logic into the
on_match
callbacks. The callback
function is invoked on every match and will give you access to the doc, the
index of the current match and all total matches. This lets you both accept or
reject the match, and define the actions to be triggered.
- matcher.add_entity("GoogleNow", on_match=merge_phrases)
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
+ matcher.add("GoogleNow", merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
If you need to match large terminology lists, you can now also use the
PhraseMatcher
, which accepts Doc
objects as match
patterns and is more efficient than the regular, rule-based matcher.
- matcher = Matcher(nlp.vocab)
- matcher.add_entity("PRODUCT")
- for text in large_terminology_list
- matcher.add_pattern("PRODUCT", [{ORTH: text}])
+ from spacy.matcher import PhraseMatcher
+ matcher = PhraseMatcher(nlp.vocab)
+ patterns = [nlp.make_doc(text) for text in large_terminology_list]
+ matcher.add("PRODUCT", None, *patterns)