mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 05:31:15 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			614 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			614 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | ||
| title: What's New in v2.0
 | ||
| teaser: New features, backwards incompatibilities and migration guide
 | ||
| menu:
 | ||
|   - ['Summary', 'summary']
 | ||
|   - ['New Features', 'features']
 | ||
|   - ['Backwards Incompatibilities', 'incompat']
 | ||
|   - ['Migrating from v1.x', 'migrating']
 | ||
| ---
 | ||
| 
 | ||
| We're very excited to finally introduce spaCy v2.0! On this page, you'll find a
 | ||
| summary of the new features, information on the backwards incompatibilities,
 | ||
| including a handy overview of what's been renamed or deprecated. To help you
 | ||
| make the most of v2.0, we also **re-wrote almost all of the usage guides and API
 | ||
| docs**, and added more [real-world examples](/usage/examples). If you're new to
 | ||
| spaCy, or just want to brush up on some NLP basics and the details of the
 | ||
| library, check out the [spaCy 101 guide](/usage/spacy-101) that explains the
 | ||
| most important concepts with examples and illustrations.
 | ||
| 
 | ||
| ## Summary {#summary}
 | ||
| 
 | ||
| <Grid cols={2}>
 | ||
| 
 | ||
| <div>
 | ||
| 
 | ||
| This release features entirely new **deep learning-powered models** for spaCy's
 | ||
| tagger, parser and entity recognizer. The new models are **10× smaller**, **20%
 | ||
| more accurate** and **even cheaper to run** than the previous generation.
 | ||
| 
 | ||
| We've also made several usability improvements that are particularly helpful for
 | ||
| **production deployments**. spaCy v2 now fully supports the Pickle protocol,
 | ||
| making it easy to use spaCy with [Apache Spark](https://spark.apache.org/). The
 | ||
| string-to-integer mapping is **no longer stateful**, making it easy to reconcile
 | ||
| annotations made in different processes. Models are smaller and use less memory,
 | ||
| and the APIs for serialization are now much more consistent. Custom pipeline
 | ||
| components let you modify the `Doc` at any stage in the pipeline. You can now
 | ||
| also add your own custom attributes, properties and methods to the `Doc`,
 | ||
| `Token` and `Span`.
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <Infobox title="Table of Contents" id="toc">
 | ||
| 
 | ||
| - [Summary](#summary)
 | ||
| - [New features](#features)
 | ||
| - [Neural network models](#features-models)
 | ||
| - [Improved processing pipelines](#features-pipelines)
 | ||
| - [Text classification](#features-text-classification)
 | ||
| - [Hash values as IDs](#features-hash-ids)
 | ||
| - [Improved word vectors support](#features-vectors)
 | ||
| - [Saving, loading and serialization](#features-serializer)
 | ||
| - [displaCy visualizer](#features-displacy)
 | ||
| - [Language data and lazy loading](#features-language)
 | ||
| - [Revised matcher API and phrase matcher](#features-matcher)
 | ||
| - [Backwards incompatibilities](#incompat)
 | ||
| - [Migrating from spaCy v1.x](#migrating)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| </Grid>
 | ||
| 
 | ||
| The main usability improvements you'll notice in spaCy v2.0 are around
 | ||
| **defining, training and loading your own models** and components. The new
 | ||
| neural network models make it much easier to train a model from scratch, or
 | ||
| update an existing model with a few examples. In v1.x, the statistical models
 | ||
| depended on the state of the `Vocab`. If you taught the model a new word, you
 | ||
| would have to save and load a lot of data — otherwise the model wouldn't
 | ||
| correctly recall the features of your new example. That's no longer the case.
 | ||
| 
 | ||
| Due to some clever use of hashing, the statistical models **never change size**,
 | ||
| even as they learn new vocabulary items. The whole pipeline is also now fully
 | ||
| differentiable. Even if you don't have explicitly annotated data, you can update
 | ||
| spaCy using all the **latest deep learning tricks** like adversarial training,
 | ||
| noise contrastive estimation or reinforcement learning.
 | ||
| 
 | ||
| ## New features {#features}
 | ||
| 
 | ||
| This section contains an overview of the most important **new features and
 | ||
| improvements**. The [API docs](/api) include additional deprecation notes.
 | ||
| 
 | ||
| ### Convolutional neural network models {#features-models}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```bash
 | ||
| > python -m spacy download en_core_web_sm
 | ||
| > python -m spacy download de_core_news_sm
 | ||
| > python -m spacy download xx_ent_wiki_sm
 | ||
| > ```
 | ||
| 
 | ||
| spaCy v2.0 features new neural models for tagging, parsing and entity
 | ||
| recognition. The models have been designed and implemented from scratch
 | ||
| specifically for spaCy, to give you an unmatched balance of speed, size and
 | ||
| accuracy. The new models are **10× smaller**, **20% more accurate**, and **even
 | ||
| cheaper to run** than the previous generation.
 | ||
| 
 | ||
| spaCy v2.0's new neural network models bring significant improvements in
 | ||
| accuracy, especially for English Named Entity Recognition. The new
 | ||
| [`en_core_web_lg`](/models/en#en_core_web_lg) model makes about **25% fewer
 | ||
| mistakes** than the corresponding v1.x model and is within **1% of the current
 | ||
| state-of-the-art**
 | ||
| ([Strubell et al., 2017](https://arxiv.org/pdf/1702.02098.pdf)). The v2.0 models
 | ||
| are also cheaper to run at scale, as they require **under 1 GB of memory** per
 | ||
| process.
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **Usage:** [Models directory](/models)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Improved processing pipelines {#features-pipelines}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > # Set custom attributes
 | ||
| > Doc.set_extension("my_attr", default=False)
 | ||
| > Token.set_extension("my_attr", getter=my_token_getter)
 | ||
| > assert doc._.my_attr, token._.my_attr
 | ||
| >
 | ||
| > # Add components to the pipeline
 | ||
| > my_component = lambda doc: doc
 | ||
| > nlp.add_pipe(my_component)
 | ||
| > ```
 | ||
| 
 | ||
| It's now much easier to **customize the pipeline** with your own components:
 | ||
| functions that receive a `Doc` object, modify and return it. Extensions let you
 | ||
| write any **attributes, properties and methods** to the `Doc`, `Token` and
 | ||
| `Span`. You can add data, implement new features, integrate other libraries with
 | ||
| spaCy or plug in your own machine learning models.
 | ||
| 
 | ||
| 
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`Language`](/api/language),
 | ||
| [`Doc.set_extension`](/api/doc#set_extension),
 | ||
| [`Span.set_extension`](/api/span#set_extension),
 | ||
| [`Token.set_extension`](/api/token#set_extension) **Usage:**
 | ||
| [Processing pipelines](/usage/processing-pipelines) **Code:**
 | ||
| [Pipeline examples](/usage/examples#section-pipeline)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Text classification {#features-text-classification}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > textcat = nlp.create_pipe("textcat")
 | ||
| > nlp.add_pipe(textcat, last=True)
 | ||
| > nlp.begin_training()
 | ||
| > for itn in range(100):
 | ||
| >    for doc, gold in train_data:
 | ||
| >        nlp.update([doc], [gold])
 | ||
| > doc = nlp("This is a text.")
 | ||
| > print(doc.cats)
 | ||
| > ```
 | ||
| 
 | ||
| spaCy v2.0 lets you add text categorization models to spaCy pipelines. The model
 | ||
| supports classification with multiple, non-mutually exclusive labels – so
 | ||
| multiple labels can apply at once. You can change the model architecture rather
 | ||
| easily, but by default, the `TextCategorizer` class uses a convolutional neural
 | ||
| network to assign position-sensitive vectors to each word in the document.
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`TextCategorizer`](/api/textcategorizer),
 | ||
| [`Doc.cats`](/api/doc#attributes), `GoldParse.cats` **Usage:**
 | ||
| [Training a text classification model](/usage/training#textcat)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Hash values instead of integer IDs {#features-hash-ids}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > doc = nlp("I love coffee")
 | ||
| > assert doc.vocab.strings["coffee"] == 3197928453018144401
 | ||
| > assert doc.vocab.strings[3197928453018144401] == "coffee"
 | ||
| >
 | ||
| > beer_hash = doc.vocab.strings.add("beer")
 | ||
| > assert doc.vocab.strings["beer"] == beer_hash
 | ||
| > assert doc.vocab.strings[beer_hash] == "beer"
 | ||
| > ```
 | ||
| 
 | ||
| The [`StringStore`](/api/stringstore) now resolves all strings to hash values
 | ||
| instead of integer IDs. This means that the string-to-int mapping **no longer
 | ||
| depends on the vocabulary state**, making a lot of workflows much simpler,
 | ||
| especially during training. Unlike integer IDs in spaCy v1.x, hash values will
 | ||
| **always match** – even across models. Strings can now be added explicitly using
 | ||
| the new [`Stringstore.add`](/api/stringstore#add) method. A token's hash is
 | ||
| available via `token.orth`.
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`StringStore`](/api/stringstore) **Usage:**
 | ||
| [Vocab, hashes and lexemes 101](/usage/spacy-101#vocab)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Improved word vectors support {#features-vectors}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > for word, vector in vector_data:
 | ||
| >     nlp.vocab.set_vector(word, vector)
 | ||
| > nlp.vocab.vectors.from_glove("/path/to/vectors")
 | ||
| > # Keep 10000 unique vectors and remap the rest
 | ||
| > nlp.vocab.prune_vectors(10000)
 | ||
| > nlp.to_disk("/model")
 | ||
| > ```
 | ||
| 
 | ||
| The new [`Vectors`](/api/vectors) class helps the `Vocab` manage the vectors
 | ||
| assigned to strings, and lets you assign vectors individually, or
 | ||
| [load in GloVe vectors](/usage/linguistic-features#adding-vectors) from a
 | ||
| directory. To help you strike a good balance between coverage and memory usage,
 | ||
| the `Vectors` class lets you map **multiple keys** to the **same row** of the
 | ||
| table. If you're using the [`spacy init-model`](/api/cli#init-model) command to
 | ||
| create a vocabulary, pruning the vectors will be taken care of automatically if
 | ||
| you set the `--prune-vectors` flag. Otherwise, you can use the new
 | ||
| [`Vocab.prune_vectors`](/api/vocab#prune_vectors).
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`Vectors`](/api/vectors), [`Vocab`](/api/vocab) **Usage:**
 | ||
| [Word vectors and semantic similarity](/usage/vectors-similarity)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Saving, loading and serialization {#features-serializer}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > nlp = spacy.load("en") # shortcut link
 | ||
| > nlp = spacy.load("en_core_web_sm") # package
 | ||
| > nlp = spacy.load("/path/to/en") # unicode path
 | ||
| > nlp = spacy.load(Path("/path/to/en")) # pathlib Path
 | ||
| >
 | ||
| > nlp.to_disk("/path/to/nlp")
 | ||
| > nlp = English().from_disk("/path/to/nlp")
 | ||
| > ```
 | ||
| 
 | ||
| spaCy's serialization API has been made consistent across classes and objects.
 | ||
| All container classes, i.e. `Language`, `Doc`, `Vocab` and `StringStore` now
 | ||
| have a `to_bytes()`, `from_bytes()`, `to_disk()` and `from_disk()` method that
 | ||
| supports the Pickle protocol.
 | ||
| 
 | ||
| The improved `spacy.load` makes loading models easier and more transparent. You
 | ||
| can load a model by supplying its shortcut link, the name of an installed
 | ||
| [model package](/models) or a path. The `Language` class to initialize will be
 | ||
| determined based on the model's settings. For a blank language, you can import
 | ||
| the class directly, e.g. `from spacy.lang.en import English` or use
 | ||
| [`spacy.blank()`](/api/top-level#spacy.blank).
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`spacy.load`](/api/top-level#spacy.load),
 | ||
| [`Language.to_disk`](/api/language#to_disk) **Usage:**
 | ||
| [Models](/usage/models#usage),
 | ||
| [Saving and loading](/usage/saving-loading#models)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### displaCy visualizer with Jupyter support {#features-displacy}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > from spacy import displacy
 | ||
| > doc = nlp("This is a sentence about Facebook.")
 | ||
| > displacy.serve(doc, style="dep") # run the web server
 | ||
| > html = displacy.render(doc, style="ent") # generate HTML
 | ||
| > ```
 | ||
| 
 | ||
| Our popular dependency and named entity visualizers are now an official part of
 | ||
| the spaCy library. displaCy can run a simple web server, or generate raw HTML
 | ||
| markup or SVG files to be exported. You can pass in one or more docs, and
 | ||
| customize the style. displaCy also auto-detects whether you're running
 | ||
| [Jupyter](https://jupyter.org) and will render the visualizations in your
 | ||
| notebook.
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`displacy`](/api/top-level#displacy) **Usage:**
 | ||
| [Visualizing spaCy](/usage/visualizers)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Improved language data and lazy loading {#features-language}
 | ||
| 
 | ||
| Language-specific data now lives in its own submodule, `spacy.lang`. Languages
 | ||
| are lazy-loaded, i.e. only loaded when you import a `Language` class, or load a
 | ||
| model that initializes one. This allows languages to contain more custom data,
 | ||
| e.g. lemmatizer lookup tables, or complex regular expressions. The language data
 | ||
| has also been tidied up and simplified. spaCy now also supports simple
 | ||
| lookup-based lemmatization – and **many new languages**!
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`Language`](/api/language) **Code:**
 | ||
| [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang)
 | ||
| **Usage:** [Adding languages](/usage/adding-languages)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Revised matcher API and phrase matcher {#features-matcher}
 | ||
| 
 | ||
| > #### Example
 | ||
| >
 | ||
| > ```python
 | ||
| > from spacy.matcher import Matcher, PhraseMatcher
 | ||
| >
 | ||
| > matcher = Matcher(nlp.vocab)
 | ||
| > matcher.add('HEARTS', None, [{"ORTH": "❤️", "OP": '+'}])
 | ||
| >
 | ||
| > phrasematcher = PhraseMatcher(nlp.vocab)
 | ||
| > phrasematcher.add("OBAMA", None, nlp("Barack Obama"))
 | ||
| > ```
 | ||
| 
 | ||
| Patterns can now be added to the matcher by calling
 | ||
| [`matcher.add()`](/api/matcher#add) with a match ID, an optional callback
 | ||
| function to be invoked on each match, and one or more patterns. This allows you
 | ||
| to write powerful, pattern-specific logic using only one matcher. For example,
 | ||
| you might only want to merge some entity types, and set custom flags for other
 | ||
| matched patterns. The new [`PhraseMatcher`](/api/phrasematcher) lets you
 | ||
| efficiently match very large terminology lists using `Doc` objects as match
 | ||
| patterns.
 | ||
| 
 | ||
| <Infobox>
 | ||
| 
 | ||
| **API:** [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher)
 | ||
| **Usage:** [Rule-based matching](/usage/rule-based-matching)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ## Backwards incompatibilities {#incompat}
 | ||
| 
 | ||
| The following modules, classes and methods have changed between v1.x and v2.0.
 | ||
| 
 | ||
| | Old                                                    | New                                                                                                                                                 |
 | ||
| | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- |
 | ||
| | `spacy.download.en`, `spacy.download.de`               | [`cli.download`](/api/cli#download)                                                                                                                 |
 | ||
| | `spacy.en` etc.                                        | `spacy.lang.en` etc.                                                                                                                                |
 | ||
| | `spacy.en.word_sets`                                   | `spacy.lang.en.stop_words`                                                                                                                          |
 | ||
| | `spacy.orth`                                           | `spacy.lang.xx.lex_attrs`                                                                                                                           |
 | ||
| | `spacy.syntax.iterators`                               | `spacy.lang.xx.syntax_iterators`                                                                                                                    |
 | ||
| | `spacy.tagger.Tagger`                                  | `spacy.pipeline.Tagger`                                                                                                                             |
 | ||
| | `spacy.cli.model`                                      | [`spacy.cli.vocab`](/api/cli#vocab)                                                                                                                 |
 | ||
| | `Language.save_to_directory`                           | [`Language.to_disk`](/api/language#to_disk)                                                                                                         |
 | ||
| | `Language.end_training`                                | [`Language.begin_training`](/api/language#begin_training)                                                                                           |
 | ||
| | `Language.create_make_doc`                             | [`Language.tokenizer`](/api/language#attributes)                                                                                                    |
 | ||
| | `Vocab.resize_vectors`                                 | [`Vectors.resize`](/api/vectors#resize)                                                                                                             |
 | ||
| | `Vocab.load` `Vocab.load_lexemes`                      | [`Vocab.from_disk`](/api/vocab#from_disk) [`Vocab.from_bytes`](/api/vocab#from_bytes)                                                               |
 | ||
| | `Vocab.dump`                                           | [`Vocab.to_disk`](/api/vocab#to_disk) [`Vocab.to_bytes`](/api/vocab#to_bytes)                                                                       |
 | ||
| | `Vocab.load_vectors` `Vocab.load_vectors_from_bin_loc` | [`Vectors.from_disk`](/api/vectors#from_disk) [`Vectors.from_bytes`](/api/vectors#from_bytes) [`Vectors.from_glove`](/api/vectors#from_glove)       |
 | ||
| | `Vocab.dump_vectors`                                   | [`Vectors.to_disk`](/api/vectors#to_disk) [`Vectors.to_bytes`](/api/vectors#to_bytes)                                                               |
 | ||
| | `StringStore.load`                                     | [`StringStore.from_disk`](/api/stringstore#from_disk) [`StringStore.from_bytes`](/api/stringstore#from_bytes)                                       |
 | ||
| | `StringStore.dump`                                     | [`StringStore.to_disk`](/api/stringstore#to_disk) [`StringStore.to_bytes`](/api/stringstore#to_bytes)                                               |
 | ||
| | `Tokenizer.load`                                       | [`Tokenizer.from_disk`](/api/tokenizer#from_disk) [`Tokenizer.from_bytes`](/api/tokenizer#from_bytes)                                               |
 | ||
| | `Tagger.load`                                          | [`Tagger.from_disk`](/api/tagger#from_disk) [`Tagger.from_bytes`](/api/tagger#from_bytes)                                                           |
 | ||
| | `Tagger.tag_names`                                     | `Tagger.labels`                                                                                                                                     |
 | ||
| | `DependencyParser.load`                                | [`DependencyParser.from_disk`](/api/dependencyparser#from_disk) [`DependencyParser.from_bytes`](/api/dependencyparser#from_bytes)                   |
 | ||
| | `EntityRecognizer.load`                                | [`EntityRecognizer.from_disk`](/api/entityrecognizer#from_disk) [`EntityRecognizer.from_bytes`](/api/entityrecognizer#from_bytes)                   |
 | ||
| | `Matcher.load`                                         | -                                                                                                                                                   |
 | ||
| | `Matcher.add_pattern` `Matcher.add_entity`             | [`Matcher.add`](/api/matcher#add) [`PhraseMatcher.add`](/api/phrasematcher#add)                                                                     |
 | ||
| | `Matcher.get_entity`                                   | [`Matcher.get`](/api/matcher#get)                                                                                                                   |
 | ||
| | `Matcher.has_entity`                                   | [`Matcher.has_key`](/api/matcher#has_key)                                                                                                           |
 | ||
| | `Doc.read_bytes`                                       | [`Doc.to_bytes`](/api/doc#to_bytes) [`Doc.from_bytes`](/api/doc#from_bytes) [`Doc.to_disk`](/api/doc#to_disk) [`Doc.from_disk`](/api/doc#from_disk) |
 | ||
| | `Token.is_ancestor_of`                                 | [`Token.is_ancestor`](/api/token#is_ancestor)                                                                                                       |
 | ||
| 
 | ||
| ### Deprecated {#deprecated}
 | ||
| 
 | ||
| The following methods are deprecated. They can still be used, but should be
 | ||
| replaced.
 | ||
| 
 | ||
| | Old                          | New                                             |
 | ||
| | ---------------------------- | ----------------------------------------------- |
 | ||
| | `Tokenizer.tokens_from_list` | [`Doc`](/api/doc)                               |
 | ||
| | `Span.sent_start`            | [`Span.is_sent_start`](/api/span#is_sent_start) |
 | ||
| 
 | ||
| ## Migrating from spaCy 1.x {#migrating}
 | ||
| 
 | ||
| Because we'e made so many architectural changes to the library, we've tried to
 | ||
| **keep breaking changes to a minimum**. A lot of projects follow the philosophy
 | ||
| that if you're going to break anything, you may as well break everything. We
 | ||
| think migration is easier if there's a logic to what has changed. We've
 | ||
| therefore followed a policy of avoiding breaking changes to the `Doc`, `Span`
 | ||
| and `Token` objects. This way, you can focus on only migrating the code that
 | ||
| does training, loading and serialization — in other words, code that works with
 | ||
| the `nlp` object directly. Code that uses the annotations should continue to
 | ||
| work.
 | ||
| 
 | ||
| <Infobox title="Important note" variant="warning">
 | ||
| 
 | ||
| If you've trained your own models, keep in mind that your train and runtime
 | ||
| inputs must match. This means you'll have to **retrain your models** with spaCy
 | ||
| v2.0.
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ### Document processing {#migrating-document-processing}
 | ||
| 
 | ||
| The [`Language.pipe`](/api/language#pipe) method allows spaCy to batch
 | ||
| documents, which brings a **significant performance advantage** in v2.0. The new
 | ||
| neural networks introduce some overhead per batch, so if you're processing a
 | ||
| number of documents in a row, you should use `nlp.pipe` and process the texts as
 | ||
| a stream.
 | ||
| 
 | ||
| ```diff
 | ||
| - docs = (nlp(text) for text in texts)
 | ||
| 
 | ||
| + docs = nlp.pipe(texts)
 | ||
| ```
 | ||
| 
 | ||
| To make usage easier, there's now a boolean `as_tuples` keyword argument, that
 | ||
| lets you pass in an iterator of `(text, context)` pairs, so you can get back an
 | ||
| iterator of `(doc, context)` tuples.
 | ||
| 
 | ||
| ### Saving, loading and serialization {#migrating-saving-loading}
 | ||
| 
 | ||
| Double-check all calls to `spacy.load()` and make sure they don't use the `path`
 | ||
| keyword argument. If you're only loading in binary data and not a model package
 | ||
| that can construct its own `Language` class and pipeline, you should now use the
 | ||
| [`Language.from_disk`](/api/language#from_disk) method.
 | ||
| 
 | ||
| ```diff
 | ||
| - nlp = spacy.load("en", path="/model")
 | ||
| 
 | ||
| + nlp = spacy.load("/model")
 | ||
| + nlp = spacy.blank("en").from_disk("/model/data")
 | ||
| ```
 | ||
| 
 | ||
| Review all other code that writes state to disk or bytes. All containers, now
 | ||
| share the same, consistent API for saving and loading. Replace saving with
 | ||
| `to_disk()` or `to_bytes()`, and loading with `from_disk()` and `from_bytes()`.
 | ||
| 
 | ||
| ```diff
 | ||
| - nlp.save_to_directory("/model")
 | ||
| - nlp.vocab.dump("/vocab")
 | ||
| 
 | ||
| + nlp.to_disk("/model")
 | ||
| + nlp.vocab.to_disk("/vocab")
 | ||
| ```
 | ||
| 
 | ||
| If you've trained models with input from v1.x, you'll need to **retrain them**
 | ||
| with spaCy v2.0. All previous models will not be compatible with the new
 | ||
| version.
 | ||
| 
 | ||
| ### Processing pipelines and language data {#migrating-languages}
 | ||
| 
 | ||
| If you're importing language data or `Language` classes, make sure to change
 | ||
| your import statements to import from `spacy.lang`. If you've added your own
 | ||
| custom language, it needs to be moved to `spacy/lang/xx` and adjusted
 | ||
| accordingly.
 | ||
| 
 | ||
| ```diff
 | ||
| - from spacy.en import English
 | ||
| 
 | ||
| + from spacy.lang.en import English
 | ||
| ```
 | ||
| 
 | ||
| If you've been using custom pipeline components, check out the new guide on
 | ||
| [processing pipelines](/usage/processing-pipelines). Pipeline components are now
 | ||
| `(name, func)` tuples. Appending them to the pipeline still works – but the
 | ||
| [`add_pipe`](/api/language#add_pipe) method now makes this much more convenient.
 | ||
| Methods for removing, renaming, replacing and retrieving components have been
 | ||
| added as well. Components can now be disabled by passing a list of their names
 | ||
| to the `disable` keyword argument on load, or by using
 | ||
| [`disable_pipes`](/api/language#disable_pipes) as a method or context manager:
 | ||
| 
 | ||
| ```diff
 | ||
| - nlp = spacy.load("en_core_web_sm", tagger=False, entity=False)
 | ||
| - doc = nlp("I don't want parsed", parse=False)
 | ||
| 
 | ||
| + nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"])
 | ||
| + with nlp.disable_pipes("parser"):
 | ||
| +    doc = nlp("I don't want parsed")
 | ||
| ```
 | ||
| 
 | ||
| To add spaCy's built-in pipeline components to your pipeline, you can still
 | ||
| import and instantiate them directly – but it's more convenient to use the new
 | ||
| [`create_pipe`](/api/language#create_pipe) method with the component name, i.e.
 | ||
| `'tagger'`, `'parser'`, `'ner'` or `'textcat'`.
 | ||
| 
 | ||
| ```diff
 | ||
| - from spacy.pipeline import Tagger
 | ||
| - tagger = Tagger(nlp.vocab)
 | ||
| - nlp.pipeline.insert(0, tagger)
 | ||
| 
 | ||
| + tagger = nlp.create_pipe("tagger")
 | ||
| + nlp.add_pipe(tagger, first=True)
 | ||
| ```
 | ||
| 
 | ||
| ### Training {#migrating-training}
 | ||
| 
 | ||
| All built-in pipeline components are now subclasses of [`Pipe`](/api/pipe),
 | ||
| fully trainable and serializable, and follow the same API. Instead of updating
 | ||
| the model and telling spaCy when to _stop_, you can now explicitly call
 | ||
| [`begin_training`](/api/language#begin_training), which returns an optimizer you
 | ||
| can pass into the [`update`](/api/language#update) function. While `update`
 | ||
| still accepts sequences of `Doc` and `GoldParse` objects, you can now also pass
 | ||
| in a list of strings and dictionaries describing the annotations. We call this
 | ||
| the ["simple training style"](/usage/training#training-simple-style). This is
 | ||
| also the recommended usage, as it removes one layer of abstraction from the
 | ||
| training.
 | ||
| 
 | ||
| ```diff
 | ||
| - for itn in range(1000):
 | ||
| -     for text, entities in train_data:
 | ||
| -         doc = Doc(text)
 | ||
| -         gold = GoldParse(doc, entities=entities)
 | ||
| -         nlp.update(doc, gold)
 | ||
| - nlp.end_training()
 | ||
| - nlp.save_to_directory("/model")
 | ||
| 
 | ||
| + nlp.begin_training()
 | ||
| + for itn in range(1000):
 | ||
| +     for texts, annotations in train_data:
 | ||
| +         nlp.update(texts, annotations)
 | ||
| + nlp.to_disk("/model")
 | ||
| ```
 | ||
| 
 | ||
| ### Attaching custom data to the Doc {#migrating-doc}
 | ||
| 
 | ||
| Previously, you had to create a new container in order to attach custom data to
 | ||
| a `Doc` object. This often required converting the `Doc` objects to and from
 | ||
| arrays. In spaCy v2.0, you can set your own attributes, properties and methods
 | ||
| on the `Doc`, `Token` and `Span` via
 | ||
| [custom extensions](/usage/processing-pipelines#custom-components-attributes).
 | ||
| This means that your application can – and should – only pass around `Doc`
 | ||
| objects and refer to them as the single source of truth.
 | ||
| 
 | ||
| ```diff
 | ||
| - doc = nlp("This is a regular doc")
 | ||
| - doc_array = doc.to_array(["ORTH", "POS"])
 | ||
| - doc_with_meta = {"doc_array": doc_array, "meta": get_doc_meta(doc_array)}
 | ||
| 
 | ||
| + Doc.set_extension("meta", getter=get_doc_meta)
 | ||
| + doc_with_meta = nlp(u'This is a doc with meta data')
 | ||
| + meta = doc._.meta
 | ||
| ```
 | ||
| 
 | ||
| If you wrap your extension attributes in a
 | ||
| [custom pipeline component](/usage/processing-pipelines#custom-components), they
 | ||
| will be assigned automatically when you call `nlp` on a text. If your
 | ||
| application assigns custom data to spaCy's container objects, or includes other
 | ||
| utilities that interact with the pipeline, consider moving this logic into its
 | ||
| own extension module.
 | ||
| 
 | ||
| ```diff
 | ||
| - doc = nlp("Doc with a standard pipeline")
 | ||
| - meta = get_meta(doc)
 | ||
| 
 | ||
| + nlp.add_pipe(meta_component)
 | ||
| + doc = nlp("Doc with a custom pipeline that assigns meta")
 | ||
| + meta = doc._.meta
 | ||
| ```
 | ||
| 
 | ||
| ### Strings and hash values {#migrating-strings}
 | ||
| 
 | ||
| The change from integer IDs to hash values may not actually affect your code
 | ||
| very much. However, if you're adding strings to the vocab manually, you now need
 | ||
| to call [`StringStore.add`](/api/stringstore#add) explicitly. You can also now
 | ||
| be sure that the string-to-hash mapping will always match across vocabularies.
 | ||
| 
 | ||
| ```diff
 | ||
| - nlp.vocab.strings["coffee"]       # 3672
 | ||
| - other_nlp.vocab.strings["coffee"] # 40259
 | ||
| 
 | ||
| + nlp.vocab.strings.add("coffee")
 | ||
| + nlp.vocab.strings["coffee"]       # 3197928453018144401
 | ||
| + other_nlp.vocab.strings["coffee"] # 3197928453018144401
 | ||
| ```
 | ||
| 
 | ||
| ### Adding patterns and callbacks to the matcher {#migrating-matcher}
 | ||
| 
 | ||
| If you're using the matcher, you can now add patterns in one step. This should
 | ||
| be easy to update – simply merge the ID, callback and patterns into one call to
 | ||
| [`Matcher.add()`](/api/matcher#add). The matcher now also supports string keys,
 | ||
| which saves you an extra import. If you've been using **acceptor functions**,
 | ||
| you'll need to move this logic into the
 | ||
| [`on_match` callbacks](/usage/linguistic-features#on_match). The callback
 | ||
| function is invoked on every match and will give you access to the doc, the
 | ||
| index of the current match and all total matches. This lets you both accept or
 | ||
| reject the match, and define the actions to be triggered.
 | ||
| 
 | ||
| ```diff
 | ||
| - matcher.add_entity("GoogleNow", on_match=merge_phrases)
 | ||
| - matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
 | ||
| 
 | ||
| + matcher.add("GoogleNow", merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
 | ||
| ```
 | ||
| 
 | ||
| If you need to match large terminology lists, you can now also use the
 | ||
| [`PhraseMatcher`](/api/phrasematcher), which accepts `Doc` objects as match
 | ||
| patterns and is more efficient than the regular, rule-based matcher.
 | ||
| 
 | ||
| ```diff
 | ||
| - matcher = Matcher(nlp.vocab)
 | ||
| - matcher.add_entity("PRODUCT")
 | ||
| - for text in large_terminology_list
 | ||
| -     matcher.add_pattern("PRODUCT", [{ORTH: text}])
 | ||
| 
 | ||
| + from spacy.matcher import PhraseMatcher
 | ||
| + matcher = PhraseMatcher(nlp.vocab)
 | ||
| + patterns = [nlp.make_doc(text) for text in large_terminology_list]
 | ||
| + matcher.add("PRODUCT", None, *patterns)
 | ||
| ```
 |