spaCy/website/docs/usage/v3.md

---
title: What's New in v3.0
teaser: New features, backwards incompatibilities and migration guide
menu:
  - ['Summary', 'summary']
  - ['New Features', 'features']
  - ['Backwards Incompatibilities', 'incompat']
  - ['Migrating from v2.x', 'migrating']
---

## Summary {#summary hidden="true"}

<Grid cols={2} gutterBottom={false}>

<div>

spaCy v3.0 features all new **transformer-based pipelines** that bring spaCy's
accuracy right up to the current **state-of-the-art**. You can use any
pretrained transformer to train your own pipelines, and even share one
transformer between multiple components with **multi-task learning**. Training
is now fully configurable and extensible, and you can define your own custom
models using **PyTorch**, **TensorFlow** and other frameworks. The new spaCy
projects system lets you describe whole **end-to-end workflows** in a single
file, giving you an easy path from prototype to production, and making it easy
to clone and adapt best-practice projects for your own use cases.

</div>

<Infobox title="Table of Contents" id="toc">

- [Summary](#summary)
- [New features](#features)
- [Transformer-based pipelines](#features-transformers)
- [Training & config system](#features-training)
- [Custom models](#features-custom-models)
- [End-to-end project workflows](#features-projects)
- [Parallel training with Ray](#features-parallel-training)
- [New built-in components](#features-pipeline-components)
- [New custom component API](#features-components)
- [Dependency matching](#features-dep-matcher)
- [Python type hints](#features-types)
- [New methods & attributes](#new-methods)
- [New & updated documentation](#new-docs)
- [Backwards incompatibilities](#incompat)
- [Migrating from spaCy v2.x](#migrating)

</Infobox>

</Grid>

## New Features {#features}

This section contains an overview of the most important **new features and
improvements**. The [API docs](/api) include additional deprecation notes. New
methods and functions that were introduced in this version are marked with the
tag <Tag variant="new">3</Tag>.

### Transformer-based pipelines {#features-transformers}

> #### Example
>
> ```cli
> $ python -m spacy download en_core_web_trf
> ```

spaCy v3.0 features all new transformer-based pipelines that bring spaCy's
accuracy right up to the current **state-of-the-art**. You can use any
pretrained transformer to train your own pipelines, and even share one
transformer between multiple components with **multi-task learning**. spaCy's
transformer support interoperates with [PyTorch](https://pytorch.org) and the
[HuggingFace `transformers`](https://huggingface.co/transformers/) library,
giving you access to thousands of pretrained models for your pipelines.

![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)

import Benchmarks from 'usage/\_benchmarks-models.md'

<Benchmarks />

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
  [Training pipelines and models](/usage/training),
  [Benchmarks](/usage/facts-figures#benchmarks)
- **API:** [`Transformer`](/api/transformer),
  [`TransformerData`](/api/transformer#transformerdata),
  [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
  [TransformerListener](/api/architectures#TransformerListener),
  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf)
- **Implementation:**
  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)

</Infobox>

### New training workflow and config system {#features-training}

> #### Example
>
> ```ini
> [training]
> vectors = null
> accumulate_gradient = 3
>
> [training.optimizer]
> @optimizers = "Adam.v1"
>
> [training.optimizer.learn_rate]
> @schedules = "warmup_linear.v1"
> warmup_steps = 250
> total_steps = 20000
> initial_rate = 0.01
> ```

spaCy v3.0 introduces a comprehensive and extensible system for **configuring
your training runs**. A single configuration file describes every detail of your
training run, with no hidden defaults, making it easy to rerun your experiments
and track changes. You can use the
[quickstart widget](/usage/training#quickstart) or the `init config` command to
get started. Instead of providing lots of arguments on the command line, you
only need to pass your `config.cfg` file to `spacy train`.

Training config files include all **settings and hyperparameters** for training
your pipeline. Some settings can also be registered **functions** that you can
swap out and customize, making it easy to implement your own custom models and
architectures.

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:** [Training pipelines and models](/usage/training)
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
  [`Config`](https://thinc.ai/docs/api-config#config)
- **CLI:** [`init config`](/api/cli#init-config),
  [`init fill-config`](/api/cli#init-fill-config), [`train`](/api/cli#train),
  [`pretrain`](/api/cli#pretrain), [`evaluate`](/api/cli#evaluate)
- **API:** [Config format](/api/data-formats#config),
  [`registry`](/api/top-level#registry)

</Infobox>

### Custom models using any framework {#features-custom-models}

> #### Example
>
> ```python
> from torch import nn
> from thinc.api import PyTorchWrapper
>
> torch_model = nn.Sequential(
>     nn.Linear(32, 32),
>     nn.ReLU(),
>     nn.Softmax(dim=1)
> )
> model = PyTorchWrapper(torch_model)
> ```

spaCy's new configuration system makes it easy to customize the neural network
models used by the different pipeline components. You can also implement your
own architectures via spaCy's machine learning library [Thinc](https://thinc.ai)
that provides various layers and utilities, as well as thin wrappers around
frameworks like **PyTorch**, **TensorFlow** and **MXNet**. Component models all
follow the same unified [`Model`](https://thinc.ai/docs/api-model) API and each
`Model` can also be used as a sublayer of a larger network, allowing you to
freely combine implementations from different frameworks into a single model.

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage: ** [Layers and architectures](/usage/layers-architectures)
- **Thinc: **
  [Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks),
  [`Model` API](https://thinc.ai/docs/api-model)
- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)

</Infobox>

### Manage end-to-end workflows with projects {#features-projects}

<!-- TODO: update example -->

> #### Example
>
> ```cli
> # Clone a project template
> $ python -m spacy project clone example
> $ cd example
> # Download data assets
> $ python -m spacy project assets
> # Run a workflow
> $ python -m spacy project run train
> ```

spaCy projects let you manage and share **end-to-end spaCy workflows** for
different **use cases and domains**, and orchestrate training, packaging and
serving your custom pipelines. You can start off by cloning a pre-defined
project template, adjust it to fit your needs, load in your data, train a
pipeline, export it as a Python package, upload your outputs to a remote storage
and share your results with your team.

![Illustration of project workflow and commands](../images/projects.svg)

spaCy projects also make it easy to **integrate with other tools** in the data
science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for
data version control, [Prodigy](/usage/projects#prodigy) for creating labelled
data, [Streamlit](/usage/projects#streamlit) for building interactive apps,
[FastAPI](/usage/projects#fastapi) for serving models in production,
[Ray](/usage/projects#ray) for parallel training,
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!

<!-- <Project id="some_example_project">

The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your pipeline.

</Project>-->

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:** [spaCy projects](/usage/projects),
  [Training pipelines and models](/usage/training)
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
- **Templates:** [`projects`](https://github.com/explosion/projects)

</Infobox>

### Parallel and distributed training with Ray {#features-parallel-training}

> #### Example
>
> ```cli
> $ pip install spacy-ray
> # Check that the CLI is registered
> $ python -m spacy ray --help
> # Train a pipeline
> $ python -m spacy ray train config.cfg --n-workers 2
> ```

[Ray](https://ray.io/) is a fast and simple framework for building and running
**distributed applications**. You can use Ray to train spaCy on one or more
remote machines, potentially speeding up your training process. The Ray
integration is powered by a lightweight extension package,
[`spacy-ray`](https://github.com/explosion/spacy-ray), that automatically adds
the [`ray`](/api/cli#ray) command to your spaCy CLI if it's installed in the
same environment. You can then run [`spacy ray train`](/api/cli#ray-train) for
parallel training.

![Illustration of setup](../images/spacy-ray.svg)

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage: **
  [Parallel and distributed training](/usage/training#parallel-training),
  [spaCy Projects integration](/usage/projects#ray)
- **CLI:** [`ray`](/api/cli#ray), [`ray train`](/api/cli#ray-train)
- **Implementation:** [`spacy-ray`](https://github.com/explosion/spacy-ray)

</Infobox>

### New built-in pipeline components {#features-pipeline-components}

spaCy v3.0 includes several new trainable and rule-based components that you can
add to your pipeline and customize for your use case:

> #### Example
>
> ```python
> # pip install spacy-lookups-data
> nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer")
> ```

| Name                                            | Description                                                                                                                                                                                                             |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation.                                                                                                                                                                          |
| [`Morphologizer`](/api/morphologizer)           | Trainable component to predict morphological features.                                                                                                                                                                  |
| [`Lemmatizer`](/api/lemmatizer)                 | Standalone component for rule-based and lookup lemmatization.                                                                                                                                                           |
| [`AttributeRuler`](/api/attributeruler)         | Component for setting token attributes using match patterns.                                                                                                                                                            |
| [`Transformer`](/api/transformer)               | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:** [Processing pipelines](/usage/processing-pipelines)
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
- **Implementation:** [`spacy/pipeline`](%%GITHUB_SPACY/spacy/pipeline)

</Infobox>

### New and improved pipeline component APIs {#features-components}

> #### Example
>
> ```python
> @Language.component("my_component")
> def my_component(doc):
>     return doc
>
> nlp.add_pipe("my_component")
> nlp.add_pipe("ner", source=other_nlp)
> nlp.analyze_pipes(pretty=True)
> ```

Defining, configuring, reusing, training and analyzing pipeline components is
now easier and more convenient. The `@Language.component` and
`@Language.factory` decorators let you register your component, define its
default configuration and meta data, like the attribute values it assigns and
requires. Any custom component can be included during training, and sourcing
components from existing trained pipelines lets you **mix and match custom
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
the current pipeline and its components, including the attributes they assign,
the scores they compute during training and whether any required attributes
aren't set.

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
  [Defining components for training](/usage/training#config-components)
- **API:** [`@Language.component`](/api/language#component),
  [`@Language.factory`](/api/language#factory),
  [`Language.add_pipe`](/api/language#add_pipe),
  [`Language.analyze_pipes`](/api/language#analyze_pipes)
- **Implementation:** [`spacy/language.py`](%%GITHUB_SPACY/spacy/language.py)

</Infobox>

### Dependency matching {#features-dep-matcher}

> #### Example
>
> ```python
> from spacy.matcher import DependencyMatcher
>
> matcher = DependencyMatcher(nlp.vocab)
> pattern = [
>     {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
>     {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}}
> ]
> matcher.add("FOUNDED", [pattern])
> ```

The new [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns
within the dependency parse using
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
A pattern added to the dependency matcher consists of a **list of
dictionaries**, with each dictionary describing a **token to match** and its
**relation to an existing token** in the pattern.

![Dependency matcher pattern](../images/dep-match-diagram.svg)

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage:**
  [Dependency matching](/usage/rule-based-matching#dependencymatcher),
- **API:** [`DependencyMatcher`](/api/dependencymatcher),
- **Implementation:**
  [`spacy/matcher/dependencymatcher.pyx`](%%GITHUB_SPACY/spacy/matcher/dependencymatcher.pyx)

</Infobox>

### Type hints and type-based data validation {#features-types}

> #### Example
>
> ```python
> from spacy.language import Language
> from pydantic import StrictBool
>
> @Language.factory("my_component")
> def create_my_component(
>     nlp: Language,
>     name: str,
>     custom: StrictBool
> ):
>    ...
> ```

spaCy v3.0 officially drops support for Python 2 and now requires **Python
3.6+**. This also means that the code base can take full advantage of
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
API that's implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy's machine learning library
[Thinc](https://thinc.ai) also features extensive
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
types for models and arrays, and a custom `mypy` plugin that can be used to
type-check model definitions.

For data validation, spaCy v3.0 adopts
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
lets you to register **custom functions with typed arguments**, reference them
in your config and see validation errors if the argument values don't match.

<Infobox title="Details & Documentation" emoji="📖" list>

- **Usage: **
  [Component type hints and validation](/usage/processing-pipelines#type-hints),
  [Training with custom code](/usage/training#custom-code)
- **Thinc: **
  [Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
  [Thinc's config system](https://thinc.ai/docs/usage-config)

</Infobox>

### New methods, attributes and commands {#new-methods}

The following methods, attributes and commands are new in spaCy v3.0.

| Name                                                                                                                            | Description                                                                                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes)                                                                                            | Access a token's [`Lexeme`](/api/lexeme).                                                                                                                                                        |
| [`Token.morph`](/api/token#attributes), [`Token.morph_`](/api/token#attributes)                                                 | Access a token's morphological analysis.                                                                                                                                                         |
| [`Language.select_pipes`](/api/language#select_pipes)                                                                           | Context manager for enabling or disabling specific pipeline components for a block.                                                                                                              |
| [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe)                      | Disable or enable a loaded pipeline component (but don't remove it).                                                                                                                             |
| [`Language.analyze_pipes`](/api/language#analyze_pipes)                                                                         | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies.                                                                                                          |
| [`Language.resume_training`](/api/language#resume_training)                                                                     | Experimental: continue training a trained pipeline and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.                              |
| [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component)                                  | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions.                                               |
| [`Language.has_factory`](/api/language#has_factory)                                                                             | Check whether a component factory is registered on a language class.s                                                                                                                            |
| [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta)       | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       |
| [`Language.config`](/api/language#config)                                                                                       | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes)                       | All available components and component names, including disabled components that are not run as part of the pipeline.                                                                            |
| [`Language.disabled`](/api/language#attributes)                                                                                 | Names of disabled components that are not run as part of the pipeline.                                                                                                                           |
| [`Pipe.score`](/api/pipe#score)                                                                                                 | Method on pipeline components that returns a dictionary of evaluation scores.                                                                                                                    |
| [`registry`](/api/top-level#registry)                                                                                           | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  |
| [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config)                        | Updated helpers for loading a pipeline's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                     |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                         | Names of all pipeline packages installed in the environment.                                                                                                                                     |
| [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training).                                                                                                   |
| [`project`](/api/cli#project)                                                                                                   | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects).                                                                                                       |
| [`ray`](/api/cli#ray)                                                                                                           | Suite of CLI commands for parallel training with [Ray](https://ray.io/), provided by the [`spacy-ray`](https://github.com/explosion/spacy-ray) extension package.                                |

### New and updated documentation {#new-docs}

<Grid cols={2} gutterBottom={false}>

<div>

To help you get started with spaCy v3.0 and the new features, we've added
several new or rewritten documentation pages, including a new usage guide on
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
a guide on [training pipelines and models](/usage/training) rewritten from
scratch, a page explaining the new [spaCy projects](/usage/projects) and updated
usage documentation on
[custom pipeline components](/usage/processing-pipelines#custom-components).
We've also added a bunch of new illustrations and new API reference pages
documenting spaCy's machine learning [model architectures](/api/architectures)
and the expected [data formats](/api/data-formats). API pages about
[pipeline components](/api/#architecture-pipeline) now include more information,
like the default config and implementation, and we've adopted a more detailed
format for documenting argument and return types.

</div>

[![Library architecture](../images/architecture.svg)](/api)

</Grid>

<Infobox title="New or reworked documentation" emoji="📖" list>

- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
  [Training models](/usage/training),
  [Layers & Architectures](/usage/layers-architectures),
  [Projects](/usage/projects),
  [Custom pipeline components](/usage/processing-pipelines#custom-components),
  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
  [Morphology](/usage/linguistic-features#morphology),
  [Lemmatization](/usage/linguistic-features#lemmatization),
  [Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions),
  [Dependency matching](/usage/rule-based-matching#dependencymatcher)
- **API Reference: ** [Library architecture](/api),
  [Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
  [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
  [`Morphologizer`](/api/morphologizer),
  [`AttributeRuler`](/api/attributeruler),
  [`SentenceRecognizer`](/api/sentencerecognizer),
  [`DependencyMatcher`](/api/dependencymatcher), [`Pipe`](/api/pipe),
  [`Corpus`](/api/corpus)

</Infobox>

## Backwards Incompatibilities {#incompat}

As always, we've tried to keep the breaking changes to a minimum and focus on
changes that were necessary to support the new features, fix problems or improve
usability. The following section lists the relevant changes to the user-facing
API. For specific examples of how to rewrite your code, check out the
[migration guide](#migrating).

<Infobox variant="warning">

Note that spaCy v3.0 now requires **Python 3.6+**.

</Infobox>

### API changes {#incompat-api}

- Pipeline package symlinks, the `link` command and shortcut names are now
  deprecated. There can be many [different trained pipelines](/models) and not
  just one "English model", so you should always use the full package name like
  [`en_core_web_sm`](/models/en) explicitly.
- A pipeline's [`meta.json`](/api/data-formats#meta) is now only used to provide
  meta information like the package name, author, license and labels. It's
  **not** used to construct the processing pipeline anymore. This is all defined
  in the [`config.cfg`](/api/data-formats#config), which also includes all
  settings used to train the pipeline.
- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
  only take a `config.cfg` file containing the full
  [training config](/usage/training#config).
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
  the component factory instead of the component function.
- **Custom pipeline components** now need to be decorated with the
  [`@Language.component`](/api/language#component) or
  [`@Language.factory`](/api/language#factory) decorator.
- [`Language.update`](/api/language#update) now takes a batch of
  [`Example`](/api/example) objects instead of raw texts and annotations, or
  `Doc` and `GoldParse` objects.
- The `Language.disable_pipes` context manager has been replaced by
  [`Language.select_pipes`](/api/language#select_pipes), which can explicitly
  disable or enable components.
- The [`Language.update`](/api/language#update),
  [`Language.evaluate`](/api/language#evaluate) and
  [`Pipe.update`](/api/pipe#update) methods now all take batches of
  [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
  raw text and a dictionary of annotations.
  [`Language.begin_training`](/api/language#begin_training) and
  [`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
  returns a sequence of `Example` objects to initialize the model instead of a
  list of tuples.
- [`Matcher.add`](/api/matcher#add) and
  [`PhraseMatcher.add`](/api/phrasematcher#add) now only accept a list of
  patterns as the second argument (instead of a variable number of arguments).
  The `on_match` callback becomes an optional keyword argument.
- The `spacy.gold` module has been renamed to
  [`spacy.training`](%%GITHUB_SPACY/spacy/training).
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
  been removed.
- The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the
  more flexible [`AttributeRuler`](/api/attributeruler).
- The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and
  doesn't provide lemmas by default or switch automatically between lookup and
  rule-based lemmas. You can now add it to your pipeline explicitly and set its
  mode on initialization.
- Various keyword arguments across functions and methods are now explicitly
  declared as _keyword-only_ arguments. Those arguments are documented
  accordingly across the API reference.

### Removed or renamed API {#incompat-removed}

| Removed                                                  | Replacement                                                                                                  |
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe) |
| `GoldParse`                                              | [`Example`](/api/example)                                                                                    |
| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                                                                                      |
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump`          | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk)                   |
| `Matcher.pipe`, `PhraseMatcher.pipe`                     | not needed                                                                                                   |
| `spacy init-model`                                       | [`spacy init vocab`](/api/cli#init-vocab)                                                                    |
| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)                                                                    |
| `spacy profile`                                          | [`spacy debug profile`](/api/cli#debug-profile)                                                              |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated                                                                          |

The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously
raise errors. Many of them were also mostly internals. If you've been working
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
on them.

| Removed                                                                                                                 | Replacement                                                                                                                                                |
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
| `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
| `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate)                                                     | logging (`DEBUG`)                                                                                                                                          |
| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |

## Migrating from v2.x {#migrating}

### Downloading and loading trained pipelines {#migrating-downloading-models}

Symlinks and shortcuts like `en` are now officially deprecated. There are
[many different trained pipelines](/models) with different capabilities and not
just one "English model". In order to download and load a package, you should
always use its full name – for instance,
[`en_core_web_sm`](/models/en#en_core_web_sm).

```diff
- python -m spacy download en
+ python -m spacy download en_core_web_sm
```

```diff
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
```

### Custom pipeline components and factories {#migrating-pipeline-components}

Custom pipeline components now have to be registered explicitly using the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator. For simple functions
that take a `Doc` and return it, all you have to do is add the
`@Language.component` decorator to it and assign it a name:

```diff
### Stateless function components
+ from spacy.language import Language

+ @Language.component("my_component")
def my_component(doc):
    return doc
```

For class components that are initialized with settings and/or the shared `nlp`
object, you can use the `@Language.factory` decorator. Also make sure that that
the method used to initialize the factory has **two named arguments**: `nlp`
(the current `nlp` object) and `name` (the string name of the component
instance).

```diff
### Stateful class components
+ from spacy.language import Language

+ @Language.factory("my_component")
class MyComponent:
-   def __init__(self, nlp):
+   def __init__(self, nlp, name):
        self.nlp = nlp

    def __call__(self, doc):
        return doc
```

Instead of decorating your class, you could also add a factory function that
takes the arguments `nlp` and `name` and returns an instance of your component:

```diff
### Stateful class components with factory function
+ from spacy.language import Language

+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+     return MyComponent(nlp)

class MyComponent:
    def __init__(self, nlp):
        self.nlp = nlp

    def __call__(self, doc):
        return doc
```

The `@Language.component` and `@Language.factory` decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
`Language.factories` manually anymore.

```diff
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
```

#### Adding components to the pipeline {#migrating-add-pipe}

The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
name** of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.

```diff
+ @Language.component("my_component")
def my_component(doc):
    return doc

- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
```

[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
itself, so you can access its attributes. The
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
and you typically shouldn't have to use it in your code.

```diff
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
```

If you need to add a component from an existing trained pipeline, you can now
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
check that the component is compatible, and take care of porting over all
config. During training, you can also reference existing trained components in
your [config](/usage/training#config-components) and decide whether or not they
should be updated with more data.

> #### config.cfg (excerpt)
>
> ```ini
> [components.ner]
> source = "en_core_web_sm"
> component = "ner"
> ```

```diff
source_nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank("en")
- ner = source_nlp.get_pipe("ner")
- nlp.add_pipe(ner)
+ nlp.add_pipe("ner", source=source_nlp)
```

### Adding match patterns {#migrating-matcher}

The [`Matcher.add`](/api/matcher#add),
[`PhraseMatcher.add`](/api/phrasematcher#add) and
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
**list of patterns** as the second argument (instead of a variable number of
arguments). The `on_match` callback becomes an optional keyword argument.

```diff
matcher = Matcher(nlp.vocab)
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
```

```diff
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
```

### Migrating attributes in tokenizer exceptions {#migrating-tokenizer-exceptions}

Tokenizer exceptions are now only allowed to set `ORTH` and `NORM` values as
part of the token attributes. Exceptions for other attributes such as `TAG` and
`LEMMA` should be moved to an [`AttributeRuler`](/api/attributeruler) component:

```diff
nlp = spacy.blank("en")
- nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't", "LEMMA": "not"}])
+ nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't"}])
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.add(patterns=[[{"ORTH": "n't"}]], attrs={"LEMMA": "not"})
```

### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions}

Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
v3.0 now manages mappings and exceptions with a separate and more flexible
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
`AttributeRuler` provides two handy helper methods
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
you load in your existing tag map or morph rules:

```diff
nlp = spacy.blank("en")
- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
```

### Training pipelines and models {#migrating-training}

To train your pipelines, you should now pretty much always use the
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
training scripts anymore, unless you _really_ want to. The training commands now
use a [flexible config file](/usage/training#config) that describes all training
settings and hyperparameters, as well as your pipeline, components and
architectures to use. The `--code` argument lets you pass in code containing
[custom registered functions](/usage/training#custom-code) that you can
reference in your config. To get started, check out the
[quickstart widget](/usage/training#quickstart).

#### Binary .spacy training data format {#migrating-training-format}

spaCy v3.0 uses a new
[binary training data format](/api/data-formats#binary-training) created by
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
objects. This means that you can train spaCy pipelines using the same format it
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
storage**, especially when packing multiple documents together. You can convert
your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert)
command, which outputs `.spacy` files:

```cli
$ python -m spacy convert ./training.json ./output
```

#### Training config {#migrating-training-config}

The easiest way to get started with a training config is to use the
[`init config`](/api/cli#init-config) command or the
[quickstart widget](/usage/training#quickstart). You can define your
requirements, and it will auto-generate a starter config with the best-matching
default settings.

```cli
$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser
```

If you've exported a starter config from our
[quickstart widget](/usage/training#quickstart), you can use the
[`init fill-config`](/api/cli#init-fill-config) to fill it with all default
values. You can then use the auto-generated `config.cfg` for training:

```diff
- python -m spacy train en ./output ./train.json ./dev.json
--pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output
```

<!-- TODO: project template -->

#### Training via the Python API {#migrating-training-python}

For most use cases, you **shouldn't** have to write your own training scripts
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
[config file](/usage/training#config) and custom
[registered functions](/usage/training#custom-code) if needed. You can even
register callbacks that can modify the `nlp` object at different stages of its
lifecycle to fully customize it before training.

If you do decide to use the [internal training API](/usage/training#api) from
Python, you should only need a few small modifications to convert your scripts
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
classmethod takes a reference `Doc` and a
[dictionary of annotations](/api/data-formats#dict-input), similar to the
"simple training style" in spaCy v2.x:

```diff
### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})
```

```diff
### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)
```

The [`Language.update`](/api/language#update),
[`Language.evaluate`](/api/language#evaluate) and
[`Pipe.update`](/api/pipe#update) methods now all take batches of
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations.

```python
### Training loop {highlight="11"}
TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London.", {"entities": [(7, 13, "LOC")]}),
]
nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for batch in minibatch(TRAIN_DATA):
        examples = []
        for text, annots in batch:
            examples.append(Example.from_dict(nlp.make_doc(text), annots))
        nlp.update(examples)
```

[`Language.begin_training`](/api/language#begin_training) and
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
returns a sequence of `Example` objects to initialize the model instead of a
list of tuples. The data examples are used to **initialize the models** of
trainable pipeline components, which includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme.

```diff
- nlp.begin_training(examples)
+ nlp.begin_training(lambda: examples)
```

#### Packaging trained pipelines {#migrating-training-packaging}

The [`spacy package`](/api/cli#package) command now automatically builds the
installable `.tar.gz` sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the `--no-sdist`
flag.

```diff
python -m spacy package ./output ./packages
- cd /output/en_pipeline-0.0.0
- python setup.py sdist
```

#### Data utilities and gold module {#migrating-gold}

The `spacy.gold` module has been renamed to `spacy.training`. This mostly
affects internals, but if you've been using the span offset conversion utilities
[`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets),
[`offsets_from_biluo_tags`](/api/top-level#offsets_from_biluo_tags) or
[`spans_from_biluo_tags`](/api/top-level#spans_from_biluo_tags), you'll have to
change your imports:

```diff
- from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
+ from spacy.training import biluo_tags_from_offsets, spans_from_biluo_tags
```

#### Migration notes for plugin maintainers {#migrating-plugins}

Thanks to everyone who's been contributing to the spaCy ecosystem by developing
and maintaining one of the many awesome [plugins and extensions](/universe).
We've tried to make it as easy as possible for you to upgrade your packages for
spaCy v3. The most common use case for plugins is providing pipeline components
and extension attributes. When migrating your plugin, double-check the
following:

- Use the [`@Language.factory`](/api/language#factory) decorator to register
  your component and assign it a name. This allows users to refer to your
  components by name and serialize pipelines referencing them. Remove all manual
  entries to the `Language.factories`.
- Make sure your component factories take at least two **named arguments**:
  `nlp` (the current `nlp` object) and `name` (the instance name of the added
  component so you can identify multiple instances of the same component).
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
  to use **string names** instead of the component functions.

```python
### {highlight="1-5"}
from spacy.language import Language

@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
    return MyCoolComponent(some_setting=some_setting)


class MyCoolComponent:
    def __init__(self, some_setting):
        self.some_setting = some_setting

    def __call__(self, doc):
        # Do something to the doc
        return doc
```

> #### Result in config.cfg
>
> ```ini
> [components.my_component]
> factory = "my_component"
> some_setting = true
> ```

```diff
import spacy
from your_plugin import MyCoolComponent

nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
```

<Infobox title="Important note on registering factories" variant="warning">

The [`@Language.factory`](/api/language#factory) decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its **string name**. However, this
requires the decorator to be executed – so users will still have to **import
your plugin**. Alternatively, your plugin could expose an
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
This means that spaCy knows how to initialize `my_component`, even if your
package isn't imported.

</Infobox>
-												Add new in v3.0

											
										
										
											2020-07-01 14:02:17 +03:00
+								---
 								title: What's New in v3.0
 								teaser: New features, backwards incompatibilities and migration guide
 								menu:
 								  - ['Summary', 'summary']
 								  - ['New Features', 'features']
 								  - ['Backwards Incompatibilities', 'incompat']
 								  - ['Migrating from v2.x', 'migrating']
 								---
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								## Summary {#summary hidden="true"}
-												Add new in v3.0

											
										
										
											2020-07-01 14:02:17 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								<Grid cols={2} gutterBottom={false}>
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
 								<div>
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								spaCy v3.0 features all new **transformer-based pipelines** that bring spaCy's
 								accuracy right up to the current **state-of-the-art**. You can use any
 								pretrained transformer to train your own pipelines, and even share one
 								transformer between multiple components with **multi-task learning**. Training
 								is now fully configurable and extensible, and you can define your own custom
 								models using **PyTorch**, **TensorFlow** and other frameworks. The new spaCy
 								projects system lets you describe whole **end-to-end workflows** in a single
 								file, giving you an easy path from prototype to production, and making it easy
 								to clone and adapt best-practice projects for your own use cases.
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								</div>
 								<Infobox title="Table of Contents" id="toc">
 								- [Summary](#summary)
 								- [New features](#features)
 								- [Transformer-based pipelines](#features-transformers)
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								- [Training & config system](#features-training)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- [Custom models](#features-custom-models)
 								- [End-to-end project workflows](#features-projects)
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								- [Parallel training with Ray](#features-parallel-training)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- [New built-in components](#features-pipeline-components)
 								- [New custom component API](#features-components)
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
+								- [Dependency matching](#features-dep-matcher)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- [Python type hints](#features-types)
 								- [New methods & attributes](#new-methods)
 								- [New & updated documentation](#new-docs)
 								- [Backwards incompatibilities](#incompat)
 								- [Migrating from spaCy v2.x](#migrating)
 								</Infobox>
 								</Grid>
-												Add new in v3.0

											
										
										
											2020-07-01 14:02:17 +03:00
+								## New Features {#features}
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								This section contains an overview of the most important **new features and
 								improvements**. The [API docs](/api) include additional deprecation notes. New
 								methods and functions that were introduced in this version are marked with the
 								tag <Tag variant="new">3</Tag>.
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								### Transformer-based pipelines {#features-transformers}
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								> #### Example
 								>
 								> ```cli
 								> $ python -m spacy download en_core_web_trf
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								spaCy v3.0 features all new transformer-based pipelines that bring spaCy's
 								accuracy right up to the current **state-of-the-art**. You can use any
 								pretrained transformer to train your own pipelines, and even share one
 								transformer between multiple components with **multi-task learning**. spaCy's
 								transformer support interoperates with [PyTorch](https://pytorch.org) and the
 								[HuggingFace `transformers`](https://huggingface.co/transformers/) library,
 								giving you access to thousands of pretrained models for your pipelines.
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								import Benchmarks from 'usage/\_benchmarks-models.md'
 								<Benchmarks />
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								  [Training pipelines and models](/usage/training),
 								  [Benchmarks](/usage/facts-figures#benchmarks)
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								- **API:** [`Transformer`](/api/transformer),
 								  [`TransformerData`](/api/transformer#transformerdata),
 								  [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
 								- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
-												Rename Transformer listener (#6001)

* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
											
										
										
											2020-08-31 13:41:39 +03:00
+								  [TransformerListener](/api/architectures#TransformerListener),
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf)
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								- **Implementation:**
 								  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								### New training workflow and config system {#features-training}
 								> #### Example
 								>
 								> ```ini
 								> [training]
 								> vectors = null
 								> accumulate_gradient = 3
 								>
 								> [training.optimizer]
 								> @optimizers = "Adam.v1"
 								>
 								> [training.optimizer.learn_rate]
 								> @schedules = "warmup_linear.v1"
 								> warmup_steps = 250
 								> total_steps = 20000
 								> initial_rate = 0.01
 								> ```
 								spaCy v3.0 introduces a comprehensive and extensible system for **configuring
 								your training runs**. A single configuration file describes every detail of your
 								training run, with no hidden defaults, making it easy to rerun your experiments
 								and track changes. You can use the
 								[quickstart widget](/usage/training#quickstart) or the `init config` command to
 								get started. Instead of providing lots of arguments on the command line, you
 								only need to pass your `config.cfg` file to `spacy train`.
 								Training config files include all **settings and hyperparameters** for training
 								your pipeline. Some settings can also be registered **functions** that you can
 								swap out and customize, making it easy to implement your own custom models and
 								architectures.
 								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage:** [Training pipelines and models](/usage/training)
 								- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
 								  [`Config`](https://thinc.ai/docs/api-config#config)
 								- **CLI:** [`init config`](/api/cli#init-config),
 								  [`init fill-config`](/api/cli#init-fill-config), [`train`](/api/cli#train),
 								  [`pretrain`](/api/cli#pretrain), [`evaluate`](/api/cli#evaluate)
 								- **API:** [Config format](/api/data-formats#config),
 								  [`registry`](/api/top-level#registry)
 								</Infobox>
-												fix typo

											
										
										
											2020-08-19 17:54:16 +03:00
+								### Custom models using any framework {#features-custom-models}
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								> #### Example
 								>
 								> ```python
 								> from torch import nn
 								> from thinc.api import PyTorchWrapper
 								>
 								> torch_model = nn.Sequential(
 								>     nn.Linear(32, 32),
 								>     nn.ReLU(),
 								>     nn.Softmax(dim=1)
 								> )
 								> model = PyTorchWrapper(torch_model)
 								> ```
 								spaCy's new configuration system makes it easy to customize the neural network
 								models used by the different pipeline components. You can also implement your
 								own architectures via spaCy's machine learning library [Thinc](https://thinc.ai)
 								that provides various layers and utilities, as well as thin wrappers around
 								frameworks like **PyTorch**, **TensorFlow** and **MXNet**. Component models all
 								follow the same unified [`Model`](https://thinc.ai/docs/api-model) API and each
 								`Model` can also be used as a sublayer of a larger network, allowing you to
 								freely combine implementations from different frameworks into a single model.
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								- **Usage: ** [Layers and architectures](/usage/layers-architectures)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- **Thinc: **
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								  [Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks),
 								  [`Model` API](https://thinc.ai/docs/api-model)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								### Manage end-to-end workflows with projects {#features-projects}
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								<!-- TODO: update example -->
 								> #### Example
 								>
 								> ```cli
 								> # Clone a project template
 								> $ python -m spacy project clone example
 								> $ cd example
 								> # Download data assets
 								> $ python -m spacy project assets
 								> # Run a workflow
 								> $ python -m spacy project run train
 								> ```
 								spaCy projects let you manage and share **end-to-end spaCy workflows** for
 								different **use cases and domains**, and orchestrate training, packaging and
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								serving your custom pipelines. You can start off by cloning a pre-defined
 								project template, adjust it to fit your needs, load in your data, train a
 								pipeline, export it as a Python package, upload your outputs to a remote storage
 								and share your results with your team.
-												Allow spacy project to push and pull to/from remote storage (#5949)

* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-08-23 19:32:09 +03:00
 								![Illustration of project workflow and commands](../images/projects.svg)
 								spaCy projects also make it easy to **integrate with other tools** in the data
 								science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for
 								data version control, [Prodigy](/usage/projects#prodigy) for creating labelled
 								data, [Streamlit](/usage/projects#streamlit) for building interactive apps,
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								[FastAPI](/usage/projects#fastapi) for serving models in production,
 								[Ray](/usage/projects#ray) for parallel training,
 								[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
 								<!-- <Project id="some_example_project">
 								The easiest way to get started with an end-to-end training process is to clone a
 								[project](/usage/projects) template. Projects let you manage multi-step
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								workflows, from data preprocessing to training and packaging your pipeline.
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
 								</Project>-->
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage:** [spaCy projects](/usage/projects),
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								  [Training pipelines and models](/usage/training)
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
 								- **Templates:** [`projects`](https://github.com/explosion/projects)
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								### Parallel and distributed training with Ray {#features-parallel-training}
 								> #### Example
 								>
 								> ```cli
 								> $ pip install spacy-ray
 								> # Check that the CLI is registered
 								> $ python -m spacy ray --help
 								> # Train a pipeline
 								> $ python -m spacy ray train config.cfg --n-workers 2
 								> ```
 								[Ray](https://ray.io/) is a fast and simple framework for building and running
 								**distributed applications**. You can use Ray to train spaCy on one or more
 								remote machines, potentially speeding up your training process. The Ray
 								integration is powered by a lightweight extension package,
 								[`spacy-ray`](https://github.com/explosion/spacy-ray), that automatically adds
 								the [`ray`](/api/cli#ray) command to your spaCy CLI if it's installed in the
 								same environment. You can then run [`spacy ray train`](/api/cli#ray-train) for
 								parallel training.
 								![Illustration of setup](../images/spacy-ray.svg)
 								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage: **
 								  [Parallel and distributed training](/usage/training#parallel-training),
 								  [spaCy Projects integration](/usage/projects#ray)
 								- **CLI:** [`ray`](/api/cli#ray), [`ray train`](/api/cli#ray-train)
 								- **Implementation:** [`spacy-ray`](https://github.com/explosion/spacy-ray)
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								### New built-in pipeline components {#features-pipeline-components}
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								spaCy v3.0 includes several new trainable and rule-based components that you can
 								add to your pipeline and customize for your use case:
 								> #### Example
 								>
 								> ```python
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								> # pip install spacy-lookups-data
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								> nlp = spacy.blank("en")
 								> nlp.add_pipe("lemmatizer")
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								| Name                                            | Description                                                                                                                                                                                                             |
 								| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation.                                                                                                                                                                          |
 								| [`Morphologizer`](/api/morphologizer)           | Trainable component to predict morphological features.                                                                                                                                                                  |
 								| [`Lemmatizer`](/api/lemmatizer)                 | Standalone component for rule-based and lookup lemmatization.                                                                                                                                                           |
 								| [`AttributeRuler`](/api/attributeruler)         | Component for setting token attributes using match patterns.                                                                                                                                                            |
 								| [`Transformer`](/api/transformer)               | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage:** [Processing pipelines](/usage/processing-pipelines)
 								- **API:** [Built-in pipeline components](/api#architecture-pipeline)
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								- **Implementation:** [`spacy/pipeline`](%%GITHUB_SPACY/spacy/pipeline)
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								### New and improved pipeline component APIs {#features-components}
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								> #### Example
 								>
 								> ```python
 								> @Language.component("my_component")
 								> def my_component(doc):
 								>     return doc
 								>
 								> nlp.add_pipe("my_component")
 								> nlp.add_pipe("ner", source=other_nlp)
 								> nlp.analyze_pipes(pretty=True)
 								> ```
 								Defining, configuring, reusing, training and analyzing pipeline components is
 								now easier and more convenient. The `@Language.component` and
 								`@Language.factory` decorators let you register your component, define its
 								default configuration and meta data, like the attribute values it assigns and
 								requires. Any custom component can be included during training, and sourcing
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								components from existing trained pipelines lets you **mix and match custom
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								pipelines**. The `nlp.analyze_pipes` method outputs structured information about
 								the current pipeline and its components, including the attributes they assign,
 								the scores they compute during training and whether any required attributes
 								aren't set.
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								  [Defining components for training](/usage/training#config-components)
 								- **API:** [`@Language.component`](/api/language#component),
 								  [`@Language.factory`](/api/language#factory),
 								  [`Language.add_pipe`](/api/language#add_pipe),
 								  [`Language.analyze_pipes`](/api/language#analyze_pipes)
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								- **Implementation:** [`spacy/language.py`](%%GITHUB_SPACY/spacy/language.py)
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
 								</Infobox>
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
+								### Dependency matching {#features-dep-matcher}
 								> #### Example
 								>
 								> ```python
-												Update docs [ci skip]

											
										
										
											2020-09-04 21:58:36 +03:00
+								> from spacy.matcher import DependencyMatcher
 								>
 								> matcher = DependencyMatcher(nlp.vocab)
 								> pattern = [
 								>     {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
 								>     {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}}
 								> ]
 								> matcher.add("FOUNDED", [pattern])
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
+								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-04 21:58:36 +03:00
+								The new [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns
 								within the dependency parse using
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
+								[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
 								operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
 								A pattern added to the dependency matcher consists of a **list of
 								dictionaries**, with each dictionary describing a **token to match** and its
 								**relation to an existing token** in the pattern.
-												Update docs [ci skip]

											
										
										
											2020-09-15 13:44:03 +03:00
+								![Dependency matcher pattern](../images/dep-match-diagram.svg)
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage:**
 								  [Dependency matching](/usage/rule-based-matching#dependencymatcher),
 								- **API:** [`DependencyMatcher`](/api/dependencymatcher),
 								- **Implementation:**
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								  [`spacy/matcher/dependencymatcher.pyx`](%%GITHUB_SPACY/spacy/matcher/dependencymatcher.pyx)
-												WIP: update docs [ci skip]

											
										
										
											2020-09-04 17:30:31 +03:00
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								### Type hints and type-based data validation {#features-types}
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								> #### Example
 								>
 								> ```python
 								> from spacy.language import Language
 								> from pydantic import StrictBool
 								>
 								> @Language.factory("my_component")
 								> def create_my_component(
 								>     nlp: Language,
 								>     name: str,
 								>     custom: StrictBool
 								> ):
 								>    ...
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								spaCy v3.0 officially drops support for Python 2 and now requires **Python
 .6+**. This also means that the code base can take full advantage of
 								[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
 								API that's implemented in pure Python (as opposed to Cython) now comes with type
 								hints. The new version of spaCy's machine learning library
 								[Thinc](https://thinc.ai) also features extensive
 								[type support](https://thinc.ai/docs/usage-type-checking/), including custom
 								types for models and arrays, and a custom `mypy` plugin that can be used to
 								type-check model definitions.
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
+								For data validation, spaCy v3.0 adopts
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
 								validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
 								lets you to register **custom functions with typed arguments**, reference them
 								in your config and see validation errors if the argument values don't match.
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox title="Details & Documentation" emoji="📖" list>
 								- **Usage: **
 								  [Component type hints and validation](/usage/processing-pipelines#type-hints),
 								  [Training with custom code](/usage/training#custom-code)
 								- **Thinc: **
 								  [Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
 								  [Thinc's config system](https://thinc.ai/docs/usage-config)
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								### New methods, attributes and commands {#new-methods}
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
 								The following methods, attributes and commands are new in spaCy v3.0.
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								| Name                                                                                                                            | Description                                                                                                                                                                                      |
 								| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| [`Token.lex`](/api/token#attributes)                                                                                            | Access a token's [`Lexeme`](/api/lexeme).                                                                                                                                                        |
 								| [`Token.morph`](/api/token#attributes), [`Token.morph_`](/api/token#attributes)                                                 | Access a token's morphological analysis.                                                                                                                                                         |
 								| [`Language.select_pipes`](/api/language#select_pipes)                                                                           | Context manager for enabling or disabling specific pipeline components for a block.                                                                                                              |
 								| [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe)                      | Disable or enable a loaded pipeline component (but don't remove it).                                                                                                                             |
 								| [`Language.analyze_pipes`](/api/language#analyze_pipes)                                                                         | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies.                                                                                                          |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| [`Language.resume_training`](/api/language#resume_training)                                                                     | Experimental: continue training a trained pipeline and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.                              |
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								| [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component)                                  | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions.                                               |
 								| [`Language.has_factory`](/api/language#has_factory)                                                                             | Check whether a component factory is registered on a language class.s                                                                                                                            |
 								| [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta)       | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       |
 								| [`Language.config`](/api/language#config)                                                                                       | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
 								| [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes)                       | All available components and component names, including disabled components that are not run as part of the pipeline.                                                                            |
 								| [`Language.disabled`](/api/language#attributes)                                                                                 | Names of disabled components that are not run as part of the pipeline.                                                                                                                           |
 								| [`Pipe.score`](/api/pipe#score)                                                                                                 | Method on pipeline components that returns a dictionary of evaluation scores.                                                                                                                    |
 								| [`registry`](/api/top-level#registry)                                                                                           | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config)                        | Updated helpers for loading a pipeline's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                     |
 								| [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                         | Names of all pipeline packages installed in the environment.                                                                                                                                     |
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								| [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training).                                                                                                   |
 								| [`project`](/api/cli#project)                                                                                                   | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects).                                                                                                       |
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:30:33 +03:00
+								| [`ray`](/api/cli#ray)                                                                                                           | Suite of CLI commands for parallel training with [Ray](https://ray.io/), provided by the [`spacy-ray`](https://github.com/explosion/spacy-ray) extension package.                                |
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								### New and updated documentation {#new-docs}
 								<Grid cols={2} gutterBottom={false}>
 								<div>
 								To help you get started with spaCy v3.0 and the new features, we've added
 								several new or rewritten documentation pages, including a new usage guide on
 								[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								a guide on [training pipelines and models](/usage/training) rewritten from
 								scratch, a page explaining the new [spaCy projects](/usage/projects) and updated
 								usage documentation on
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								[custom pipeline components](/usage/processing-pipelines#custom-components).
 								We've also added a bunch of new illustrations and new API reference pages
 								documenting spaCy's machine learning [model architectures](/api/architectures)
 								and the expected [data formats](/api/data-formats). API pages about
 								[pipeline components](/api/#architecture-pipeline) now include more information,
 								like the default config and implementation, and we've adopted a more detailed
 								format for documenting argument and return types.
 								</div>
 								[![Library architecture](../images/architecture.svg)](/api)
 								</Grid>
 								<Infobox title="New or reworked documentation" emoji="📖" list>
 								- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
-												Update docs [ci skip]

											
										
										
											2020-08-21 17:21:55 +03:00
+								  [Training models](/usage/training),
 								  [Layers & Architectures](/usage/layers-architectures),
 								  [Projects](/usage/projects),
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								  [Custom pipeline components](/usage/processing-pipelines#custom-components),
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
 								  [Morphology](/usage/linguistic-features#morphology),
 								  [Lemmatization](/usage/linguistic-features#lemmatization),
-												Update docs [ci skip]

											
										
										
											2020-09-05 14:41:10 +03:00
+								  [Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions),
 								  [Dependency matching](/usage/rule-based-matching#dependencymatcher)
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								- **API Reference: ** [Library architecture](/api),
 								  [Model architectures](/api/architectures), [Data formats](/api/data-formats)
 								- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
 								  [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
 								  [`Morphologizer`](/api/morphologizer),
 								  [`AttributeRuler`](/api/attributeruler),
-												Officially support DependencyMatcher

Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.

Patterns
--------

Refactor pattern structure to:

```
{
  "LEFT_ID": str,
  "REL_OP": str,
  "RIGHT_ID": str,
  "RIGHT_ATTRS": dict,
}
```

The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.

New operators
-------------

Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.

* `.*`: precedes
* `;`: immediately follows
* `;*`: follows

Operator fixes
--------------

* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
  all sibling operators)
* Linear precedence operators do not match nodes outside the same parse

Additional fixes
----------------

* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling

											
										
										
											2020-08-31 21:04:26 +03:00
+								  [`SentenceRecognizer`](/api/sentencerecognizer),
-												Update docs [ci skip]

											
										
										
											2020-09-05 14:41:10 +03:00
+								  [`DependencyMatcher`](/api/dependencymatcher), [`Pipe`](/api/pipe),
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								  [`Corpus`](/api/corpus)
 								</Infobox>
-												Add new in v3.0

											
										
										
											2020-07-01 14:02:17 +03:00
+								## Backwards Incompatibilities {#incompat}
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								As always, we've tried to keep the breaking changes to a minimum and focus on
 								changes that were necessary to support the new features, fix problems or improve
 								usability. The following section lists the relevant changes to the user-facing
 								API. For specific examples of how to rewrite your code, check out the
 								[migration guide](#migrating).
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								<Infobox variant="warning">
 								Note that spaCy v3.0 now requires **Python 3.6+**.
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								### API changes {#incompat-api}
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								- Pipeline package symlinks, the `link` command and shortcut names are now
 								  deprecated. There can be many [different trained pipelines](/models) and not
 								  just one "English model", so you should always use the full package name like
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								  [`en_core_web_sm`](/models/en) explicitly.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								- A pipeline's [`meta.json`](/api/data-formats#meta) is now only used to provide
 								  meta information like the package name, author, license and labels. It's
 								  **not** used to construct the processing pipeline anymore. This is all defined
 								  in the [`config.cfg`](/api/data-formats#config), which also includes all
 								  settings used to train the pipeline.
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
 								  only take a `config.cfg` file containing the full
 								  [training config](/usage/training#config).
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
 								  the component factory instead of the component function.
-												context manager with space (for consistency)

											
										
										
											2020-08-21 19:34:02 +03:00
+								- **Custom pipeline components** now need to be decorated with the
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								  [`@Language.component`](/api/language#component) or
 								  [`@Language.factory`](/api/language#factory) decorator.
 								- [`Language.update`](/api/language#update) now takes a batch of
 								  [`Example`](/api/example) objects instead of raw texts and annotations, or
 								  `Doc` and `GoldParse` objects.
-												context manager with space (for consistency)

											
										
										
											2020-08-21 19:34:02 +03:00
+								- The `Language.disable_pipes` context manager has been replaced by
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								  [`Language.select_pipes`](/api/language#select_pipes), which can explicitly
 								  disable or enable components.
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								- The [`Language.update`](/api/language#update),
 								  [`Language.evaluate`](/api/language#evaluate) and
 								  [`Pipe.update`](/api/pipe#update) methods now all take batches of
 								  [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
 								  raw text and a dictionary of annotations.
 								  [`Language.begin_training`](/api/language#begin_training) and
 								  [`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
 								  returns a sequence of `Example` objects to initialize the model instead of a
 								  list of tuples.
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								- [`Matcher.add`](/api/matcher#add) and
 								  [`PhraseMatcher.add`](/api/phrasematcher#add) now only accept a list of
 								  patterns as the second argument (instead of a variable number of arguments).
 								  The `on_match` callback becomes an optional keyword argument.
 								- The `spacy.gold` module has been renamed to
 								  [`spacy.training`](%%GITHUB_SPACY/spacy/training).
-												Update docs

											
										
										
											2020-08-25 12:54:37 +03:00
+								- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
 								  been removed.
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								- The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the
 								  more flexible [`AttributeRuler`](/api/attributeruler).
 								- The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and
 								  doesn't provide lemmas by default or switch automatically between lookup and
 								  rule-based lemmas. You can now add it to your pipeline explicitly and set its
 								  mode on initialization.
-												Update docs [ci skip]

											
										
										
											2020-09-15 13:44:03 +03:00
+								- Various keyword arguments across functions and methods are now explicitly
 								  declared as _keyword-only_ arguments. Those arguments are documented
 								  accordingly across the API reference.
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
 								### Removed or renamed API {#incompat-removed}
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								| Removed                                                  | Replacement                                                                                                  |
 								| -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
 								| `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe) |
 								| `GoldParse`                                              | [`Example`](/api/example)                                                                                    |
 								| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                                                                                      |
 								| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump`          | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk)                   |
-												Deprecate (Phrase)Matcher.pipe

											
										
										
											2020-08-31 18:01:24 +03:00
+								| `Matcher.pipe`, `PhraseMatcher.pipe`                     | not needed                                                                                                   |
-												Fix anchor [ci skip]

											
										
										
											2020-09-03 17:49:56 +03:00
+								| `spacy init-model`                                       | [`spacy init vocab`](/api/cli#init-vocab)                                                                    |
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)                                                                    |
 								| `spacy profile`                                          | [`spacy debug profile`](/api/cli#debug-profile)                                                              |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated                                                                          |
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								The following deprecated methods, attributes and arguments were removed in v3.0.
 								Most of them have been **deprecated for a while** and many would previously
 								raise errors. Many of them were also mostly internals. If you've been working
 								with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
 								on them.
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
-												Update docstrings, docs and types

											
										
										
											2020-07-29 12:36:42 +03:00
+								| Removed                                                                                                                 | Replacement                                                                                                                                                |
 								| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
 								| `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
 								| `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
 								| `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
 								| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
 								| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								| `verbose` argument on [`Language.evaluate`](/api/language#evaluate)                                                     | logging (`DEBUG`)                                                                                                                                          |
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
-												Add new in v3.0

											
										
										
											2020-07-01 14:02:17 +03:00
+								## Migrating from v2.x {#migrating}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								### Downloading and loading trained pipelines {#migrating-downloading-models}
-												Update docstrings, docs and types

											
										
										
											2020-07-29 12:36:42 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Symlinks and shortcuts like `en` are now officially deprecated. There are
 								[many different trained pipelines](/models) with different capabilities and not
 								just one "English model". In order to download and load a package, you should
 								always use its full name – for instance,
 								[`en_core_web_sm`](/models/en#en_core_web_sm).
-												Update docstrings, docs and types

											
										
										
											2020-07-29 12:36:42 +03:00
 								```diff
 								- python -m spacy download en
 								+ python -m spacy download en_core_web_sm
 								```
 								```diff
 								- nlp = spacy.load("en")
 								+ nlp = spacy.load("en_core_web_sm")
 								```
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								### Custom pipeline components and factories {#migrating-pipeline-components}
 								Custom pipeline components now have to be registered explicitly using the
 								[`@Language.component`](/api/language#component) or
 								[`@Language.factory`](/api/language#factory) decorator. For simple functions
 								that take a `Doc` and return it, all you have to do is add the
 								`@Language.component` decorator to it and assign it a name:
 								```diff
 								### Stateless function components
 								+ from spacy.language import Language
 								+ @Language.component("my_component")
 								def my_component(doc):
 								    return doc
 								```
 								For class components that are initialized with settings and/or the shared `nlp`
 								object, you can use the `@Language.factory` decorator. Also make sure that that
 								the method used to initialize the factory has **two named arguments**: `nlp`
 								(the current `nlp` object) and `name` (the string name of the component
 								instance).
 								```diff
 								### Stateful class components
 								+ from spacy.language import Language
 								+ @Language.factory("my_component")
 								class MyComponent:
 								-   def __init__(self, nlp):
 								+   def __init__(self, nlp, name):
 								        self.nlp = nlp
 								    def __call__(self, doc):
 								        return doc
 								```
 								Instead of decorating your class, you could also add a factory function that
 								takes the arguments `nlp` and `name` and returns an instance of your component:
 								```diff
 								### Stateful class components with factory function
 								+ from spacy.language import Language
 								+ @Language.factory("my_component")
 								+ def create_my_component(nlp, name):
 								+     return MyComponent(nlp)
 								class MyComponent:
 								    def __init__(self, nlp):
 								        self.nlp = nlp
 								    def __call__(self, doc):
 								        return doc
 								```
 								The `@Language.component` and `@Language.factory` decorators now take care of
 								adding an entry to the component factories, so spaCy knows how to load a
 								component back in from its string name. You won't have to write to
 								`Language.factories` manually anymore.
 								```diff
 								- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
 								```
 								#### Adding components to the pipeline {#migrating-add-pipe}
 								The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
 								name** of the component factory instead of a callable component. This allows
 								spaCy to track and serialize components that have been added and their settings.
 								```diff
 								+ @Language.component("my_component")
 								def my_component(doc):
 								    return doc
 								- nlp.add_pipe(my_component)
 								+ nlp.add_pipe("my_component")
 								```
 								[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
 								itself, so you can access its attributes. The
 								[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
 								and you typically shouldn't have to use it in your code.
 								```diff
 								- parser = nlp.create_pipe("parser")
 								- nlp.add_pipe(parser)
 								+ parser = nlp.add_pipe("parser")
 								```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								If you need to add a component from an existing trained pipeline, you can now
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
 								check that the component is compatible, and take care of porting over all
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								config. During training, you can also reference existing trained components in
 								your [config](/usage/training#config-components) and decide whether or not they
 								should be updated with more data.
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
 								> #### config.cfg (excerpt)
 								>
 								> ```ini
 								> [components.ner]
 								> source = "en_core_web_sm"
 								> component = "ner"
 								> ```
 								```diff
 								source_nlp = spacy.load("en_core_web_sm")
 								nlp = spacy.blank("en")
 								- ner = source_nlp.get_pipe("ner")
 								- nlp.add_pipe(ner)
 								+ nlp.add_pipe("ner", source=source_nlp)
 								```
 								### Adding match patterns {#migrating-matcher}
 								The [`Matcher.add`](/api/matcher#add),
 								[`PhraseMatcher.add`](/api/phrasematcher#add) and
 								[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
 								**list of patterns** as the second argument (instead of a variable number of
 								arguments). The `on_match` callback becomes an optional keyword argument.
 								```diff
 								matcher = Matcher(nlp.vocab)
 								patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
 								- matcher.add("GoogleNow", on_match, *patterns)
 								+ matcher.add("GoogleNow", patterns, on_match=on_match)
 								```
 								```diff
 								matcher = PhraseMatcher(nlp.vocab)
 								patterns = [nlp("health care reform"), nlp("healthcare reform")]
 								- matcher.add("HEALTH", on_match, *patterns)
 								+ matcher.add("HEALTH", patterns, on_match=on_match)
 								```
-												Restrict tokenizer exceptions to ORTH and NORM

											
										
										
											2020-08-31 10:42:06 +03:00
+								### Migrating attributes in tokenizer exceptions {#migrating-tokenizer-exceptions}
 								Tokenizer exceptions are now only allowed to set `ORTH` and `NORM` values as
 								part of the token attributes. Exceptions for other attributes such as `TAG` and
 								`LEMMA` should be moved to an [`AttributeRuler`](/api/attributeruler) component:
 								```diff
 								nlp = spacy.blank("en")
 								- nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't", "LEMMA": "not"}])
 								+ nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't"}])
 								+ ruler = nlp.add_pipe("attribute_ruler")
 								+ ruler.add(patterns=[[{"ORTH": "n't"}]], attrs={"LEMMA": "not"})
 								```
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions}
 								Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
 								v3.0 now manages mappings and exceptions with a separate and more flexible
 								pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
 								[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
 								`AttributeRuler` provides two handy helper methods
 								[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
 								[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
 								you load in your existing tag map or morph rules:
 								```diff
 								nlp = spacy.blank("en")
 								- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
 								+ ruler = nlp.add_pipe("attribute_ruler")
 								+ ruler.load_from_tag_map(YOUR_TAG_MAP)
 								```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								### Training pipelines and models {#migrating-training}
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								To train your pipelines, you should now pretty much always use the
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
 								training scripts anymore, unless you _really_ want to. The training commands now
 								use a [flexible config file](/usage/training#config) that describes all training
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								settings and hyperparameters, as well as your pipeline, components and
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								architectures to use. The `--code` argument lets you pass in code containing
 								[custom registered functions](/usage/training#custom-code) that you can
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								reference in your config. To get started, check out the
 								[quickstart widget](/usage/training#quickstart).
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
 								#### Binary .spacy training data format {#migrating-training-format}
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								spaCy v3.0 uses a new
 								[binary training data format](/api/data-formats#binary-training) created by
 								serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								objects. This means that you can train spaCy pipelines using the same format it
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								outputs: annotated `Doc` objects. The binary format is extremely **efficient in
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								storage**, especially when packing multiple documents together. You can convert
 								your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert)
 								command, which outputs `.spacy` files:
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								$ python -m spacy convert ./training.json ./output
 								```
 								#### Training config {#migrating-training-config}
-												Update docs [ci skip]

											
										
										
											2020-08-10 02:20:10 +03:00
+								The easiest way to get started with a training config is to use the
-												Update docs [ci skip]

											
										
										
											2020-08-18 02:29:34 +03:00
+								[`init config`](/api/cli#init-config) command or the
 								[quickstart widget](/usage/training#quickstart). You can define your
 								requirements, and it will auto-generate a starter config with the best-matching
 								default settings.
-												Update docs [ci skip]

											
										
										
											2020-08-10 02:20:10 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
-												Update docs [ci skip]

											
										
										
											2020-08-18 02:29:34 +03:00
+								$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser
-												Update docs [ci skip]

											
										
										
											2020-08-10 02:20:10 +03:00
+								```
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-18 02:29:34 +03:00
+								If you've exported a starter config from our
 								[quickstart widget](/usage/training#quickstart), you can use the
 								[`init fill-config`](/api/cli#init-fill-config) to fill it with all default
 								values. You can then use the auto-generated `config.cfg` for training:
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								```diff
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								- python -m spacy train en ./output ./train.json ./dev.json
 								--pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								+ python -m spacy train ./config.cfg --output ./output
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-09-13 23:36:20 +03:00
+								<!-- TODO: project template -->
-												Update docs [ci skip]

											
										
										
											2020-08-19 13:14:41 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								#### Training via the Python API {#migrating-training-python}
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								For most use cases, you **shouldn't** have to write your own training scripts
 								anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
 								[config file](/usage/training#config) and custom
 								[registered functions](/usage/training#custom-code) if needed. You can even
 								register callbacks that can modify the `nlp` object at different stages of its
 								lifecycle to fully customize it before training.
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								If you do decide to use the [internal training API](/usage/training#api) from
 								Python, you should only need a few small modifications to convert your scripts
 								from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
 								classmethod takes a reference `Doc` and a
 								[dictionary of annotations](/api/data-formats#dict-input), similar to the
 								"simple training style" in spaCy v2.x:
-												Update docstrings, docs and types

											
										
										
											2020-07-29 12:36:42 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-11 21:57:23 +03:00
+								```diff
 								### Migrating Doc and GoldParse
 								doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
 								entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
 								- gold = GoldParse(doc, entities=entities)
 								+ example = Example.from_dict(doc, {"entities": entities})
 								```
 								```diff
 								### Migrating simple training style
 								text = "Mark Zuckerberg is the CEO of Facebook"
 								annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
 								+ doc = nlp.make_doc(text)
 								+ example = Example.from_dict(doc, annotations)
 								```
 								The [`Language.update`](/api/language#update),
 								[`Language.evaluate`](/api/language#evaluate) and
 								[`Pipe.update`](/api/pipe#update) methods now all take batches of
 								[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
 								raw text and a dictionary of annotations.
 								```python
 								### Training loop {highlight="11"}
 								TRAIN_DATA = [
 								    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
 								    ("I like London.", {"entities": [(7, 13, "LOC")]}),
 								]
 								nlp.begin_training()
 								for i in range(20):
 								    random.shuffle(TRAIN_DATA)
 								    for batch in minibatch(TRAIN_DATA):
 								        examples = []
 								        for text, annots in batch:
 								            examples.append(Example.from_dict(nlp.make_doc(text), annots))
 								        nlp.update(examples)
 								```
 								[`Language.begin_training`](/api/language#begin_training) and
 								[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
 								returns a sequence of `Example` objects to initialize the model instead of a
 								list of tuples. The data examples are used to **initialize the models** of
 								trainable pipeline components, which includes validating the network,
 								[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
 								setting up the label scheme.
 								```diff
 								- nlp.begin_training(examples)
 								+ nlp.begin_training(lambda: examples)
 								```
-												Update docstrings, docs and types

											
										
										
											2020-07-29 12:36:42 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								#### Packaging trained pipelines {#migrating-training-packaging}
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
 								The [`spacy package`](/api/cli#package) command now automatically builds the
 								installable `.tar.gz` sdist of the Python package, so you don't have to run this
 								step manually anymore. You can disable the behavior by setting the `--no-sdist`
 								flag.
 								```diff
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								python -m spacy package ./output ./packages
 								- cd /output/en_pipeline-0.0.0
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								- python setup.py sdist
 								```
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								#### Data utilities and gold module {#migrating-gold}
 								The `spacy.gold` module has been renamed to `spacy.training`. This mostly
 								affects internals, but if you've been using the span offset conversion utilities
 								[`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets),
 								[`offsets_from_biluo_tags`](/api/top-level#offsets_from_biluo_tags) or
 								[`spans_from_biluo_tags`](/api/top-level#spans_from_biluo_tags), you'll have to
 								change your imports:
 								```diff
 								- from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
 								+ from spacy.training import biluo_tags_from_offsets, spans_from_biluo_tags
 								```
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								#### Migration notes for plugin maintainers {#migrating-plugins}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								Thanks to everyone who's been contributing to the spaCy ecosystem by developing
 								and maintaining one of the many awesome [plugins and extensions](/universe).
-												Update docs [ci skip]

											
										
										
											2020-08-10 01:01:38 +03:00
+								We've tried to make it as easy as possible for you to upgrade your packages for
 								spaCy v3. The most common use case for plugins is providing pipeline components
 								and extension attributes. When migrating your plugin, double-check the
 								following:
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								- Use the [`@Language.factory`](/api/language#factory) decorator to register
 								  your component and assign it a name. This allows users to refer to your
 								  components by name and serialize pipelines referencing them. Remove all manual
 								  entries to the `Language.factories`.
 								- Make sure your component factories take at least two **named arguments**:
 								  `nlp` (the current `nlp` object) and `name` (the instance name of the added
 								  component so you can identify multiple instances of the same component).
 								- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
 								  to use **string names** instead of the component functions.
 								```python
 								### {highlight="1-5"}
 								from spacy.language import Language
 								@Language.factory("my_component", default_config={"some_setting": False})
 								def create_component(nlp: Language, name: str, some_setting: bool):
 								    return MyCoolComponent(some_setting=some_setting)
 								class MyCoolComponent:
 								    def __init__(self, some_setting):
 								        self.some_setting = some_setting
 								    def __call__(self, doc):
 								        # Do something to the doc
 								        return doc
 								```
 								> #### Result in config.cfg
 								>
 								> ```ini
 								> [components.my_component]
 								> factory = "my_component"
 								> some_setting = true
 								> ```
 								```diff
 								import spacy
 								from your_plugin import MyCoolComponent
 								nlp = spacy.load("en_core_web_sm")
 								- component = MyCoolComponent(some_setting=True)
 								- nlp.add_pipe(component)
 								+ nlp.add_pipe("my_component", config={"some_setting": True})
 								```
 								<Infobox title="Important note on registering factories" variant="warning">
 								The [`@Language.factory`](/api/language#factory) decorator takes care of letting
 								spaCy know that a component of that name is available. This means that your
 								users can add it to the pipeline using its **string name**. However, this
 								requires the decorator to be executed – so users will still have to **import
 								your plugin**. Alternatively, your plugin could expose an
 								[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
 								This means that spaCy knows how to initialize `my_component`, even if your
 								package isn't imported.
 								</Infobox>