mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	* Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>
		
			
				
	
	
		
			763 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			763 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
title: What's New in v3.0
 | 
						||
teaser: New features, backwards incompatibilities and migration guide
 | 
						||
menu:
 | 
						||
  - ['Summary', 'summary']
 | 
						||
  - ['New Features', 'features']
 | 
						||
  - ['Backwards Incompatibilities', 'incompat']
 | 
						||
  - ['Migrating from v2.x', 'migrating']
 | 
						||
---
 | 
						||
 | 
						||
## Summary {#summary}
 | 
						||
 | 
						||
<Grid cols={2}>
 | 
						||
 | 
						||
<div>
 | 
						||
 | 
						||
</div>
 | 
						||
 | 
						||
<Infobox title="Table of Contents" id="toc">
 | 
						||
 | 
						||
- [Summary](#summary)
 | 
						||
- [New features](#features)
 | 
						||
- [Training & config system](#features-training)
 | 
						||
- [Transformer-based pipelines](#features-transformers)
 | 
						||
- [Custom models](#features-custom-models)
 | 
						||
- [End-to-end project workflows](#features-projects)
 | 
						||
- [New built-in components](#features-pipeline-components)
 | 
						||
- [New custom component API](#features-components)
 | 
						||
- [Python type hints](#features-types)
 | 
						||
- [New methods & attributes](#new-methods)
 | 
						||
- [New & updated documentation](#new-docs)
 | 
						||
- [Backwards incompatibilities](#incompat)
 | 
						||
- [Migrating from spaCy v2.x](#migrating)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
</Grid>
 | 
						||
 | 
						||
## New Features {#features}
 | 
						||
 | 
						||
### New training workflow and config system {#features-training}
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage:** [Training models](/usage/training)
 | 
						||
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
 | 
						||
  [`Config`](https://thinc.ai/docs/api-config#config)
 | 
						||
- **CLI:** [`train`](/api/cli#train), [`pretrain`](/api/cli#pretrain),
 | 
						||
  [`evaluate`](/api/cli#evaluate)
 | 
						||
- **API:** [Config format](/api/data-formats#config),
 | 
						||
  [`registry`](/api/top-level#registry)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Transformer-based pipelines {#features-transformers}
 | 
						||
 | 
						||

 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
 | 
						||
  [Training models](/usage/training)
 | 
						||
- **API:** [`Transformer`](/api/transformer),
 | 
						||
  [`TransformerData`](/api/transformer#transformerdata),
 | 
						||
  [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
 | 
						||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
 | 
						||
  [Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
 | 
						||
  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
 | 
						||
- **Models:** [`en_core_trf_lg_sm`](/models/en)
 | 
						||
- **Implementation:**
 | 
						||
  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Custom models using any framework {#features-custom-models}
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
<!-- TODO: link to new custom models page -->
 | 
						||
 | 
						||
- **Thinc: **
 | 
						||
  [Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks)
 | 
						||
- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Manage end-to-end workflows with projects {#features-projects}
 | 
						||
 | 
						||
<!-- TODO: update example -->
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```cli
 | 
						||
> # Clone a project template
 | 
						||
> $ python -m spacy project clone example
 | 
						||
> $ cd example
 | 
						||
> # Download data assets
 | 
						||
> $ python -m spacy project assets
 | 
						||
> # Run a workflow
 | 
						||
> $ python -m spacy project run train
 | 
						||
> ```
 | 
						||
 | 
						||
spaCy projects let you manage and share **end-to-end spaCy workflows** for
 | 
						||
different **use cases and domains**, and orchestrate training, packaging and
 | 
						||
serving your custom models. You can start off by cloning a pre-defined project
 | 
						||
template, adjust it to fit your needs, load in your data, train a model, export
 | 
						||
it as a Python package, upload your outputs to a remote storage and share your
 | 
						||
results with your team.
 | 
						||
 | 
						||

 | 
						||
 | 
						||
spaCy projects also make it easy to **integrate with other tools** in the data
 | 
						||
science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for
 | 
						||
data version control, [Prodigy](/usage/projects#prodigy) for creating labelled
 | 
						||
data, [Streamlit](/usage/projects#streamlit) for building interactive apps,
 | 
						||
[FastAPI](/usage/projects#fastapi) for serving models in production,
 | 
						||
[Ray](/usage/projects#ray) for parallel training,
 | 
						||
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
 | 
						||
 | 
						||
<!-- <Project id="some_example_project">
 | 
						||
 | 
						||
The easiest way to get started with an end-to-end training process is to clone a
 | 
						||
[project](/usage/projects) template. Projects let you manage multi-step
 | 
						||
workflows, from data preprocessing to training and packaging your model.
 | 
						||
 | 
						||
</Project>-->
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage:** [spaCy projects](/usage/projects),
 | 
						||
  [Training models](/usage/training)
 | 
						||
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
 | 
						||
- **Templates:** [`projects`](https://github.com/explosion/projects)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### New built-in pipeline components {#features-pipeline-components}
 | 
						||
 | 
						||
spaCy v3.0 includes several new trainable and rule-based components that you can
 | 
						||
add to your pipeline and customize for your use case:
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> nlp = spacy.blank("en")
 | 
						||
> nlp.add_pipe("lemmatizer")
 | 
						||
> ```
 | 
						||
 | 
						||
| Name                                            | Description                                                                                                                                                                                                             |
 | 
						||
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation.                                                                                                                                                                          |
 | 
						||
| [`Morphologizer`](/api/morphologizer)           | Trainable component to predict morphological features.                                                                                                                                                                  |
 | 
						||
| [`Lemmatizer`](/api/lemmatizer)                 | Standalone component for rule-based and lookup lemmatization.                                                                                                                                                           |
 | 
						||
| [`AttributeRuler`](/api/attributeruler)         | Component for setting token attributes using match patterns.                                                                                                                                                            |
 | 
						||
| [`Transformer`](/api/transformer)               | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage:** [Processing pipelines](/usage/processing-pipelines)
 | 
						||
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
 | 
						||
- **Implementation:**
 | 
						||
  [`spacy/pipeline`](https://github.com/explosion/spaCy/tree/develop/spacy/pipeline)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### New and improved pipeline component APIs {#features-components}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> @Language.component("my_component")
 | 
						||
> def my_component(doc):
 | 
						||
>     return doc
 | 
						||
>
 | 
						||
> nlp.add_pipe("my_component")
 | 
						||
> nlp.add_pipe("ner", source=other_nlp)
 | 
						||
> nlp.analyze_pipes(pretty=True)
 | 
						||
> ```
 | 
						||
 | 
						||
Defining, configuring, reusing, training and analyzing pipeline components is
 | 
						||
now easier and more convenient. The `@Language.component` and
 | 
						||
`@Language.factory` decorators let you register your component, define its
 | 
						||
default configuration and meta data, like the attribute values it assigns and
 | 
						||
requires. Any custom component can be included during training, and sourcing
 | 
						||
components from existing pretrained models lets you **mix and match custom
 | 
						||
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
 | 
						||
the current pipeline and its components, including the attributes they assign,
 | 
						||
the scores they compute during training and whether any required attributes
 | 
						||
aren't set.
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
 | 
						||
  [Defining components for training](/usage/training#config-components)
 | 
						||
- **API:** [`@Language.component`](/api/language#component),
 | 
						||
  [`@Language.factory`](/api/language#factory),
 | 
						||
  [`Language.add_pipe`](/api/language#add_pipe),
 | 
						||
  [`Language.analyze_pipes`](/api/language#analyze_pipes)
 | 
						||
- **Implementation:**
 | 
						||
  [`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Type hints and type-based data validation {#features-types}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> from spacy.language import Language
 | 
						||
> from pydantic import StrictBool
 | 
						||
>
 | 
						||
> @Language.factory("my_component")
 | 
						||
> def create_my_component(
 | 
						||
>     nlp: Language,
 | 
						||
>     name: str,
 | 
						||
>     custom: StrictBool
 | 
						||
> ):
 | 
						||
>    ...
 | 
						||
> ```
 | 
						||
 | 
						||
spaCy v3.0 officially drops support for Python 2 and now requires **Python
 | 
						||
3.6+**. This also means that the code base can take full advantage of
 | 
						||
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
 | 
						||
API that's implemented in pure Python (as opposed to Cython) now comes with type
 | 
						||
hints. The new version of spaCy's machine learning library
 | 
						||
[Thinc](https://thinc.ai) also features extensive
 | 
						||
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
 | 
						||
types for models and arrays, and a custom `mypy` plugin that can be used to
 | 
						||
type-check model definitions.
 | 
						||
 | 
						||
For data validation, spacy v3.0 adopts
 | 
						||
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
 | 
						||
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
 | 
						||
lets you to register **custom functions with typed arguments**, reference them
 | 
						||
in your config and see validation errors if the argument values don't match.
 | 
						||
 | 
						||
<Infobox title="Details & Documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage: **
 | 
						||
  [Component type hints and validation](/usage/processing-pipelines#type-hints),
 | 
						||
  [Training with custom code](/usage/training#custom-code)
 | 
						||
- **Thinc: **
 | 
						||
  [Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
 | 
						||
  [Thinc's config system](https://thinc.ai/docs/usage-config)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### New methods, attributes and commands {#new-methods}
 | 
						||
 | 
						||
The following methods, attributes and commands are new in spaCy v3.0.
 | 
						||
 | 
						||
| Name                                                                                                                          | Description                                                                                                                                                                                      |
 | 
						||
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
						||
| [`Token.lex`](/api/token#attributes)                                                                                          | Access a token's [`Lexeme`](/api/lexeme).                                                                                                                                                        |
 | 
						||
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes)                                                | Access a token's morphological analysis.                                                                                                                                                         |
 | 
						||
| [`Language.select_pipes`](/api/language#select_pipes)                                                                         | Context manager for enabling or disabling specific pipeline components for a block.                                                                                                              |
 | 
						||
| [`Language.analyze_pipes`](/api/language#analyze_pipes)                                                                       | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies.                                                                                                          |
 | 
						||
| [`Language.resume_training`](/api/language#resume_training)                                                                   | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.                              |
 | 
						||
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component)                                 | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions.                                               |
 | 
						||
| [`Language.has_factory`](/api/language#has_factory)                                                                           | Check whether a component factory is registered on a language class.s                                                                                                                            |
 | 
						||
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta)      | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name.                                                                                       |
 | 
						||
| [`Language.config`](/api/language#config)                                                                                     | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
 | 
						||
| [`Pipe.score`](/api/pipe#score)                                                                                               | Method on trainable pipeline components that returns a dictionary of evaluation scores.                                                                                                          |
 | 
						||
| [`registry`](/api/top-level#registry)                                                                                         | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config).                                                                                  |
 | 
						||
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config)                       | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config).                                                                        |
 | 
						||
| [`util.get_installed_models`](/api/top-level#util.get_installed_models)                                                       | Names of all models installed in the environment.                                                                                                                                                |
 | 
						||
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training).                                                                                                   |
 | 
						||
| [`project`](/api/cli#project)                                                                                                 | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects).                                                                                                       |
 | 
						||
 | 
						||
### New and updated documentation {#new-docs}
 | 
						||
 | 
						||
<Grid cols={2} gutterBottom={false}>
 | 
						||
 | 
						||
<div>
 | 
						||
 | 
						||
To help you get started with spaCy v3.0 and the new features, we've added
 | 
						||
several new or rewritten documentation pages, including a new usage guide on
 | 
						||
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
 | 
						||
a guide on [training models](/usage/training) rewritten from scratch, a page
 | 
						||
explaining the new [spaCy projects](/usage/projects) and updated usage
 | 
						||
documentation on
 | 
						||
[custom pipeline components](/usage/processing-pipelines#custom-components).
 | 
						||
We've also added a bunch of new illustrations and new API reference pages
 | 
						||
documenting spaCy's machine learning [model architectures](/api/architectures)
 | 
						||
and the expected [data formats](/api/data-formats). API pages about
 | 
						||
[pipeline components](/api/#architecture-pipeline) now include more information,
 | 
						||
like the default config and implementation, and we've adopted a more detailed
 | 
						||
format for documenting argument and return types.
 | 
						||
 | 
						||
</div>
 | 
						||
 | 
						||
[](/api)
 | 
						||
 | 
						||
</Grid>
 | 
						||
 | 
						||
<Infobox title="New or reworked documentation" emoji="📖" list>
 | 
						||
 | 
						||
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
 | 
						||
  [Training models](/usage/training),
 | 
						||
  [Layers & Architectures](/usage/layers-architectures),
 | 
						||
  [Projects](/usage/projects),
 | 
						||
  [Custom pipeline components](/usage/processing-pipelines#custom-components),
 | 
						||
  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
 | 
						||
- **API Reference: ** [Library architecture](/api),
 | 
						||
  [Model architectures](/api/architectures), [Data formats](/api/data-formats)
 | 
						||
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
 | 
						||
  [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
 | 
						||
  [`Morphologizer`](/api/morphologizer),
 | 
						||
  [`AttributeRuler`](/api/attributeruler),
 | 
						||
  [`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe),
 | 
						||
  [`Corpus`](/api/corpus)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Backwards Incompatibilities {#incompat}
 | 
						||
 | 
						||
As always, we've tried to keep the breaking changes to a minimum and focus on
 | 
						||
changes that were necessary to support the new features, fix problems or improve
 | 
						||
usability. The following section lists the relevant changes to the user-facing
 | 
						||
API. For specific examples of how to rewrite your code, check out the
 | 
						||
[migration guide](#migrating).
 | 
						||
 | 
						||
<Infobox variant="warning">
 | 
						||
 | 
						||
Note that spaCy v3.0 now requires **Python 3.6+**.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### API changes {#incompat-api}
 | 
						||
 | 
						||
- Model symlinks, the `link` command and shortcut names are now deprecated.
 | 
						||
  There can be many [different models](/models) and not just one "English
 | 
						||
  model", so you should always use the full model name like
 | 
						||
  [`en_core_web_sm`](/models/en) explicitly.
 | 
						||
- A model's [`meta.json`](/api/data-formats#meta) is now only used to provide
 | 
						||
  meta information like the model name, author, license and labels. It's **not**
 | 
						||
  used to construct the processing pipeline anymore. This is all defined in the
 | 
						||
  [`config.cfg`](/api/data-formats#config), which also includes all settings
 | 
						||
  used to train the model.
 | 
						||
- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
 | 
						||
  only take a `config.cfg` file containing the full
 | 
						||
  [training config](/usage/training#config).
 | 
						||
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
 | 
						||
  the component factory instead of the component function.
 | 
						||
- **Custom pipeline components** now need to be decorated with the
 | 
						||
  [`@Language.component`](/api/language#component) or
 | 
						||
  [`@Language.factory`](/api/language#factory) decorator.
 | 
						||
- [`Language.update`](/api/language#update) now takes a batch of
 | 
						||
  [`Example`](/api/example) objects instead of raw texts and annotations, or
 | 
						||
  `Doc` and `GoldParse` objects.
 | 
						||
- The `Language.disable_pipes` context manager has been replaced by
 | 
						||
  [`Language.select_pipes`](/api/language#select_pipes), which can explicitly
 | 
						||
  disable or enable components.
 | 
						||
- The [`Language.update`](/api/language#update),
 | 
						||
  [`Language.evaluate`](/api/language#evaluate) and
 | 
						||
  [`Pipe.update`](/api/pipe#update) methods now all take batches of
 | 
						||
  [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
 | 
						||
  raw text and a dictionary of annotations.
 | 
						||
  [`Language.begin_training`](/api/language#begin_training) and
 | 
						||
  [`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
 | 
						||
  returns a sequence of `Example` objects to initialize the model instead of a
 | 
						||
  list of tuples.
 | 
						||
- [`Matcher.add`](/api/matcher#add),
 | 
						||
  [`PhraseMatcher.add`](/api/phrasematcher#add) and
 | 
						||
  [`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
 | 
						||
  of patterns as the second argument (instead of a variable number of
 | 
						||
  arguments). The `on_match` callback becomes an optional keyword argument.
 | 
						||
 | 
						||
### Removed or renamed API {#incompat-removed}
 | 
						||
 | 
						||
| Removed                                                  | Replacement                                                                                |
 | 
						||
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
 | 
						||
| `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes)                                      |
 | 
						||
| `GoldParse`                                              | [`Example`](/api/example)                                                                  |
 | 
						||
| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                                                                    |
 | 
						||
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump`          | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
 | 
						||
| `spacy init-model`                                       | [`spacy init model`](/api/cli#init-model)                                                  |
 | 
						||
| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)                                                  |
 | 
						||
| `spacy profile`                                          | [`spacy debug profile`](/api/cli#debug-profile)                                            |
 | 
						||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated                                                  |
 | 
						||
 | 
						||
The following deprecated methods, attributes and arguments were removed in v3.0.
 | 
						||
Most of them have been **deprecated for a while** and many would previously
 | 
						||
raise errors. Many of them were also mostly internals. If you've been working
 | 
						||
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
 | 
						||
on them.
 | 
						||
 | 
						||
| Removed                                                                                                                 | Replacement                                                                                                                                                |
 | 
						||
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
						||
| `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
 | 
						||
| `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
 | 
						||
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
 | 
						||
| `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
 | 
						||
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
 | 
						||
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
 | 
						||
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate)                                                     | logging (`DEBUG`)                                                                                                                                          |
 | 
						||
| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
 | 
						||
 | 
						||
## Migrating from v2.x {#migrating}
 | 
						||
 | 
						||
### Downloading and loading models {#migrating-downloading-models}
 | 
						||
 | 
						||
Model symlinks and shortcuts like `en` are now officially deprecated. There are
 | 
						||
[many different models](/models) with different capabilities and not just one
 | 
						||
"English model". In order to download and load a model, you should always use
 | 
						||
its full name – for instance, [`en_core_web_sm`](/models/en#en_core_web_sm).
 | 
						||
 | 
						||
```diff
 | 
						||
- python -m spacy download en
 | 
						||
+ python -m spacy download en_core_web_sm
 | 
						||
```
 | 
						||
 | 
						||
```diff
 | 
						||
- nlp = spacy.load("en")
 | 
						||
+ nlp = spacy.load("en_core_web_sm")
 | 
						||
```
 | 
						||
 | 
						||
### Custom pipeline components and factories {#migrating-pipeline-components}
 | 
						||
 | 
						||
Custom pipeline components now have to be registered explicitly using the
 | 
						||
[`@Language.component`](/api/language#component) or
 | 
						||
[`@Language.factory`](/api/language#factory) decorator. For simple functions
 | 
						||
that take a `Doc` and return it, all you have to do is add the
 | 
						||
`@Language.component` decorator to it and assign it a name:
 | 
						||
 | 
						||
```diff
 | 
						||
### Stateless function components
 | 
						||
+ from spacy.language import Language
 | 
						||
 | 
						||
+ @Language.component("my_component")
 | 
						||
def my_component(doc):
 | 
						||
    return doc
 | 
						||
```
 | 
						||
 | 
						||
For class components that are initialized with settings and/or the shared `nlp`
 | 
						||
object, you can use the `@Language.factory` decorator. Also make sure that that
 | 
						||
the method used to initialize the factory has **two named arguments**: `nlp`
 | 
						||
(the current `nlp` object) and `name` (the string name of the component
 | 
						||
instance).
 | 
						||
 | 
						||
```diff
 | 
						||
### Stateful class components
 | 
						||
+ from spacy.language import Language
 | 
						||
 | 
						||
+ @Language.factory("my_component")
 | 
						||
class MyComponent:
 | 
						||
-   def __init__(self, nlp):
 | 
						||
+   def __init__(self, nlp, name):
 | 
						||
        self.nlp = nlp
 | 
						||
 | 
						||
    def __call__(self, doc):
 | 
						||
        return doc
 | 
						||
```
 | 
						||
 | 
						||
Instead of decorating your class, you could also add a factory function that
 | 
						||
takes the arguments `nlp` and `name` and returns an instance of your component:
 | 
						||
 | 
						||
```diff
 | 
						||
### Stateful class components with factory function
 | 
						||
+ from spacy.language import Language
 | 
						||
 | 
						||
+ @Language.factory("my_component")
 | 
						||
+ def create_my_component(nlp, name):
 | 
						||
+     return MyComponent(nlp)
 | 
						||
 | 
						||
class MyComponent:
 | 
						||
    def __init__(self, nlp):
 | 
						||
        self.nlp = nlp
 | 
						||
 | 
						||
    def __call__(self, doc):
 | 
						||
        return doc
 | 
						||
```
 | 
						||
 | 
						||
The `@Language.component` and `@Language.factory` decorators now take care of
 | 
						||
adding an entry to the component factories, so spaCy knows how to load a
 | 
						||
component back in from its string name. You won't have to write to
 | 
						||
`Language.factories` manually anymore.
 | 
						||
 | 
						||
```diff
 | 
						||
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
 | 
						||
```
 | 
						||
 | 
						||
#### Adding components to the pipeline {#migrating-add-pipe}
 | 
						||
 | 
						||
The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
 | 
						||
name** of the component factory instead of a callable component. This allows
 | 
						||
spaCy to track and serialize components that have been added and their settings.
 | 
						||
 | 
						||
```diff
 | 
						||
+ @Language.component("my_component")
 | 
						||
def my_component(doc):
 | 
						||
    return doc
 | 
						||
 | 
						||
- nlp.add_pipe(my_component)
 | 
						||
+ nlp.add_pipe("my_component")
 | 
						||
```
 | 
						||
 | 
						||
[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
 | 
						||
itself, so you can access its attributes. The
 | 
						||
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
 | 
						||
and you typically shouldn't have to use it in your code.
 | 
						||
 | 
						||
```diff
 | 
						||
- parser = nlp.create_pipe("parser")
 | 
						||
- nlp.add_pipe(parser)
 | 
						||
+ parser = nlp.add_pipe("parser")
 | 
						||
```
 | 
						||
 | 
						||
If you need to add a component from an existing pretrained model, you can now
 | 
						||
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
 | 
						||
check that the component is compatible, and take care of porting over all
 | 
						||
config. During training, you can also reference existing pretrained components
 | 
						||
in your [config](/usage/training#config-components) and decide whether or not
 | 
						||
they should be updated with more data.
 | 
						||
 | 
						||
> #### config.cfg (excerpt)
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [components.ner]
 | 
						||
> source = "en_core_web_sm"
 | 
						||
> component = "ner"
 | 
						||
> ```
 | 
						||
 | 
						||
```diff
 | 
						||
source_nlp = spacy.load("en_core_web_sm")
 | 
						||
nlp = spacy.blank("en")
 | 
						||
- ner = source_nlp.get_pipe("ner")
 | 
						||
- nlp.add_pipe(ner)
 | 
						||
+ nlp.add_pipe("ner", source=source_nlp)
 | 
						||
```
 | 
						||
 | 
						||
### Adding match patterns {#migrating-matcher}
 | 
						||
 | 
						||
The [`Matcher.add`](/api/matcher#add),
 | 
						||
[`PhraseMatcher.add`](/api/phrasematcher#add) and
 | 
						||
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
 | 
						||
**list of patterns** as the second argument (instead of a variable number of
 | 
						||
arguments). The `on_match` callback becomes an optional keyword argument.
 | 
						||
 | 
						||
```diff
 | 
						||
matcher = Matcher(nlp.vocab)
 | 
						||
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
 | 
						||
- matcher.add("GoogleNow", on_match, *patterns)
 | 
						||
+ matcher.add("GoogleNow", patterns, on_match=on_match)
 | 
						||
```
 | 
						||
 | 
						||
```diff
 | 
						||
matcher = PhraseMatcher(nlp.vocab)
 | 
						||
patterns = [nlp("health care reform"), nlp("healthcare reform")]
 | 
						||
- matcher.add("HEALTH", on_match, *patterns)
 | 
						||
+ matcher.add("HEALTH", patterns, on_match=on_match)
 | 
						||
```
 | 
						||
 | 
						||
### Training models {#migrating-training}
 | 
						||
 | 
						||
To train your models, you should now pretty much always use the
 | 
						||
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
 | 
						||
training scripts anymore, unless you _really_ want to. The training commands now
 | 
						||
use a [flexible config file](/usage/training#config) that describes all training
 | 
						||
settings and hyperparameters, as well as your pipeline, model components and
 | 
						||
architectures to use. The `--code` argument lets you pass in code containing
 | 
						||
[custom registered functions](/usage/training#custom-code) that you can
 | 
						||
reference in your config. To get started, check out the
 | 
						||
[quickstart widget](/usage/training#quickstart).
 | 
						||
 | 
						||
#### Binary .spacy training data format {#migrating-training-format}
 | 
						||
 | 
						||
spaCy v3.0 uses a new
 | 
						||
[binary training data format](/api/data-formats#binary-training) created by
 | 
						||
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
 | 
						||
objects. This means that you can train spaCy models using the same format it
 | 
						||
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
 | 
						||
storage**, especially when packing multiple documents together. You can convert
 | 
						||
your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert)
 | 
						||
command, which outputs `.spacy` files:
 | 
						||
 | 
						||
```cli
 | 
						||
$ python -m spacy convert ./training.json ./output
 | 
						||
```
 | 
						||
 | 
						||
#### Training config {#migrating-training-config}
 | 
						||
 | 
						||
The easiest way to get started with a training config is to use the
 | 
						||
[`init config`](/api/cli#init-config) command or the
 | 
						||
[quickstart widget](/usage/training#quickstart). You can define your
 | 
						||
requirements, and it will auto-generate a starter config with the best-matching
 | 
						||
default settings.
 | 
						||
 | 
						||
```cli
 | 
						||
$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser
 | 
						||
```
 | 
						||
 | 
						||
If you've exported a starter config from our
 | 
						||
[quickstart widget](/usage/training#quickstart), you can use the
 | 
						||
[`init fill-config`](/api/cli#init-fill-config) to fill it with all default
 | 
						||
values. You can then use the auto-generated `config.cfg` for training:
 | 
						||
 | 
						||
```diff
 | 
						||
### {wrap="true"}
 | 
						||
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
 | 
						||
+ python -m spacy train ./config.cfg --output ./output
 | 
						||
```
 | 
						||
 | 
						||
<!-- TODO:
 | 
						||
 | 
						||
<Project id="some_example_project">
 | 
						||
 | 
						||
The easiest way to get started with an end-to-end training process is to clone a
 | 
						||
[project](/usage/projects) template. Projects let you manage multi-step
 | 
						||
workflows, from data preprocessing to training and packaging your model.
 | 
						||
 | 
						||
</Project>
 | 
						||
 | 
						||
-->
 | 
						||
 | 
						||
#### Training via the Python API {#migrating-training-python}
 | 
						||
 | 
						||
For most use cases, you **shouldn't** have to write your own training scripts
 | 
						||
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
 | 
						||
[config file](/usage/training#config) and custom
 | 
						||
[registered functions](/usage/training#custom-code) if needed. You can even
 | 
						||
register callbacks that can modify the `nlp` object at different stages of its
 | 
						||
lifecycle to fully customize it before training.
 | 
						||
 | 
						||
If you do decide to use the [internal training API](/usage/training#api) from
 | 
						||
Python, you should only need a few small modifications to convert your scripts
 | 
						||
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
 | 
						||
classmethod takes a reference `Doc` and a
 | 
						||
[dictionary of annotations](/api/data-formats#dict-input), similar to the
 | 
						||
"simple training style" in spaCy v2.x:
 | 
						||
 | 
						||
```diff
 | 
						||
### Migrating Doc and GoldParse
 | 
						||
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
 | 
						||
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
 | 
						||
- gold = GoldParse(doc, entities=entities)
 | 
						||
+ example = Example.from_dict(doc, {"entities": entities})
 | 
						||
```
 | 
						||
 | 
						||
```diff
 | 
						||
### Migrating simple training style
 | 
						||
text = "Mark Zuckerberg is the CEO of Facebook"
 | 
						||
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
 | 
						||
+ doc = nlp.make_doc(text)
 | 
						||
+ example = Example.from_dict(doc, annotations)
 | 
						||
```
 | 
						||
 | 
						||
The [`Language.update`](/api/language#update),
 | 
						||
[`Language.evaluate`](/api/language#evaluate) and
 | 
						||
[`Pipe.update`](/api/pipe#update) methods now all take batches of
 | 
						||
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
 | 
						||
raw text and a dictionary of annotations.
 | 
						||
 | 
						||
```python
 | 
						||
### Training loop {highlight="11"}
 | 
						||
TRAIN_DATA = [
 | 
						||
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
 | 
						||
    ("I like London.", {"entities": [(7, 13, "LOC")]}),
 | 
						||
]
 | 
						||
nlp.begin_training()
 | 
						||
for i in range(20):
 | 
						||
    random.shuffle(TRAIN_DATA)
 | 
						||
    for batch in minibatch(TRAIN_DATA):
 | 
						||
        examples = []
 | 
						||
        for text, annots in batch:
 | 
						||
            examples.append(Example.from_dict(nlp.make_doc(text), annots))
 | 
						||
        nlp.update(examples)
 | 
						||
```
 | 
						||
 | 
						||
[`Language.begin_training`](/api/language#begin_training) and
 | 
						||
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
 | 
						||
returns a sequence of `Example` objects to initialize the model instead of a
 | 
						||
list of tuples. The data examples are used to **initialize the models** of
 | 
						||
trainable pipeline components, which includes validating the network,
 | 
						||
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
 | 
						||
setting up the label scheme.
 | 
						||
 | 
						||
```diff
 | 
						||
- nlp.begin_training(examples)
 | 
						||
+ nlp.begin_training(lambda: examples)
 | 
						||
```
 | 
						||
 | 
						||
#### Packaging models {#migrating-training-packaging}
 | 
						||
 | 
						||
The [`spacy package`](/api/cli#package) command now automatically builds the
 | 
						||
installable `.tar.gz` sdist of the Python package, so you don't have to run this
 | 
						||
step manually anymore. You can disable the behavior by setting the `--no-sdist`
 | 
						||
flag.
 | 
						||
 | 
						||
```diff
 | 
						||
python -m spacy package ./model ./packages
 | 
						||
- cd /output/en_model-0.0.0
 | 
						||
- python setup.py sdist
 | 
						||
```
 | 
						||
 | 
						||
#### Migration notes for plugin maintainers {#migrating-plugins}
 | 
						||
 | 
						||
Thanks to everyone who's been contributing to the spaCy ecosystem by developing
 | 
						||
and maintaining one of the many awesome [plugins and extensions](/universe).
 | 
						||
We've tried to make it as easy as possible for you to upgrade your packages for
 | 
						||
spaCy v3. The most common use case for plugins is providing pipeline components
 | 
						||
and extension attributes. When migrating your plugin, double-check the
 | 
						||
following:
 | 
						||
 | 
						||
- Use the [`@Language.factory`](/api/language#factory) decorator to register
 | 
						||
  your component and assign it a name. This allows users to refer to your
 | 
						||
  components by name and serialize pipelines referencing them. Remove all manual
 | 
						||
  entries to the `Language.factories`.
 | 
						||
- Make sure your component factories take at least two **named arguments**:
 | 
						||
  `nlp` (the current `nlp` object) and `name` (the instance name of the added
 | 
						||
  component so you can identify multiple instances of the same component).
 | 
						||
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
 | 
						||
  to use **string names** instead of the component functions.
 | 
						||
 | 
						||
```python
 | 
						||
### {highlight="1-5"}
 | 
						||
from spacy.language import Language
 | 
						||
 | 
						||
@Language.factory("my_component", default_config={"some_setting": False})
 | 
						||
def create_component(nlp: Language, name: str, some_setting: bool):
 | 
						||
    return MyCoolComponent(some_setting=some_setting)
 | 
						||
 | 
						||
 | 
						||
class MyCoolComponent:
 | 
						||
    def __init__(self, some_setting):
 | 
						||
        self.some_setting = some_setting
 | 
						||
 | 
						||
    def __call__(self, doc):
 | 
						||
        # Do something to the doc
 | 
						||
        return doc
 | 
						||
```
 | 
						||
 | 
						||
> #### Result in config.cfg
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [components.my_component]
 | 
						||
> factory = "my_component"
 | 
						||
> some_setting = true
 | 
						||
> ```
 | 
						||
 | 
						||
```diff
 | 
						||
import spacy
 | 
						||
from your_plugin import MyCoolComponent
 | 
						||
 | 
						||
nlp = spacy.load("en_core_web_sm")
 | 
						||
- component = MyCoolComponent(some_setting=True)
 | 
						||
- nlp.add_pipe(component)
 | 
						||
+ nlp.add_pipe("my_component", config={"some_setting": True})
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Important note on registering factories" variant="warning">
 | 
						||
 | 
						||
The [`@Language.factory`](/api/language#factory) decorator takes care of letting
 | 
						||
spaCy know that a component of that name is available. This means that your
 | 
						||
users can add it to the pipeline using its **string name**. However, this
 | 
						||
requires the decorator to be executed – so users will still have to **import
 | 
						||
your plugin**. Alternatively, your plugin could expose an
 | 
						||
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
 | 
						||
This means that spaCy knows how to initialize `my_component`, even if your
 | 
						||
package isn't imported.
 | 
						||
 | 
						||
</Infobox>
 |