Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-10 00:01:38 +02:00
parent 3eaeb73342
commit c044460823
5 changed files with 237 additions and 77 deletions

View File

@ -3,7 +3,7 @@ title: Model Architectures
teaser: Pre-defined model architectures included with the core library teaser: Pre-defined model architectures included with the core library
source: spacy/ml/models source: spacy/ml/models
menu: menu:
- ['Tok2Vec', 'tok2vec'] - ['Tok2Vec', 'tok2vec-arch']
- ['Transformers', 'transformers'] - ['Transformers', 'transformers']
- ['Parser & NER', 'parser'] - ['Parser & NER', 'parser']
- ['Tagging', 'tagger'] - ['Tagging', 'tagger']
@ -236,7 +236,7 @@ and residual connections.
> depth = 4 > depth = 4
> ``` > ```
Encode context using bidirectonal LSTM layers. Requires Encode context using bidirectional LSTM layers. Requires
[PyTorch](https://pytorch.org). [PyTorch](https://pytorch.org).
| Name | Type | Description | | Name | Type | Description |
@ -278,8 +278,6 @@ architectures into your training config.
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener} ### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}
<!-- TODO: description -->
> #### Example Config > #### Example Config
> >
> ```ini > ```ini
@ -291,10 +289,41 @@ architectures into your training config.
> @layers = "reduce_mean.v1" > @layers = "reduce_mean.v1"
> ``` > ```
| Name | Type | Description | Create a `TransformerListener` layer, which will connect to a
| ------------- | ------------------------- | ---------------------------------------------------------------------------------------------- | [`Transformer`](/api/transformer) component earlier in the pipeline. The layer
| `grad_factor` | float | Factor for weighting the gradient if multiple components listen to the same transformer model. | takes a list of [`Doc`](/api/doc) objects as input, and produces a list of
| `pooling` | `Model[Ragged, Floats2d]` | Pooling layer to determine how the vector for each spaCy token will be computed. | 2-dimensional arrays as output, with each array having one row per token. Most
spaCy models expect a sublayer with this signature, making it easy to connect
them to a transformer model via this sublayer. Transformer models usually
operate over wordpieces, which usually don't align one-to-one against spaCy
tokens. The layer therefore requires a reduction operation in order to calculate
a single token vector given zero or more wordpiece vectors.
| Name | Type | Description |
| ------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer}
> #### Example Config
>
> ```ini
> # TODO:
> ```
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
**not** allow multiple components to share the transformer weights, and does
**not** allow the transformer to set annotations into the [`Doc`](/api/doc)
object, but it's a **simpler solution** if you only need the transformer within
one component.
| Name | Type | Description |
| ------------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_spans` | callable | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
## Parser & NER architectures {#parser} ## Parser & NER architectures {#parser}
@ -595,8 +624,6 @@ A function that creates a default, empty `KnowledgeBase` from a
A function that takes as input a [`KnowledgeBase`](/api/kb) and a A function that takes as input a [`KnowledgeBase`](/api/kb) and a
[`Span`](/api/span) object denoting a named entity, and returns a list of [`Span`](/api/span) object denoting a named entity, and returns a list of
plausible [`Candidate` objects](/api/kb/#candidate_init). plausible [`Candidate` objects](/api/kb/#candidate_init). The default
`CandidateGenerator` simply uses the text of a mention to find its potential
The default `CandidateGenerator` simply uses the text of a mention to find its aliases in the `KnowledgeBase`. Note that this function is case-dependent.
potential aliases in the Knowledgebase. Note that this function is
case-dependent.

View File

@ -242,6 +242,21 @@ a batch of [Example](/api/example) objects.
Update the models in the pipeline. Update the models in the pipeline.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a batch of [`Example`](/api/example)
objects instead of the raw texts and annotations or `Doc` and `GoldParse`
objects. An [`Example`](/api/example) streamlines how data is passed around. It
stores two `Doc` objects: one for holding the gold-standard reference data, and
one for holding the predictions of the pipeline.
For most use cases, you shouldn't have to write your own training scripts
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a config file
and custom registered functions if needed. See the
[training documentation](/usage/training) for details.
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python
@ -253,7 +268,7 @@ Update the models in the pipeline.
| Name | Type | Description | | Name | Type | Description |
| --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. | | `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |

View File

@ -9,6 +9,28 @@ api_string_name: lemmatizer
api_trainable: false api_trainable: false
--- ---
Component for assigning base forms to tokens using rules based on part-of-speech
tags, or lookup tables. Functionality to train the component is coming soon.
Different [`Language`](/api/language) subclasses can implement their own
lemmatizer components via
[language-specific factories](/usage/processing-pipelines#factories-language).
The default data used is provided by the
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
extension package.
<Infobox variant="warning" title="New in v3.0">
As of v3.0, the `Lemmatizer` is a **standalone pipeline component** that can be
added to your pipeline, and not a hidden part of the vocab that runs behind the
scenes. This makes it easier to customize how lemmas should be assigned in your
pipeline.
If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
tags is available in the pipeline and runs _before_ the lemmatizer.
</Infobox>
## Config and implementation ## Config and implementation
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes
@ -29,7 +51,7 @@ lemmatizers, see the
| Setting | Type | Description | Default | | Setting | Type | Description | Default |
| ----------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | | ----------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
| `mode` | str | The lemmatizer mode, e.g. "lookup" or "rule". | `"lookup"` | | `mode` | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. | `"lookup"` |
| `lookups` | [`Lookups`](/api/lookups) | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. | `None` | | `lookups` | [`Lookups`](/api/lookups) | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. | `None` |
| `overwrite` | bool | Whether to overwrite existing lemmas. | `False` | | `overwrite` | bool | Whether to overwrite existing lemmas. | `False` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Not yet implemented:** the model to use. | `None` | | `model` | [`Model`](https://thinc.ai/docs/api-model) | **Not yet implemented:** the model to use. | `None` |
@ -55,15 +77,15 @@ Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#add_pipe). [`nlp.add_pipe`](/api/language#add_pipe).
| Name | Type | Description | | Name | Type | Description |
| -------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | [`Vocab`](/api/vocab) | The vocab. | | `vocab` | [`Vocab`](/api/vocab) | The vocab. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented). | | `model` | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented). |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | | `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | | | _keyword-only_ | | |
| mode | str | The lemmatizer mode, e.g. "lookup" or "rule". Defaults to "lookup". | | mode | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. |
| lookups | [`Lookups`](/api/lookups) | A lookups object containing the tables such as "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". Defaults to `None`. | | lookups | [`Lookups`](/api/lookups) | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. |
| overwrite | bool | Whether to overwrite existing lemmas. | | overwrite | bool | Whether to overwrite existing lemmas. |
## Lemmatizer.\_\_call\_\_ {#call tag="method"} ## Lemmatizer.\_\_call\_\_ {#call tag="method"}

View File

@ -25,8 +25,15 @@ work out-of-the-box.
</Infobox> </Infobox>
This pipeline component lets you use transformer models in your pipeline. The This pipeline component lets you use transformer models in your pipeline, using
component assigns the output of the transformer to the Doc's extension the [HuggingFace `transformers`](https://huggingface.co/transformers) library
under the hood. Usually you will connect subsequent components to the shared
transformer using the
[TransformerListener](/api/architectures#TransformerListener) layer. This works
similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
The component assigns the output of the transformer to the `Doc`'s extension
attributes. We also calculate an alignment between the word-piece tokens and the attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
@ -53,11 +60,11 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ``` > ```
| Setting | Type | Description | Default | | Setting | Type | Description | Default |
| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | | ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` | | `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` | | `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. | `null_annotation_setter` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) | | `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. | [TransformerModel](/api/architectures#TransformerModel) |
```python ```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -86,18 +93,22 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
> trf = Transformer(nlp.vocab, model) > trf = Transformer(nlp.vocab, model)
> ``` > ```
Create a new pipeline instance. In your application, you would normally use a Construct a `Transformer` component. One or more subsequent spaCy components can
shortcut for this and instantiate the component using its string name and use the transformer outputs as features in its model, with gradients
[`nlp.add_pipe`](/api/language#create_pipe). backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
| Name | Type | Description | | Name | Type | Description |
| ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | | `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. |
| `annotation_setter` | `Callable` | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. Defaults to `null_annotation_setter`, a function that does nothing. | | `annotation_setter` | `Callable` | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. |
| _keyword-only_ | | | | _keyword-only_ | | |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | | `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| `max_batch_items` | int | Maximum size of a padded batch. Defaults to `128*32`. | | `max_batch_items` | int | Maximum size of a padded batch. Defaults to `128*32`. |
## Transformer.\_\_call\_\_ {#call tag="method"} ## Transformer.\_\_call\_\_ {#call tag="method"}
@ -184,7 +195,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
## Transformer.set_annotations {#set_annotations tag="method"} ## Transformer.set_annotations {#set_annotations tag="method"}
Modify a batch of documents, using pre-computed scores. Assign the extracted features to the Doc objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
callback is then called, if provided.
> #### Example > #### Example
> >
@ -201,8 +215,19 @@ Modify a batch of documents, using pre-computed scores.
## Transformer.update {#update tag="method"} ## Transformer.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating the Prepare for an update to the transformer. Like the [`Tok2Vec`](/api/tok2vec)
pipe's model. Delegates to [`predict`](/api/transformer#predict). component, the `Transformer` component is unusual in that it does not receive
"gold standard" annotations to calculate a weight update. The optimal output of
the transformer data is unknown it's a hidden layer inside the network that is
updated by backpropagating from output layers.
The `Transformer` component therefore does **not** perform a weight update
during its own `update` method. Instead, it runs its transformer model and
communicates the output and the backpropagation callback to any **downstream
components** that have been connected to it via the
[TransformerListener](/api/architectures#TransformerListener) sublayer. If there
are multiple listeners, the last layer will actually backprop to the transformer
and call the optimizer, while the others simply increment the gradients.
> #### Example > #### Example
> >
@ -212,15 +237,15 @@ pipe's model. Delegates to [`predict`](/api/transformer#predict).
> losses = trf.update(examples, sgd=optimizer) > losses = trf.update(examples, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. |
| _keyword-only_ | | | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). |
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. | | `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. | | `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Transformer.create_optimizer {#create_optimizer tag="method"} ## Transformer.create_optimizer {#create_optimizer tag="method"}
@ -396,14 +421,16 @@ Split a `TransformerData` object that represents a batch into a list with one
## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} ## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
<!-- TODO: details on what this is for -->
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
return a lists of [`Span`](/api/span) objects for each doc, to be processed by return a lists of [`Span`](/api/span) objects for each doc, to be processed by
the transformer. The returned spans can overlap. Span getters can be referenced the transformer. This is used to manage long documents, by cutting them into
in the config's `[components.transformer.model.get_spans]` block to customize smaller sequences before running the transformer. The spans are allowed to
the sequences processed by the transformer. You can also register custom span overlap, and you can also omit sections of the Doc if they are not relevant.
getters using the `@registry.span_getters` decorator.
Span getters can be referenced in the config's
`[components.transformer.model.get_spans]` block to customize the sequences
processed by the transformer. You can also register custom span getters using
the `@registry.span_getters` decorator.
> #### Example > #### Example
> >

View File

@ -6,25 +6,97 @@ menu:
- ['New Features', 'features'] - ['New Features', 'features']
- ['Backwards Incompatibilities', 'incompat'] - ['Backwards Incompatibilities', 'incompat']
- ['Migrating from v2.x', 'migrating'] - ['Migrating from v2.x', 'migrating']
- ['Migrating plugins', 'plugins']
--- ---
## Summary {#summary} ## Summary {#summary}
## New Features {#features} ## New Features {#features}
### New training workflow and config system {#features-training}
### Transformer-based pipelines {#features-transformers}
### Custom models using any framework {#feautres-custom-models}
### Manage end-to-end workflows with projects {#features-projects}
### New built-in pipeline components {#features-pipeline-components}
| Name | Description |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component`
- `Language.analyze_pipes`
- Adding components from other models
### Type hints and type-based data validation {#features-types}
spaCy v3.0 officially drops support for Python 2 and now requires **Python
3.6+**. This also means that the code base can take full advantage of
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
API that's implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy's machine learning library
[Thinc](https://thinc.ai) also features extensive
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
types for models and arrays, and a custom `mypy` plugin that can be used to
type-check model definitions.
For data validation, spacy v3.0 adopts
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
lets you to register **custom functions with typed arguments**, reference them
in your config and see validation errors if the argument values don't match.
### CLI
| Name | Description |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. |
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). |
## Backwards Incompatibilities {#incompat} ## Backwards Incompatibilities {#incompat}
### Removed or renamed objects, methods, attributes and arguments {#incompat-removed} As always, we've tried to keep the breaking changes to a minimum and focus on
changes that were necessary to support the new features, fix problems or improve
usability. The following section lists the relevant changes to the user-facing
API. For specific examples of how to rewrite your code, check out the
[migration guide](#migrating).
| Removed | Replacement | ### Compatibility {#incompat-compat}
| -------------------------------------------------------- | ----------------------------------------- |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated} - spaCy now requires **Python 3.6+**.
### API changes {#incompat-api}
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
the component factory instead of the component function.
- **Custom pipeline components** now needs to be decorated with the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator.
- [`Language.update`](/api/language#update) now takes a batch of
[`Example`](/api/example) objects instead of raw texts and annotations, or
`Doc` and `GoldParse` objects.
- The `Language.disable_pipes` contextmanager has been replaced by
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
disable or enable components.
### Removed or renamed API {#incompat-removed}
| Removed | Replacement |
| -------------------------------------------------------- | ----------------------------------------------------- |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously Most of them have been **deprecated for a while** and many would previously
@ -214,17 +286,14 @@ python -m spacy package ./model ./packages
- python setup.py sdist - python setup.py sdist
``` ```
## Migration notes for plugin maintainers {#plugins} #### Migration notes for plugin maintainers {#migrating-plugins}
Thanks to everyone who's been contributing to the spaCy ecosystem by developing Thanks to everyone who's been contributing to the spaCy ecosystem by developing
and maintaining one of the many awesome [plugins and extensions](/universe). and maintaining one of the many awesome [plugins and extensions](/universe).
We've tried to keep breaking changes to a minimum and make it as easy as We've tried to make it as easy as possible for you to upgrade your packages for
possible for you to upgrade your packages for spaCy v3. spaCy v3. The most common use case for plugins is providing pipeline components
and extension attributes. When migrating your plugin, double-check the
### Custom pipeline components following:
The most common use case for plugins is providing pipeline components and
extension attributes.
- Use the [`@Language.factory`](/api/language#factory) decorator to register - Use the [`@Language.factory`](/api/language#factory) decorator to register
your component and assign it a name. This allows users to refer to your your component and assign it a name. This allows users to refer to your