Update docs [ci skip]

2025-07-15 18:52:29 +03:00 · 2020-08-10 00:01:38 +02:00 · 2020-08-10 00:01:38 +02:00 · c044460823
commit c044460823
parent 3eaeb73342
5 changed files with 237 additions and 77 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -3,7 +3,7 @@ title: Model Architectures
 teaser: Pre-defined model architectures included with the core library
 source: spacy/ml/models
 menu:
-  - ['Tok2Vec', 'tok2vec']
+  - ['Tok2Vec', 'tok2vec-arch']
  - ['Transformers', 'transformers']
  - ['Parser & NER', 'parser']
  - ['Tagging', 'tagger']
@ -236,7 +236,7 @@ and residual connections.
 > depth = 4
 > ```

-Encode context using bidirectonal LSTM layers. Requires
+Encode context using bidirectional LSTM layers. Requires
 [PyTorch](https://pytorch.org).

 | Name          | Type | Description                                                                                                                                                                                            |
@ -278,8 +278,6 @@ architectures into your training config.

 ### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}

-<!-- TODO: description -->
-
 > #### Example Config
 >
 > ```ini
@ -291,10 +289,41 @@ architectures into your training config.
 > @layers = "reduce_mean.v1"
 > ```

-| Name          | Type                      | Description                                                                                    |
-| ------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
-| `grad_factor` | float                     | Factor for weighting the gradient if multiple components listen to the same transformer model. |
-| `pooling`     | `Model[Ragged, Floats2d]` | Pooling layer to determine how the vector for each spaCy token will be computed.               |
+Create a `TransformerListener` layer, which will connect to a
+[`Transformer`](/api/transformer) component earlier in the pipeline. The layer
+takes a list of [`Doc`](/api/doc) objects as input, and produces a list of
+2-dimensional arrays as output, with each array having one row per token. Most
+spaCy models expect a sublayer with this signature, making it easy to connect
+them to a transformer model via this sublayer. Transformer models usually
+operate over wordpieces, which usually don't align one-to-one against spaCy
+tokens. The layer therefore requires a reduction operation in order to calculate
+a single token vector given zero or more wordpiece vectors.
+
+| Name          | Type                                       | Description                                                                                                                                                                                                                                                         |
+| ------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `pooling`     | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types)                                                                                                                                      | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
+| `grad_factor` | float                                      | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
+
+### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer}
+
+> #### Example Config
+>
+> ```ini
+> # TODO:
+> ```
+
+Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
+**not** allow multiple components to share the transformer weights, and does
+**not** allow the transformer to set annotations into the [`Doc`](/api/doc)
+object, but it's a **simpler solution** if you only need the transformer within
+one component.
+
+| Name               | Type                                       | Description                                                                                                                                                                                                                                                         |
+| ------------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_spans`        | callable                                   | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples.                                                     |
+| `tokenizer_config` | `Dict[str, Any]`                           | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer).                                                                                                                    |
+| `pooling`          | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types)                                                                                                                                      | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
+| `grad_factor`      | float                                      | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |

 ## Parser & NER architectures {#parser}

@ -595,8 +624,6 @@ A function that creates a default, empty `KnowledgeBase` from a

 A function that takes as input a [`KnowledgeBase`](/api/kb) and a
 [`Span`](/api/span) object denoting a named entity, and returns a list of
-plausible [`Candidate` objects](/api/kb/#candidate_init).
-
-The default `CandidateGenerator` simply uses the text of a mention to find its
-potential aliases in the Knowledgebase. Note that this function is
-case-dependent.
+plausible [`Candidate` objects](/api/kb/#candidate_init). The default
+`CandidateGenerator` simply uses the text of a mention to find its potential
+aliases in the `KnowledgeBase`. Note that this function is case-dependent.
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -242,6 +242,21 @@ a batch of [Example](/api/example) objects.

 Update the models in the pipeline.

+<Infobox variant="warning" title="Changed in v3.0">
+
+The `Language.update` method now takes a batch of [`Example`](/api/example)
+objects instead of the raw texts and annotations or `Doc` and `GoldParse`
+objects. An [`Example`](/api/example) streamlines how data is passed around. It
+stores two `Doc` objects: one for holding the gold-standard reference data, and
+one for holding the predictions of the pipeline.
+
+For most use cases, you shouldn't have to write your own training scripts
+anymore. Instead, you can use [`spacy train`](/api/cli#train) with a config file
+and custom registered functions if needed. See the
+[training documentation](/usage/training) for details.
+
+</Infobox>
+
 > #### Example
 >
 > ```python
@ -253,7 +268,7 @@ Update the models in the pipeline.

 | Name            | Type                                                | Description                                                                                            |
 | --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
-| `examples`      | `Iterable[Example]`                                 | A batch of `Example` objects to learn from.                                                            |
+| `examples`      | `Iterable[Example]`                                 | A batch of [`Example`](/api/example) objects to learn from.                                            |
 | _keyword-only_  |                                                     |                                                                                                        |
 | `drop`          | float                                               | The dropout rate.                                                                                      |
 | `sgd`           | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer.                                                                                         |
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -9,6 +9,28 @@ api_string_name: lemmatizer
 api_trainable: false
 ---

+Component for assigning base forms to tokens using rules based on part-of-speech
+tags, or lookup tables. Functionality to train the component is coming soon.
+Different [`Language`](/api/language) subclasses can implement their own
+lemmatizer components via
+[language-specific factories](/usage/processing-pipelines#factories-language).
+The default data used is provided by the
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+extension package.
+
+<Infobox variant="warning" title="New in v3.0">
+
+As of v3.0, the `Lemmatizer` is a **standalone pipeline component** that can be
+added to your pipeline, and not a hidden part of the vocab that runs behind the
+scenes. This makes it easier to customize how lemmas should be assigned in your
+pipeline.
+
+If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
+be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
+tags is available in the pipeline and runs _before_ the lemmatizer.
+
+</Infobox>
+
 ## Config and implementation

 The default config is defined by the pipeline component factory and describes
@ -29,7 +51,7 @@ lemmatizers, see the

 | Setting     | Type                                       | Description                                                                                                                                                                            | Default    |
 | ----------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
-| `mode`      | str                                        | The lemmatizer mode, e.g. "lookup" or "rule".                                                                                                                                          | `"lookup"` |
+| `mode`      | str                                        | The lemmatizer mode, e.g. `"lookup"` or `"rule"`.                                                                                                                                      | `"lookup"` |
 | `lookups`   | [`Lookups`](/api/lookups)                  | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. | `None`     |
 | `overwrite` | bool                                       | Whether to overwrite existing lemmas.                                                                                                                                                  | `False`    |
 | `model`     | [`Model`](https://thinc.ai/docs/api-model) | **Not yet implemented:** the model to use.                                                                                                                                             | `None`     |
@ -55,15 +77,15 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).

-| Name           | Type                                       | Description                                                                                                                      |
-| -------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`        | [`Vocab`](/api/vocab)                      | The vocab.                                                                                                                       |
-| `model`        | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented).                                                                                                   |
-| `name`         | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                      |
-| _keyword-only_ |                                            |                                                                                                                                  |
-| mode           | str                                        | The lemmatizer mode, e.g. "lookup" or "rule". Defaults to "lookup".                                                              |
-| lookups        | [`Lookups`](/api/lookups)                  | A lookups object containing the tables such as "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". Defaults to `None`. |
-| overwrite      | bool                                       | Whether to overwrite existing lemmas.                                                                                            |
+| Name           | Type                                       | Description                                                                                                                              |
+| -------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`        | [`Vocab`](/api/vocab)                      | The vocab.                                                                                                                               |
+| `model`        | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented).                                                                                                           |
+| `name`         | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                              |
+| _keyword-only_ |                                            |                                                                                                                                          |
+| mode           | str                                        | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`.                                                                |
+| lookups        | [`Lookups`](/api/lookups)                  | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. |
+| overwrite      | bool                                       | Whether to overwrite existing lemmas.                                                                                                    |

 ## Lemmatizer.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -25,8 +25,15 @@ work out-of-the-box.

 </Infobox>

-This pipeline component lets you use transformer models in your pipeline. The
-component assigns the output of the transformer to the Doc's extension
+This pipeline component lets you use transformer models in your pipeline, using
+the [HuggingFace `transformers`](https://huggingface.co/transformers) library
+under the hood. Usually you will connect subsequent components to the shared
+transformer using the
+[TransformerListener](/api/architectures#TransformerListener) layer. This works
+similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
+[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
+
+The component assigns the output of the transformer to the `Doc`'s extension
 attributes. We also calculate an alignment between the word-piece tokens and the
 spaCy tokenization, so that we can use the last hidden states to set the
 `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
@ -53,11 +60,11 @@ architectures and their arguments and hyperparameters.
 > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
 > ```

-| Setting             | Type                                       | Description                                                                                                                                                         | Default                                                 |
-| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
-| `max_batch_items`   | int                                        | Maximum size of a padded batch.                                                                                                                                     | `4096`                                                  |
-| `annotation_setter` | Callable                                   | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter`                                |
-| `model`             | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                   | [TransformerModel](/api/architectures#TransformerModel) |
+| Setting             | Type                                       | Description                                                                                                                                                                                                                                                                                     | Default                                                 |
+| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| `max_batch_items`   | int                                        | Maximum size of a padded batch.                                                                                                                                                                                                                                                                 | `4096`                                                  |
+| `annotation_setter` | Callable                                   | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. | `null_annotation_setter`                                |
+| `model`             | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer.                                                                                                             | [TransformerModel](/api/architectures#TransformerModel) |

 ```python
 https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -86,18 +93,22 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
 > trf = Transformer(nlp.vocab, model)
 > ```

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.add_pipe`](/api/language#create_pipe).
+Construct a `Transformer` component. One or more subsequent spaCy components can
+use the transformer outputs as features in its model, with gradients
+backpropagated to the single shared weights. The activations from the
+transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
+attribute. You can also provide a callback to set additional annotations. In
+your application, you would normally use a shortcut for this and instantiate the
+component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).

-| Name                | Type                                       | Description                                                                                                                                                                                                                             |
-| ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`             | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                  |
-| `model`             | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                   |
-| `annotation_setter` | `Callable`                                 | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. Defaults to `null_annotation_setter`, a function that does nothing. |
-| _keyword-only_      |                                            |                                                                                                                                                                                                                                         |
-| `name`              | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                             |
-| `max_batch_items`   | int                                        | Maximum size of a padded batch. Defaults to `128*32`.                                                                                                                                                                                   |
+| Name                | Type                                       | Description                                                                                                                                                                                                                                                                                     |
+| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`             | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                          |
+| `model`             | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this.    |
+| `annotation_setter` | `Callable`                                 | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. |
+| _keyword-only_      |                                            |                                                                                                                                                                                                                                                                                                 |
+| `name`              | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                     |
+| `max_batch_items`   | int                                        | Maximum size of a padded batch. Defaults to `128*32`.                                                                                                                                                                                                                                           |

 ## Transformer.\_\_call\_\_ {#call tag="method"}

@ -184,7 +195,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.

 ## Transformer.set_annotations {#set_annotations tag="method"}

-Modify a batch of documents, using pre-computed scores.
+Assign the extracted features to the Doc objects. By default, the
+[`TransformerData`](/api/transformer#transformerdata) object is written to the
+[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
+callback is then called, if provided.

 > #### Example
 >
@ -201,8 +215,19 @@ Modify a batch of documents, using pre-computed scores.

 ## Transformer.update {#update tag="method"}

-Learn from a batch of documents and gold-standard information, updating the
-pipe's model. Delegates to [`predict`](/api/transformer#predict).
+Prepare for an update to the transformer. Like the [`Tok2Vec`](/api/tok2vec)
+component, the `Transformer` component is unusual in that it does not receive
+"gold standard" annotations to calculate a weight update. The optimal output of
+the transformer data is unknown – it's a hidden layer inside the network that is
+updated by backpropagating from output layers.
+
+The `Transformer` component therefore does **not** perform a weight update
+during its own `update` method. Instead, it runs its transformer model and
+communicates the output and the backpropagation callback to any **downstream
+components** that have been connected to it via the
+[TransformerListener](/api/architectures#TransformerListener) sublayer. If there
+are multiple listeners, the last layer will actually backprop to the transformer
+and call the optimizer, while the others simply increment the gradients.

 > #### Example
 >
@ -212,15 +237,15 @@ pipe's model. Delegates to [`predict`](/api/transformer#predict).
 > losses = trf.update(examples, sgd=optimizer)
 > ```

-| Name              | Type                                                | Description                                                                                                                               |
-| ----------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
-| `examples`        | `Iterable[Example]`                                 | A batch of [`Example`](/api/example) objects to learn from.                                                                               |
-| _keyword-only_    |                                                     |                                                                                                                                           |
-| `drop`            | float                                               | The dropout rate.                                                                                                                         |
-| `set_annotations` | bool                                                | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). |
-| `sgd`             | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer.                                                                                                                            |
-| `losses`          | `Dict[str, float]`                                  | Optional record of the loss during training. Updated using the component name as the key.                                                 |
-| **RETURNS**       | `Dict[str, float]`                                  | The updated `losses` dictionary.                                                                                                          |
+| Name              | Type                                                | Description                                                                                                                                                |
+| ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `examples`        | `Iterable[Example]`                                 | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. |
+| _keyword-only_    |                                                     |                                                                                                                                                            |
+| `drop`            | float                                               | The dropout rate.                                                                                                                                          |
+| `set_annotations` | bool                                                | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations).                  |
+| `sgd`             | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer.                                                                                                                                             |
+| `losses`          | `Dict[str, float]`                                  | Optional record of the loss during training. Updated using the component name as the key.                                                                  |
+| **RETURNS**       | `Dict[str, float]`                                  | The updated `losses` dictionary.                                                                                                                           |

 ## Transformer.create_optimizer {#create_optimizer tag="method"}

@ -396,14 +421,16 @@ Split a `TransformerData` object that represents a batch into a list with one

 ## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}

-<!-- TODO: details on what this is for -->
-
 Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
 return a lists of [`Span`](/api/span) objects for each doc, to be processed by
-the transformer. The returned spans can overlap. Span getters can be referenced
-in the config's `[components.transformer.model.get_spans]` block to customize
-the sequences processed by the transformer. You can also register custom span
-getters using the `@registry.span_getters` decorator.
+the transformer. This is used to manage long documents, by cutting them into
+smaller sequences before running the transformer. The spans are allowed to
+overlap, and you can also omit sections of the Doc if they are not relevant.
+
+Span getters can be referenced in the config's
+`[components.transformer.model.get_spans]` block to customize the sequences
+processed by the transformer. You can also register custom span getters using
+the `@registry.span_getters` decorator.

 > #### Example
 >
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -6,25 +6,97 @@ menu:
  - ['New Features', 'features']
  - ['Backwards Incompatibilities', 'incompat']
  - ['Migrating from v2.x', 'migrating']
-  - ['Migrating plugins', 'plugins']
 ---

 ## Summary {#summary}

 ## New Features {#features}

+### New training workflow and config system {#features-training}
+
+### Transformer-based pipelines {#features-transformers}
+
+### Custom models using any framework {#feautres-custom-models}
+
+### Manage end-to-end workflows with projects {#features-projects}
+
+### New built-in pipeline components {#features-pipeline-components}
+
+| Name                                            | Description                                                                                                                                                                                                  |
+| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation.                                                                                                                                                               |
+| [`Morphologizer`](/api/morphologizer)           | Trainable component to predict morphological features.                                                                                                                                                       |
+| [`Lemmatizer`](/api/lemmatizer)                 | Standalone component for rule-based and lookup lemmatization.                                                                                                                                                |
+| [`AttributeRuler`](/api/attributeruler)         | Component for setting token attributes using match patterns.                                                                                                                                                 |
+| [`Transformer`](/api/transformer)               | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
+
+### New and improved pipeline component APIs {#features-components}
+
+- `Language.factory`, `Language.component`
+- `Language.analyze_pipes`
+- Adding components from other models
+
+### Type hints and type-based data validation {#features-types}
+
+spaCy v3.0 officially drops support for Python 2 and now requires **Python
+3.6+**. This also means that the code base can take full advantage of
+[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
+API that's implemented in pure Python (as opposed to Cython) now comes with type
+hints. The new version of spaCy's machine learning library
+[Thinc](https://thinc.ai) also features extensive
+[type support](https://thinc.ai/docs/usage-type-checking/), including custom
+types for models and arrays, and a custom `mypy` plugin that can be used to
+type-check model definitions.
+
+For data validation, spacy v3.0 adopts
+[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
+validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
+lets you to register **custom functions with typed arguments**, reference them
+in your config and see validation errors if the argument values don't match.
+
+### CLI
+
+| Name                                    | Description                                                                                              |
+| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
+| [`init config`](/api/cli#init-config)   | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
+| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors.                              |
+| [`project`](/api/cli#project)           | Subcommand for cloning and running [spaCy projects](/usage/projects).                                    |
+
 ## Backwards Incompatibilities {#incompat}

-### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}
+As always, we've tried to keep the breaking changes to a minimum and focus on
+changes that were necessary to support the new features, fix problems or improve
+usability. The following section lists the relevant changes to the user-facing
+API. For specific examples of how to rewrite your code, check out the
+[migration guide](#migrating).

-| Removed                                                  | Replacement                               |
-| -------------------------------------------------------- | ----------------------------------------- |
-| `GoldParse`                                              | [`Example`](/api/example)                 |
-| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                   |
-| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data) |
-| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
+### Compatibility {#incompat-compat}

-### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}
+- spaCy now requires **Python 3.6+**.
+
+### API changes {#incompat-api}
+
+- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
+  the component factory instead of the component function.
+- **Custom pipeline components** now needs to be decorated with the
+  [`@Language.component`](/api/language#component) or
+  [`@Language.factory`](/api/language#factory) decorator.
+- [`Language.update`](/api/language#update) now takes a batch of
+  [`Example`](/api/example) objects instead of raw texts and annotations, or
+  `Doc` and `GoldParse` objects.
+- The `Language.disable_pipes` contextmanager has been replaced by
+  [`Language.select_pipes`](/api/language#select_pipes), which can explicitly
+  disable or enable components.
+
+### Removed or renamed API {#incompat-removed}
+
+| Removed                                                  | Replacement                                           |
+| -------------------------------------------------------- | ----------------------------------------------------- |
+| `Language.disable_pipes`                                 | [`Language.select_pipes`](/api/language#select_pipes) |
+| `GoldParse`                                              | [`Example`](/api/example)                             |
+| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                               |
+| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data)             |
+| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated             |

 The following deprecated methods, attributes and arguments were removed in v3.0.
 Most of them have been **deprecated for a while** and many would previously
@ -214,17 +286,14 @@ python -m spacy package ./model ./packages
 - python setup.py sdist
 ```

-## Migration notes for plugin maintainers {#plugins}
+#### Migration notes for plugin maintainers {#migrating-plugins}

 Thanks to everyone who's been contributing to the spaCy ecosystem by developing
 and maintaining one of the many awesome [plugins and extensions](/universe).
-We've tried to keep breaking changes to a minimum and make it as easy as
-possible for you to upgrade your packages for spaCy v3.
-
-### Custom pipeline components
-
-The most common use case for plugins is providing pipeline components and
-extension attributes.
+We've tried to make it as easy as possible for you to upgrade your packages for
+spaCy v3. The most common use case for plugins is providing pipeline components
+and extension attributes. When migrating your plugin, double-check the
+following:

 - Use the [`@Language.factory`](/api/language#factory) decorator to register
  your component and assign it a name. This allows users to refer to your