Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-11 20:57:23 +02:00
parent 10f42e3a39
commit b7ec06e331
22 changed files with 429 additions and 82 deletions

View File

@ -274,7 +274,7 @@ architectures into your training config.
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. | | `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). | | `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener} ### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener}
> #### Example Config > #### Example Config
> >

View File

@ -43,7 +43,7 @@ $ python -m spacy download [model] [--direct] [pip args]
| Argument | Type | Description | | Argument | Type | Description |
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model` | positional | Model name, e.g. `en_core_web_sm`.. | | `model` | positional | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). |
| `--direct`, `-d` | flag | Force direct download of exact model version. | | `--direct`, `-d` | flag | Force direct download of exact model version. |
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. | | pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
| `--help`, `-h` | flag | Show help message and available arguments. | | `--help`, `-h` | flag | Show help message and available arguments. |

View File

@ -182,10 +182,10 @@ run [`spacy pretrain`](/api/cli#pretrain).
> ``` > ```
The main data format used in spaCy v3.0 is a **binary format** created by The main data format used in spaCy v3.0 is a **binary format** created by
serializing a [`DocBin`](/api/docbin) object, which represents a collection of serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
`Doc` objects. This means that you can train spaCy models using the same format objects. This means that you can train spaCy models using the same format it
it outputs: annotated `Doc` objects. The binary format is extremely **efficient outputs: annotated `Doc` objects. The binary format is extremely **efficient in
in storage**, especially when packing multiple documents together. storage**, especially when packing multiple documents together.
Typically, the extension for these binary files is `.spacy`, and they are used Typically, the extension for these binary files is `.spacy`, and they are used
as input format for specifying a [training corpus](/api/corpus) and for spaCy's as input format for specifying a [training corpus](/api/corpus) and for spaCy's

View File

@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
## DependencyParser.begin_training {#begin_training tag="method"} ## DependencyParser.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> parser = nlp.add_pipe("parser") > parser = nlp.add_pipe("parser")
> optimizer = parser.begin_training(pipeline=nlp.pipeline) > optimizer = parser.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
## EntityLinker.begin_training {#begin_training tag="method"} ## EntityLinker.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> entity_linker = nlp.add_pipe("entity_linker", last=True) > entity_linker = nlp.add_pipe("entity_linker", last=True)
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline) > optimizer = entity_linker.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -131,14 +131,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
## EntityRecognizer.begin_training {#begin_training tag="method"} ## EntityRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> ner = nlp.add_pipe("ner") > ner = nlp.add_pipe("ner")
> optimizer = ner.begin_training(pipeline=nlp.pipeline) > optimizer = ner.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -200,12 +200,28 @@ more efficient than processing texts one-by-one.
## Language.begin_training {#begin_training tag="method"} ## Language.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the pipeline for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples can either be the full training data or a representative sample. They
are used to **initialize the models** of trainable pipeline components and are
passed each component's [`begin_training`](/api/pipe#begin_training) method, if
available. Initialization includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a **function** that is called with no
arguments and returns a sequence of [`Example`](/api/example) objects instead of
tuples of `Doc` and `GoldParse` objects.
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python
> get_examples = lambda: examples
> optimizer = nlp.begin_training(get_examples) > optimizer = nlp.begin_training(get_examples)
> ``` > ```
@ -276,7 +292,7 @@ and custom registered functions if needed. See the
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. | | `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Language.rehearse {#rehearse tag="method,experimental"} ## Language.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model, to try to address
@ -302,6 +318,13 @@ the "catastrophic forgetting" problem. This feature is experimental.
Evaluate a model's pipeline components. Evaluate a model's pipeline components.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a batch of [`Example`](/api/example)
objects instead of tuples of `Doc` and `GoldParse` objects.
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python

View File

@ -121,15 +121,21 @@ applied to the `Doc` in order. Both [`__call__`](/api/morphologizer#call) and
## Morphologizer.begin_training {#begin_training tag="method"} ## Morphologizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> morphologizer = nlp.add_pipe("morphologizer") > morphologizer = nlp.add_pipe("morphologizer")
> nlp.pipeline.append(morphologizer) > nlp.pipeline.append(morphologizer)
> optimizer = morphologizer.begin_training(pipeline=nlp.pipeline) > optimizer = morphologizer.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -9,8 +9,8 @@ components like the [`EntityRecognizer`](/api/entityrecognizer) or
[`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the [`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the
interface that components should follow to function as trainable components in a interface that components should follow to function as trainable components in a
spaCy pipeline. See the docs on spaCy pipeline. See the docs on
[writing trainable components](/usage/processing-pipelines#trainable) for how to [writing trainable components](/usage/processing-pipelines#trainable-components)
use the `Pipe` base class to implement custom components. for how to use the `Pipe` base class to implement custom components.
> #### Why is Pipe implemented in Cython? > #### Why is Pipe implemented in Cython?
> >
@ -106,14 +106,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/pipe#call) and
## Pipe.begin_training {#begin_training tag="method"} ## Pipe.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> pipe = nlp.add_pipe("your_custom_pipe") > pipe = nlp.add_pipe("your_custom_pipe")
> optimizer = pipe.begin_training(pipeline=nlp.pipeline) > optimizer = pipe.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -200,7 +206,7 @@ This method needs to be overwritten with your own custom `update` method.
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. | | `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Pipe.rehearse {#rehearse tag="method,experimental"} ## Pipe.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model, to try to address

View File

@ -116,14 +116,20 @@ and [`pipe`](/api/sentencerecognizer#pipe) delegate to the
## SentenceRecognizer.begin_training {#begin_training tag="method"} ## SentenceRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> senter = nlp.add_pipe("senter") > senter = nlp.add_pipe("senter")
> optimizer = senter.begin_training(pipeline=nlp.pipeline) > optimizer = senter.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -193,7 +199,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | | `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental"} ## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model, to try to address

View File

@ -114,14 +114,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
## Tagger.begin_training {#begin_training tag="method"} ## Tagger.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> tagger = nlp.add_pipe("tagger") > tagger = nlp.add_pipe("tagger")
> optimizer = tagger.begin_training(pipeline=nlp.pipeline) > optimizer = tagger.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -191,7 +197,7 @@ Delegates to [`predict`](/api/tagger#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | | `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Tagger.rehearse {#rehearse tag="method,experimental"} ## Tagger.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model, to try to address

View File

@ -122,14 +122,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
## TextCategorizer.begin_training {#begin_training tag="method"} ## TextCategorizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> textcat = nlp.add_pipe("textcat") > textcat = nlp.add_pipe("textcat")
> optimizer = textcat.begin_training(pipeline=nlp.pipeline) > optimizer = textcat.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -199,7 +205,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. | | `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | | **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## TextCategorizer.rehearse {#rehearse tag="method,experimental"} ## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model, to try to address

View File

@ -125,14 +125,20 @@ and [`set_annotations`](/api/tok2vec#set_annotations) methods.
## Tok2Vec.begin_training {#begin_training tag="method"} ## Tok2Vec.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> tok2vec = nlp.add_pipe("tok2vec") > tok2vec = nlp.add_pipe("tok2vec")
> optimizer = tok2vec.begin_training(pipeline=nlp.pipeline) > optimizer = tok2vec.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -159,14 +159,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/transformer#call) and
## Transformer.begin_training {#begin_training tag="method"} ## Transformer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. [`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example > #### Example
> >
> ```python > ```python
> trf = nlp.add_pipe("transformer") > trf = nlp.add_pipe("transformer")
> optimizer = trf.begin_training(pipeline=nlp.pipeline) > optimizer = trf.begin_training(lambda: [], pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -45,9 +45,9 @@ three components:
2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`. 2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
3. **Size:** Model size indicator, `sm`, `md` or `lg`. 3. **Size:** Model size indicator, `sm`, `md` or `lg`.
For example, `en_core_web_sm` is a small English model trained on written web For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
text (blogs, news, comments), that includes vocabulary, vectors, syntax and model trained on written web text (blogs, news, comments), that includes
entities. vocabulary, vectors, syntax and entities.
### Model versioning {#model-versioning} ### Model versioning {#model-versioning}

View File

@ -687,13 +687,13 @@ give you everything you need to train fully custom models with
</Infobox> </Infobox>
<!-- TODO: maybe add something about why the Example class is great and its benefits, and how it's passed around, holds the alignment etc -->
The [`Example`](/api/example) object contains annotated training data, also The [`Example`](/api/example) object contains annotated training data, also
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. Here's an example of a simple `Example` for gold-standard annotations. It also includes the **alignment** between those two
part-of-speech tags: documents if they differ in tokenization. The `Example` class ensures that spaCy
can rely on one **standardized format** that's passed through the pipeline.
Here's an example of a simple `Example` for part-of-speech tags:
```python ```python
words = ["I", "like", "stuff"] words = ["I", "like", "stuff"]
@ -744,7 +744,8 @@ example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O"
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class. As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
It can be constructed in a very similar way, from a `Doc` and a dictionary of It can be constructed in a very similar way, from a `Doc` and a dictionary of
annotations: annotations. For more details, see the
[migration guide](/usage/v3#migrating-training).
```diff ```diff
- gold = GoldParse(doc, entities=entities) - gold = GoldParse(doc, entities=entities)

View File

@ -14,12 +14,49 @@ menu:
### New training workflow and config system {#features-training} ### New training workflow and config system {#features-training}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Training models](/usage/training)
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
[`Config`](https://thinc.ai/docs/api-config#config)
- **CLI:** [`train`](/api/cli#train), [`pretrain`](/api/cli#pretrain),
[`evaluate`](/api/cli#evaluate)
- **API:** [Config format](/api/data-formats#config),
[`registry`](/api/top-level#registry)
</Infobox>
### Transformer-based pipelines {#features-transformers} ### Transformer-based pipelines {#features-transformers}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Transformers](/usage/transformers),
[Training models](/usage/training)
- **API:** [`Transformer`](/api/transformer),
[`TransformerData`](/api/transformer#transformerdata),
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
- **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
</Infobox>
### Custom models using any framework {#feautres-custom-models} ### Custom models using any framework {#feautres-custom-models}
### Manage end-to-end workflows with projects {#features-projects} ### Manage end-to-end workflows with projects {#features-projects}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [spaCy projects](/usage/projects),
[Training models](/usage/training)
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
- **Templates:** [`projects`](https://github.com/explosion/projects)
</Infobox>
### New built-in pipeline components {#features-pipeline-components} ### New built-in pipeline components {#features-pipeline-components}
| Name | Description | | Name | Description |
@ -30,14 +67,48 @@ menu:
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. | | [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). | | [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Processing pipelines](/usage/processing-pipelines)
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
- **Implementation:**
[`spacy/pipeline`](https://github.com/explosion/spaCy/tree/develop/spacy/pipeline)
</Infobox>
### New and improved pipeline component APIs {#features-components} ### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component` - `Language.factory`, `Language.component`
- `Language.analyze_pipes` - `Language.analyze_pipes`
- Adding components from other models - Adding components from other models
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
[Defining components during training](/usage/training#config-components)
- **API:** [`Language`](/api/language)
- **Implementation:**
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
</Infobox>
### Type hints and type-based data validation {#features-types} ### Type hints and type-based data validation {#features-types}
> #### Example
>
> ```python
> from spacy.language import Language
> from pydantic import StrictBool
>
> @Language.factory("my_component")
> def create_my_component(
> nlp: Language,
> name: str,
> custom: StrictBool
> ):
> ...
> ```
spaCy v3.0 officially drops support for Python 2 and now requires **Python spaCy v3.0 officially drops support for Python 2 and now requires **Python
3.6+**. This also means that the code base can take full advantage of 3.6+**. This also means that the code base can take full advantage of
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing [type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
@ -54,13 +125,36 @@ validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
lets you to register **custom functions with typed arguments**, reference them lets you to register **custom functions with typed arguments**, reference them
in your config and see validation errors if the argument values don't match. in your config and see validation errors if the argument values don't match.
### CLI <Infobox title="Details & Documentation" emoji="📖" list>
| Name | Description | - **Usage: **
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- | [Component type hints and validation](/usage/processing-pipelines#type-hints),
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. | [Training with custom code](/usage/training#custom-code)
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. | - **Thinc: **
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). | [Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
[Thinc's config system](https://thinc.ai/docs/usage-config)
</Infobox>
### New methods, attributes and commands
The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description |
| ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`init config`](/api/cli#init-config) | CLI command for initializing a [training config](/usage/training) file for a blank language or auto-filling a partial config. |
| [`debug config`](/api/cli#debug-config) | CLI command for debugging a [training config](/usage/training) file and showing validation errors. |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
## Backwards Incompatibilities {#incompat} ## Backwards Incompatibilities {#incompat}
@ -70,12 +164,21 @@ usability. The following section lists the relevant changes to the user-facing
API. For specific examples of how to rewrite your code, check out the API. For specific examples of how to rewrite your code, check out the
[migration guide](#migrating). [migration guide](#migrating).
### Compatibility {#incompat-compat} <Infobox variant="warning">
- spaCy now requires **Python 3.6+**. Note that spaCy v3.0 now requires **Python 3.6+**.
</Infobox>
### API changes {#incompat-api} ### API changes {#incompat-api}
- Model symlinks, the `link` command and shortcut names are now deprecated.
There can be many [different models](/models) and not just one "English
model", so you should always use the full model name like
[`en_core_web_sm`](/models/en) explicitly.
- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
only take a `config.cfg` file containing the full
[training config](/usage/training#config).
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of - [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
the component factory instead of the component function. the component factory instead of the component function.
- **Custom pipeline components** now needs to be decorated with the - **Custom pipeline components** now needs to be decorated with the
@ -87,6 +190,20 @@ API. For specific examples of how to rewrite your code, check out the
- The `Language.disable_pipes` contextmanager has been replaced by - The `Language.disable_pipes` contextmanager has been replaced by
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly [`Language.select_pipes`](/api/language#select_pipes), which can explicitly
disable or enable components. disable or enable components.
- The [`Language.update`](/api/language#update),
[`Language.evaluate`](/api/language#evaluate) and
[`Pipe.update`](/api/pipe#update) methods now all take batches of
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations.
[`Language.begin_training`](/api/language#begin_training) and
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
returns a sequence of `Example` objects to initialize the model instead of a
list of tuples.
- [`Matcher.add`](/api/matcher#add),
[`PhraseMatcher.add`](/api/phrasematcher#add) and
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
of patterns as the second argument (instead of a variable number of
arguments). The `on_match` callback becomes an optional keyword argument.
### Removed or renamed API {#incompat-removed} ### Removed or renamed API {#incompat-removed}
@ -96,6 +213,7 @@ API. For specific examples of how to rewrite your code, check out the
| `GoldParse` | [`Example`](/api/example) | | `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) | | `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. The following deprecated methods, attributes and arguments were removed in v3.0.
@ -121,7 +239,7 @@ on them.
Model symlinks and shortcuts like `en` are now officially deprecated. There are Model symlinks and shortcuts like `en` are now officially deprecated. There are
[many different models](/models) with different capabilities and not just one [many different models](/models) with different capabilities and not just one
"English model". In order to download and load a model, you should always use "English model". In order to download and load a model, you should always use
its full name for instance, `en_core_web_sm`. its full name for instance, [`en_core_web_sm`](/models/en#en_core_web_sm).
```diff ```diff
- python -m spacy download en - python -m spacy download en
@ -224,6 +342,51 @@ and you typically shouldn't have to use it in your code.
+ parser = nlp.add_pipe("parser") + parser = nlp.add_pipe("parser")
``` ```
If you need to add a component from an existing pretrained model, you can now
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
check that the component is compatible, and take care of porting over all
config. During training, you can also reference existing pretrained components
in your [config](/usage/training#config-components) and decide whether or not
they should be updated with more data.
> #### config.cfg (excerpt)
>
> ```ini
> [components.ner]
> source = "en_core_web_sm"
> component = "ner"
> ```
```diff
source_nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank("en")
- ner = source_nlp.get_pipe("ner")
- nlp.add_pipe(ner)
+ nlp.add_pipe("ner", source=source_nlp)
```
### Adding match patterns {#migrating-matcher}
The [`Matcher.add`](/api/matcher#add),
[`PhraseMatcher.add`](/api/phrasematcher#add) and
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
**list of patterns** as the second argument (instead of a variable number of
arguments). The `on_match` callback becomes an optional keyword argument.
```diff
matcher = Matcher(nlp.vocab)
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
```
```diff
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
```
### Training models {#migrating-training} ### Training models {#migrating-training}
To train your models, you should now pretty much always use the To train your models, you should now pretty much always use the
@ -233,15 +396,20 @@ use a [flexible config file](/usage/training#config) that describes all training
settings and hyperparameters, as well as your pipeline, model components and settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The `--code` argument lets you pass in code containing architectures to use. The `--code` argument lets you pass in code containing
[custom registered functions](/usage/training#custom-code) that you can [custom registered functions](/usage/training#custom-code) that you can
reference in your config. reference in your config. To get started, check out the
[quickstart widget](/usage/training#quickstart).
#### Binary .spacy training data format {#migrating-training-format} #### Binary .spacy training data format {#migrating-training-format}
spaCy now uses a new spaCy v3.0 uses a new
[binary training data format](/api/data-formats#binary-training), which is much [binary training data format](/api/data-formats#binary-training) created by
smaller and consists of `Doc` objects, serialized via the serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using objects. This means that you can train spaCy models using the same format it
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files: outputs: annotated `Doc` objects. The binary format is extremely **efficient in
storage**, especially when packing multiple documents together.
You can convert your existing JSON-formatted data using the
[`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
```bash ```bash
$ python -m spacy convert ./training.json ./output $ python -m spacy convert ./training.json ./output
@ -273,13 +441,72 @@ workflows, from data preprocessing to training and packaging your model.
</Project> </Project>
#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
<!-- TODO: write -->
#### Training via the Python API {#migrating-training-python} #### Training via the Python API {#migrating-training-python}
<!-- TODO: this should explain the GoldParse -> Example stuff --> For most use cases, you **shouldn't** have to write your own training scripts
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
[config file](/usage/training#config) and custom
[registered functions](/usage/training#custom-code) if needed. You can even
register callbacks that can modify the `nlp` object at different stages of its
lifecycle to fully customize it before training.
If you do decide to use the [internal training API](/usage/training#api) from
Python, you should only need a few small modifications to convert your scripts
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
classmethod takes a reference `Doc` and a
[dictionary of annotations](/api/data-formats#dict-input), similar to the
"simple training style" in spaCy v2.x:
```diff
### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})
```
```diff
### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)
```
The [`Language.update`](/api/language#update),
[`Language.evaluate`](/api/language#evaluate) and
[`Pipe.update`](/api/pipe#update) methods now all take batches of
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations.
```python
### Training loop {highlight="11"}
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London.", {"entities": [(7, 13, "LOC")]}),
]
nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for batch in minibatch(TRAIN_DATA):
examples = []
for text, annots in batch:
examples.append(Example.from_dict(nlp.make_doc(text), annots))
nlp.update(examples)
```
[`Language.begin_training`](/api/language#begin_training) and
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
returns a sequence of `Example` objects to initialize the model instead of a
list of tuples. The data examples are used to **initialize the models** of
trainable pipeline components, which includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme.
```diff
- nlp.begin_training(examples)
+ nlp.begin_training(lambda: examples)
```
#### Packaging models {#migrating-training-packaging} #### Packaging models {#migrating-training-packaging}

View File

@ -23,6 +23,7 @@ import { ReactComponent as MoonIcon } from '../images/icons/moon.svg'
import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg' import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
import { ReactComponent as NetworkIcon } from '../images/icons/network.svg' import { ReactComponent as NetworkIcon } from '../images/icons/network.svg'
import { ReactComponent as DownloadIcon } from '../images/icons/download.svg' import { ReactComponent as DownloadIcon } from '../images/icons/download.svg'
import { ReactComponent as PackageIcon } from '../images/icons/package.svg'
import classes from '../styles/icon.module.sass' import classes from '../styles/icon.module.sass'
@ -49,6 +50,7 @@ const icons = {
clipboard: ClipboardIcon, clipboard: ClipboardIcon,
network: NetworkIcon, network: NetworkIcon,
download: DownloadIcon, download: DownloadIcon,
package: PackageIcon,
} }
export default function Icon({ name, width = 20, height, inline = false, variant, className }) { export default function Icon({ name, width = 20, height, inline = false, variant, className }) {

View File

@ -5,8 +5,17 @@ import classNames from 'classnames'
import Icon from './icon' import Icon from './icon'
import classes from '../styles/infobox.module.sass' import classes from '../styles/infobox.module.sass'
export default function Infobox({ title, emoji, id, variant = 'default', className, children }) { export default function Infobox({
title,
emoji,
id,
variant = 'default',
list = false,
className,
children,
}) {
const infoboxClassNames = classNames(classes.root, className, { const infoboxClassNames = classNames(classes.root, className, {
[classes.list]: !!list,
[classes.warning]: variant === 'warning', [classes.warning]: variant === 'warning',
[classes.danger]: variant === 'danger', [classes.danger]: variant === 'danger',
}) })

View File

@ -8,13 +8,21 @@ import Icon from './icon'
import classes from '../styles/link.module.sass' import classes from '../styles/link.module.sass'
import { isString } from './util' import { isString } from './util'
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io)/gi const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
const Whitespace = ({ children }) => ( const Whitespace = ({ children }) => (
// Ensure that links are always wrapped in spaces // Ensure that links are always wrapped in spaces
<> {children} </> <> {children} </>
) )
function getIcon(dest) {
if (/(github.com)/.test(dest)) return 'code'
if (/^\/?api\/architectures#/.test(dest)) return 'network'
if (/^\/?api/.test(dest)) return 'docs'
if (/^\/?models\/(.+)/.test(dest)) return 'package'
return null
}
export default function Link({ export default function Link({
children, children,
to, to,
@ -30,22 +38,19 @@ export default function Link({
}) { }) {
const dest = to || href const dest = to || href
const external = forceExternal || /(http(s?)):\/\//gi.test(dest) const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
const isApi = !external && !hidden && !hideIcon && /^\/?api/.test(dest) const icon = getIcon(dest)
const isArch = !external && !hidden && !hideIcon && /^\/?api\/architectures#/.test(dest) const withIcon = !hidden && !hideIcon && !!icon
const isSource = external && !hidden && !hideIcon && /(github.com)/.test(dest)
const withIcon = isApi || isArch || isSource
const sourceWithText = withIcon && isString(children) const sourceWithText = withIcon && isString(children)
const linkClassNames = classNames(classes.root, className, { const linkClassNames = classNames(classes.root, className, {
[classes.hidden]: hidden, [classes.hidden]: hidden,
[classes.nowrap]: (withIcon && !sourceWithText) || isArch, [classes.nowrap]: (withIcon && !sourceWithText) || icon === 'network',
[classes.withIcon]: withIcon, [classes.withIcon]: withIcon,
}) })
const Wrapper = ws ? Whitespace : Fragment const Wrapper = ws ? Whitespace : Fragment
const icon = isArch ? 'network' : isApi ? 'docs' : isSource ? 'code' : null
const content = ( const content = (
<> <>
{sourceWithText ? <span className={classes.sourceText}>{children}</span> : children} {sourceWithText ? <span className={classes.sourceText}>{children}</span> : children}
{icon && <Icon name={icon} width={16} inline className={classes.icon} />} {withIcon && <Icon name={icon} width={16} inline className={classes.icon} />}
</> </>
) )

View File

@ -0,0 +1,5 @@
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
<path fill="none" d="M21 16V8a2 2 0 0 0-1-1.73l-7-4a2 2 0 0 0-2 0l-7 4A2 2 0 0 0 3 8v8a2 2 0 0 0 1 1.73l7 4a2 2 0 0 0 2 0l7-4A2 2 0 0 0 21 16z"></path>
<polyline fill="none" points="3.27 6.96 12 12.01 20.73 6.96"></polyline>
<line fill="none" x1="12" y1="22.08" x2="12" y2="12"></line>
</svg>

After

Width:  |  Height:  |  Size: 440 B

View File

@ -14,6 +14,21 @@
font-size: inherit font-size: inherit
line-height: inherit line-height: inherit
ul li
padding-left: 0.75em
.list ul li
font-size: var(--font-size-sm)
list-style: none
padding: 0
margin: 0 0 0.35rem 0
&:before
all: initial
a, a span
border-bottom: 0 !important
.title .title
font-weight: bold font-weight: bold
color: var(--color-theme) color: var(--color-theme)