mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 02:16:32 +03:00
Update docs [ci skip]
This commit is contained in:
parent
10f42e3a39
commit
b7ec06e331
|
@ -274,7 +274,7 @@ architectures into your training config.
|
||||||
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
|
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
|
||||||
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
|
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
|
||||||
|
|
||||||
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}
|
### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener}
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
|
|
|
@ -43,7 +43,7 @@ $ python -m spacy download [model] [--direct] [pip args]
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | positional | Model name, e.g. `en_core_web_sm`.. |
|
| `model` | positional | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). |
|
||||||
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
||||||
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
|
|
|
@ -182,10 +182,10 @@ run [`spacy pretrain`](/api/cli#pretrain).
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
The main data format used in spaCy v3.0 is a **binary format** created by
|
The main data format used in spaCy v3.0 is a **binary format** created by
|
||||||
serializing a [`DocBin`](/api/docbin) object, which represents a collection of
|
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
|
||||||
`Doc` objects. This means that you can train spaCy models using the same format
|
objects. This means that you can train spaCy models using the same format it
|
||||||
it outputs: annotated `Doc` objects. The binary format is extremely **efficient
|
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
|
||||||
in storage**, especially when packing multiple documents together.
|
storage**, especially when packing multiple documents together.
|
||||||
|
|
||||||
Typically, the extension for these binary files is `.spacy`, and they are used
|
Typically, the extension for these binary files is `.spacy`, and they are used
|
||||||
as input format for specifying a [training corpus](/api/corpus) and for spaCy's
|
as input format for specifying a [training corpus](/api/corpus) and for spaCy's
|
||||||
|
|
|
@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
|
||||||
|
|
||||||
## DependencyParser.begin_training {#begin_training tag="method"}
|
## DependencyParser.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> parser = nlp.add_pipe("parser")
|
> parser = nlp.add_pipe("parser")
|
||||||
> optimizer = parser.begin_training(pipeline=nlp.pipeline)
|
> optimizer = parser.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
|
||||||
|
|
||||||
## EntityLinker.begin_training {#begin_training tag="method"}
|
## EntityLinker.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> entity_linker = nlp.add_pipe("entity_linker", last=True)
|
> entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||||
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
|
> optimizer = entity_linker.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -131,14 +131,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
|
||||||
|
|
||||||
## EntityRecognizer.begin_training {#begin_training tag="method"}
|
## EntityRecognizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> ner = nlp.add_pipe("ner")
|
> ner = nlp.add_pipe("ner")
|
||||||
> optimizer = ner.begin_training(pipeline=nlp.pipeline)
|
> optimizer = ner.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -200,12 +200,28 @@ more efficient than processing texts one-by-one.
|
||||||
|
|
||||||
## Language.begin_training {#begin_training tag="method"}
|
## Language.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the pipeline for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples can either be the full training data or a representative sample. They
|
||||||
|
are used to **initialize the models** of trainable pipeline components and are
|
||||||
|
passed each component's [`begin_training`](/api/pipe#begin_training) method, if
|
||||||
|
available. Initialization includes validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
|
|
||||||
|
The `Language.update` method now takes a **function** that is called with no
|
||||||
|
arguments and returns a sequence of [`Example`](/api/example) objects instead of
|
||||||
|
tuples of `Doc` and `GoldParse` objects.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
> get_examples = lambda: examples
|
||||||
> optimizer = nlp.begin_training(get_examples)
|
> optimizer = nlp.begin_training(get_examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -276,7 +292,7 @@ and custom registered functions if needed. See the
|
||||||
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||||
|
|
||||||
## Language.rehearse {#rehearse tag="method,experimental"}
|
## Language.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
current model to make predictions similar to an initial model, to try to address
|
current model to make predictions similar to an initial model, to try to address
|
||||||
|
@ -302,6 +318,13 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
|
|
||||||
Evaluate a model's pipeline components.
|
Evaluate a model's pipeline components.
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
|
|
||||||
|
The `Language.update` method now takes a batch of [`Example`](/api/example)
|
||||||
|
objects instead of tuples of `Doc` and `GoldParse` objects.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
|
|
@ -121,15 +121,21 @@ applied to the `Doc` in order. Both [`__call__`](/api/morphologizer#call) and
|
||||||
|
|
||||||
## Morphologizer.begin_training {#begin_training tag="method"}
|
## Morphologizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> morphologizer = nlp.add_pipe("morphologizer")
|
> morphologizer = nlp.add_pipe("morphologizer")
|
||||||
> nlp.pipeline.append(morphologizer)
|
> nlp.pipeline.append(morphologizer)
|
||||||
> optimizer = morphologizer.begin_training(pipeline=nlp.pipeline)
|
> optimizer = morphologizer.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -9,8 +9,8 @@ components like the [`EntityRecognizer`](/api/entityrecognizer) or
|
||||||
[`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the
|
[`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the
|
||||||
interface that components should follow to function as trainable components in a
|
interface that components should follow to function as trainable components in a
|
||||||
spaCy pipeline. See the docs on
|
spaCy pipeline. See the docs on
|
||||||
[writing trainable components](/usage/processing-pipelines#trainable) for how to
|
[writing trainable components](/usage/processing-pipelines#trainable-components)
|
||||||
use the `Pipe` base class to implement custom components.
|
for how to use the `Pipe` base class to implement custom components.
|
||||||
|
|
||||||
> #### Why is Pipe implemented in Cython?
|
> #### Why is Pipe implemented in Cython?
|
||||||
>
|
>
|
||||||
|
@ -106,14 +106,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/pipe#call) and
|
||||||
|
|
||||||
## Pipe.begin_training {#begin_training tag="method"}
|
## Pipe.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> pipe = nlp.add_pipe("your_custom_pipe")
|
> pipe = nlp.add_pipe("your_custom_pipe")
|
||||||
> optimizer = pipe.begin_training(pipeline=nlp.pipeline)
|
> optimizer = pipe.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -200,7 +206,7 @@ This method needs to be overwritten with your own custom `update` method.
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||||
|
|
||||||
## Pipe.rehearse {#rehearse tag="method,experimental"}
|
## Pipe.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
current model to make predictions similar to an initial model, to try to address
|
current model to make predictions similar to an initial model, to try to address
|
||||||
|
|
|
@ -116,14 +116,20 @@ and [`pipe`](/api/sentencerecognizer#pipe) delegate to the
|
||||||
|
|
||||||
## SentenceRecognizer.begin_training {#begin_training tag="method"}
|
## SentenceRecognizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> senter = nlp.add_pipe("senter")
|
> senter = nlp.add_pipe("senter")
|
||||||
> optimizer = senter.begin_training(pipeline=nlp.pipeline)
|
> optimizer = senter.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -193,7 +199,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||||
|
|
||||||
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental"}
|
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
current model to make predictions similar to an initial model, to try to address
|
current model to make predictions similar to an initial model, to try to address
|
||||||
|
|
|
@ -114,14 +114,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
|
||||||
|
|
||||||
## Tagger.begin_training {#begin_training tag="method"}
|
## Tagger.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> tagger = nlp.add_pipe("tagger")
|
> tagger = nlp.add_pipe("tagger")
|
||||||
> optimizer = tagger.begin_training(pipeline=nlp.pipeline)
|
> optimizer = tagger.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -191,7 +197,7 @@ Delegates to [`predict`](/api/tagger#predict) and
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||||
|
|
||||||
## Tagger.rehearse {#rehearse tag="method,experimental"}
|
## Tagger.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
current model to make predictions similar to an initial model, to try to address
|
current model to make predictions similar to an initial model, to try to address
|
||||||
|
|
|
@ -122,14 +122,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
|
||||||
|
|
||||||
## TextCategorizer.begin_training {#begin_training tag="method"}
|
## TextCategorizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> textcat = nlp.add_pipe("textcat")
|
> textcat = nlp.add_pipe("textcat")
|
||||||
> optimizer = textcat.begin_training(pipeline=nlp.pipeline)
|
> optimizer = textcat.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -199,7 +205,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||||
|
|
||||||
## TextCategorizer.rehearse {#rehearse tag="method,experimental"}
|
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
current model to make predictions similar to an initial model, to try to address
|
current model to make predictions similar to an initial model, to try to address
|
||||||
|
|
|
@ -125,14 +125,20 @@ and [`set_annotations`](/api/tok2vec#set_annotations) methods.
|
||||||
|
|
||||||
## Tok2Vec.begin_training {#begin_training tag="method"}
|
## Tok2Vec.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> tok2vec = nlp.add_pipe("tok2vec")
|
> tok2vec = nlp.add_pipe("tok2vec")
|
||||||
> optimizer = tok2vec.begin_training(pipeline=nlp.pipeline)
|
> optimizer = tok2vec.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -159,14 +159,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/transformer#call) and
|
||||||
|
|
||||||
## Transformer.begin_training {#begin_training tag="method"}
|
## Transformer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
Initialize the pipe for training, using data examples if available. Returns an
|
Initialize the component for training and return an
|
||||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
|
||||||
|
function that returns an iterable of [`Example`](/api/example) objects. The data
|
||||||
|
examples are used to **initialize the model** of the component and can either be
|
||||||
|
the full training data or a representative sample. Initialization includes
|
||||||
|
validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> trf = nlp.add_pipe("transformer")
|
> trf = nlp.add_pipe("transformer")
|
||||||
> optimizer = trf.begin_training(pipeline=nlp.pipeline)
|
> optimizer = trf.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -45,9 +45,9 @@ three components:
|
||||||
2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
|
2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
|
||||||
3. **Size:** Model size indicator, `sm`, `md` or `lg`.
|
3. **Size:** Model size indicator, `sm`, `md` or `lg`.
|
||||||
|
|
||||||
For example, `en_core_web_sm` is a small English model trained on written web
|
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
|
||||||
text (blogs, news, comments), that includes vocabulary, vectors, syntax and
|
model trained on written web text (blogs, news, comments), that includes
|
||||||
entities.
|
vocabulary, vectors, syntax and entities.
|
||||||
|
|
||||||
### Model versioning {#model-versioning}
|
### Model versioning {#model-versioning}
|
||||||
|
|
||||||
|
|
|
@ -687,13 +687,13 @@ give you everything you need to train fully custom models with
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
<!-- TODO: maybe add something about why the Example class is great and its benefits, and how it's passed around, holds the alignment etc -->
|
|
||||||
|
|
||||||
The [`Example`](/api/example) object contains annotated training data, also
|
The [`Example`](/api/example) object contains annotated training data, also
|
||||||
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
|
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
|
||||||
that will hold the predictions, and another `Doc` object that holds the
|
that will hold the predictions, and another `Doc` object that holds the
|
||||||
gold-standard annotations. Here's an example of a simple `Example` for
|
gold-standard annotations. It also includes the **alignment** between those two
|
||||||
part-of-speech tags:
|
documents if they differ in tokenization. The `Example` class ensures that spaCy
|
||||||
|
can rely on one **standardized format** that's passed through the pipeline.
|
||||||
|
Here's an example of a simple `Example` for part-of-speech tags:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
words = ["I", "like", "stuff"]
|
words = ["I", "like", "stuff"]
|
||||||
|
@ -744,7 +744,8 @@ example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O"
|
||||||
|
|
||||||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
|
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
|
||||||
It can be constructed in a very similar way, from a `Doc` and a dictionary of
|
It can be constructed in a very similar way, from a `Doc` and a dictionary of
|
||||||
annotations:
|
annotations. For more details, see the
|
||||||
|
[migration guide](/usage/v3#migrating-training).
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- gold = GoldParse(doc, entities=entities)
|
- gold = GoldParse(doc, entities=entities)
|
||||||
|
|
|
@ -14,12 +14,49 @@ menu:
|
||||||
|
|
||||||
### New training workflow and config system {#features-training}
|
### New training workflow and config system {#features-training}
|
||||||
|
|
||||||
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
|
- **Usage:** [Training models](/usage/training)
|
||||||
|
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
|
||||||
|
[`Config`](https://thinc.ai/docs/api-config#config)
|
||||||
|
- **CLI:** [`train`](/api/cli#train), [`pretrain`](/api/cli#pretrain),
|
||||||
|
[`evaluate`](/api/cli#evaluate)
|
||||||
|
- **API:** [Config format](/api/data-formats#config),
|
||||||
|
[`registry`](/api/top-level#registry)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Transformer-based pipelines {#features-transformers}
|
### Transformer-based pipelines {#features-transformers}
|
||||||
|
|
||||||
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
|
- **Usage:** [Transformers](/usage/transformers),
|
||||||
|
[Training models](/usage/training)
|
||||||
|
- **API:** [`Transformer`](/api/transformer),
|
||||||
|
[`TransformerData`](/api/transformer#transformerdata),
|
||||||
|
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||||||
|
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||||||
|
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
|
||||||
|
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||||||
|
- **Models:** [`en_core_bert_sm`](/models/en)
|
||||||
|
- **Implementation:**
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Custom models using any framework {#feautres-custom-models}
|
### Custom models using any framework {#feautres-custom-models}
|
||||||
|
|
||||||
### Manage end-to-end workflows with projects {#features-projects}
|
### Manage end-to-end workflows with projects {#features-projects}
|
||||||
|
|
||||||
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
|
- **Usage:** [spaCy projects](/usage/projects),
|
||||||
|
[Training models](/usage/training)
|
||||||
|
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
|
||||||
|
- **Templates:** [`projects`](https://github.com/explosion/projects)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### New built-in pipeline components {#features-pipeline-components}
|
### New built-in pipeline components {#features-pipeline-components}
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -30,14 +67,48 @@ menu:
|
||||||
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
|
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
|
||||||
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
|
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
|
||||||
|
|
||||||
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
|
- **Usage:** [Processing pipelines](/usage/processing-pipelines)
|
||||||
|
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
|
||||||
|
- **Implementation:**
|
||||||
|
[`spacy/pipeline`](https://github.com/explosion/spaCy/tree/develop/spacy/pipeline)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### New and improved pipeline component APIs {#features-components}
|
### New and improved pipeline component APIs {#features-components}
|
||||||
|
|
||||||
- `Language.factory`, `Language.component`
|
- `Language.factory`, `Language.component`
|
||||||
- `Language.analyze_pipes`
|
- `Language.analyze_pipes`
|
||||||
- Adding components from other models
|
- Adding components from other models
|
||||||
|
|
||||||
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
|
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
|
||||||
|
[Defining components during training](/usage/training#config-components)
|
||||||
|
- **API:** [`Language`](/api/language)
|
||||||
|
- **Implementation:**
|
||||||
|
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Type hints and type-based data validation {#features-types}
|
### Type hints and type-based data validation {#features-types}
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy.language import Language
|
||||||
|
> from pydantic import StrictBool
|
||||||
|
>
|
||||||
|
> @Language.factory("my_component")
|
||||||
|
> def create_my_component(
|
||||||
|
> nlp: Language,
|
||||||
|
> name: str,
|
||||||
|
> custom: StrictBool
|
||||||
|
> ):
|
||||||
|
> ...
|
||||||
|
> ```
|
||||||
|
|
||||||
spaCy v3.0 officially drops support for Python 2 and now requires **Python
|
spaCy v3.0 officially drops support for Python 2 and now requires **Python
|
||||||
3.6+**. This also means that the code base can take full advantage of
|
3.6+**. This also means that the code base can take full advantage of
|
||||||
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
|
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
|
||||||
|
@ -54,13 +125,36 @@ validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
|
||||||
lets you to register **custom functions with typed arguments**, reference them
|
lets you to register **custom functions with typed arguments**, reference them
|
||||||
in your config and see validation errors if the argument values don't match.
|
in your config and see validation errors if the argument values don't match.
|
||||||
|
|
||||||
### CLI
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
| Name | Description |
|
- **Usage: **
|
||||||
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
|
[Component type hints and validation](/usage/processing-pipelines#type-hints),
|
||||||
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
|
[Training with custom code](/usage/training#custom-code)
|
||||||
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. |
|
- **Thinc: **
|
||||||
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). |
|
[Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
|
||||||
|
[Thinc's config system](https://thinc.ai/docs/usage-config)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
### New methods, attributes and commands
|
||||||
|
|
||||||
|
The following methods, attributes and commands are new in spaCy v3.0.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
|
||||||
|
| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. |
|
||||||
|
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
|
||||||
|
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
|
||||||
|
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
|
||||||
|
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
|
||||||
|
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
|
||||||
|
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
|
||||||
|
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
|
||||||
|
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
|
||||||
|
| [`init config`](/api/cli#init-config) | CLI command for initializing a [training config](/usage/training) file for a blank language or auto-filling a partial config. |
|
||||||
|
| [`debug config`](/api/cli#debug-config) | CLI command for debugging a [training config](/usage/training) file and showing validation errors. |
|
||||||
|
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
|
||||||
|
|
||||||
## Backwards Incompatibilities {#incompat}
|
## Backwards Incompatibilities {#incompat}
|
||||||
|
|
||||||
|
@ -70,12 +164,21 @@ usability. The following section lists the relevant changes to the user-facing
|
||||||
API. For specific examples of how to rewrite your code, check out the
|
API. For specific examples of how to rewrite your code, check out the
|
||||||
[migration guide](#migrating).
|
[migration guide](#migrating).
|
||||||
|
|
||||||
### Compatibility {#incompat-compat}
|
<Infobox variant="warning">
|
||||||
|
|
||||||
- spaCy now requires **Python 3.6+**.
|
Note that spaCy v3.0 now requires **Python 3.6+**.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### API changes {#incompat-api}
|
### API changes {#incompat-api}
|
||||||
|
|
||||||
|
- Model symlinks, the `link` command and shortcut names are now deprecated.
|
||||||
|
There can be many [different models](/models) and not just one "English
|
||||||
|
model", so you should always use the full model name like
|
||||||
|
[`en_core_web_sm`](/models/en) explicitly.
|
||||||
|
- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
|
||||||
|
only take a `config.cfg` file containing the full
|
||||||
|
[training config](/usage/training#config).
|
||||||
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
|
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
|
||||||
the component factory instead of the component function.
|
the component factory instead of the component function.
|
||||||
- **Custom pipeline components** now needs to be decorated with the
|
- **Custom pipeline components** now needs to be decorated with the
|
||||||
|
@ -87,6 +190,20 @@ API. For specific examples of how to rewrite your code, check out the
|
||||||
- The `Language.disable_pipes` contextmanager has been replaced by
|
- The `Language.disable_pipes` contextmanager has been replaced by
|
||||||
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
|
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
|
||||||
disable or enable components.
|
disable or enable components.
|
||||||
|
- The [`Language.update`](/api/language#update),
|
||||||
|
[`Language.evaluate`](/api/language#evaluate) and
|
||||||
|
[`Pipe.update`](/api/pipe#update) methods now all take batches of
|
||||||
|
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
|
||||||
|
raw text and a dictionary of annotations.
|
||||||
|
[`Language.begin_training`](/api/language#begin_training) and
|
||||||
|
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
|
||||||
|
returns a sequence of `Example` objects to initialize the model instead of a
|
||||||
|
list of tuples.
|
||||||
|
- [`Matcher.add`](/api/matcher#add),
|
||||||
|
[`PhraseMatcher.add`](/api/phrasematcher#add) and
|
||||||
|
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
|
||||||
|
of patterns as the second argument (instead of a variable number of
|
||||||
|
arguments). The `on_match` callback becomes an optional keyword argument.
|
||||||
|
|
||||||
### Removed or renamed API {#incompat-removed}
|
### Removed or renamed API {#incompat-removed}
|
||||||
|
|
||||||
|
@ -96,6 +213,7 @@ API. For specific examples of how to rewrite your code, check out the
|
||||||
| `GoldParse` | [`Example`](/api/example) |
|
| `GoldParse` | [`Example`](/api/example) |
|
||||||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||||||
|
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
||||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
||||||
|
|
||||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||||
|
@ -121,7 +239,7 @@ on them.
|
||||||
Model symlinks and shortcuts like `en` are now officially deprecated. There are
|
Model symlinks and shortcuts like `en` are now officially deprecated. There are
|
||||||
[many different models](/models) with different capabilities and not just one
|
[many different models](/models) with different capabilities and not just one
|
||||||
"English model". In order to download and load a model, you should always use
|
"English model". In order to download and load a model, you should always use
|
||||||
its full name – for instance, `en_core_web_sm`.
|
its full name – for instance, [`en_core_web_sm`](/models/en#en_core_web_sm).
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- python -m spacy download en
|
- python -m spacy download en
|
||||||
|
@ -224,6 +342,51 @@ and you typically shouldn't have to use it in your code.
|
||||||
+ parser = nlp.add_pipe("parser")
|
+ parser = nlp.add_pipe("parser")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you need to add a component from an existing pretrained model, you can now
|
||||||
|
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
|
||||||
|
check that the component is compatible, and take care of porting over all
|
||||||
|
config. During training, you can also reference existing pretrained components
|
||||||
|
in your [config](/usage/training#config-components) and decide whether or not
|
||||||
|
they should be updated with more data.
|
||||||
|
|
||||||
|
> #### config.cfg (excerpt)
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [components.ner]
|
||||||
|
> source = "en_core_web_sm"
|
||||||
|
> component = "ner"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
source_nlp = spacy.load("en_core_web_sm")
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
- ner = source_nlp.get_pipe("ner")
|
||||||
|
- nlp.add_pipe(ner)
|
||||||
|
+ nlp.add_pipe("ner", source=source_nlp)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adding match patterns {#migrating-matcher}
|
||||||
|
|
||||||
|
The [`Matcher.add`](/api/matcher#add),
|
||||||
|
[`PhraseMatcher.add`](/api/phrasematcher#add) and
|
||||||
|
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
|
||||||
|
**list of patterns** as the second argument (instead of a variable number of
|
||||||
|
arguments). The `on_match` callback becomes an optional keyword argument.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
matcher = Matcher(nlp.vocab)
|
||||||
|
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
|
||||||
|
- matcher.add("GoogleNow", on_match, *patterns)
|
||||||
|
+ matcher.add("GoogleNow", patterns, on_match=on_match)
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
matcher = PhraseMatcher(nlp.vocab)
|
||||||
|
patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
||||||
|
- matcher.add("HEALTH", on_match, *patterns)
|
||||||
|
+ matcher.add("HEALTH", patterns, on_match=on_match)
|
||||||
|
```
|
||||||
|
|
||||||
### Training models {#migrating-training}
|
### Training models {#migrating-training}
|
||||||
|
|
||||||
To train your models, you should now pretty much always use the
|
To train your models, you should now pretty much always use the
|
||||||
|
@ -233,15 +396,20 @@ use a [flexible config file](/usage/training#config) that describes all training
|
||||||
settings and hyperparameters, as well as your pipeline, model components and
|
settings and hyperparameters, as well as your pipeline, model components and
|
||||||
architectures to use. The `--code` argument lets you pass in code containing
|
architectures to use. The `--code` argument lets you pass in code containing
|
||||||
[custom registered functions](/usage/training#custom-code) that you can
|
[custom registered functions](/usage/training#custom-code) that you can
|
||||||
reference in your config.
|
reference in your config. To get started, check out the
|
||||||
|
[quickstart widget](/usage/training#quickstart).
|
||||||
|
|
||||||
#### Binary .spacy training data format {#migrating-training-format}
|
#### Binary .spacy training data format {#migrating-training-format}
|
||||||
|
|
||||||
spaCy now uses a new
|
spaCy v3.0 uses a new
|
||||||
[binary training data format](/api/data-formats#binary-training), which is much
|
[binary training data format](/api/data-formats#binary-training) created by
|
||||||
smaller and consists of `Doc` objects, serialized via the
|
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
|
||||||
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
|
objects. This means that you can train spaCy models using the same format it
|
||||||
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
|
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
|
||||||
|
storage**, especially when packing multiple documents together.
|
||||||
|
|
||||||
|
You can convert your existing JSON-formatted data using the
|
||||||
|
[`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy convert ./training.json ./output
|
$ python -m spacy convert ./training.json ./output
|
||||||
|
@ -273,13 +441,72 @@ workflows, from data preprocessing to training and packaging your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
|
|
||||||
|
|
||||||
<!-- TODO: write -->
|
|
||||||
|
|
||||||
#### Training via the Python API {#migrating-training-python}
|
#### Training via the Python API {#migrating-training-python}
|
||||||
|
|
||||||
<!-- TODO: this should explain the GoldParse -> Example stuff -->
|
For most use cases, you **shouldn't** have to write your own training scripts
|
||||||
|
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
|
||||||
|
[config file](/usage/training#config) and custom
|
||||||
|
[registered functions](/usage/training#custom-code) if needed. You can even
|
||||||
|
register callbacks that can modify the `nlp` object at different stages of its
|
||||||
|
lifecycle to fully customize it before training.
|
||||||
|
|
||||||
|
If you do decide to use the [internal training API](/usage/training#api) from
|
||||||
|
Python, you should only need a few small modifications to convert your scripts
|
||||||
|
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
|
||||||
|
classmethod takes a reference `Doc` and a
|
||||||
|
[dictionary of annotations](/api/data-formats#dict-input), similar to the
|
||||||
|
"simple training style" in spaCy v2.x:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
### Migrating Doc and GoldParse
|
||||||
|
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
|
||||||
|
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
|
||||||
|
- gold = GoldParse(doc, entities=entities)
|
||||||
|
+ example = Example.from_dict(doc, {"entities": entities})
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
### Migrating simple training style
|
||||||
|
text = "Mark Zuckerberg is the CEO of Facebook"
|
||||||
|
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
|
||||||
|
+ doc = nlp.make_doc(text)
|
||||||
|
+ example = Example.from_dict(doc, annotations)
|
||||||
|
```
|
||||||
|
|
||||||
|
The [`Language.update`](/api/language#update),
|
||||||
|
[`Language.evaluate`](/api/language#evaluate) and
|
||||||
|
[`Pipe.update`](/api/pipe#update) methods now all take batches of
|
||||||
|
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
|
||||||
|
raw text and a dictionary of annotations.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Training loop {highlight="11"}
|
||||||
|
TRAIN_DATA = [
|
||||||
|
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
|
||||||
|
("I like London.", {"entities": [(7, 13, "LOC")]}),
|
||||||
|
]
|
||||||
|
nlp.begin_training()
|
||||||
|
for i in range(20):
|
||||||
|
random.shuffle(TRAIN_DATA)
|
||||||
|
for batch in minibatch(TRAIN_DATA):
|
||||||
|
examples = []
|
||||||
|
for text, annots in batch:
|
||||||
|
examples.append(Example.from_dict(nlp.make_doc(text), annots))
|
||||||
|
nlp.update(examples)
|
||||||
|
```
|
||||||
|
|
||||||
|
[`Language.begin_training`](/api/language#begin_training) and
|
||||||
|
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
|
||||||
|
returns a sequence of `Example` objects to initialize the model instead of a
|
||||||
|
list of tuples. The data examples are used to **initialize the models** of
|
||||||
|
trainable pipeline components, which includes validating the network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- nlp.begin_training(examples)
|
||||||
|
+ nlp.begin_training(lambda: examples)
|
||||||
|
```
|
||||||
|
|
||||||
#### Packaging models {#migrating-training-packaging}
|
#### Packaging models {#migrating-training-packaging}
|
||||||
|
|
||||||
|
|
|
@ -23,6 +23,7 @@ import { ReactComponent as MoonIcon } from '../images/icons/moon.svg'
|
||||||
import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
|
import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
|
||||||
import { ReactComponent as NetworkIcon } from '../images/icons/network.svg'
|
import { ReactComponent as NetworkIcon } from '../images/icons/network.svg'
|
||||||
import { ReactComponent as DownloadIcon } from '../images/icons/download.svg'
|
import { ReactComponent as DownloadIcon } from '../images/icons/download.svg'
|
||||||
|
import { ReactComponent as PackageIcon } from '../images/icons/package.svg'
|
||||||
|
|
||||||
import classes from '../styles/icon.module.sass'
|
import classes from '../styles/icon.module.sass'
|
||||||
|
|
||||||
|
@ -49,6 +50,7 @@ const icons = {
|
||||||
clipboard: ClipboardIcon,
|
clipboard: ClipboardIcon,
|
||||||
network: NetworkIcon,
|
network: NetworkIcon,
|
||||||
download: DownloadIcon,
|
download: DownloadIcon,
|
||||||
|
package: PackageIcon,
|
||||||
}
|
}
|
||||||
|
|
||||||
export default function Icon({ name, width = 20, height, inline = false, variant, className }) {
|
export default function Icon({ name, width = 20, height, inline = false, variant, className }) {
|
||||||
|
|
|
@ -5,8 +5,17 @@ import classNames from 'classnames'
|
||||||
import Icon from './icon'
|
import Icon from './icon'
|
||||||
import classes from '../styles/infobox.module.sass'
|
import classes from '../styles/infobox.module.sass'
|
||||||
|
|
||||||
export default function Infobox({ title, emoji, id, variant = 'default', className, children }) {
|
export default function Infobox({
|
||||||
|
title,
|
||||||
|
emoji,
|
||||||
|
id,
|
||||||
|
variant = 'default',
|
||||||
|
list = false,
|
||||||
|
className,
|
||||||
|
children,
|
||||||
|
}) {
|
||||||
const infoboxClassNames = classNames(classes.root, className, {
|
const infoboxClassNames = classNames(classes.root, className, {
|
||||||
|
[classes.list]: !!list,
|
||||||
[classes.warning]: variant === 'warning',
|
[classes.warning]: variant === 'warning',
|
||||||
[classes.danger]: variant === 'danger',
|
[classes.danger]: variant === 'danger',
|
||||||
})
|
})
|
||||||
|
|
|
@ -8,13 +8,21 @@ import Icon from './icon'
|
||||||
import classes from '../styles/link.module.sass'
|
import classes from '../styles/link.module.sass'
|
||||||
import { isString } from './util'
|
import { isString } from './util'
|
||||||
|
|
||||||
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io)/gi
|
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
|
||||||
|
|
||||||
const Whitespace = ({ children }) => (
|
const Whitespace = ({ children }) => (
|
||||||
// Ensure that links are always wrapped in spaces
|
// Ensure that links are always wrapped in spaces
|
||||||
<> {children} </>
|
<> {children} </>
|
||||||
)
|
)
|
||||||
|
|
||||||
|
function getIcon(dest) {
|
||||||
|
if (/(github.com)/.test(dest)) return 'code'
|
||||||
|
if (/^\/?api\/architectures#/.test(dest)) return 'network'
|
||||||
|
if (/^\/?api/.test(dest)) return 'docs'
|
||||||
|
if (/^\/?models\/(.+)/.test(dest)) return 'package'
|
||||||
|
return null
|
||||||
|
}
|
||||||
|
|
||||||
export default function Link({
|
export default function Link({
|
||||||
children,
|
children,
|
||||||
to,
|
to,
|
||||||
|
@ -30,22 +38,19 @@ export default function Link({
|
||||||
}) {
|
}) {
|
||||||
const dest = to || href
|
const dest = to || href
|
||||||
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
|
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
|
||||||
const isApi = !external && !hidden && !hideIcon && /^\/?api/.test(dest)
|
const icon = getIcon(dest)
|
||||||
const isArch = !external && !hidden && !hideIcon && /^\/?api\/architectures#/.test(dest)
|
const withIcon = !hidden && !hideIcon && !!icon
|
||||||
const isSource = external && !hidden && !hideIcon && /(github.com)/.test(dest)
|
|
||||||
const withIcon = isApi || isArch || isSource
|
|
||||||
const sourceWithText = withIcon && isString(children)
|
const sourceWithText = withIcon && isString(children)
|
||||||
const linkClassNames = classNames(classes.root, className, {
|
const linkClassNames = classNames(classes.root, className, {
|
||||||
[classes.hidden]: hidden,
|
[classes.hidden]: hidden,
|
||||||
[classes.nowrap]: (withIcon && !sourceWithText) || isArch,
|
[classes.nowrap]: (withIcon && !sourceWithText) || icon === 'network',
|
||||||
[classes.withIcon]: withIcon,
|
[classes.withIcon]: withIcon,
|
||||||
})
|
})
|
||||||
const Wrapper = ws ? Whitespace : Fragment
|
const Wrapper = ws ? Whitespace : Fragment
|
||||||
const icon = isArch ? 'network' : isApi ? 'docs' : isSource ? 'code' : null
|
|
||||||
const content = (
|
const content = (
|
||||||
<>
|
<>
|
||||||
{sourceWithText ? <span className={classes.sourceText}>{children}</span> : children}
|
{sourceWithText ? <span className={classes.sourceText}>{children}</span> : children}
|
||||||
{icon && <Icon name={icon} width={16} inline className={classes.icon} />}
|
{withIcon && <Icon name={icon} width={16} inline className={classes.icon} />}
|
||||||
</>
|
</>
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
5
website/src/images/icons/package.svg
Normal file
5
website/src/images/icons/package.svg
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
|
||||||
|
<path fill="none" d="M21 16V8a2 2 0 0 0-1-1.73l-7-4a2 2 0 0 0-2 0l-7 4A2 2 0 0 0 3 8v8a2 2 0 0 0 1 1.73l7 4a2 2 0 0 0 2 0l7-4A2 2 0 0 0 21 16z"></path>
|
||||||
|
<polyline fill="none" points="3.27 6.96 12 12.01 20.73 6.96"></polyline>
|
||||||
|
<line fill="none" x1="12" y1="22.08" x2="12" y2="12"></line>
|
||||||
|
</svg>
|
After Width: | Height: | Size: 440 B |
|
@ -14,6 +14,21 @@
|
||||||
font-size: inherit
|
font-size: inherit
|
||||||
line-height: inherit
|
line-height: inherit
|
||||||
|
|
||||||
|
ul li
|
||||||
|
padding-left: 0.75em
|
||||||
|
|
||||||
|
.list ul li
|
||||||
|
font-size: var(--font-size-sm)
|
||||||
|
list-style: none
|
||||||
|
padding: 0
|
||||||
|
margin: 0 0 0.35rem 0
|
||||||
|
|
||||||
|
&:before
|
||||||
|
all: initial
|
||||||
|
|
||||||
|
a, a span
|
||||||
|
border-bottom: 0 !important
|
||||||
|
|
||||||
.title
|
.title
|
||||||
font-weight: bold
|
font-weight: bold
|
||||||
color: var(--color-theme)
|
color: var(--color-theme)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user