Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-11 20:57:23 +02:00
parent 10f42e3a39
commit b7ec06e331
22 changed files with 429 additions and 82 deletions

View File

@ -274,7 +274,7 @@ architectures into your training config.
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}
### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener}
> #### Example Config
>

View File

@ -43,7 +43,7 @@ $ python -m spacy download [model] [--direct] [pip args]
| Argument | Type | Description |
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model` | positional | Model name, e.g. `en_core_web_sm`.. |
| `model` | positional | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). |
| `--direct`, `-d` | flag | Force direct download of exact model version. |
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
| `--help`, `-h` | flag | Show help message and available arguments. |

View File

@ -182,10 +182,10 @@ run [`spacy pretrain`](/api/cli#pretrain).
> ```
The main data format used in spaCy v3.0 is a **binary format** created by
serializing a [`DocBin`](/api/docbin) object, which represents a collection of
`Doc` objects. This means that you can train spaCy models using the same format
it outputs: annotated `Doc` objects. The binary format is extremely **efficient
in storage**, especially when packing multiple documents together.
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
objects. This means that you can train spaCy models using the same format it
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
storage**, especially when packing multiple documents together.
Typically, the extension for these binary files is `.spacy`, and they are used
as input format for specifying a [training corpus](/api/corpus) and for spaCy's

View File

@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
## DependencyParser.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> parser = nlp.add_pipe("parser")
> optimizer = parser.begin_training(pipeline=nlp.pipeline)
> optimizer = parser.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -142,14 +142,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
## EntityLinker.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> entity_linker = nlp.add_pipe("entity_linker", last=True)
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
> optimizer = entity_linker.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -131,14 +131,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
## EntityRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> ner = nlp.add_pipe("ner")
> optimizer = ner.begin_training(pipeline=nlp.pipeline)
> optimizer = ner.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -200,12 +200,28 @@ more efficient than processing texts one-by-one.
## Language.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the pipeline for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples can either be the full training data or a representative sample. They
are used to **initialize the models** of trainable pipeline components and are
passed each component's [`begin_training`](/api/pipe#begin_training) method, if
available. Initialization includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a **function** that is called with no
arguments and returns a sequence of [`Example`](/api/example) objects instead of
tuples of `Doc` and `GoldParse` objects.
</Infobox>
> #### Example
>
> ```python
> get_examples = lambda: examples
> optimizer = nlp.begin_training(get_examples)
> ```
@ -276,7 +292,7 @@ and custom registered functions if needed. See the
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Language.rehearse {#rehearse tag="method,experimental"}
## Language.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address
@ -302,6 +318,13 @@ the "catastrophic forgetting" problem. This feature is experimental.
Evaluate a model's pipeline components.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a batch of [`Example`](/api/example)
objects instead of tuples of `Doc` and `GoldParse` objects.
</Infobox>
> #### Example
>
> ```python

View File

@ -121,15 +121,21 @@ applied to the `Doc` in order. Both [`__call__`](/api/morphologizer#call) and
## Morphologizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> morphologizer = nlp.add_pipe("morphologizer")
> nlp.pipeline.append(morphologizer)
> optimizer = morphologizer.begin_training(pipeline=nlp.pipeline)
> optimizer = morphologizer.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -9,8 +9,8 @@ components like the [`EntityRecognizer`](/api/entityrecognizer) or
[`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the
interface that components should follow to function as trainable components in a
spaCy pipeline. See the docs on
[writing trainable components](/usage/processing-pipelines#trainable) for how to
use the `Pipe` base class to implement custom components.
[writing trainable components](/usage/processing-pipelines#trainable-components)
for how to use the `Pipe` base class to implement custom components.
> #### Why is Pipe implemented in Cython?
>
@ -106,14 +106,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/pipe#call) and
## Pipe.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> pipe = nlp.add_pipe("your_custom_pipe")
> optimizer = pipe.begin_training(pipeline=nlp.pipeline)
> optimizer = pipe.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
@ -200,7 +206,7 @@ This method needs to be overwritten with your own custom `update` method.
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Pipe.rehearse {#rehearse tag="method,experimental"}
## Pipe.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address

View File

@ -116,14 +116,20 @@ and [`pipe`](/api/sentencerecognizer#pipe) delegate to the
## SentenceRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> senter = nlp.add_pipe("senter")
> optimizer = senter.begin_training(pipeline=nlp.pipeline)
> optimizer = senter.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
@ -193,7 +199,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental"}
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address

View File

@ -114,14 +114,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
## Tagger.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> tagger = nlp.add_pipe("tagger")
> optimizer = tagger.begin_training(pipeline=nlp.pipeline)
> optimizer = tagger.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
@ -191,7 +197,7 @@ Delegates to [`predict`](/api/tagger#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## Tagger.rehearse {#rehearse tag="method,experimental"}
## Tagger.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address

View File

@ -122,14 +122,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
## TextCategorizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> optimizer = textcat.begin_training(pipeline=nlp.pipeline)
> optimizer = textcat.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
@ -199,7 +205,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## TextCategorizer.rehearse {#rehearse tag="method,experimental"}
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address

View File

@ -125,14 +125,20 @@ and [`set_annotations`](/api/tok2vec#set_annotations) methods.
## Tok2Vec.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> tok2vec = nlp.add_pipe("tok2vec")
> optimizer = tok2vec.begin_training(pipeline=nlp.pipeline)
> optimizer = tok2vec.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -159,14 +159,20 @@ applied to the `Doc` in order. Both [`__call__`](/api/transformer#call) and
## Transformer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
Initialize the component for training and return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a
function that returns an iterable of [`Example`](/api/example) objects. The data
examples are used to **initialize the model** of the component and can either be
the full training data or a representative sample. Initialization includes
validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data.
> #### Example
>
> ```python
> trf = nlp.add_pipe("transformer")
> optimizer = trf.begin_training(pipeline=nlp.pipeline)
> optimizer = trf.begin_training(lambda: [], pipeline=nlp.pipeline)
> ```
| Name | Type | Description |

View File

@ -45,9 +45,9 @@ three components:
2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
3. **Size:** Model size indicator, `sm`, `md` or `lg`.
For example, `en_core_web_sm` is a small English model trained on written web
text (blogs, news, comments), that includes vocabulary, vectors, syntax and
entities.
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
model trained on written web text (blogs, news, comments), that includes
vocabulary, vectors, syntax and entities.
### Model versioning {#model-versioning}

View File

@ -687,13 +687,13 @@ give you everything you need to train fully custom models with
</Infobox>
<!-- TODO: maybe add something about why the Example class is great and its benefits, and how it's passed around, holds the alignment etc -->
The [`Example`](/api/example) object contains annotated training data, also
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. Here's an example of a simple `Example` for
part-of-speech tags:
gold-standard annotations. It also includes the **alignment** between those two
documents if they differ in tokenization. The `Example` class ensures that spaCy
can rely on one **standardized format** that's passed through the pipeline.
Here's an example of a simple `Example` for part-of-speech tags:
```python
words = ["I", "like", "stuff"]
@ -744,7 +744,8 @@ example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O"
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
It can be constructed in a very similar way, from a `Doc` and a dictionary of
annotations:
annotations. For more details, see the
[migration guide](/usage/v3#migrating-training).
```diff
- gold = GoldParse(doc, entities=entities)

View File

@ -14,12 +14,49 @@ menu:
### New training workflow and config system {#features-training}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Training models](/usage/training)
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
[`Config`](https://thinc.ai/docs/api-config#config)
- **CLI:** [`train`](/api/cli#train), [`pretrain`](/api/cli#pretrain),
[`evaluate`](/api/cli#evaluate)
- **API:** [Config format](/api/data-formats#config),
[`registry`](/api/top-level#registry)
</Infobox>
### Transformer-based pipelines {#features-transformers}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Transformers](/usage/transformers),
[Training models](/usage/training)
- **API:** [`Transformer`](/api/transformer),
[`TransformerData`](/api/transformer#transformerdata),
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
- **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
</Infobox>
### Custom models using any framework {#feautres-custom-models}
### Manage end-to-end workflows with projects {#features-projects}
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [spaCy projects](/usage/projects),
[Training models](/usage/training)
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
- **Templates:** [`projects`](https://github.com/explosion/projects)
</Infobox>
### New built-in pipeline components {#features-pipeline-components}
| Name | Description |
@ -30,14 +67,48 @@ menu:
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Processing pipelines](/usage/processing-pipelines)
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
- **Implementation:**
[`spacy/pipeline`](https://github.com/explosion/spaCy/tree/develop/spacy/pipeline)
</Infobox>
### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component`
- `Language.analyze_pipes`
- Adding components from other models
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
[Defining components during training](/usage/training#config-components)
- **API:** [`Language`](/api/language)
- **Implementation:**
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
</Infobox>
### Type hints and type-based data validation {#features-types}
> #### Example
>
> ```python
> from spacy.language import Language
> from pydantic import StrictBool
>
> @Language.factory("my_component")
> def create_my_component(
> nlp: Language,
> name: str,
> custom: StrictBool
> ):
> ...
> ```
spaCy v3.0 officially drops support for Python 2 and now requires **Python
3.6+**. This also means that the code base can take full advantage of
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
@ -54,13 +125,36 @@ validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
lets you to register **custom functions with typed arguments**, reference them
in your config and see validation errors if the argument values don't match.
### CLI
<Infobox title="Details & Documentation" emoji="📖" list>
| Name | Description |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. |
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). |
- **Usage: **
[Component type hints and validation](/usage/processing-pipelines#type-hints),
[Training with custom code](/usage/training#custom-code)
- **Thinc: **
[Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
[Thinc's config system](https://thinc.ai/docs/usage-config)
</Infobox>
### New methods, attributes and commands
The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description |
| ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
| [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s |
| [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`init config`](/api/cli#init-config) | CLI command for initializing a [training config](/usage/training) file for a blank language or auto-filling a partial config. |
| [`debug config`](/api/cli#debug-config) | CLI command for debugging a [training config](/usage/training) file and showing validation errors. |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
## Backwards Incompatibilities {#incompat}
@ -70,12 +164,21 @@ usability. The following section lists the relevant changes to the user-facing
API. For specific examples of how to rewrite your code, check out the
[migration guide](#migrating).
### Compatibility {#incompat-compat}
<Infobox variant="warning">
- spaCy now requires **Python 3.6+**.
Note that spaCy v3.0 now requires **Python 3.6+**.
</Infobox>
### API changes {#incompat-api}
- Model symlinks, the `link` command and shortcut names are now deprecated.
There can be many [different models](/models) and not just one "English
model", so you should always use the full model name like
[`en_core_web_sm`](/models/en) explicitly.
- The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now
only take a `config.cfg` file containing the full
[training config](/usage/training#config).
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
the component factory instead of the component function.
- **Custom pipeline components** now needs to be decorated with the
@ -87,6 +190,20 @@ API. For specific examples of how to rewrite your code, check out the
- The `Language.disable_pipes` contextmanager has been replaced by
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
disable or enable components.
- The [`Language.update`](/api/language#update),
[`Language.evaluate`](/api/language#evaluate) and
[`Pipe.update`](/api/pipe#update) methods now all take batches of
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations.
[`Language.begin_training`](/api/language#begin_training) and
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
returns a sequence of `Example` objects to initialize the model instead of a
list of tuples.
- [`Matcher.add`](/api/matcher#add),
[`PhraseMatcher.add`](/api/phrasematcher#add) and
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
of patterns as the second argument (instead of a variable number of
arguments). The `on_match` callback becomes an optional keyword argument.
### Removed or renamed API {#incompat-removed}
@ -96,6 +213,7 @@ API. For specific examples of how to rewrite your code, check out the
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0.
@ -121,7 +239,7 @@ on them.
Model symlinks and shortcuts like `en` are now officially deprecated. There are
[many different models](/models) with different capabilities and not just one
"English model". In order to download and load a model, you should always use
its full name for instance, `en_core_web_sm`.
its full name for instance, [`en_core_web_sm`](/models/en#en_core_web_sm).
```diff
- python -m spacy download en
@ -224,6 +342,51 @@ and you typically shouldn't have to use it in your code.
+ parser = nlp.add_pipe("parser")
```
If you need to add a component from an existing pretrained model, you can now
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
check that the component is compatible, and take care of porting over all
config. During training, you can also reference existing pretrained components
in your [config](/usage/training#config-components) and decide whether or not
they should be updated with more data.
> #### config.cfg (excerpt)
>
> ```ini
> [components.ner]
> source = "en_core_web_sm"
> component = "ner"
> ```
```diff
source_nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank("en")
- ner = source_nlp.get_pipe("ner")
- nlp.add_pipe(ner)
+ nlp.add_pipe("ner", source=source_nlp)
```
### Adding match patterns {#migrating-matcher}
The [`Matcher.add`](/api/matcher#add),
[`PhraseMatcher.add`](/api/phrasematcher#add) and
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
**list of patterns** as the second argument (instead of a variable number of
arguments). The `on_match` callback becomes an optional keyword argument.
```diff
matcher = Matcher(nlp.vocab)
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
```
```diff
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
```
### Training models {#migrating-training}
To train your models, you should now pretty much always use the
@ -233,15 +396,20 @@ use a [flexible config file](/usage/training#config) that describes all training
settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The `--code` argument lets you pass in code containing
[custom registered functions](/usage/training#custom-code) that you can
reference in your config.
reference in your config. To get started, check out the
[quickstart widget](/usage/training#quickstart).
#### Binary .spacy training data format {#migrating-training-format}
spaCy now uses a new
[binary training data format](/api/data-formats#binary-training), which is much
smaller and consists of `Doc` objects, serialized via the
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
spaCy v3.0 uses a new
[binary training data format](/api/data-formats#binary-training) created by
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
objects. This means that you can train spaCy models using the same format it
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
storage**, especially when packing multiple documents together.
You can convert your existing JSON-formatted data using the
[`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
```bash
$ python -m spacy convert ./training.json ./output
@ -273,13 +441,72 @@ workflows, from data preprocessing to training and packaging your model.
</Project>
#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
<!-- TODO: write -->
#### Training via the Python API {#migrating-training-python}
<!-- TODO: this should explain the GoldParse -> Example stuff -->
For most use cases, you **shouldn't** have to write your own training scripts
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
[config file](/usage/training#config) and custom
[registered functions](/usage/training#custom-code) if needed. You can even
register callbacks that can modify the `nlp` object at different stages of its
lifecycle to fully customize it before training.
If you do decide to use the [internal training API](/usage/training#api) from
Python, you should only need a few small modifications to convert your scripts
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
classmethod takes a reference `Doc` and a
[dictionary of annotations](/api/data-formats#dict-input), similar to the
"simple training style" in spaCy v2.x:
```diff
### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})
```
```diff
### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)
```
The [`Language.update`](/api/language#update),
[`Language.evaluate`](/api/language#evaluate) and
[`Pipe.update`](/api/pipe#update) methods now all take batches of
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations.
```python
### Training loop {highlight="11"}
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London.", {"entities": [(7, 13, "LOC")]}),
]
nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for batch in minibatch(TRAIN_DATA):
examples = []
for text, annots in batch:
examples.append(Example.from_dict(nlp.make_doc(text), annots))
nlp.update(examples)
```
[`Language.begin_training`](/api/language#begin_training) and
[`Pipe.begin_training`](/api/pipe#begin_training) now take a function that
returns a sequence of `Example` objects to initialize the model instead of a
list of tuples. The data examples are used to **initialize the models** of
trainable pipeline components, which includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme.
```diff
- nlp.begin_training(examples)
+ nlp.begin_training(lambda: examples)
```
#### Packaging models {#migrating-training-packaging}

View File

@ -23,6 +23,7 @@ import { ReactComponent as MoonIcon } from '../images/icons/moon.svg'
import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
import { ReactComponent as NetworkIcon } from '../images/icons/network.svg'
import { ReactComponent as DownloadIcon } from '../images/icons/download.svg'
import { ReactComponent as PackageIcon } from '../images/icons/package.svg'
import classes from '../styles/icon.module.sass'
@ -49,6 +50,7 @@ const icons = {
clipboard: ClipboardIcon,
network: NetworkIcon,
download: DownloadIcon,
package: PackageIcon,
}
export default function Icon({ name, width = 20, height, inline = false, variant, className }) {

View File

@ -5,8 +5,17 @@ import classNames from 'classnames'
import Icon from './icon'
import classes from '../styles/infobox.module.sass'
export default function Infobox({ title, emoji, id, variant = 'default', className, children }) {
export default function Infobox({
title,
emoji,
id,
variant = 'default',
list = false,
className,
children,
}) {
const infoboxClassNames = classNames(classes.root, className, {
[classes.list]: !!list,
[classes.warning]: variant === 'warning',
[classes.danger]: variant === 'danger',
})

View File

@ -8,13 +8,21 @@ import Icon from './icon'
import classes from '../styles/link.module.sass'
import { isString } from './util'
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io)/gi
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
const Whitespace = ({ children }) => (
// Ensure that links are always wrapped in spaces
<> {children} </>
)
function getIcon(dest) {
if (/(github.com)/.test(dest)) return 'code'
if (/^\/?api\/architectures#/.test(dest)) return 'network'
if (/^\/?api/.test(dest)) return 'docs'
if (/^\/?models\/(.+)/.test(dest)) return 'package'
return null
}
export default function Link({
children,
to,
@ -30,22 +38,19 @@ export default function Link({
}) {
const dest = to || href
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
const isApi = !external && !hidden && !hideIcon && /^\/?api/.test(dest)
const isArch = !external && !hidden && !hideIcon && /^\/?api\/architectures#/.test(dest)
const isSource = external && !hidden && !hideIcon && /(github.com)/.test(dest)
const withIcon = isApi || isArch || isSource
const icon = getIcon(dest)
const withIcon = !hidden && !hideIcon && !!icon
const sourceWithText = withIcon && isString(children)
const linkClassNames = classNames(classes.root, className, {
[classes.hidden]: hidden,
[classes.nowrap]: (withIcon && !sourceWithText) || isArch,
[classes.nowrap]: (withIcon && !sourceWithText) || icon === 'network',
[classes.withIcon]: withIcon,
})
const Wrapper = ws ? Whitespace : Fragment
const icon = isArch ? 'network' : isApi ? 'docs' : isSource ? 'code' : null
const content = (
<>
{sourceWithText ? <span className={classes.sourceText}>{children}</span> : children}
{icon && <Icon name={icon} width={16} inline className={classes.icon} />}
{withIcon && <Icon name={icon} width={16} inline className={classes.icon} />}
</>
)

View File

@ -0,0 +1,5 @@
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
<path fill="none" d="M21 16V8a2 2 0 0 0-1-1.73l-7-4a2 2 0 0 0-2 0l-7 4A2 2 0 0 0 3 8v8a2 2 0 0 0 1 1.73l7 4a2 2 0 0 0 2 0l7-4A2 2 0 0 0 21 16z"></path>
<polyline fill="none" points="3.27 6.96 12 12.01 20.73 6.96"></polyline>
<line fill="none" x1="12" y1="22.08" x2="12" y2="12"></line>
</svg>

After

Width:  |  Height:  |  Size: 440 B

View File

@ -14,6 +14,21 @@
font-size: inherit
line-height: inherit
ul li
padding-left: 0.75em
.list ul li
font-size: var(--font-size-sm)
list-style: none
padding: 0
margin: 0 0 0.35rem 0
&:before
all: initial
a, a span
border-bottom: 0 !important
.title
font-weight: bold
color: var(--color-theme)