Update docs and formatting

This commit is contained in:
Ines Montani 2020-09-09 21:26:10 +02:00
parent aa27e3f1f2
commit 2e567a47c2
8 changed files with 165 additions and 151 deletions

View File

@ -293,7 +293,11 @@ context, the original parameters are restored.
## DependencyParser.add_label {#add_label tag="method"}
Add a new label to the pipe.
Add a new label to the pipe. Note that you don't have to call this method if you
provide a **representative data sample** to the
[`begin_training`](#begin_training) method. In this case, all labels found in
the sample will be automatically added to the model, and the output dimension
will be [inferred](/usage/layers-architectures#shape-inference) automatically.
> #### Example
>
@ -307,17 +311,13 @@ Add a new label to the pipe.
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
Note that you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## DependencyParser.set_output {#set_output tag="method"}
Change the output dimension of the component's model by calling the model's
attribute `resize_output`. This is a function that takes the original model and
the new output dimension `nO`, and changes the model in place.
the new output dimension `nO`, and changes the model in place. When resizing an
already trained model, care should be taken to avoid the "catastrophic
forgetting" problem.
> #### Example
>
@ -330,9 +330,6 @@ the new output dimension `nO`, and changes the model in place.
| ---- | --------------------------------- |
| `nO` | The new output dimension. ~~int~~ |
When resizing an already trained model, care should be taken to avoid the
"catastrophic forgetting" problem.
## DependencyParser.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.

View File

@ -281,7 +281,11 @@ context, the original parameters are restored.
## EntityRecognizer.add_label {#add_label tag="method"}
Add a new label to the pipe.
Add a new label to the pipe. Note that you don't have to call this method if you
provide a **representative data sample** to the
[`begin_training`](#begin_training) method. In this case, all labels found in
the sample will be automatically added to the model, and the output dimension
will be [inferred](/usage/layers-architectures#shape-inference) automatically.
> #### Example
>
@ -295,17 +299,13 @@ Add a new label to the pipe.
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
Note that you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## EntityRecognizer.set_output {#set_output tag="method"}
Change the output dimension of the component's model by calling the model's
attribute `resize_output`. This is a function that takes the original model and
the new output dimension `nO`, and changes the model in place.
the new output dimension `nO`, and changes the model in place. When resizing an
already trained model, care should be taken to avoid the "catastrophic
forgetting" problem.
> #### Example
>
@ -318,9 +318,6 @@ the new output dimension `nO`, and changes the model in place.
| ---- | --------------------------------- |
| `nO` | The new output dimension. ~~int~~ |
When resizing an already trained model, care should be taken to avoid the
"catastrophic forgetting" problem.
## EntityRecognizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.

View File

@ -259,7 +259,11 @@ context, the original parameters are restored.
Add a new label to the pipe. If the `Morphologizer` should set annotations for
both `pos` and `morph`, the label should include the UPOS as the feature `POS`.
Raises an error if the output dimension is already set, or if the model has
already been fully [initialized](#begin_training).
already been fully [initialized](#begin_training). Note that you don't have to
call this method if you provide a **representative data sample** to the
[`begin_training`](#begin_training) method. In this case, all labels found in
the sample will be automatically added to the model, and the output dimension
will be [inferred](/usage/layers-architectures#shape-inference) automatically.
> #### Example
>
@ -273,12 +277,6 @@ already been fully [initialized](#begin_training).
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
Note that you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## Morphologizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.

View File

@ -293,12 +293,6 @@ context, the original parameters are restored.
> pipe.add_label("MY_LABEL")
> ```
<Infobox variant="danger">
This method needs to be overwritten with your own custom `add_label` method.
</Infobox>
Add a new label to the pipe, to be predicted by the model. The actual
implementation depends on the specific component, but in general `add_label`
shouldn't be called if the output dimension is already set, or if the model has
@ -308,6 +302,12 @@ the component is [resizable](#is_resizable), in which case
[`set_output`](#set_output) should be called to ensure that the model is
properly resized.
<Infobox variant="danger">
This method needs to be overwritten with your own custom `add_label` method.
</Infobox>
| Name | Description |
| ----------- | ------------------------------------------------------- |
| `label` | The label to add. ~~str~~ |
@ -326,41 +326,37 @@ model, and the output dimension will be
> ```python
> can_resize = pipe.is_resizable()
> ```
>
> ```python
> ### Custom resizing
> def custom_resize(model, new_nO):
> # adjust model
> return model
>
> custom_model.attrs["resize_output"] = custom_resize
> ```
Check whether or not the output dimension of the component's model can be
resized. If this method returns `True`, [`set_output`](#set_output) can be
called to change the model's output dimension.
For built-in components that are not resizable, you have to create and train a
new model from scratch with the appropriate architecture and output dimension.
For custom components, you can implement a `resize_output` function and add it
as an attribute to the component's model.
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------------------- |
| **RETURNS** | Whether or not the output dimension of the model can be changed after initialization. ~~bool~~ |
> #### Example
>
> ```python
> def custom_resize(model, new_nO):
> # adjust model
> return model
> custom_model.attrs["resize_output"] = custom_resize
> ```
For built-in components that are not resizable, you have to create and train a
new model from scratch with the appropriate architecture and output dimension.
For custom components, you can implement a `resize_output` function and add it
as an attribute to the component's model.
## Pipe.set_output {#set_output tag="method"}
Change the output dimension of the component's model. If the component is not
[resizable](#is_resizable), this method will throw a `NotImplementedError`.
If a component is resizable, the model's attribute `resize_output` will be
called. This is a function that takes the original model and the new output
dimension `nO`, and changes the model in place.
When resizing an already trained model, care should be taken to avoid the
"catastrophic forgetting" problem.
[resizable](#is_resizable), this method will raise a `NotImplementedError`. If a
component is resizable, the model's attribute `resize_output` will be called.
This is a function that takes the original model and the new output dimension
`nO`, and changes the model in place. When resizing an already trained model,
care should be taken to avoid the "catastrophic forgetting" problem.
> #### Example
>

View File

@ -289,7 +289,12 @@ context, the original parameters are restored.
## Tagger.add_label {#add_label tag="method"}
Add a new label to the pipe. Raises an error if the output dimension is already
set, or if the model has already been fully [initialized](#begin_training).
set, or if the model has already been fully [initialized](#begin_training). Note
that you don't have to call this method if you provide a **representative data
sample** to the [`begin_training`](#begin_training) method. In this case, all
labels found in the sample will be automatically added to the model, and the
output dimension will be [inferred](/usage/layers-architectures#shape-inference)
automatically.
> #### Example
>
@ -303,12 +308,6 @@ set, or if the model has already been fully [initialized](#begin_training).
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
Note that you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## Tagger.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.

View File

@ -298,7 +298,12 @@ Modify the pipe's model, to use the given parameter values.
## TextCategorizer.add_label {#add_label tag="method"}
Add a new label to the pipe. Raises an error if the output dimension is already
set, or if the model has already been fully [initialized](#begin_training).
set, or if the model has already been fully [initialized](#begin_training). Note
that you don't have to call this method if you provide a **representative data
sample** to the [`begin_training`](#begin_training) method. In this case, all
labels found in the sample will be automatically added to the model, and the
output dimension will be [inferred](/usage/layers-architectures#shape-inference)
automatically.
> #### Example
>
@ -312,12 +317,6 @@ set, or if the model has already been fully [initialized](#begin_training).
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
Note that you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## TextCategorizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.

View File

@ -5,8 +5,7 @@ menu:
- ['Type Signatures', 'type-sigs']
- ['Swapping Architectures', 'swap-architectures']
- ['PyTorch & TensorFlow', 'frameworks']
- ['Custom Models', 'custom-models']
- ['Thinc implementation', 'thinc']
- ['Custom Thinc Models', 'thinc']
- ['Trainable Components', 'components']
next: /usage/projects
---
@ -226,13 +225,24 @@ you'll be able to try it out in any of the spaCy components.
Thinc allows you to [wrap models](https://thinc.ai/docs/usage-frameworks)
written in other machine learning frameworks like PyTorch, TensorFlow and MXNet
using a unified [`Model`](https://thinc.ai/docs/api-model) API.
For example, let's use PyTorch to define a very simple Neural network consisting
of two hidden `Linear` layers with `ReLU` activation and dropout, and a
softmax-activated output layer.
using a unified [`Model`](https://thinc.ai/docs/api-model) API. This makes it
easy to use a model implemented in a different framework to power a component in
your spaCy pipeline. For example, to wrap a PyTorch model as a Thinc `Model`,
you can use Thinc's
[`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper):
```python
from thinc.api import PyTorchWrapper
wrapped_pt_model = PyTorchWrapper(torch_model)
```
Let's use PyTorch to define a very simple neural network consisting of two
hidden `Linear` layers with `ReLU` activation and dropout, and a
softmax-activated output layer:
```python
### PyTorch model
from torch import nn
torch_model = nn.Sequential(
@ -246,15 +256,6 @@ torch_model = nn.Sequential(
)
```
This PyTorch model can be wrapped as a Thinc `Model` by using Thinc's
`PyTorchWrapper`:
```python
from thinc.api import PyTorchWrapper
wrapped_pt_model = PyTorchWrapper(torch_model)
```
The resulting wrapped `Model` can be used as a **custom architecture** as such,
or can be a **subcomponent of a larger model**. For instance, we can use Thinc's
[`chain`](https://thinc.ai/docs/api-layers#chain) combinator, which works like
@ -273,21 +274,26 @@ model = chain(char_embed, with_array(wrapped_pt_model))
In the above example, we have combined our custom PyTorch model with a character
embedding layer defined by spaCy.
[CharacterEmbed](/api/architectures#CharacterEmbed) returns a `Model` that takes
a `List[Doc]` as input, and outputs a `List[Floats2d]`. To make sure that the
wrapped PyTorch model receives valid inputs, we use Thinc's
a ~~List[Doc]~~ as input, and outputs a ~~List[Floats2d]~~. To make sure that
the wrapped PyTorch model receives valid inputs, we use Thinc's
[`with_array`](https://thinc.ai/docs/api-layers#with_array) helper.
As another example, you could have a model where you use PyTorch just for the
transformer layers, and use "native" Thinc layers to do fiddly input and output
transformations and add on task-specific "heads", as efficiency is less of a
consideration for those parts of the network.
You could also implement a model that only uses PyTorch for the transformer
layers, and "native" Thinc layers to do fiddly input and output transformations
and add on task-specific "heads", as efficiency is less of a consideration for
those parts of the network.
## Custom models for trainable components {#custom-models}
### Using wrapped models {#frameworks-usage}
To use our custom model including the PyTorch subnetwork, all we need to do is
register the architecture. The full example then becomes:
register the architecture using the
[`architectures` registry](/api/top-level#registry). This will assign the
architecture a name so spaCy knows how to find it, and allows passing in
arguments like hyperparameters via the [config](/usage/training#config). The
full example then becomes:
```python
### Registering the architecture {highlight="9"}
from typing import List
from thinc.types import Floats2d
from thinc.api import Model, PyTorchWrapper, chain, with_array
@ -297,7 +303,7 @@ from spacy.ml import CharacterEmbed
from torch import nn
@spacy.registry.architectures("CustomTorchModel.v1")
def TorchModel(
def create_torch_model(
nO: int,
width: int,
hidden_width: int,
@ -321,8 +327,10 @@ def TorchModel(
return model
```
Now you can use this model definition in any existing trainable spaCy component,
by specifying it in the config file:
The model definition can now be used in any existing trainable spaCy component,
by specifying it in the config file. In this configuration, all required
parameters for the various subcomponents of the custom architecture are passed
in as settings via the config.
```ini
### config.cfg (excerpt) {highlight="5-5"}
@ -340,29 +348,28 @@ nC = 8
dropout = 0.2
```
In this configuration, we pass all required parameters for the various
subcomponents of the custom architecture as settings in the training config
file. Remember that it is best not to rely on any (hidden) default values, to
ensure that training configs are complete and experiments fully reproducible.
<Infobox variant="warning">
## Thinc implemention details {#thinc}
Remember that it is best not to rely on any (hidden) default values, to ensure
that training configs are complete and experiments fully reproducible.
Ofcourse it's also possible to define the `Model` from the previous section
</Infobox>
## Custom models with Thinc {#thinc}
Of course it's also possible to define the `Model` from the previous section
entirely in Thinc. The Thinc documentation provides details on the
[various layers](https://thinc.ai/docs/api-layers) and helper functions
available.
The combinators often used in Thinc can be used to
[overload operators](https://thinc.ai/docs/usage-models#operators). A common
usage is to bind `chain` to `>>`. The "native" Thinc version of our simple
neural network would then become:
available. Combinators can also be used to
[overload operators](https://thinc.ai/docs/usage-models#operators) and a common
usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our
simple neural network would then become:
```python
from thinc.api import chain, with_array, Model, Relu, Dropout, Softmax
from spacy.ml import CharacterEmbed
char_embed = CharacterEmbed(width, embed_size, nM, nC)
with Model.define_operators({">>": chain}):
layers = (
Relu(hidden_width, width)
@ -374,20 +381,37 @@ with Model.define_operators({">>": chain}):
model = char_embed >> with_array(layers)
```
**⚠️ Note that Thinc layers define the output dimension (`nO`) as the first
argument, followed (optionally) by the input dimension (`nI`). This is in
contrast to how the PyTorch layers are defined, where `in_features` precedes
`out_features`.**
<Infobox variant="warning" title="Important note on inputs and outputs">
### Shape inference in thinc {#shape-inference}
Note that Thinc layers define the output dimension (`nO`) as the first argument,
followed (optionally) by the input dimension (`nI`). This is in contrast to how
the PyTorch layers are defined, where `in_features` precedes `out_features`.
It is not strictly necessary to define all the input and output dimensions for
each layer, as Thinc can perform
</Infobox>
### Shape inference in Thinc {#thinc-shape-inference}
It is **not** strictly necessary to define all the input and output dimensions
for each layer, as Thinc can perform
[shape inference](https://thinc.ai/docs/usage-models#validation) between
sequential layers by matching up the output dimensionality of one layer to the
input dimensionality of the next. This means that we can simplify the `layers`
definition:
> #### Diff
>
> ```diff
> layers = (
> Relu(hidden_width, width)
> >> Dropout(dropout)
> - >> Relu(hidden_width, hidden_width)
> + >> Relu(hidden_width)
> >> Dropout(dropout)
> - >> Softmax(nO, hidden_width)
> + >> Softmax(nO)
> )
> ```
```python
with Model.define_operators({">>": chain}):
layers = (
@ -399,12 +423,14 @@ with Model.define_operators({">>": chain}):
)
```
Thinc can go one step further and deduce the correct input dimension of the
first layer, and output dimension of the last. To enable this functionality, you
have to call [`model.initialize`](https://thinc.ai/docs/api-model#initialize)
with an input sample `X` and an output sample `Y` with the correct dimensions.
Thinc can even go one step further and **deduce the correct input dimension** of
the first layer, and output dimension of the last. To enable this functionality,
you have to call
[`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input
sample** `X` and an **output sample** `Y` with the correct dimensions:
```python
### Shape inference with initialization {highlight="3,7,10"}
with Model.define_operators({">>": chain}):
layers = (
Relu(hidden_width)
@ -418,21 +444,21 @@ with Model.define_operators({">>": chain}):
```
The built-in [pipeline components](/usage/processing-pipelines) in spaCy ensure
that their internal models are always initialized with appropriate sample data.
In this case, `X` is typically a `List` of `Doc` objects, while `Y` is a `List`
of 1D or 2D arrays, depending on the specific task. This functionality is
triggered when [`nlp.begin_training`](/api/language#begin_training) is called.
that their internal models are **always initialized** with appropriate sample
data. In this case, `X` is typically a ~~List[Doc]~~, while `Y` is typically a
~~List[Array1d]~~ or ~~List[Array2d]~~, depending on the specific task. This
functionality is triggered when
[`nlp.begin_training`](/api/language#begin_training) is called.
### Dropout and normalization {#drop-norm}
### Dropout and normalization in Thinc {#thinc-dropout-norm}
Many of the `Thinc` layers allow you to define a `dropout` argument that will
result in "chaining" an additional
Many of the available Thinc [layers](https://thinc.ai/docs/api-layers) allow you
to define a `dropout` argument that will result in "chaining" an additional
[`Dropout`](https://thinc.ai/docs/api-layers#dropout) layer. Optionally, you can
often specify whether or not you want to add layer normalization, which would
result in an additional
[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer.
That means that the following `layers` definition is equivalent to the previous:
[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer. That means that
the following `layers` definition is equivalent to the previous:
```python
with Model.define_operators({">>": chain}):

View File

@ -34,6 +34,8 @@
"Floats2d": "https://thinc.ai/docs/api-types#types",
"Floats3d": "https://thinc.ai/docs/api-types#types",
"FloatsXd": "https://thinc.ai/docs/api-types#types",
"Array1d": "https://thinc.ai/docs/api-types#types",
"Array2d": "https://thinc.ai/docs/api-types#types",
"Ops": "https://thinc.ai/docs/api-backends#ops",
"cymem.Pool": "https://github.com/explosion/cymem",
"preshed.BloomFilter": "https://github.com/explosion/preshed",