Merge branch 'feature/docs-layers' of https://github.com/svlandeg/spaCy into feature/docs-layers

This commit is contained in:
svlandeg 2020-09-02 17:44:00 +02:00
commit ab909a3f68
3 changed files with 187 additions and 82 deletions

View File

@ -25,36 +25,6 @@ usage documentation on
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN}
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy.HashEmbedCNN.v1"
> pretrained_vectors = null
> width = 96
> depth = 4
> embed_size = 2000
> window_size = 1
> maxout_pieces = 3
> subword_features = true
> ```
Build spaCy's "standard" embedding layer, which uses hash embedding with subword
features and a CNN with layer-normalized maxout.
| Name | Description |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ |
| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ |
| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ |
| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ |
| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ |
| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ |
| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.Tok2Vec.v1 {#Tok2Vec} ### spacy.Tok2Vec.v1 {#Tok2Vec}
> #### Example config > #### Example config
@ -72,7 +42,8 @@ features and a CNN with layer-normalized maxout.
> # ... > # ...
> ``` > ```
Construct a tok2vec model out of embedding and encoding subnetworks. See the Construct a tok2vec model out of two subnetworks: one for embedding and one for
encoding. See the
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp) ["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
blog post for background. blog post for background.
@ -82,6 +53,39 @@ blog post for background.
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ | | `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN}
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy.HashEmbedCNN.v1"
> pretrained_vectors = null
> width = 96
> depth = 4
> embed_size = 2000
> window_size = 1
> maxout_pieces = 3
> subword_features = true
> ```
Build spaCy's "standard" tok2vec layer. This layer is defined by a
[MultiHashEmbed](/api/architectures#MultiHashEmbed) embedding layer that uses
subword features, and a
[MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder) encoding layer
consisting of a CNN and a layer-normalized maxout activation function.
| Name | Description |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ |
| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ |
| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ |
| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ |
| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ |
| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ |
| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.Tok2VecListener.v1 {#Tok2VecListener} ### spacy.Tok2VecListener.v1 {#Tok2VecListener}
> #### Example config > #### Example config

View File

@ -10,49 +10,72 @@ menu:
next: /usage/projects next: /usage/projects
--- ---
A **model architecture** is a function that wires up a > #### Example
[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then >
use in a component or as a layer of a larger network. You can use Thinc as a > ```python
thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can > from thinc.api import Model, chain
implement your logic in Thinc directly. spaCy's built-in components will never >
construct their `Model` instances themselves, so you won't have to subclass the > @spacy.registry.architectures.register("model.v1")
component to change its model architecture. You can just **update the config** > def build_model(width: int, classes: int) -> Model:
so that it refers to a different registered function. Once the component has > tok2vec = build_tok2vec(width)
been created, its model instance has already been assigned, so you cannot change > output_layer = build_output_layer(width, classes)
its model architecture. The architecture is like a recipe for the network, and > model = chain(tok2vec, output_layer)
you can't change the recipe once the dish has already been prepared. You have to > return model
make a new one. > ```
A **model architecture** is a function that wires up a
[Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the
neural network that is run internally as part of a component in a spaCy
pipeline. To define the actual architecture, you can implement your logic in
Thinc directly, or you can use Thinc as a thin wrapper around frameworks such as
PyTorch, TensorFlow and MXNet. Each Model can also be used as a sublayer of a
larger network, allowing you to freely combine implementations from different
frameworks into one `Thinc` Model.
spaCy's built-in components require a `Model` instance to be passed to them via
the config system. To change the model architecture of an existing component,
you just need to [**update the config**](#swap-architectures) so that it refers
to a different registered function. Once the component has been created from
this config, you won't be able to change it anymore. The architecture is like a
recipe for the network, and you can't change the recipe once the dish has
already been prepared. You have to make a new one.
```ini
### config.cfg (excerpt)
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "model.v1"
width = 512
classes = 16
```
## Type signatures {#type-sigs} ## Type signatures {#type-sigs}
<!-- TODO: update example, maybe simplify definition? -->
> #### Example > #### Example
> >
> ```python > ```python
> @spacy.registry.architectures.register("spacy.Tagger.v1") > from typing import List
> def build_tagger_model( > from thinc.api import Model, chain
> tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None > from thinc.types import Floats2d
> ) -> Model[List[Doc], List[Floats2d]]: > def chain_model(
> t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None > tok2vec: Model[List[Doc], List[Floats2d]],
> output_layer = Softmax(nO, t2v_width, init_W=zero_init) > layer1: Model[List[Floats2d], Floats2d],
> softmax = with_array(output_layer) > layer2: Model[Floats2d, Floats2d]
> model = chain(tok2vec, softmax) > ) -> Model[List[Doc], Floats2d]:
> model.set_ref("tok2vec", tok2vec) > model = chain(tok2vec, layer1, layer2)
> model.set_ref("softmax", output_layer)
> model.set_ref("output_layer", output_layer)
> return model > return model
> ``` > ```
The Thinc `Model` class is a **generic type** that can specify its input and The Thinc `Model` class is a **generic type** that can specify its input and
output types. Python uses a square-bracket notation for this, so the type output types. Python uses a square-bracket notation for this, so the type
~~Model[List, Dict]~~ says that each batch of inputs to the model will be a ~~Model[List, Dict]~~ says that each batch of inputs to the model will be a
list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict` list, and the outputs will be a dictionary. You can be even more specific and
are also generics, allowing you to be more specific about the data. For write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that the
instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that model expects a list of [`Doc`](/api/doc) objects as input, and returns a
the model expects a list of [`Doc`](/api/doc) objects as input, and returns a dictionary mapping of strings to floats. Some of the most common types you'll
dictionary mapping strings to floats. Some of the most common types you'll see see are:
are:
| Type | Description | | Type | Description |
| ------------------ | ---------------------------------------------------------------------------------------------------- | | ------------------ | ---------------------------------------------------------------------------------------------------- |
@ -77,10 +100,10 @@ interchangeably. There are many other ways they could be incompatible. However,
if the types don't match, they almost surely _won't_ be compatible. This little if the types don't match, they almost surely _won't_ be compatible. This little
bit of validation goes a long way, especially if you bit of validation goes a long way, especially if you
[configure your editor](https://thinc.ai/docs/usage-type-checking) or other [configure your editor](https://thinc.ai/docs/usage-type-checking) or other
tools to highlight these errors early. Thinc will also verify that your types tools to highlight these errors early. The config file is also validated at the
match correctly when your config file is processed at the beginning of training. beginning of training, to verify that all the types match correctly.
<Infobox title="Tip: Static type checking in your editor" emoji="💡"> <Accordion title="Tip: Static type checking in your editor" emoji="💡">
If you're using a modern editor like Visual Studio Code, you can If you're using a modern editor like Visual Studio Code, you can
[set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the [set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the
@ -89,35 +112,113 @@ code.
[![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting) [![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting)
</Infobox> </Accordion>
## Swapping model architectures {#swap-architectures} ## Swapping model architectures {#swap-architectures}
<!-- TODO: textcat example, using different architecture in the config --> If no model is specified for the [`TextCategorizer`](/api/textcategorizer), the
[TextCatEnsemble](/api/architectures#TextCatEnsemble) architecture is used by
default. This architecture combines a simpel bag-of-words model with a neural
network, usually resulting in the most accurate results, but at the cost of
speed. The config file for this model would look something like this:
```ini
### config.cfg (excerpt)
[components.textcat]
factory = "textcat"
labels = []
[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v1"
exclusive_classes = false
pretrained_vectors = null
width = 64
conv_depth = 2
embed_size = 2000
window_size = 1
ngram_size = 1
dropout = 0
nO = null
```
spaCy has two additional built-in `textcat` architectures, and you can easily
use those by swapping out the definition of the textcat's model. For instance,
to use the simpel and fast [bag-of-words model](/api/architectures#TextCatBOW),
you can change the config to:
```ini
### config.cfg (excerpt)
[components.textcat]
factory = "textcat"
labels = []
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null
```
The details of all prebuilt architectures and their parameters, can be consulted
on the [API page for model architectures](/api/architectures).
### Defining sublayers {#sublayers} ### Defining sublayers {#sublayers}
Model architecture functions often accept **sublayers as arguments**, so that Model architecture functions often accept **sublayers as arguments**, so that
you can try **substituting a different layer** into the network. Depending on you can try **substituting a different layer** into the network. Depending on
how the architecture function is structured, you might be able to define your how the architecture function is structured, you might be able to define your
network structure entirely through the [config system](/usage/training#config), network structure entirely through the [config system](/usage/training#config),
using layers that have already been defined. The using layers that have already been defined.
[transformers documentation](/usage/embeddings-transformers#transformers)
section shows a common example of swapping in a different sublayer.
In most neural network models for NLP, the most important parts of the network In most neural network models for NLP, the most important parts of the network
are what we refer to as the are what we refer to as the
[embed and encode](https://explosion.ai/blog/embed-encode-attend-predict) steps. [embed and encode](https://explosion.ai/blog/deep-learning-formula-nlp) steps.
These steps together compute dense, context-sensitive representations of the These steps together compute dense, context-sensitive representations of the
tokens. Most of spaCy's default architectures accept a tokens, and their combination forms a typical
[`tok2vec` embedding layer](/api/architectures#tok2vec-arch) as an argument, so [`Tok2Vec`](/api/architectures#Tok2Vec) layer:
you can control this important part of the network separately. This makes it
easy to **switch between** transformer, CNN, BiLSTM or other feature extraction
approaches. And if you want to define your own solution, all you need to do is
register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and
you'll be able to try it out in any of the spaCy components.
<!-- TODO: example of swapping sublayers --> ```ini
### config.cfg (excerpt)
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
# ...
```
By defining these sublayers specifically, it becomes straightforward to swap out
a sublayer for another one, for instance changing the first sublayer to a
character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed)
architecture:
```ini
### config.cfg (excerpt)
[components.tok2vec.model.embed]
@architectures = "spacy.CharacterEmbed.v1"
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
# ...
```
Most of spaCy's default architectures accept a `tok2vec` layer as a sublayer
within the larger task-specific neural network. This makes it easy to **switch
between** transformer, CNN, BiLSTM or other feature extraction approaches. The
[transformers documentation](/usage/embeddings-transformers#training-custom-model)
section shows an example of swapping out a model's standard `tok2vec` layer with
a transformer. And if you want to define your own solution, all you need to do
is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and
you'll be able to try it out in any of the spaCy components.
## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks} ## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks}

View File

@ -825,7 +825,7 @@ from spacy.tokens import Doc
@spacy.registry.architectures("custom_neural_network.v1") @spacy.registry.architectures("custom_neural_network.v1")
def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]: def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
# ... return create_model(output_width)
``` ```
```ini ```ini