From 2e567a47c261b3eac22f9cd37abd737d1f48fdfb Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 9 Sep 2020 21:26:10 +0200 Subject: [PATCH] Update docs and formatting --- website/docs/api/dependencyparser.md | 19 +-- website/docs/api/entityrecognizer.md | 19 +-- website/docs/api/morphologizer.md | 12 +- website/docs/api/pipe.md | 54 +++--- website/docs/api/tagger.md | 13 +- website/docs/api/textcategorizer.md | 13 +- website/docs/usage/layers-architectures.md | 184 ++++++++++++--------- website/meta/type-annotations.json | 2 + 8 files changed, 165 insertions(+), 151 deletions(-) diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index 9fd8f60d2..ed5e8bdb2 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -293,7 +293,11 @@ context, the original parameters are restored. ## DependencyParser.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Note that you don't have to call this method if you +provide a **representative data sample** to the +[`begin_training`](#begin_training) method. In this case, all labels found in +the sample will be automatically added to the model, and the output dimension +will be [inferred](/usage/layers-architectures#shape-inference) automatically. > #### Example > @@ -307,17 +311,13 @@ Add a new label to the pipe. | `label` | The label to add. ~~str~~ | | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | -Note that you don't have to call `pipe.add_label` if you provide a -representative data sample to the [`begin_training`](#begin_training) method. In -this case, all labels found in the sample will be automatically added to the -model, and the output dimension will be -[inferred](/usage/layers-architectures#shape-inference) automatically. - ## DependencyParser.set_output {#set_output tag="method"} Change the output dimension of the component's model by calling the model's attribute `resize_output`. This is a function that takes the original model and -the new output dimension `nO`, and changes the model in place. +the new output dimension `nO`, and changes the model in place. When resizing an +already trained model, care should be taken to avoid the "catastrophic +forgetting" problem. > #### Example > @@ -330,9 +330,6 @@ the new output dimension `nO`, and changes the model in place. | ---- | --------------------------------- | | `nO` | The new output dimension. ~~int~~ | -When resizing an already trained model, care should be taken to avoid the -"catastrophic forgetting" problem. - ## DependencyParser.to_disk {#to_disk tag="method"} Serialize the pipe to disk. diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 51ad984ee..fc6904824 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -281,7 +281,11 @@ context, the original parameters are restored. ## EntityRecognizer.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Note that you don't have to call this method if you +provide a **representative data sample** to the +[`begin_training`](#begin_training) method. In this case, all labels found in +the sample will be automatically added to the model, and the output dimension +will be [inferred](/usage/layers-architectures#shape-inference) automatically. > #### Example > @@ -295,17 +299,13 @@ Add a new label to the pipe. | `label` | The label to add. ~~str~~ | | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | -Note that you don't have to call `pipe.add_label` if you provide a -representative data sample to the [`begin_training`](#begin_training) method. In -this case, all labels found in the sample will be automatically added to the -model, and the output dimension will be -[inferred](/usage/layers-architectures#shape-inference) automatically. - ## EntityRecognizer.set_output {#set_output tag="method"} Change the output dimension of the component's model by calling the model's attribute `resize_output`. This is a function that takes the original model and -the new output dimension `nO`, and changes the model in place. +the new output dimension `nO`, and changes the model in place. When resizing an +already trained model, care should be taken to avoid the "catastrophic +forgetting" problem. > #### Example > @@ -318,9 +318,6 @@ the new output dimension `nO`, and changes the model in place. | ---- | --------------------------------- | | `nO` | The new output dimension. ~~int~~ | -When resizing an already trained model, care should be taken to avoid the -"catastrophic forgetting" problem. - ## EntityRecognizer.to_disk {#to_disk tag="method"} Serialize the pipe to disk. diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md index 120b62b2f..c83d3d9fd 100644 --- a/website/docs/api/morphologizer.md +++ b/website/docs/api/morphologizer.md @@ -259,7 +259,11 @@ context, the original parameters are restored. Add a new label to the pipe. If the `Morphologizer` should set annotations for both `pos` and `morph`, the label should include the UPOS as the feature `POS`. Raises an error if the output dimension is already set, or if the model has -already been fully [initialized](#begin_training). +already been fully [initialized](#begin_training). Note that you don't have to +call this method if you provide a **representative data sample** to the +[`begin_training`](#begin_training) method. In this case, all labels found in +the sample will be automatically added to the model, and the output dimension +will be [inferred](/usage/layers-architectures#shape-inference) automatically. > #### Example > @@ -273,12 +277,6 @@ already been fully [initialized](#begin_training). | `label` | The label to add. ~~str~~ | | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | -Note that you don't have to call `pipe.add_label` if you provide a -representative data sample to the [`begin_training`](#begin_training) method. In -this case, all labels found in the sample will be automatically added to the -model, and the output dimension will be -[inferred](/usage/layers-architectures#shape-inference) automatically. - ## Morphologizer.to_disk {#to_disk tag="method"} Serialize the pipe to disk. diff --git a/website/docs/api/pipe.md b/website/docs/api/pipe.md index 7b77141fa..be1279553 100644 --- a/website/docs/api/pipe.md +++ b/website/docs/api/pipe.md @@ -293,12 +293,6 @@ context, the original parameters are restored. > pipe.add_label("MY_LABEL") > ``` - - -This method needs to be overwritten with your own custom `add_label` method. - - - Add a new label to the pipe, to be predicted by the model. The actual implementation depends on the specific component, but in general `add_label` shouldn't be called if the output dimension is already set, or if the model has @@ -308,6 +302,12 @@ the component is [resizable](#is_resizable), in which case [`set_output`](#set_output) should be called to ensure that the model is properly resized. + + +This method needs to be overwritten with your own custom `add_label` method. + + + | Name | Description | | ----------- | ------------------------------------------------------- | | `label` | The label to add. ~~str~~ | @@ -326,41 +326,37 @@ model, and the output dimension will be > ```python > can_resize = pipe.is_resizable() > ``` +> +> ```python +> ### Custom resizing +> def custom_resize(model, new_nO): +> # adjust model +> return model +> +> custom_model.attrs["resize_output"] = custom_resize +> ``` Check whether or not the output dimension of the component's model can be resized. If this method returns `True`, [`set_output`](#set_output) can be called to change the model's output dimension. +For built-in components that are not resizable, you have to create and train a +new model from scratch with the appropriate architecture and output dimension. +For custom components, you can implement a `resize_output` function and add it +as an attribute to the component's model. + | Name | Description | | ----------- | ---------------------------------------------------------------------------------------------- | | **RETURNS** | Whether or not the output dimension of the model can be changed after initialization. ~~bool~~ | -> #### Example -> -> ```python -> def custom_resize(model, new_nO): -> # adjust model -> return model -> custom_model.attrs["resize_output"] = custom_resize -> ``` - -For built-in components that are not resizable, you have to create and train a -new model from scratch with the appropriate architecture and output dimension. - -For custom components, you can implement a `resize_output` function and add it -as an attribute to the component's model. - ## Pipe.set_output {#set_output tag="method"} Change the output dimension of the component's model. If the component is not -[resizable](#is_resizable), this method will throw a `NotImplementedError`. - -If a component is resizable, the model's attribute `resize_output` will be -called. This is a function that takes the original model and the new output -dimension `nO`, and changes the model in place. - -When resizing an already trained model, care should be taken to avoid the -"catastrophic forgetting" problem. +[resizable](#is_resizable), this method will raise a `NotImplementedError`. If a +component is resizable, the model's attribute `resize_output` will be called. +This is a function that takes the original model and the new output dimension +`nO`, and changes the model in place. When resizing an already trained model, +care should be taken to avoid the "catastrophic forgetting" problem. > #### Example > diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index 0e929a6ab..eceb28b19 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -289,7 +289,12 @@ context, the original parameters are restored. ## Tagger.add_label {#add_label tag="method"} Add a new label to the pipe. Raises an error if the output dimension is already -set, or if the model has already been fully [initialized](#begin_training). +set, or if the model has already been fully [initialized](#begin_training). Note +that you don't have to call this method if you provide a **representative data +sample** to the [`begin_training`](#begin_training) method. In this case, all +labels found in the sample will be automatically added to the model, and the +output dimension will be [inferred](/usage/layers-architectures#shape-inference) +automatically. > #### Example > @@ -303,12 +308,6 @@ set, or if the model has already been fully [initialized](#begin_training). | `label` | The label to add. ~~str~~ | | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | -Note that you don't have to call `pipe.add_label` if you provide a -representative data sample to the [`begin_training`](#begin_training) method. In -this case, all labels found in the sample will be automatically added to the -model, and the output dimension will be -[inferred](/usage/layers-architectures#shape-inference) automatically. - ## Tagger.to_disk {#to_disk tag="method"} Serialize the pipe to disk. diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index e0c7c2f79..0d71655c6 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -298,7 +298,12 @@ Modify the pipe's model, to use the given parameter values. ## TextCategorizer.add_label {#add_label tag="method"} Add a new label to the pipe. Raises an error if the output dimension is already -set, or if the model has already been fully [initialized](#begin_training). +set, or if the model has already been fully [initialized](#begin_training). Note +that you don't have to call this method if you provide a **representative data +sample** to the [`begin_training`](#begin_training) method. In this case, all +labels found in the sample will be automatically added to the model, and the +output dimension will be [inferred](/usage/layers-architectures#shape-inference) +automatically. > #### Example > @@ -312,12 +317,6 @@ set, or if the model has already been fully [initialized](#begin_training). | `label` | The label to add. ~~str~~ | | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | -Note that you don't have to call `pipe.add_label` if you provide a -representative data sample to the [`begin_training`](#begin_training) method. In -this case, all labels found in the sample will be automatically added to the -model, and the output dimension will be -[inferred](/usage/layers-architectures#shape-inference) automatically. - ## TextCategorizer.to_disk {#to_disk tag="method"} Serialize the pipe to disk. diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 894cccc26..6783f2b7f 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -5,8 +5,7 @@ menu: - ['Type Signatures', 'type-sigs'] - ['Swapping Architectures', 'swap-architectures'] - ['PyTorch & TensorFlow', 'frameworks'] - - ['Custom Models', 'custom-models'] - - ['Thinc implementation', 'thinc'] + - ['Custom Thinc Models', 'thinc'] - ['Trainable Components', 'components'] next: /usage/projects --- @@ -226,13 +225,24 @@ you'll be able to try it out in any of the spaCy components. ​ Thinc allows you to [wrap models](https://thinc.ai/docs/usage-frameworks) written in other machine learning frameworks like PyTorch, TensorFlow and MXNet -using a unified [`Model`](https://thinc.ai/docs/api-model) API. - -For example, let's use PyTorch to define a very simple Neural network consisting -of two hidden `Linear` layers with `ReLU` activation and dropout, and a -softmax-activated output layer. +using a unified [`Model`](https://thinc.ai/docs/api-model) API. This makes it +easy to use a model implemented in a different framework to power a component in +your spaCy pipeline. For example, to wrap a PyTorch model as a Thinc `Model`, +you can use Thinc's +[`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper): ```python +from thinc.api import PyTorchWrapper + +wrapped_pt_model = PyTorchWrapper(torch_model) +``` + +Let's use PyTorch to define a very simple neural network consisting of two +hidden `Linear` layers with `ReLU` activation and dropout, and a +softmax-activated output layer: + +```python +### PyTorch model from torch import nn torch_model = nn.Sequential( @@ -246,15 +256,6 @@ torch_model = nn.Sequential( ) ``` -This PyTorch model can be wrapped as a Thinc `Model` by using Thinc's -`PyTorchWrapper`: - -```python -from thinc.api import PyTorchWrapper - -wrapped_pt_model = PyTorchWrapper(torch_model) -``` - The resulting wrapped `Model` can be used as a **custom architecture** as such, or can be a **subcomponent of a larger model**. For instance, we can use Thinc's [`chain`](https://thinc.ai/docs/api-layers#chain) combinator, which works like @@ -273,21 +274,26 @@ model = chain(char_embed, with_array(wrapped_pt_model)) In the above example, we have combined our custom PyTorch model with a character embedding layer defined by spaCy. [CharacterEmbed](/api/architectures#CharacterEmbed) returns a `Model` that takes -a `List[Doc]` as input, and outputs a `List[Floats2d]`. To make sure that the -wrapped PyTorch model receives valid inputs, we use Thinc's +a ~~List[Doc]~~ as input, and outputs a ~~List[Floats2d]~~. To make sure that +the wrapped PyTorch model receives valid inputs, we use Thinc's [`with_array`](https://thinc.ai/docs/api-layers#with_array) helper. -As another example, you could have a model where you use PyTorch just for the -transformer layers, and use "native" Thinc layers to do fiddly input and output -transformations and add on task-specific "heads", as efficiency is less of a -consideration for those parts of the network. +You could also implement a model that only uses PyTorch for the transformer +layers, and "native" Thinc layers to do fiddly input and output transformations +and add on task-specific "heads", as efficiency is less of a consideration for +those parts of the network. -## Custom models for trainable components {#custom-models} +### Using wrapped models {#frameworks-usage} To use our custom model including the PyTorch subnetwork, all we need to do is -register the architecture. The full example then becomes: +register the architecture using the +[`architectures` registry](/api/top-level#registry). This will assign the +architecture a name so spaCy knows how to find it, and allows passing in +arguments like hyperparameters via the [config](/usage/training#config). The +full example then becomes: ```python +### Registering the architecture {highlight="9"} from typing import List from thinc.types import Floats2d from thinc.api import Model, PyTorchWrapper, chain, with_array @@ -297,7 +303,7 @@ from spacy.ml import CharacterEmbed from torch import nn @spacy.registry.architectures("CustomTorchModel.v1") -def TorchModel( +def create_torch_model( nO: int, width: int, hidden_width: int, @@ -321,8 +327,10 @@ def TorchModel( return model ``` -Now you can use this model definition in any existing trainable spaCy component, -by specifying it in the config file: +The model definition can now be used in any existing trainable spaCy component, +by specifying it in the config file. In this configuration, all required +parameters for the various subcomponents of the custom architecture are passed +in as settings via the config. ```ini ### config.cfg (excerpt) {highlight="5-5"} @@ -340,106 +348,124 @@ nC = 8 dropout = 0.2 ``` -In this configuration, we pass all required parameters for the various -subcomponents of the custom architecture as settings in the training config -file. Remember that it is best not to rely on any (hidden) default values, to -ensure that training configs are complete and experiments fully reproducible. + -## Thinc implemention details {#thinc} +Remember that it is best not to rely on any (hidden) default values, to ensure +that training configs are complete and experiments fully reproducible. -Ofcourse it's also possible to define the `Model` from the previous section + + +## Custom models with Thinc {#thinc} + +Of course it's also possible to define the `Model` from the previous section entirely in Thinc. The Thinc documentation provides details on the [various layers](https://thinc.ai/docs/api-layers) and helper functions -available. - -The combinators often used in Thinc can be used to -[overload operators](https://thinc.ai/docs/usage-models#operators). A common -usage is to bind `chain` to `>>`. The "native" Thinc version of our simple -neural network would then become: +available. Combinators can also be used to +[overload operators](https://thinc.ai/docs/usage-models#operators) and a common +usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our +simple neural network would then become: ```python from thinc.api import chain, with_array, Model, Relu, Dropout, Softmax from spacy.ml import CharacterEmbed char_embed = CharacterEmbed(width, embed_size, nM, nC) - with Model.define_operators({">>": chain}): layers = ( - Relu(hidden_width, width) - >> Dropout(dropout) - >> Relu(hidden_width, hidden_width) - >> Dropout(dropout) - >> Softmax(nO, hidden_width) + Relu(hidden_width, width) + >> Dropout(dropout) + >> Relu(hidden_width, hidden_width) + >> Dropout(dropout) + >> Softmax(nO, hidden_width) ) model = char_embed >> with_array(layers) ``` -**⚠️ Note that Thinc layers define the output dimension (`nO`) as the first -argument, followed (optionally) by the input dimension (`nI`). This is in -contrast to how the PyTorch layers are defined, where `in_features` precedes -`out_features`.** + -### Shape inference in thinc {#shape-inference} +Note that Thinc layers define the output dimension (`nO`) as the first argument, +followed (optionally) by the input dimension (`nI`). This is in contrast to how +the PyTorch layers are defined, where `in_features` precedes `out_features`. -It is not strictly necessary to define all the input and output dimensions for -each layer, as Thinc can perform + + +### Shape inference in Thinc {#thinc-shape-inference} + +It is **not** strictly necessary to define all the input and output dimensions +for each layer, as Thinc can perform [shape inference](https://thinc.ai/docs/usage-models#validation) between sequential layers by matching up the output dimensionality of one layer to the input dimensionality of the next. This means that we can simplify the `layers` definition: +> #### Diff +> +> ```diff +> layers = ( +> Relu(hidden_width, width) +> >> Dropout(dropout) +> - >> Relu(hidden_width, hidden_width) +> + >> Relu(hidden_width) +> >> Dropout(dropout) +> - >> Softmax(nO, hidden_width) +> + >> Softmax(nO) +> ) +> ``` + ```python with Model.define_operators({">>": chain}): layers = ( - Relu(hidden_width, width) - >> Dropout(dropout) - >> Relu(hidden_width) - >> Dropout(dropout) - >> Softmax(nO) + Relu(hidden_width, width) + >> Dropout(dropout) + >> Relu(hidden_width) + >> Dropout(dropout) + >> Softmax(nO) ) ``` -Thinc can go one step further and deduce the correct input dimension of the -first layer, and output dimension of the last. To enable this functionality, you -have to call [`model.initialize`](https://thinc.ai/docs/api-model#initialize) -with an input sample `X` and an output sample `Y` with the correct dimensions. +Thinc can even go one step further and **deduce the correct input dimension** of +the first layer, and output dimension of the last. To enable this functionality, +you have to call +[`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input +sample** `X` and an **output sample** `Y` with the correct dimensions: ```python +### Shape inference with initialization {highlight="3,7,10"} with Model.define_operators({">>": chain}): layers = ( - Relu(hidden_width) - >> Dropout(dropout) - >> Relu(hidden_width) - >> Dropout(dropout) - >> Softmax() + Relu(hidden_width) + >> Dropout(dropout) + >> Relu(hidden_width) + >> Dropout(dropout) + >> Softmax() ) model = char_embed >> with_array(layers) model.initialize(X=input_sample, Y=output_sample) ``` The built-in [pipeline components](/usage/processing-pipelines) in spaCy ensure -that their internal models are always initialized with appropriate sample data. -In this case, `X` is typically a `List` of `Doc` objects, while `Y` is a `List` -of 1D or 2D arrays, depending on the specific task. This functionality is -triggered when [`nlp.begin_training`](/api/language#begin_training) is called. +that their internal models are **always initialized** with appropriate sample +data. In this case, `X` is typically a ~~List[Doc]~~, while `Y` is typically a +~~List[Array1d]~~ or ~~List[Array2d]~~, depending on the specific task. This +functionality is triggered when +[`nlp.begin_training`](/api/language#begin_training) is called. -### Dropout and normalization {#drop-norm} +### Dropout and normalization in Thinc {#thinc-dropout-norm} -Many of the `Thinc` layers allow you to define a `dropout` argument that will -result in "chaining" an additional +Many of the available Thinc [layers](https://thinc.ai/docs/api-layers) allow you +to define a `dropout` argument that will result in "chaining" an additional [`Dropout`](https://thinc.ai/docs/api-layers#dropout) layer. Optionally, you can often specify whether or not you want to add layer normalization, which would result in an additional -[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer. - -That means that the following `layers` definition is equivalent to the previous: +[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer. That means that +the following `layers` definition is equivalent to the previous: ```python with Model.define_operators({">>": chain}): layers = ( - Relu(hidden_width, dropout=dropout, normalize=False) - >> Relu(hidden_width, dropout=dropout, normalize=False) - >> Softmax() + Relu(hidden_width, dropout=dropout, normalize=False) + >> Relu(hidden_width, dropout=dropout, normalize=False) + >> Softmax() ) model = char_embed >> with_array(layers) model.initialize(X=input_sample, Y=output_sample) diff --git a/website/meta/type-annotations.json b/website/meta/type-annotations.json index b1d94403d..79d4d357d 100644 --- a/website/meta/type-annotations.json +++ b/website/meta/type-annotations.json @@ -34,6 +34,8 @@ "Floats2d": "https://thinc.ai/docs/api-types#types", "Floats3d": "https://thinc.ai/docs/api-types#types", "FloatsXd": "https://thinc.ai/docs/api-types#types", + "Array1d": "https://thinc.ai/docs/api-types#types", + "Array2d": "https://thinc.ai/docs/api-types#types", "Ops": "https://thinc.ai/docs/api-backends#ops", "cymem.Pool": "https://github.com/explosion/cymem", "preshed.BloomFilter": "https://github.com/explosion/preshed",