From a15c5fb1912af7599eadf1bc9ef2547397b95c9e Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Sun, 9 Aug 2020 16:10:48 +0200 Subject: [PATCH] Update docstrings and docs --- spacy/pipeline/dep_parser.pyx | 5 +++ website/docs/api/architectures.md | 41 ++++++++++++++++++++ website/docs/api/dependencyparser.md | 57 ++++++++++++++++++---------- website/docs/api/entityrecognizer.md | 46 ++++++++++++---------- website/docs/api/tagger.md | 22 +++++------ website/docs/api/textcategorizer.md | 31 +++++---------- website/docs/api/tok2vec.md | 29 ++++++++++---- 7 files changed, 152 insertions(+), 79 deletions(-) diff --git a/spacy/pipeline/dep_parser.pyx b/spacy/pipeline/dep_parser.pyx index 072c334e2..801229af5 100644 --- a/spacy/pipeline/dep_parser.pyx +++ b/spacy/pipeline/dep_parser.pyx @@ -71,6 +71,11 @@ def make_parser( actions are decreased. Note that more than one action may be optimal for a given state. + model (Model): The model for the transition-based parser. The model needs + to have a specific substructure of named components --- see the + spacy.ml.tb_framework.TransitionModel for details. + moves (List[str]): A list of transition names. Inferred from the data if not + provided. update_with_oracle_cut_size (int): During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 0050b53a5..db996caa0 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -70,6 +70,47 @@ blog post for background. | `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. | | `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. | +### spacy.Tok2VecListener.v1 {#Tok2VecListener} + +> #### Example config +> +> ```ini +> [components.tok2vec] +> factory = "tok2vec" +> +> [components.tok2vec.model] +> @architectures = "spacy.HashEmbedCNN.v1" +> width = 342 +> +> [components.tagger] +> factory = "tagger" +> +> [components.tagger.model] +> @architectures = "spacy.Tagger.v1" +> +> [components.tagger.model.tok2vec] +> @architectures = "spacy.Tok2VecListener.v1" +> width = ${components.tok2vec.model:width} +> ``` + +A listener is used as a sublayer within a component such as a +[`DependencyParser`](/api/dependencyparser), +[`EntityRecognizer`](/api/entityrecognizer)or +[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple +listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component +that's earlier in the pipeline. The listener layers act as **proxies**, passing +the predictions from the `Tok2Vec` component into downstream components, and +communicating gradients back upstream. + +Instead of defining its own `Tok2Vec` instance, a model architecture like +[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` +argument that connects to the shared `tok2vec` component in the pipeline. + +| Name | Type | Description | +| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `width` | int | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. | +| `upstream` | str | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. | + ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index e56e85e64..6c9222781 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -8,6 +8,23 @@ api_string_name: parser api_trainable: true --- +A transition-based dependency parser component. The dependency parser jointly +learns sentence segmentation and labelled dependency parsing, and can optionally +learn to merge tokens that had been over-segmented by the tokenizer. The parser +uses a variant of the **non-monotonic arc-eager transition-system** described by +[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with +the addition of a "break" transition to perform the sentence segmentation. +[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective +dependency transformation** is used to allow the parser to predict +non-projective parses. + +The parser is trained using an **imitation learning objective**. It follows the +actions predicted by the current weights, and at each state, determines which +actions are compatible with the optimal parse that could be reached from the +current state. The weights such that the scores assigned to the set of optimal +actions is increased, while scores assigned to other actions are decreased. Note +that more than one action may be optimal for a given state. + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes @@ -23,18 +40,21 @@ architectures and their arguments and hyperparameters. > from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL > config = { > "moves": None, -> # TODO: rest +> "update_with_oracle_cut_size": 100, +> "learn_tokens": False, +> "min_action_freq": 30, > "model": DEFAULT_PARSER_MODEL, > } > nlp.add_pipe("parser", config=config) > ``` - - -| Setting | Type | Description | Default | -| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- | -| `moves` | list | | `None` | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | +| Setting | Type | Description | Default | +| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | +| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | `None` | +| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` | +| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. | `False` | +| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30` | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/dep_parser.pyx @@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#add_pipe). - - -| Name | Type | Description | -| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | -| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | -| `moves` | list | | -| _keyword-only_ | | | -| `update_with_oracle_cut_size` | int | | -| `multitasks` | `Iterable` | | -| `learn_tokens` | bool | | -| `min_action_freq` | int | | +| Name | Type | Description | +| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | +| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | +| _keyword-only_ | | | +| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. | +| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. | +| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | ## DependencyParser.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 0ab17f953..a6368e62b 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -8,6 +8,18 @@ api_string_name: ner api_trainable: true --- +A transition-based named entity recognition component. The entity recognizer +identifies **non-overlapping labelled spans** of tokens. The transition-based +algorithm used encodes certain assumptions that are effective for "traditional" +named entity recognition tasks, but may not be a good fit for every span +identification problem. Specifically, the loss function optimizes for **whole +entity accuracy**, so if your inter-annotator agreement on boundary tokens is +low, the component will likely perform poorly on your problem. The +transition-based algorithm also assumes that the most decisive information about +your entities will be close to their initial tokens. If your entities are long +and characterized by tokens in their middle, the component will likely not be a +good fit for your task. + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes @@ -23,18 +35,17 @@ architectures and their arguments and hyperparameters. > from spacy.pipeline.ner import DEFAULT_NER_MODEL > config = { > "moves": None, -> # TODO: rest +> "update_with_oracle_cut_size": 100, > "model": DEFAULT_NER_MODEL, > } > nlp.add_pipe("ner", config=config) > ``` - - -| Setting | Type | Description | Default | -| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- | -| `moves` | list | | `None` | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | +| Setting | Type | Description | Default | +| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | +| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | +| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/ner.pyx @@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#add_pipe). - - -| Name | Type | Description | -| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | -| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | -| `moves` | list | | -| _keyword-only_ | | | -| `update_with_oracle_cut_size` | int | | -| `multitasks` | `Iterable` | | -| `learn_tokens` | bool | | -| `min_action_freq` | int | | +| Name | Type | Description | +| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | +| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | +| _keyword-only_ | | | +| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index d9b8f4caf..233171779 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -28,10 +28,10 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("tagger", config=config) > ``` -| Setting | Type | Description | Default | -| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- | -| `set_morphology` | bool | Whether to set morphological features. | `False` | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [Tagger](/api/architectures#Tagger) | +| Setting | Type | Description | Default | +| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- | +| `set_morphology` | bool | Whether to set morphological features. | `False` | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx @@ -58,13 +58,13 @@ Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#add_pipe). -| Name | Type | Description | -| ---------------- | ------- | ------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | -| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | -| _keyword-only_ | | | -| `set_morphology` | bool | Whether to set morphological features. | +| Name | Type | Description | +| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | +| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. | +| _keyword-only_ | | | +| `set_morphology` | bool | Whether to set morphological features. | ## Tagger.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index 1efd5831c..5af540828 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -9,6 +9,12 @@ api_string_name: textcat api_trainable: true --- +The text categorizer predicts **categories over a whole document**. It can learn +one or more labels, and the labels can be mutually exclusive (i.e. one true +label per document) or non-mutually exclusive (i.e. zero or more labels may be +true per document). The multi-label setting is controlled by the model instance +that's provided. + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes @@ -29,10 +35,10 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("textcat", config=config) > ``` -| Setting | Type | Description | Default | -| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- | -| `labels` | `Iterable[str]` | The labels to use. | `[]` | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) | +| Setting | Type | Description | Default | +| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- | +| `labels` | `List[str]` | A list of categories to learn. If empty, the model infers the categories from the data. | `[]` | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py @@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and | _keyword-only_ | | | | `labels` | `Iterable[str]` | The labels to use. | - - ## TextCategorizer.\_\_call\_\_ {#call tag="method"} Apply the pipe to one document. The document is modified in place, and returned. diff --git a/website/docs/api/tok2vec.md b/website/docs/api/tok2vec.md index f810793ce..6727b43bf 100644 --- a/website/docs/api/tok2vec.md +++ b/website/docs/api/tok2vec.md @@ -8,7 +8,20 @@ api_string_name: tok2vec api_trainable: true --- - +Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute. +This is mostly useful to **share a single subnetwork** between multiple +components, e.g. to have one embedding and CNN network shared between a +[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and +[`EntityRecognizer`](/api/entityrecognizer). + +In order to use the `Tok2Vec` predictions, subsequent components should use the +[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec +subnetwork of their model. This layer will read data from the `doc.tensor` +attribute during prediction. During training, the `Tok2Vec` component will save +its prediction and backprop callback for each batch, so that the subsequent +components can backpropagate to the shared weights. This implementation is used +because it allows us to avoid relying on object identity within the models to +achieve the parameter sharing. ## Config and implementation {#config} @@ -27,9 +40,9 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("tok2vec", config=config) > ``` -| Setting | Type | Description | Default | -| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) | +| Setting | Type | Description | Default | +| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py @@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and ## Tok2Vec.\_\_call\_\_ {#call tag="method"} -Apply the pipe to one document. The document is modified in place, and returned. -This usually happens under the hood when the `nlp` object is called on a text -and all pipeline components are applied to the `Doc` in order. Both +Apply the pipe to one document and add context-sensitive embeddings to the +`Doc.tensor` attribute, allowing them to be used as features by downstream +components. The document is modified in place, and returned. This usually +happens under the hood when the `nlp` object is called on a text and all +pipeline components are applied to the `Doc` in order. Both [`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the [`predict`](/api/tok2vec#predict) and [`set_annotations`](/api/tok2vec#set_annotations) methods.