Update docstrings and docs

2025-11-04 09:57:26 +03:00 · 2020-08-09 16:10:48 +02:00 · 2020-08-09 16:10:48 +02:00 · a15c5fb191
commit a15c5fb191
parent 8d2baa153d
7 changed files with 152 additions and 79 deletions
--- a/spacy/pipeline/dep_parser.pyx
+++ b/spacy/pipeline/dep_parser.pyx
@ -71,6 +71,11 @@ def make_parser(
    actions are decreased. Note that more than one action may be optimal for
    a given state.

+    model (Model): The model for the transition-based parser. The model needs
+        to have a specific substructure of named components --- see the
+        spacy.ml.tb_framework.TransitionModel for details.
+    moves (List[str]): A list of transition names. Inferred from the data if not
+        provided.
    update_with_oracle_cut_size (int):
        During training, cut long sequences into shorter segments by creating
        intermediate states based on the gold-standard history. The model is
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -70,6 +70,47 @@ blog post for background.
 | `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
 | `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |

+### spacy.Tok2VecListener.v1 {#Tok2VecListener}
+
+> #### Example config
+>
+> ```ini
+> [components.tok2vec]
+> factory = "tok2vec"
+>
+> [components.tok2vec.model]
+> @architectures = "spacy.HashEmbedCNN.v1"
+> width = 342
+>
+> [components.tagger]
+> factory = "tagger"
+>
+> [components.tagger.model]
+> @architectures = "spacy.Tagger.v1"
+>
+> [components.tagger.model.tok2vec]
+> @architectures = "spacy.Tok2VecListener.v1"
+> width = ${components.tok2vec.model:width}
+> ```
+
+A listener is used as a sublayer within a component such as a
+[`DependencyParser`](/api/dependencyparser),
+[`EntityRecognizer`](/api/entityrecognizer)or
+[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
+listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
+that's earlier in the pipeline. The listener layers act as **proxies**, passing
+the predictions from the `Tok2Vec` component into downstream components, and
+communicating gradients back upstream.
+
+Instead of defining its own `Tok2Vec` instance, a model architecture like
+[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
+argument that connects to the shared `tok2vec` component in the pipeline.
+
+| Name       | Type | Description                                                                                                                                                                                                                                                                                            |
+| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `width`    | int  | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component.                                                                                                                                                                                                               |
+| `upstream` | str  | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
+
 ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}

 <!-- TODO: check example config -->
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -8,6 +8,23 @@ api_string_name: parser
 api_trainable: true
 ---

+A transition-based dependency parser component. The dependency parser jointly
+learns sentence segmentation and labelled dependency parsing, and can optionally
+learn to merge tokens that had been over-segmented by the tokenizer. The parser
+uses a variant of the **non-monotonic arc-eager transition-system** described by
+[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
+the addition of a "break" transition to perform the sentence segmentation.
+[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
+dependency transformation** is used to allow the parser to predict
+non-projective parses.
+
+The parser is trained using an **imitation learning objective**. It follows the
+actions predicted by the current weights, and at each state, determines which
+actions are compatible with the optimal parse that could be reached from the
+current state. The weights such that the scores assigned to the set of optimal
+actions is increased, while scores assigned to other actions are decreased. Note
+that more than one action may be optimal for a given state.
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -23,17 +40,20 @@ architectures and their arguments and hyperparameters.
 > from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
 > config = {
 >    "moves": None,
->   # TODO: rest
+>    "update_with_oracle_cut_size": 100,
+>    "learn_tokens": False,
+>    "min_action_freq": 30,
 >    "model": DEFAULT_PARSER_MODEL,
 > }
 > nlp.add_pipe("parser", config=config)
 > ```

-<!-- TODO: finish API docs -->
-
 | Setting                       | Type                                       | Description                                                                                                                                                                                                                                                                                 | Default                                                           |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
-| `moves` | list                                       |                   | `None`                                                            |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         | `None`                                                            |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it.                                                                    | `100`                                                             |
+| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             | `False`                                                           |
+| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30`                                                              |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                                                                                           | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |

 ```python
@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).

-<!-- TODO: finish API docs -->
-
 | Name                          | Type                                       | Description                                                                                                                                                                                                                                                                                 |
-| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                      |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                                                             |
 | `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                 |
-| `moves`                       | list                                       |                                                                                             |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         |
 | _keyword-only_                |                                            |                                                                                                                                                                                                                                                                                             |
-| `update_with_oracle_cut_size` | int                                        |                                                                                             |
-| `multitasks`                  | `Iterable`                                 |                                                                                             |
-| `learn_tokens`                | bool                                       |                                                                                             |
-| `min_action_freq`             | int                                        |                                                                                             |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default.                                           |
+| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             |
+| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |

 ## DependencyParser.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -8,6 +8,18 @@ api_string_name: ner
 api_trainable: true
 ---

+A transition-based named entity recognition component. The entity recognizer
+identifies **non-overlapping labelled spans** of tokens. The transition-based
+algorithm used encodes certain assumptions that are effective for "traditional"
+named entity recognition tasks, but may not be a good fit for every span
+identification problem. Specifically, the loss function optimizes for **whole
+entity accuracy**, so if your inter-annotator agreement on boundary tokens is
+low, the component will likely perform poorly on your problem. The
+transition-based algorithm also assumes that the most decisive information about
+your entities will be close to their initial tokens. If your entities are long
+and characterized by tokens in their middle, the component will likely not be a
+good fit for your task.
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -23,17 +35,16 @@ architectures and their arguments and hyperparameters.
 > from spacy.pipeline.ner import DEFAULT_NER_MODEL
 > config = {
 >    "moves": None,
->   # TODO: rest
+>    "update_with_oracle_cut_size": 100,
 >    "model": DEFAULT_NER_MODEL,
 > }
 > nlp.add_pipe("ner", config=config)
 > ```

-<!-- TODO: finish API docs -->
-
 | Setting                       | Type                                       | Description                                                                                                                                                                                                              | Default                                                           |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
-| `moves` | list                                       |                   | `None`                                                            |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                      |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100`                                                             |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                        | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |

 ```python
@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).

-<!-- TODO: finish API docs -->
-
 | Name                          | Type                                       | Description                                                                                                                                                                                                                                       |
-| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                            |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                   |
 | `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                       |
-| `moves`                       | list                                       |                                                                                             |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                               |
 | _keyword-only_                |                                            |                                                                                                                                                                                                                                                   |
-| `update_with_oracle_cut_size` | int                                        |                                                                                             |
-| `multitasks`                  | `Iterable`                                 |                                                                                             |
-| `learn_tokens`                | bool                                       |                                                                                             |
-| `min_action_freq`             | int                                        |                                                                                             |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |

 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -29,9 +29,9 @@ architectures and their arguments and hyperparameters.
 > ```

 | Setting          | Type                                       | Description                                                                                                                                                                                                      | Default                             |
-| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
+| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
 | `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           | `False`                             |
-| `model`          | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                      | [Tagger](/api/architectures#Tagger) |
+| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |

 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
@ -59,9 +59,9 @@ shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).

 | Name             | Type                                       | Description                                                                                                                                                                                                      |
-| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
+| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`          | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                           |
-| `model`          | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
+| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
 | `name`           | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                      |
 | _keyword-only_   |                                            |                                                                                                                                                                                                                  |
 | `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           |
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -9,6 +9,12 @@ api_string_name: textcat
 api_trainable: true
 ---

+The text categorizer predicts **categories over a whole document**. It can learn
+one or more labels, and the labels can be mutually exclusive (i.e. one true
+label per document) or non-mutually exclusive (i.e. zero or more labels may be
+true per document). The multi-label setting is controlled by the model instance
+that's provided.
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -30,9 +36,9 @@ architectures and their arguments and hyperparameters.
 > ```

 | Setting  | Type                                       | Description                                                                             | Default                                               |
-| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
-| `labels` | `Iterable[str]`                            | The labels to use. | `[]`                                                  |
-| `model`  | [`Model`](https://thinc.ai/docs/api-model) | The model to use.  | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
+| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
+| `labels` | `List[str]`                                | A list of categories to learn. If empty, the model infers the categories from the data. | `[]`                                                  |
+| `model`  | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category.                                | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |

 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
 | _keyword-only_ |                                            |                                                                                             |
 | `labels`       | `Iterable[str]`                            | The labels to use.                                                                          |

-<!-- TODO move to config page
-### Architectures {#architectures new="2.1"}
-
-Text classification models can be used to solve a wide variety of problems.
-Differences in text length, number of labels, difficulty, and runtime
-performance constraints mean that no single algorithm performs well on all types
-of problems. To handle a wider variety of problems, the `TextCategorizer` object
-allows configuration of its model architecture, using the `architecture` keyword
-argument.
-
-| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                      |
-| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `"ensemble"`   | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model.                                                                                                                                               |
-| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.                                                                                                                                                                                |
-| `"bow"`        | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
-->
-
 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
--- a/website/docs/api/tok2vec.md
+++ b/website/docs/api/tok2vec.md
@ -8,7 +8,20 @@ api_string_name: tok2vec
 api_trainable: true
 ---

-<!-- TODO: intro describing component -->
+Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
+This is mostly useful to **share a single subnetwork** between multiple
+components, e.g. to have one embedding and CNN network shared between a
+[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
+[`EntityRecognizer`](/api/entityrecognizer).
+
+In order to use the `Tok2Vec` predictions, subsequent components should use the
+[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
+subnetwork of their model. This layer will read data from the `doc.tensor`
+attribute during prediction. During training, the `Tok2Vec` component will save
+its prediction and backprop callback for each batch, so that the subsequent
+components can backpropagate to the shared weights. This implementation is used
+because it allows us to avoid relying on object identity within the models to
+achieve the parameter sharing.

 ## Config and implementation {#config}

@ -28,8 +41,8 @@ architectures and their arguments and hyperparameters.
 > ```

 | Setting | Type                                       | Description                                                             | Default                                         |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
-| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
+| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
+| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |

 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and

 ## Tok2Vec.\_\_call\_\_ {#call tag="method"}

-Apply the pipe to one document. The document is modified in place, and returned.
-This usually happens under the hood when the `nlp` object is called on a text
-and all pipeline components are applied to the `Doc` in order. Both
+Apply the pipe to one document and add context-sensitive embeddings to the
+`Doc.tensor` attribute, allowing them to be used as features by downstream
+components. The document is modified in place, and returned. This usually
+happens under the hood when the `nlp` object is called on a text and all
+pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
 [`predict`](/api/tok2vec#predict) and
 [`set_annotations`](/api/tok2vec#set_annotations) methods.