Update docstrings and docs

2025-11-04 01:48:04 +03:00 · 2020-08-09 16:10:48 +02:00 · 2020-08-09 16:10:48 +02:00 · a15c5fb191
commit a15c5fb191
parent 8d2baa153d
7 changed files with 152 additions and 79 deletions
--- a/spacy/pipeline/dep_parser.pyx
+++ b/spacy/pipeline/dep_parser.pyx
@ -71,6 +71,11 @@ def make_parser(
    actions are decreased. Note that more than one action may be optimal for
    a given state.
    model (Model): The model for the transition-based parser. The model needs
        to have a specific substructure of named components --- see the
        spacy.ml.tb_framework.TransitionModel for details.
    moves (List[str]): A list of transition names. Inferred from the data if not
        provided.
    update_with_oracle_cut_size (int):
        During training, cut long sequences into shorter segments by creating
        intermediate states based on the gold-standard history. The model is
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -70,6 +70,47 @@ blog post for background.
 | `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
 | `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
 ### spacy.Tok2VecListener.v1 {#Tok2VecListener}
 > #### Example config
 >
 > ```ini
 > [components.tok2vec]
 > factory = "tok2vec"
 >
 > [components.tok2vec.model]
 > @architectures = "spacy.HashEmbedCNN.v1"
 > width = 342
 >
 > [components.tagger]
 > factory = "tagger"
 >
 > [components.tagger.model]
 > @architectures = "spacy.Tagger.v1"
 >
 > [components.tagger.model.tok2vec]
 > @architectures = "spacy.Tok2VecListener.v1"
 > width = ${components.tok2vec.model:width}
 > ```
 A listener is used as a sublayer within a component such as a
 [`DependencyParser`](/api/dependencyparser),
 [`EntityRecognizer`](/api/entityrecognizer)or
 [`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
 listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
 that's earlier in the pipeline. The listener layers act as **proxies**, passing
 the predictions from the `Tok2Vec` component into downstream components, and
 communicating gradients back upstream.
 Instead of defining its own `Tok2Vec` instance, a model architecture like
 [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
 argument that connects to the shared `tok2vec` component in the pipeline.
 | Name       | Type | Description                                                                                                                                                                                                                                                                                            |
 | ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `width`    | int  | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component.                                                                                                                                                                                                               |
 | `upstream` | str  | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
 ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
 <!-- TODO: check example config -->
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -8,6 +8,23 @@ api_string_name: parser
 api_trainable: true
 ---
 A transition-based dependency parser component. The dependency parser jointly
 learns sentence segmentation and labelled dependency parsing, and can optionally
 learn to merge tokens that had been over-segmented by the tokenizer. The parser
 uses a variant of the **non-monotonic arc-eager transition-system** described by
 [Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
 the addition of a "break" transition to perform the sentence segmentation.
 [Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
 dependency transformation** is used to allow the parser to predict
 non-projective parses.
 The parser is trained using an **imitation learning objective**. It follows the
 actions predicted by the current weights, and at each state, determines which
 actions are compatible with the optimal parse that could be reached from the
 current state. The weights such that the scores assigned to the set of optimal
 actions is increased, while scores assigned to other actions are decreased. Note
 that more than one action may be optimal for a given state.
 ## Config and implementation {#config}
 The default config is defined by the pipeline component factory and describes
@ -23,17 +40,20 @@ architectures and their arguments and hyperparameters.
 > from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
 > config = {
 >    "moves": None,
->   # TODO: rest
+>    "update_with_oracle_cut_size": 100,
 >    "learn_tokens": False,
 >    "min_action_freq": 30,
 >    "model": DEFAULT_PARSER_MODEL,
 > }
 > nlp.add_pipe("parser", config=config)
 > ```
 <!-- TODO: finish API docs -->
 | Setting                       | Type                                       | Description                                                                                                                                                                                                                                                                                 | Default                                                           |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
-| `moves` | list                                       |                   | `None`                                                            |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         | `None`                                                            |
 | `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it.                                                                    | `100`                                                             |
 | `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             | `False`                                                           |
 | `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30`                                                              |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                                                                                           | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 ```python
@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).
 <!-- TODO: finish API docs -->
 | Name                          | Type                                       | Description                                                                                                                                                                                                                                                                                 |
-| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                      |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                                                             |
 | `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                 |
-| `moves`                       | list                                       |                                                                                             |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         |
 | _keyword-only_                |                                            |                                                                                                                                                                                                                                                                                             |
-| `update_with_oracle_cut_size` | int                                        |                                                                                             |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default.                                           |
-| `multitasks`                  | `Iterable`                                 |                                                                                             |
+| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             |
-| `learn_tokens`                | bool                                       |                                                                                             |
+| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
 | `min_action_freq`             | int                                        |                                                                                             |
 ## DependencyParser.\_\_call\_\_ {#call tag="method"}
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -8,6 +8,18 @@ api_string_name: ner
 api_trainable: true
 ---
 A transition-based named entity recognition component. The entity recognizer
 identifies **non-overlapping labelled spans** of tokens. The transition-based
 algorithm used encodes certain assumptions that are effective for "traditional"
 named entity recognition tasks, but may not be a good fit for every span
 identification problem. Specifically, the loss function optimizes for **whole
 entity accuracy**, so if your inter-annotator agreement on boundary tokens is
 low, the component will likely perform poorly on your problem. The
 transition-based algorithm also assumes that the most decisive information about
 your entities will be close to their initial tokens. If your entities are long
 and characterized by tokens in their middle, the component will likely not be a
 good fit for your task.
 ## Config and implementation {#config}
 The default config is defined by the pipeline component factory and describes
@ -23,17 +35,16 @@ architectures and their arguments and hyperparameters.
 > from spacy.pipeline.ner import DEFAULT_NER_MODEL
 > config = {
 >    "moves": None,
->   # TODO: rest
+>    "update_with_oracle_cut_size": 100,
 >    "model": DEFAULT_NER_MODEL,
 > }
 > nlp.add_pipe("ner", config=config)
 > ```
 <!-- TODO: finish API docs -->
 | Setting                       | Type                                       | Description                                                                                                                                                                                                              | Default                                                           |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
-| `moves` | list                                       |                   | `None`                                                            |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                      |
 | `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100`                                                             |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                        | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 ```python
@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).
 <!-- TODO: finish API docs -->
 | Name                          | Type                                       | Description                                                                                                                                                                                                                                       |
-| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
+| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                            |
 | `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                   |
 | `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                       |
-| `moves`                       | list                                       |                                                                                             |
+| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                               |
 | _keyword-only_                |                                            |                                                                                                                                                                                                                                                   |
-| `update_with_oracle_cut_size` | int                                        |                                                                                             |
+| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
 | `multitasks`                  | `Iterable`                                 |                                                                                             |
 | `learn_tokens`                | bool                                       |                                                                                             |
 | `min_action_freq`             | int                                        |                                                                                             |
 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -29,9 +29,9 @@ architectures and their arguments and hyperparameters.
 > ```
 | Setting          | Type                                       | Description                                                                                                                                                                                                      | Default                             |
-| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
+| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
 | `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           | `False`                             |
-| `model`          | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                      | [Tagger](/api/architectures#Tagger) |
+| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
@ -59,9 +59,9 @@ shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).
 | Name             | Type                                       | Description                                                                                                                                                                                                      |
-| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
+| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`          | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                           |
-| `model`          | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
+| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
 | `name`           | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                      |
 | _keyword-only_   |                                            |                                                                                                                                                                                                                  |
 | `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           |
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -9,6 +9,12 @@ api_string_name: textcat
 api_trainable: true
 ---
 The text categorizer predicts **categories over a whole document**. It can learn
 one or more labels, and the labels can be mutually exclusive (i.e. one true
 label per document) or non-mutually exclusive (i.e. zero or more labels may be
 true per document). The multi-label setting is controlled by the model instance
 that's provided.
 ## Config and implementation {#config}
 The default config is defined by the pipeline component factory and describes
@ -30,9 +36,9 @@ architectures and their arguments and hyperparameters.
 > ```
 | Setting  | Type                                       | Description                                                                             | Default                                               |
-| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
+| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
-| `labels` | `Iterable[str]`                            | The labels to use. | `[]`                                                  |
+| `labels` | `List[str]`                                | A list of categories to learn. If empty, the model infers the categories from the data. | `[]`                                                  |
-| `model`  | [`Model`](https://thinc.ai/docs/api-model) | The model to use.  | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
+| `model`  | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category.                                | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
 | _keyword-only_ |                                            |                                                                                             |
 | `labels`       | `Iterable[str]`                            | The labels to use.                                                                          |
 <!-- TODO move to config page
 ### Architectures {#architectures new="2.1"}
 Text classification models can be used to solve a wide variety of problems.
 Differences in text length, number of labels, difficulty, and runtime
 performance constraints mean that no single algorithm performs well on all types
 of problems. To handle a wider variety of problems, the `TextCategorizer` object
 allows configuration of its model architecture, using the `architecture` keyword
 argument.
 | Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                      |
 | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `"ensemble"`   | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model.                                                                                                                                               |
 | `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.                                                                                                                                                                                |
 | `"bow"`        | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
 -->
 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
--- a/website/docs/api/tok2vec.md
+++ b/website/docs/api/tok2vec.md
@ -8,7 +8,20 @@ api_string_name: tok2vec
 api_trainable: true
 ---
-<!-- TODO: intro describing component -->
+Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
 This is mostly useful to **share a single subnetwork** between multiple
 components, e.g. to have one embedding and CNN network shared between a
 [`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
 [`EntityRecognizer`](/api/entityrecognizer).
 In order to use the `Tok2Vec` predictions, subsequent components should use the
 [Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
 subnetwork of their model. This layer will read data from the `doc.tensor`
 attribute during prediction. During training, the `Tok2Vec` component will save
 its prediction and backprop callback for each batch, so that the subsequent
 components can backpropagate to the shared weights. This implementation is used
 because it allows us to avoid relying on object identity within the models to
 achieve the parameter sharing.
 ## Config and implementation {#config}
@ -28,8 +41,8 @@ architectures and their arguments and hyperparameters.
 > ```
 | Setting | Type                                       | Description                                                             | Default                                         |
-| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
+| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
-| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
+| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and
 ## Tok2Vec.\_\_call\_\_ {#call tag="method"}
-Apply the pipe to one document. The document is modified in place, and returned.
+Apply the pipe to one document and add context-sensitive embeddings to the
-This usually happens under the hood when the `nlp` object is called on a text
+`Doc.tensor` attribute, allowing them to be used as features by downstream
-and all pipeline components are applied to the `Doc` in order. Both
+components. The document is modified in place, and returned. This usually
 happens under the hood when the `nlp` object is called on a text and all
 pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
 [`predict`](/api/tok2vec#predict) and
 [`set_annotations`](/api/tok2vec#set_annotations) methods.