Update docstrings and docs

This commit is contained in:
Ines Montani 2020-08-09 16:10:48 +02:00
parent 8d2baa153d
commit a15c5fb191
7 changed files with 152 additions and 79 deletions

View File

@ -71,6 +71,11 @@ def make_parser(
actions are decreased. Note that more than one action may be optimal for
a given state.
model (Model): The model for the transition-based parser. The model needs
to have a specific substructure of named components --- see the
spacy.ml.tb_framework.TransitionModel for details.
moves (List[str]): A list of transition names. Inferred from the data if not
provided.
update_with_oracle_cut_size (int):
During training, cut long sequences into shorter segments by creating
intermediate states based on the gold-standard history. The model is

View File

@ -70,6 +70,47 @@ blog post for background.
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. |
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
> #### Example config
>
> ```ini
> [components.tok2vec]
> factory = "tok2vec"
>
> [components.tok2vec.model]
> @architectures = "spacy.HashEmbedCNN.v1"
> width = 342
>
> [components.tagger]
> factory = "tagger"
>
> [components.tagger.model]
> @architectures = "spacy.Tagger.v1"
>
> [components.tagger.model.tok2vec]
> @architectures = "spacy.Tok2VecListener.v1"
> width = ${components.tok2vec.model:width}
> ```
A listener is used as a sublayer within a component such as a
[`DependencyParser`](/api/dependencyparser),
[`EntityRecognizer`](/api/entityrecognizer)or
[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
that's earlier in the pipeline. The listener layers act as **proxies**, passing
the predictions from the `Tok2Vec` component into downstream components, and
communicating gradients back upstream.
Instead of defining its own `Tok2Vec` instance, a model architecture like
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
argument that connects to the shared `tok2vec` component in the pipeline.
| Name | Type | Description |
| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `width` | int | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. |
| `upstream` | str | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
<!-- TODO: check example config -->

View File

@ -8,6 +8,23 @@ api_string_name: parser
api_trainable: true
---
A transition-based dependency parser component. The dependency parser jointly
learns sentence segmentation and labelled dependency parsing, and can optionally
learn to merge tokens that had been over-segmented by the tokenizer. The parser
uses a variant of the **non-monotonic arc-eager transition-system** described by
[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
the addition of a "break" transition to perform the sentence segmentation.
[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
dependency transformation** is used to allow the parser to predict
non-projective parses.
The parser is trained using an **imitation learning objective**. It follows the
actions predicted by the current weights, and at each state, determines which
actions are compatible with the optimal parse that could be reached from the
current state. The weights such that the scores assigned to the set of optimal
actions is increased, while scores assigned to other actions are decreased. Note
that more than one action may be optimal for a given state.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -23,18 +40,21 @@ architectures and their arguments and hyperparameters.
> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
> config = {
> "moves": None,
> # TODO: rest
> "update_with_oracle_cut_size": 100,
> "learn_tokens": False,
> "min_action_freq": 30,
> "model": DEFAULT_PARSER_MODEL,
> }
> nlp.add_pipe("parser", config=config)
> ```
<!-- TODO: finish API docs -->
| Setting | Type | Description | Default |
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
| `moves` | list | | `None` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
| Setting | Type | Description | Default |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | `None` |
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. | `False` |
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/dep_parser.pyx
@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#add_pipe).
<!-- TODO: finish API docs -->
| Name | Type | Description |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| `moves` | list | |
| _keyword-only_ | | |
| `update_with_oracle_cut_size` | int | |
| `multitasks` | `Iterable` | |
| `learn_tokens` | bool | |
| `min_action_freq` | int | |
| Name | Type | Description |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
| _keyword-only_ | | |
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. |
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
## DependencyParser.\_\_call\_\_ {#call tag="method"}

View File

@ -8,6 +8,18 @@ api_string_name: ner
api_trainable: true
---
A transition-based named entity recognition component. The entity recognizer
identifies **non-overlapping labelled spans** of tokens. The transition-based
algorithm used encodes certain assumptions that are effective for "traditional"
named entity recognition tasks, but may not be a good fit for every span
identification problem. Specifically, the loss function optimizes for **whole
entity accuracy**, so if your inter-annotator agreement on boundary tokens is
low, the component will likely perform poorly on your problem. The
transition-based algorithm also assumes that the most decisive information about
your entities will be close to their initial tokens. If your entities are long
and characterized by tokens in their middle, the component will likely not be a
good fit for your task.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -23,18 +35,17 @@ architectures and their arguments and hyperparameters.
> from spacy.pipeline.ner import DEFAULT_NER_MODEL
> config = {
> "moves": None,
> # TODO: rest
> "update_with_oracle_cut_size": 100,
> "model": DEFAULT_NER_MODEL,
> }
> nlp.add_pipe("ner", config=config)
> ```
<!-- TODO: finish API docs -->
| Setting | Type | Description | Default |
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
| `moves` | list | | `None` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
| Setting | Type | Description | Default |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/ner.pyx
@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#add_pipe).
<!-- TODO: finish API docs -->
| Name | Type | Description |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| `moves` | list | |
| _keyword-only_ | | |
| `update_with_oracle_cut_size` | int | |
| `multitasks` | `Iterable` | |
| `learn_tokens` | bool | |
| `min_action_freq` | int | |
| Name | Type | Description |
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
| _keyword-only_ | | |
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}

View File

@ -28,10 +28,10 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("tagger", config=config)
> ```
| Setting | Type | Description | Default |
| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
| `set_morphology` | bool | Whether to set morphological features. | `False` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [Tagger](/api/architectures#Tagger) |
| Setting | Type | Description | Default |
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
| `set_morphology` | bool | Whether to set morphological features. | `False` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
@ -58,13 +58,13 @@ Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#add_pipe).
| Name | Type | Description |
| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | |
| `set_morphology` | bool | Whether to set morphological features. |
| Name | Type | Description |
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | |
| `set_morphology` | bool | Whether to set morphological features. |
## Tagger.\_\_call\_\_ {#call tag="method"}

View File

@ -9,6 +9,12 @@ api_string_name: textcat
api_trainable: true
---
The text categorizer predicts **categories over a whole document**. It can learn
one or more labels, and the labels can be mutually exclusive (i.e. one true
label per document) or non-mutually exclusive (i.e. zero or more labels may be
true per document). The multi-label setting is controlled by the model instance
that's provided.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -29,10 +35,10 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("textcat", config=config)
> ```
| Setting | Type | Description | Default |
| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
| `labels` | `Iterable[str]` | The labels to use. | `[]` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
| Setting | Type | Description | Default |
| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| `labels` | `List[str]` | A list of categories to learn. If empty, the model infers the categories from the data. | `[]` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
| _keyword-only_ | | |
| `labels` | `Iterable[str]` | The labels to use. |
<!-- TODO move to config page
### Architectures {#architectures new="2.1"}
Text classification models can be used to solve a wide variety of problems.
Differences in text length, number of labels, difficulty, and runtime
performance constraints mean that no single algorithm performs well on all types
of problems. To handle a wider variety of problems, the `TextCategorizer` object
allows configuration of its model architecture, using the `architecture` keyword
argument.
| Name | Description |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"ensemble"` | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model. |
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. |
| `"bow"` | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
-->
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.

View File

@ -8,7 +8,20 @@ api_string_name: tok2vec
api_trainable: true
---
<!-- TODO: intro describing component -->
Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
This is mostly useful to **share a single subnetwork** between multiple
components, e.g. to have one embedding and CNN network shared between a
[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
[`EntityRecognizer`](/api/entityrecognizer).
In order to use the `Tok2Vec` predictions, subsequent components should use the
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
subnetwork of their model. This layer will read data from the `doc.tensor`
attribute during prediction. During training, the `Tok2Vec` component will save
its prediction and backprop callback for each batch, so that the subsequent
components can backpropagate to the shared weights. This implementation is used
because it allows us to avoid relying on object identity within the models to
achieve the parameter sharing.
## Config and implementation {#config}
@ -27,9 +40,9 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("tok2vec", config=config)
> ```
| Setting | Type | Description | Default |
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
| Setting | Type | Description | Default |
| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when the `nlp` object is called on a text
and all pipeline components are applied to the `Doc` in order. Both
Apply the pipe to one document and add context-sensitive embeddings to the
`Doc.tensor` attribute, allowing them to be used as features by downstream
components. The document is modified in place, and returned. This usually
happens under the hood when the `nlp` object is called on a text and all
pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
[`predict`](/api/tok2vec#predict) and
[`set_annotations`](/api/tok2vec#set_annotations) methods.