mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Update docstrings and docs
This commit is contained in:
parent
8d2baa153d
commit
a15c5fb191
|
@ -71,6 +71,11 @@ def make_parser(
|
|||
actions are decreased. Note that more than one action may be optimal for
|
||||
a given state.
|
||||
|
||||
model (Model): The model for the transition-based parser. The model needs
|
||||
to have a specific substructure of named components --- see the
|
||||
spacy.ml.tb_framework.TransitionModel for details.
|
||||
moves (List[str]): A list of transition names. Inferred from the data if not
|
||||
provided.
|
||||
update_with_oracle_cut_size (int):
|
||||
During training, cut long sequences into shorter segments by creating
|
||||
intermediate states based on the gold-standard history. The model is
|
||||
|
|
|
@ -70,6 +70,47 @@ blog post for background.
|
|||
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. |
|
||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
|
||||
|
||||
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [components.tok2vec]
|
||||
> factory = "tok2vec"
|
||||
>
|
||||
> [components.tok2vec.model]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> width = 342
|
||||
>
|
||||
> [components.tagger]
|
||||
> factory = "tagger"
|
||||
>
|
||||
> [components.tagger.model]
|
||||
> @architectures = "spacy.Tagger.v1"
|
||||
>
|
||||
> [components.tagger.model.tok2vec]
|
||||
> @architectures = "spacy.Tok2VecListener.v1"
|
||||
> width = ${components.tok2vec.model:width}
|
||||
> ```
|
||||
|
||||
A listener is used as a sublayer within a component such as a
|
||||
[`DependencyParser`](/api/dependencyparser),
|
||||
[`EntityRecognizer`](/api/entityrecognizer)or
|
||||
[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
|
||||
listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
|
||||
that's earlier in the pipeline. The listener layers act as **proxies**, passing
|
||||
the predictions from the `Tok2Vec` component into downstream components, and
|
||||
communicating gradients back upstream.
|
||||
|
||||
Instead of defining its own `Tok2Vec` instance, a model architecture like
|
||||
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
|
||||
argument that connects to the shared `tok2vec` component in the pipeline.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `width` | int | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. |
|
||||
| `upstream` | str | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
|
||||
|
||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||
|
||||
<!-- TODO: check example config -->
|
||||
|
|
|
@ -8,6 +8,23 @@ api_string_name: parser
|
|||
api_trainable: true
|
||||
---
|
||||
|
||||
A transition-based dependency parser component. The dependency parser jointly
|
||||
learns sentence segmentation and labelled dependency parsing, and can optionally
|
||||
learn to merge tokens that had been over-segmented by the tokenizer. The parser
|
||||
uses a variant of the **non-monotonic arc-eager transition-system** described by
|
||||
[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
|
||||
the addition of a "break" transition to perform the sentence segmentation.
|
||||
[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
|
||||
dependency transformation** is used to allow the parser to predict
|
||||
non-projective parses.
|
||||
|
||||
The parser is trained using an **imitation learning objective**. It follows the
|
||||
actions predicted by the current weights, and at each state, determines which
|
||||
actions are compatible with the optimal parse that could be reached from the
|
||||
current state. The weights such that the scores assigned to the set of optimal
|
||||
actions is increased, while scores assigned to other actions are decreased. Note
|
||||
that more than one action may be optimal for a given state.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
|
@ -23,18 +40,21 @@ architectures and their arguments and hyperparameters.
|
|||
> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||
> config = {
|
||||
> "moves": None,
|
||||
> # TODO: rest
|
||||
> "update_with_oracle_cut_size": 100,
|
||||
> "learn_tokens": False,
|
||||
> "min_action_freq": 30,
|
||||
> "model": DEFAULT_PARSER_MODEL,
|
||||
> }
|
||||
> nlp.add_pipe("parser", config=config)
|
||||
> ```
|
||||
|
||||
<!-- TODO: finish API docs -->
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
|
||||
| `moves` | list | | `None` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
|
||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | `None` |
|
||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
|
||||
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. | `False` |
|
||||
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/dep_parser.pyx
|
||||
|
@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
|
|||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
<!-- TODO: finish API docs -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `moves` | list | |
|
||||
| _keyword-only_ | | |
|
||||
| `update_with_oracle_cut_size` | int | |
|
||||
| `multitasks` | `Iterable` | |
|
||||
| `learn_tokens` | bool | |
|
||||
| `min_action_freq` | int | |
|
||||
| Name | Type | Description |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
||||
| _keyword-only_ | | |
|
||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
|
||||
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. |
|
||||
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
|
||||
|
||||
## DependencyParser.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -8,6 +8,18 @@ api_string_name: ner
|
|||
api_trainable: true
|
||||
---
|
||||
|
||||
A transition-based named entity recognition component. The entity recognizer
|
||||
identifies **non-overlapping labelled spans** of tokens. The transition-based
|
||||
algorithm used encodes certain assumptions that are effective for "traditional"
|
||||
named entity recognition tasks, but may not be a good fit for every span
|
||||
identification problem. Specifically, the loss function optimizes for **whole
|
||||
entity accuracy**, so if your inter-annotator agreement on boundary tokens is
|
||||
low, the component will likely perform poorly on your problem. The
|
||||
transition-based algorithm also assumes that the most decisive information about
|
||||
your entities will be close to their initial tokens. If your entities are long
|
||||
and characterized by tokens in their middle, the component will likely not be a
|
||||
good fit for your task.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
|
@ -23,18 +35,17 @@ architectures and their arguments and hyperparameters.
|
|||
> from spacy.pipeline.ner import DEFAULT_NER_MODEL
|
||||
> config = {
|
||||
> "moves": None,
|
||||
> # TODO: rest
|
||||
> "update_with_oracle_cut_size": 100,
|
||||
> "model": DEFAULT_NER_MODEL,
|
||||
> }
|
||||
> nlp.add_pipe("ner", config=config)
|
||||
> ```
|
||||
|
||||
<!-- TODO: finish API docs -->
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
|
||||
| `moves` | list | | `None` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
|
||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/ner.pyx
|
||||
|
@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
|
|||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
<!-- TODO: finish API docs -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `moves` | list | |
|
||||
| _keyword-only_ | | |
|
||||
| `update_with_oracle_cut_size` | int | |
|
||||
| `multitasks` | `Iterable` | |
|
||||
| `learn_tokens` | bool | |
|
||||
| `min_action_freq` | int | |
|
||||
| Name | Type | Description |
|
||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
||||
| _keyword-only_ | | |
|
||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
|
||||
|
||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -28,10 +28,10 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("tagger", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
|
||||
| `set_morphology` | bool | Whether to set morphological features. | `False` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [Tagger](/api/architectures#Tagger) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
|
||||
| `set_morphology` | bool | Whether to set morphological features. | `False` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
|
||||
|
@ -58,13 +58,13 @@ Create a new pipeline instance. In your application, you would normally use a
|
|||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| `set_morphology` | bool | Whether to set morphological features. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| `set_morphology` | bool | Whether to set morphological features. |
|
||||
|
||||
## Tagger.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -9,6 +9,12 @@ api_string_name: textcat
|
|||
api_trainable: true
|
||||
---
|
||||
|
||||
The text categorizer predicts **categories over a whole document**. It can learn
|
||||
one or more labels, and the labels can be mutually exclusive (i.e. one true
|
||||
label per document) or non-mutually exclusive (i.e. zero or more labels may be
|
||||
true per document). The multi-label setting is controlled by the model instance
|
||||
that's provided.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
|
@ -29,10 +35,10 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("textcat", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
|
||||
| `labels` | `Iterable[str]` | The labels to use. | `[]` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
|
||||
| Setting | Type | Description | Default |
|
||||
| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
|
||||
| `labels` | `List[str]` | A list of categories to learn. If empty, the model infers the categories from the data. | `[]` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
|
||||
|
@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
|
|||
| _keyword-only_ | | |
|
||||
| `labels` | `Iterable[str]` | The labels to use. |
|
||||
|
||||
<!-- TODO move to config page
|
||||
### Architectures {#architectures new="2.1"}
|
||||
|
||||
Text classification models can be used to solve a wide variety of problems.
|
||||
Differences in text length, number of labels, difficulty, and runtime
|
||||
performance constraints mean that no single algorithm performs well on all types
|
||||
of problems. To handle a wider variety of problems, the `TextCategorizer` object
|
||||
allows configuration of its model architecture, using the `architecture` keyword
|
||||
argument.
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `"ensemble"` | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model. |
|
||||
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. |
|
||||
| `"bow"` | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
|
||||
-->
|
||||
|
||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Apply the pipe to one document. The document is modified in place, and returned.
|
||||
|
|
|
@ -8,7 +8,20 @@ api_string_name: tok2vec
|
|||
api_trainable: true
|
||||
---
|
||||
|
||||
<!-- TODO: intro describing component -->
|
||||
Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
|
||||
This is mostly useful to **share a single subnetwork** between multiple
|
||||
components, e.g. to have one embedding and CNN network shared between a
|
||||
[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
|
||||
[`EntityRecognizer`](/api/entityrecognizer).
|
||||
|
||||
In order to use the `Tok2Vec` predictions, subsequent components should use the
|
||||
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
|
||||
subnetwork of their model. This layer will read data from the `doc.tensor`
|
||||
attribute during prediction. During training, the `Tok2Vec` component will save
|
||||
its prediction and backprop callback for each batch, so that the subsequent
|
||||
components can backpropagate to the shared weights. This implementation is used
|
||||
because it allows us to avoid relying on object identity within the models to
|
||||
achieve the parameter sharing.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
|
@ -27,9 +40,9 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("tok2vec", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
|
||||
|
@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and
|
|||
|
||||
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Apply the pipe to one document. The document is modified in place, and returned.
|
||||
This usually happens under the hood when the `nlp` object is called on a text
|
||||
and all pipeline components are applied to the `Doc` in order. Both
|
||||
Apply the pipe to one document and add context-sensitive embeddings to the
|
||||
`Doc.tensor` attribute, allowing them to be used as features by downstream
|
||||
components. The document is modified in place, and returned. This usually
|
||||
happens under the hood when the `nlp` object is called on a text and all
|
||||
pipeline components are applied to the `Doc` in order. Both
|
||||
[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
|
||||
[`predict`](/api/tok2vec#predict) and
|
||||
[`set_annotations`](/api/tok2vec#set_annotations) methods.
|
||||
|
|
Loading…
Reference in New Issue
Block a user