mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	Update docstrings and docs
This commit is contained in:
		
							parent
							
								
									8d2baa153d
								
							
						
					
					
						commit
						a15c5fb191
					
				| 
						 | 
				
			
			@ -71,6 +71,11 @@ def make_parser(
 | 
			
		|||
    actions are decreased. Note that more than one action may be optimal for
 | 
			
		||||
    a given state.
 | 
			
		||||
 | 
			
		||||
    model (Model): The model for the transition-based parser. The model needs
 | 
			
		||||
        to have a specific substructure of named components --- see the
 | 
			
		||||
        spacy.ml.tb_framework.TransitionModel for details.
 | 
			
		||||
    moves (List[str]): A list of transition names. Inferred from the data if not
 | 
			
		||||
        provided.
 | 
			
		||||
    update_with_oracle_cut_size (int):
 | 
			
		||||
        During training, cut long sequences into shorter segments by creating
 | 
			
		||||
        intermediate states based on the gold-standard history. The model is
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -70,6 +70,47 @@ blog post for background.
 | 
			
		|||
| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
 | 
			
		||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
 | 
			
		||||
 | 
			
		||||
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
 | 
			
		||||
 | 
			
		||||
> #### Example config
 | 
			
		||||
>
 | 
			
		||||
> ```ini
 | 
			
		||||
> [components.tok2vec]
 | 
			
		||||
> factory = "tok2vec"
 | 
			
		||||
>
 | 
			
		||||
> [components.tok2vec.model]
 | 
			
		||||
> @architectures = "spacy.HashEmbedCNN.v1"
 | 
			
		||||
> width = 342
 | 
			
		||||
>
 | 
			
		||||
> [components.tagger]
 | 
			
		||||
> factory = "tagger"
 | 
			
		||||
>
 | 
			
		||||
> [components.tagger.model]
 | 
			
		||||
> @architectures = "spacy.Tagger.v1"
 | 
			
		||||
>
 | 
			
		||||
> [components.tagger.model.tok2vec]
 | 
			
		||||
> @architectures = "spacy.Tok2VecListener.v1"
 | 
			
		||||
> width = ${components.tok2vec.model:width}
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
A listener is used as a sublayer within a component such as a
 | 
			
		||||
[`DependencyParser`](/api/dependencyparser),
 | 
			
		||||
[`EntityRecognizer`](/api/entityrecognizer)or
 | 
			
		||||
[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
 | 
			
		||||
listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
 | 
			
		||||
that's earlier in the pipeline. The listener layers act as **proxies**, passing
 | 
			
		||||
the predictions from the `Tok2Vec` component into downstream components, and
 | 
			
		||||
communicating gradients back upstream.
 | 
			
		||||
 | 
			
		||||
Instead of defining its own `Tok2Vec` instance, a model architecture like
 | 
			
		||||
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
 | 
			
		||||
argument that connects to the shared `tok2vec` component in the pipeline.
 | 
			
		||||
 | 
			
		||||
| Name       | Type | Description                                                                                                                                                                                                                                                                                            |
 | 
			
		||||
| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
			
		||||
| `width`    | int  | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component.                                                                                                                                                                                                               |
 | 
			
		||||
| `upstream` | str  | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
 | 
			
		||||
 | 
			
		||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
 | 
			
		||||
 | 
			
		||||
<!-- TODO: check example config -->
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -8,6 +8,23 @@ api_string_name: parser
 | 
			
		|||
api_trainable: true
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
A transition-based dependency parser component. The dependency parser jointly
 | 
			
		||||
learns sentence segmentation and labelled dependency parsing, and can optionally
 | 
			
		||||
learn to merge tokens that had been over-segmented by the tokenizer. The parser
 | 
			
		||||
uses a variant of the **non-monotonic arc-eager transition-system** described by
 | 
			
		||||
[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
 | 
			
		||||
the addition of a "break" transition to perform the sentence segmentation.
 | 
			
		||||
[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
 | 
			
		||||
dependency transformation** is used to allow the parser to predict
 | 
			
		||||
non-projective parses.
 | 
			
		||||
 | 
			
		||||
The parser is trained using an **imitation learning objective**. It follows the
 | 
			
		||||
actions predicted by the current weights, and at each state, determines which
 | 
			
		||||
actions are compatible with the optimal parse that could be reached from the
 | 
			
		||||
current state. The weights such that the scores assigned to the set of optimal
 | 
			
		||||
actions is increased, while scores assigned to other actions are decreased. Note
 | 
			
		||||
that more than one action may be optimal for a given state.
 | 
			
		||||
 | 
			
		||||
## Config and implementation {#config}
 | 
			
		||||
 | 
			
		||||
The default config is defined by the pipeline component factory and describes
 | 
			
		||||
| 
						 | 
				
			
			@ -23,17 +40,20 @@ architectures and their arguments and hyperparameters.
 | 
			
		|||
> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
 | 
			
		||||
> config = {
 | 
			
		||||
>    "moves": None,
 | 
			
		||||
>   # TODO: rest
 | 
			
		||||
>    "update_with_oracle_cut_size": 100,
 | 
			
		||||
>    "learn_tokens": False,
 | 
			
		||||
>    "min_action_freq": 30,
 | 
			
		||||
>    "model": DEFAULT_PARSER_MODEL,
 | 
			
		||||
> }
 | 
			
		||||
> nlp.add_pipe("parser", config=config)
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
<!-- TODO: finish API docs -->
 | 
			
		||||
 | 
			
		||||
| Setting                       | Type                                       | Description                                                                                                                                                                                                                                                                                 | Default                                                           |
 | 
			
		||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
 | 
			
		||||
| `moves` | list                                       |                   | `None`                                                            |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
 | 
			
		||||
| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         | `None`                                                            |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it.                                                                    | `100`                                                             |
 | 
			
		||||
| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             | `False`                                                           |
 | 
			
		||||
| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30`                                                              |
 | 
			
		||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                                                                                           | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
| 
						 | 
				
			
			@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
 | 
			
		|||
shortcut for this and instantiate the component using its string name and
 | 
			
		||||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
			
		||||
 | 
			
		||||
<!-- TODO: finish API docs -->
 | 
			
		||||
 | 
			
		||||
| Name                          | Type                                       | Description                                                                                                                                                                                                                                                                                 |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                      |
 | 
			
		||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                                                             |
 | 
			
		||||
| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                 |
 | 
			
		||||
| `moves`                       | list                                       |                                                                                             |
 | 
			
		||||
| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         |
 | 
			
		||||
| _keyword-only_                |                                            |                                                                                                                                                                                                                                                                                             |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        |                                                                                             |
 | 
			
		||||
| `multitasks`                  | `Iterable`                                 |                                                                                             |
 | 
			
		||||
| `learn_tokens`                | bool                                       |                                                                                             |
 | 
			
		||||
| `min_action_freq`             | int                                        |                                                                                             |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default.                                           |
 | 
			
		||||
| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             |
 | 
			
		||||
| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
 | 
			
		||||
 | 
			
		||||
## DependencyParser.\_\_call\_\_ {#call tag="method"}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -8,6 +8,18 @@ api_string_name: ner
 | 
			
		|||
api_trainable: true
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
A transition-based named entity recognition component. The entity recognizer
 | 
			
		||||
identifies **non-overlapping labelled spans** of tokens. The transition-based
 | 
			
		||||
algorithm used encodes certain assumptions that are effective for "traditional"
 | 
			
		||||
named entity recognition tasks, but may not be a good fit for every span
 | 
			
		||||
identification problem. Specifically, the loss function optimizes for **whole
 | 
			
		||||
entity accuracy**, so if your inter-annotator agreement on boundary tokens is
 | 
			
		||||
low, the component will likely perform poorly on your problem. The
 | 
			
		||||
transition-based algorithm also assumes that the most decisive information about
 | 
			
		||||
your entities will be close to their initial tokens. If your entities are long
 | 
			
		||||
and characterized by tokens in their middle, the component will likely not be a
 | 
			
		||||
good fit for your task.
 | 
			
		||||
 | 
			
		||||
## Config and implementation {#config}
 | 
			
		||||
 | 
			
		||||
The default config is defined by the pipeline component factory and describes
 | 
			
		||||
| 
						 | 
				
			
			@ -23,17 +35,16 @@ architectures and their arguments and hyperparameters.
 | 
			
		|||
> from spacy.pipeline.ner import DEFAULT_NER_MODEL
 | 
			
		||||
> config = {
 | 
			
		||||
>    "moves": None,
 | 
			
		||||
>   # TODO: rest
 | 
			
		||||
>    "update_with_oracle_cut_size": 100,
 | 
			
		||||
>    "model": DEFAULT_NER_MODEL,
 | 
			
		||||
> }
 | 
			
		||||
> nlp.add_pipe("ner", config=config)
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
<!-- TODO: finish API docs -->
 | 
			
		||||
 | 
			
		||||
| Setting                       | Type                                       | Description                                                                                                                                                                                                              | Default                                                           |
 | 
			
		||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
 | 
			
		||||
| `moves` | list                                       |                   | `None`                                                            |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
 | 
			
		||||
| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                      |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100`                                                             |
 | 
			
		||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                        | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
| 
						 | 
				
			
			@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
 | 
			
		|||
shortcut for this and instantiate the component using its string name and
 | 
			
		||||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
			
		||||
 | 
			
		||||
<!-- TODO: finish API docs -->
 | 
			
		||||
 | 
			
		||||
| Name                          | Type                                       | Description                                                                                                                                                                                                                                       |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
 | 
			
		||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                            |
 | 
			
		||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                   |
 | 
			
		||||
| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                       |
 | 
			
		||||
| `moves`                       | list                                       |                                                                                             |
 | 
			
		||||
| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                               |
 | 
			
		||||
| _keyword-only_                |                                            |                                                                                                                                                                                                                                                   |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        |                                                                                             |
 | 
			
		||||
| `multitasks`                  | `Iterable`                                 |                                                                                             |
 | 
			
		||||
| `learn_tokens`                | bool                                       |                                                                                             |
 | 
			
		||||
| `min_action_freq`             | int                                        |                                                                                             |
 | 
			
		||||
| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
 | 
			
		||||
 | 
			
		||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -29,9 +29,9 @@ architectures and their arguments and hyperparameters.
 | 
			
		|||
> ```
 | 
			
		||||
 | 
			
		||||
| Setting          | Type                                       | Description                                                                                                                                                                                                      | Default                             |
 | 
			
		||||
| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
 | 
			
		||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
 | 
			
		||||
| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           | `False`                             |
 | 
			
		||||
| `model`          | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                      | [Tagger](/api/architectures#Tagger) |
 | 
			
		||||
| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
 | 
			
		||||
| 
						 | 
				
			
			@ -59,9 +59,9 @@ shortcut for this and instantiate the component using its string name and
 | 
			
		|||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
			
		||||
 | 
			
		||||
| Name             | Type                                       | Description                                                                                                                                                                                                      |
 | 
			
		||||
| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
 | 
			
		||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `vocab`          | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                           |
 | 
			
		||||
| `model`          | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
 | 
			
		||||
| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
 | 
			
		||||
| `name`           | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                      |
 | 
			
		||||
| _keyword-only_   |                                            |                                                                                                                                                                                                                  |
 | 
			
		||||
| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -9,6 +9,12 @@ api_string_name: textcat
 | 
			
		|||
api_trainable: true
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
The text categorizer predicts **categories over a whole document**. It can learn
 | 
			
		||||
one or more labels, and the labels can be mutually exclusive (i.e. one true
 | 
			
		||||
label per document) or non-mutually exclusive (i.e. zero or more labels may be
 | 
			
		||||
true per document). The multi-label setting is controlled by the model instance
 | 
			
		||||
that's provided.
 | 
			
		||||
 | 
			
		||||
## Config and implementation {#config}
 | 
			
		||||
 | 
			
		||||
The default config is defined by the pipeline component factory and describes
 | 
			
		||||
| 
						 | 
				
			
			@ -30,9 +36,9 @@ architectures and their arguments and hyperparameters.
 | 
			
		|||
> ```
 | 
			
		||||
 | 
			
		||||
| Setting  | Type                                       | Description                                                                             | Default                                               |
 | 
			
		||||
| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
 | 
			
		||||
| `labels` | `Iterable[str]`                            | The labels to use. | `[]`                                                  |
 | 
			
		||||
| `model`  | [`Model`](https://thinc.ai/docs/api-model) | The model to use.  | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
 | 
			
		||||
| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
 | 
			
		||||
| `labels` | `List[str]`                                | A list of categories to learn. If empty, the model infers the categories from the data. | `[]`                                                  |
 | 
			
		||||
| `model`  | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category.                                | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
 | 
			
		||||
| 
						 | 
				
			
			@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
 | 
			
		|||
| _keyword-only_ |                                            |                                                                                             |
 | 
			
		||||
| `labels`       | `Iterable[str]`                            | The labels to use.                                                                          |
 | 
			
		||||
 | 
			
		||||
<!-- TODO move to config page
 | 
			
		||||
### Architectures {#architectures new="2.1"}
 | 
			
		||||
 | 
			
		||||
Text classification models can be used to solve a wide variety of problems.
 | 
			
		||||
Differences in text length, number of labels, difficulty, and runtime
 | 
			
		||||
performance constraints mean that no single algorithm performs well on all types
 | 
			
		||||
of problems. To handle a wider variety of problems, the `TextCategorizer` object
 | 
			
		||||
allows configuration of its model architecture, using the `architecture` keyword
 | 
			
		||||
argument.
 | 
			
		||||
 | 
			
		||||
| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                      |
 | 
			
		||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `"ensemble"`   | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model.                                                                                                                                               |
 | 
			
		||||
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.                                                                                                                                                                                |
 | 
			
		||||
| `"bow"`        | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
 | 
			
		||||
-->
 | 
			
		||||
 | 
			
		||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
 | 
			
		||||
 | 
			
		||||
Apply the pipe to one document. The document is modified in place, and returned.
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -8,7 +8,20 @@ api_string_name: tok2vec
 | 
			
		|||
api_trainable: true
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
<!-- TODO: intro describing component -->
 | 
			
		||||
Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
 | 
			
		||||
This is mostly useful to **share a single subnetwork** between multiple
 | 
			
		||||
components, e.g. to have one embedding and CNN network shared between a
 | 
			
		||||
[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
 | 
			
		||||
[`EntityRecognizer`](/api/entityrecognizer).
 | 
			
		||||
 | 
			
		||||
In order to use the `Tok2Vec` predictions, subsequent components should use the
 | 
			
		||||
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
 | 
			
		||||
subnetwork of their model. This layer will read data from the `doc.tensor`
 | 
			
		||||
attribute during prediction. During training, the `Tok2Vec` component will save
 | 
			
		||||
its prediction and backprop callback for each batch, so that the subsequent
 | 
			
		||||
components can backpropagate to the shared weights. This implementation is used
 | 
			
		||||
because it allows us to avoid relying on object identity within the models to
 | 
			
		||||
achieve the parameter sharing.
 | 
			
		||||
 | 
			
		||||
## Config and implementation {#config}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -28,8 +41,8 @@ architectures and their arguments and hyperparameters.
 | 
			
		|||
> ```
 | 
			
		||||
 | 
			
		||||
| Setting | Type                                       | Description                                                             | Default                                         |
 | 
			
		||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
 | 
			
		||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
 | 
			
		||||
| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
 | 
			
		||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
 | 
			
		||||
| 
						 | 
				
			
			@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and
 | 
			
		|||
 | 
			
		||||
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
 | 
			
		||||
 | 
			
		||||
Apply the pipe to one document. The document is modified in place, and returned.
 | 
			
		||||
This usually happens under the hood when the `nlp` object is called on a text
 | 
			
		||||
and all pipeline components are applied to the `Doc` in order. Both
 | 
			
		||||
Apply the pipe to one document and add context-sensitive embeddings to the
 | 
			
		||||
`Doc.tensor` attribute, allowing them to be used as features by downstream
 | 
			
		||||
components. The document is modified in place, and returned. This usually
 | 
			
		||||
happens under the hood when the `nlp` object is called on a text and all
 | 
			
		||||
pipeline components are applied to the `Doc` in order. Both
 | 
			
		||||
[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
 | 
			
		||||
[`predict`](/api/tok2vec#predict) and
 | 
			
		||||
[`set_annotations`](/api/tok2vec#set_annotations) methods.
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user