mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Update docstrings and docs
This commit is contained in:
		
							parent
							
								
									8d2baa153d
								
							
						
					
					
						commit
						a15c5fb191
					
				| 
						 | 
					@ -71,6 +71,11 @@ def make_parser(
 | 
				
			||||||
    actions are decreased. Note that more than one action may be optimal for
 | 
					    actions are decreased. Note that more than one action may be optimal for
 | 
				
			||||||
    a given state.
 | 
					    a given state.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    model (Model): The model for the transition-based parser. The model needs
 | 
				
			||||||
 | 
					        to have a specific substructure of named components --- see the
 | 
				
			||||||
 | 
					        spacy.ml.tb_framework.TransitionModel for details.
 | 
				
			||||||
 | 
					    moves (List[str]): A list of transition names. Inferred from the data if not
 | 
				
			||||||
 | 
					        provided.
 | 
				
			||||||
    update_with_oracle_cut_size (int):
 | 
					    update_with_oracle_cut_size (int):
 | 
				
			||||||
        During training, cut long sequences into shorter segments by creating
 | 
					        During training, cut long sequences into shorter segments by creating
 | 
				
			||||||
        intermediate states based on the gold-standard history. The model is
 | 
					        intermediate states based on the gold-standard history. The model is
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -70,6 +70,47 @@ blog post for background.
 | 
				
			||||||
| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
 | 
					| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
 | 
				
			||||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
 | 
					| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### spacy.Tok2VecListener.v1 {#Tok2VecListener}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> #### Example config
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```ini
 | 
				
			||||||
 | 
					> [components.tok2vec]
 | 
				
			||||||
 | 
					> factory = "tok2vec"
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> [components.tok2vec.model]
 | 
				
			||||||
 | 
					> @architectures = "spacy.HashEmbedCNN.v1"
 | 
				
			||||||
 | 
					> width = 342
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> [components.tagger]
 | 
				
			||||||
 | 
					> factory = "tagger"
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> [components.tagger.model]
 | 
				
			||||||
 | 
					> @architectures = "spacy.Tagger.v1"
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> [components.tagger.model.tok2vec]
 | 
				
			||||||
 | 
					> @architectures = "spacy.Tok2VecListener.v1"
 | 
				
			||||||
 | 
					> width = ${components.tok2vec.model:width}
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A listener is used as a sublayer within a component such as a
 | 
				
			||||||
 | 
					[`DependencyParser`](/api/dependencyparser),
 | 
				
			||||||
 | 
					[`EntityRecognizer`](/api/entityrecognizer)or
 | 
				
			||||||
 | 
					[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple
 | 
				
			||||||
 | 
					listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component
 | 
				
			||||||
 | 
					that's earlier in the pipeline. The listener layers act as **proxies**, passing
 | 
				
			||||||
 | 
					the predictions from the `Tok2Vec` component into downstream components, and
 | 
				
			||||||
 | 
					communicating gradients back upstream.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Instead of defining its own `Tok2Vec` instance, a model architecture like
 | 
				
			||||||
 | 
					[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
 | 
				
			||||||
 | 
					argument that connects to the shared `tok2vec` component in the pipeline.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Name       | Type | Description                                                                                                                                                                                                                                                                                            |
 | 
				
			||||||
 | 
					| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
				
			||||||
 | 
					| `width`    | int  | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component.                                                                                                                                                                                                               |
 | 
				
			||||||
 | 
					| `upstream` | str  | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
 | 
					### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: check example config -->
 | 
					<!-- TODO: check example config -->
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,23 @@ api_string_name: parser
 | 
				
			||||||
api_trainable: true
 | 
					api_trainable: true
 | 
				
			||||||
---
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A transition-based dependency parser component. The dependency parser jointly
 | 
				
			||||||
 | 
					learns sentence segmentation and labelled dependency parsing, and can optionally
 | 
				
			||||||
 | 
					learn to merge tokens that had been over-segmented by the tokenizer. The parser
 | 
				
			||||||
 | 
					uses a variant of the **non-monotonic arc-eager transition-system** described by
 | 
				
			||||||
 | 
					[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with
 | 
				
			||||||
 | 
					the addition of a "break" transition to perform the sentence segmentation.
 | 
				
			||||||
 | 
					[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective
 | 
				
			||||||
 | 
					dependency transformation** is used to allow the parser to predict
 | 
				
			||||||
 | 
					non-projective parses.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The parser is trained using an **imitation learning objective**. It follows the
 | 
				
			||||||
 | 
					actions predicted by the current weights, and at each state, determines which
 | 
				
			||||||
 | 
					actions are compatible with the optimal parse that could be reached from the
 | 
				
			||||||
 | 
					current state. The weights such that the scores assigned to the set of optimal
 | 
				
			||||||
 | 
					actions is increased, while scores assigned to other actions are decreased. Note
 | 
				
			||||||
 | 
					that more than one action may be optimal for a given state.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Config and implementation {#config}
 | 
					## Config and implementation {#config}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The default config is defined by the pipeline component factory and describes
 | 
					The default config is defined by the pipeline component factory and describes
 | 
				
			||||||
| 
						 | 
					@ -23,17 +40,20 @@ architectures and their arguments and hyperparameters.
 | 
				
			||||||
> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
 | 
					> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
 | 
				
			||||||
> config = {
 | 
					> config = {
 | 
				
			||||||
>    "moves": None,
 | 
					>    "moves": None,
 | 
				
			||||||
>   # TODO: rest
 | 
					>    "update_with_oracle_cut_size": 100,
 | 
				
			||||||
 | 
					>    "learn_tokens": False,
 | 
				
			||||||
 | 
					>    "min_action_freq": 30,
 | 
				
			||||||
>    "model": DEFAULT_PARSER_MODEL,
 | 
					>    "model": DEFAULT_PARSER_MODEL,
 | 
				
			||||||
> }
 | 
					> }
 | 
				
			||||||
> nlp.add_pipe("parser", config=config)
 | 
					> nlp.add_pipe("parser", config=config)
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: finish API docs -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
| Setting                       | Type                                       | Description                                                                                                                                                                                                                                                                                 | Default                                                           |
 | 
					| Setting                       | Type                                       | Description                                                                                                                                                                                                                                                                                 | Default                                                           |
 | 
				
			||||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
 | 
					| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
 | 
				
			||||||
| `moves` | list                                       |                   | `None`                                                            |
 | 
					| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         | `None`                                                            |
 | 
				
			||||||
 | 
					| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it.                                                                    | `100`                                                             |
 | 
				
			||||||
 | 
					| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             | `False`                                                           |
 | 
				
			||||||
 | 
					| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30`                                                              |
 | 
				
			||||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                                                                                           | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
					| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                                                                                           | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
| 
						 | 
					@ -61,19 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
 | 
				
			||||||
shortcut for this and instantiate the component using its string name and
 | 
					shortcut for this and instantiate the component using its string name and
 | 
				
			||||||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
					[`nlp.add_pipe`](/api/language#add_pipe).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: finish API docs -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
| Name                          | Type                                       | Description                                                                                                                                                                                                                                                                                 |
 | 
					| Name                          | Type                                       | Description                                                                                                                                                                                                                                                                                 |
 | 
				
			||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
 | 
					| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
				
			||||||
| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                      |
 | 
					| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                                                                      |
 | 
				
			||||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                                                             |
 | 
					| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                                                             |
 | 
				
			||||||
| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                 |
 | 
					| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                                                                 |
 | 
				
			||||||
| `moves`                       | list                                       |                                                                                             |
 | 
					| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                                                                         |
 | 
				
			||||||
| _keyword-only_                |                                            |                                                                                                                                                                                                                                                                                             |
 | 
					| _keyword-only_                |                                            |                                                                                                                                                                                                                                                                                             |
 | 
				
			||||||
| `update_with_oracle_cut_size` | int                                        |                                                                                             |
 | 
					| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default.                                           |
 | 
				
			||||||
| `multitasks`                  | `Iterable`                                 |                                                                                             |
 | 
					| `learn_tokens`                | bool                                       | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental.                                                                                                                                                                                             |
 | 
				
			||||||
| `learn_tokens`                | bool                                       |                                                                                             |
 | 
					| `min_action_freq`             | int                                        | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
 | 
				
			||||||
| `min_action_freq`             | int                                        |                                                                                             |
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## DependencyParser.\_\_call\_\_ {#call tag="method"}
 | 
					## DependencyParser.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,6 +8,18 @@ api_string_name: ner
 | 
				
			||||||
api_trainable: true
 | 
					api_trainable: true
 | 
				
			||||||
---
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A transition-based named entity recognition component. The entity recognizer
 | 
				
			||||||
 | 
					identifies **non-overlapping labelled spans** of tokens. The transition-based
 | 
				
			||||||
 | 
					algorithm used encodes certain assumptions that are effective for "traditional"
 | 
				
			||||||
 | 
					named entity recognition tasks, but may not be a good fit for every span
 | 
				
			||||||
 | 
					identification problem. Specifically, the loss function optimizes for **whole
 | 
				
			||||||
 | 
					entity accuracy**, so if your inter-annotator agreement on boundary tokens is
 | 
				
			||||||
 | 
					low, the component will likely perform poorly on your problem. The
 | 
				
			||||||
 | 
					transition-based algorithm also assumes that the most decisive information about
 | 
				
			||||||
 | 
					your entities will be close to their initial tokens. If your entities are long
 | 
				
			||||||
 | 
					and characterized by tokens in their middle, the component will likely not be a
 | 
				
			||||||
 | 
					good fit for your task.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Config and implementation {#config}
 | 
					## Config and implementation {#config}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The default config is defined by the pipeline component factory and describes
 | 
					The default config is defined by the pipeline component factory and describes
 | 
				
			||||||
| 
						 | 
					@ -23,17 +35,16 @@ architectures and their arguments and hyperparameters.
 | 
				
			||||||
> from spacy.pipeline.ner import DEFAULT_NER_MODEL
 | 
					> from spacy.pipeline.ner import DEFAULT_NER_MODEL
 | 
				
			||||||
> config = {
 | 
					> config = {
 | 
				
			||||||
>    "moves": None,
 | 
					>    "moves": None,
 | 
				
			||||||
>   # TODO: rest
 | 
					>    "update_with_oracle_cut_size": 100,
 | 
				
			||||||
>    "model": DEFAULT_NER_MODEL,
 | 
					>    "model": DEFAULT_NER_MODEL,
 | 
				
			||||||
> }
 | 
					> }
 | 
				
			||||||
> nlp.add_pipe("ner", config=config)
 | 
					> nlp.add_pipe("ner", config=config)
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: finish API docs -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
| Setting                       | Type                                       | Description                                                                                                                                                                                                              | Default                                                           |
 | 
					| Setting                       | Type                                       | Description                                                                                                                                                                                                              | Default                                                           |
 | 
				
			||||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------------------------- |
 | 
					| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
 | 
				
			||||||
| `moves` | list                                       |                   | `None`                                                            |
 | 
					| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                      |
 | 
				
			||||||
 | 
					| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100`                                                             |
 | 
				
			||||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                        | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
					| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                                                                        | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
| 
						 | 
					@ -61,19 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
 | 
				
			||||||
shortcut for this and instantiate the component using its string name and
 | 
					shortcut for this and instantiate the component using its string name and
 | 
				
			||||||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
					[`nlp.add_pipe`](/api/language#add_pipe).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: finish API docs -->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
| Name                          | Type                                       | Description                                                                                                                                                                                                                                       |
 | 
					| Name                          | Type                                       | Description                                                                                                                                                                                                                                       |
 | 
				
			||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
 | 
					| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
				
			||||||
| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                            |
 | 
					| `vocab`                       | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                                                            |
 | 
				
			||||||
| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                   |
 | 
					| `model`                       | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.                                                                                                                                                                   |
 | 
				
			||||||
| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                       |
 | 
					| `name`                        | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                                                       |
 | 
				
			||||||
| `moves`                       | list                                       |                                                                                             |
 | 
					| `moves`                       | `List[str]`                                | A list of transition names. Inferred from the data if not provided.                                                                                                                                                                               |
 | 
				
			||||||
| _keyword-only_                |                                            |                                                                                                                                                                                                                                                   |
 | 
					| _keyword-only_                |                                            |                                                                                                                                                                                                                                                   |
 | 
				
			||||||
| `update_with_oracle_cut_size` | int                                        |                                                                                             |
 | 
					| `update_with_oracle_cut_size` | int                                        | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
 | 
				
			||||||
| `multitasks`                  | `Iterable`                                 |                                                                                             |
 | 
					 | 
				
			||||||
| `learn_tokens`                | bool                                       |                                                                                             |
 | 
					 | 
				
			||||||
| `min_action_freq`             | int                                        |                                                                                             |
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
 | 
					## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -29,9 +29,9 @@ architectures and their arguments and hyperparameters.
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Setting          | Type                                       | Description                                                                                                                                                                                                      | Default                             |
 | 
					| Setting          | Type                                       | Description                                                                                                                                                                                                      | Default                             |
 | 
				
			||||||
| ---------------- | ------------------------------------------ | -------------------------------------- | ----------------------------------- |
 | 
					| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
 | 
				
			||||||
| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           | `False`                             |
 | 
					| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           | `False`                             |
 | 
				
			||||||
| `model`          | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                      | [Tagger](/api/architectures#Tagger) |
 | 
					| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
 | 
					https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
 | 
				
			||||||
| 
						 | 
					@ -59,9 +59,9 @@ shortcut for this and instantiate the component using its string name and
 | 
				
			||||||
[`nlp.add_pipe`](/api/language#add_pipe).
 | 
					[`nlp.add_pipe`](/api/language#add_pipe).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Name             | Type                                       | Description                                                                                                                                                                                                      |
 | 
					| Name             | Type                                       | Description                                                                                                                                                                                                      |
 | 
				
			||||||
| ---------------- | ------- | ------------------------------------------------------------------------------------------- |
 | 
					| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
				
			||||||
| `vocab`          | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                           |
 | 
					| `vocab`          | `Vocab`                                    | The shared vocabulary.                                                                                                                                                                                           |
 | 
				
			||||||
| `model`          | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
 | 
					| `model`          | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
 | 
				
			||||||
| `name`           | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                      |
 | 
					| `name`           | str                                        | String name of the component instance. Used to add entries to the `losses` during training.                                                                                                                      |
 | 
				
			||||||
| _keyword-only_   |                                            |                                                                                                                                                                                                                  |
 | 
					| _keyword-only_   |                                            |                                                                                                                                                                                                                  |
 | 
				
			||||||
| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           |
 | 
					| `set_morphology` | bool                                       | Whether to set morphological features.                                                                                                                                                                           |
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,12 @@ api_string_name: textcat
 | 
				
			||||||
api_trainable: true
 | 
					api_trainable: true
 | 
				
			||||||
---
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The text categorizer predicts **categories over a whole document**. It can learn
 | 
				
			||||||
 | 
					one or more labels, and the labels can be mutually exclusive (i.e. one true
 | 
				
			||||||
 | 
					label per document) or non-mutually exclusive (i.e. zero or more labels may be
 | 
				
			||||||
 | 
					true per document). The multi-label setting is controlled by the model instance
 | 
				
			||||||
 | 
					that's provided.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Config and implementation {#config}
 | 
					## Config and implementation {#config}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The default config is defined by the pipeline component factory and describes
 | 
					The default config is defined by the pipeline component factory and describes
 | 
				
			||||||
| 
						 | 
					@ -30,9 +36,9 @@ architectures and their arguments and hyperparameters.
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Setting  | Type                                       | Description                                                                             | Default                                               |
 | 
					| Setting  | Type                                       | Description                                                                             | Default                                               |
 | 
				
			||||||
| -------- | ------------------------------------------ | ------------------ | ----------------------------------------------------- |
 | 
					| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
 | 
				
			||||||
| `labels` | `Iterable[str]`                            | The labels to use. | `[]`                                                  |
 | 
					| `labels` | `List[str]`                                | A list of categories to learn. If empty, the model infers the categories from the data. | `[]`                                                  |
 | 
				
			||||||
| `model`  | [`Model`](https://thinc.ai/docs/api-model) | The model to use.  | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
 | 
					| `model`  | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category.                                | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
 | 
					https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
 | 
				
			||||||
| 
						 | 
					@ -67,23 +73,6 @@ shortcut for this and instantiate the component using its string name and
 | 
				
			||||||
| _keyword-only_ |                                            |                                                                                             |
 | 
					| _keyword-only_ |                                            |                                                                                             |
 | 
				
			||||||
| `labels`       | `Iterable[str]`                            | The labels to use.                                                                          |
 | 
					| `labels`       | `Iterable[str]`                            | The labels to use.                                                                          |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO move to config page
 | 
					 | 
				
			||||||
### Architectures {#architectures new="2.1"}
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Text classification models can be used to solve a wide variety of problems.
 | 
					 | 
				
			||||||
Differences in text length, number of labels, difficulty, and runtime
 | 
					 | 
				
			||||||
performance constraints mean that no single algorithm performs well on all types
 | 
					 | 
				
			||||||
of problems. To handle a wider variety of problems, the `TextCategorizer` object
 | 
					 | 
				
			||||||
allows configuration of its model architecture, using the `architecture` keyword
 | 
					 | 
				
			||||||
argument.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                      |
 | 
					 | 
				
			||||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
					 | 
				
			||||||
| `"ensemble"`   | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model.                                                                                                                                               |
 | 
					 | 
				
			||||||
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.                                                                                                                                                                                |
 | 
					 | 
				
			||||||
| `"bow"`        | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
 | 
					 | 
				
			||||||
-->
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
 | 
					## TextCategorizer.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Apply the pipe to one document. The document is modified in place, and returned.
 | 
					Apply the pipe to one document. The document is modified in place, and returned.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,7 +8,20 @@ api_string_name: tok2vec
 | 
				
			||||||
api_trainable: true
 | 
					api_trainable: true
 | 
				
			||||||
---
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: intro describing component -->
 | 
					Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
 | 
				
			||||||
 | 
					This is mostly useful to **share a single subnetwork** between multiple
 | 
				
			||||||
 | 
					components, e.g. to have one embedding and CNN network shared between a
 | 
				
			||||||
 | 
					[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
 | 
				
			||||||
 | 
					[`EntityRecognizer`](/api/entityrecognizer).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In order to use the `Tok2Vec` predictions, subsequent components should use the
 | 
				
			||||||
 | 
					[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
 | 
				
			||||||
 | 
					subnetwork of their model. This layer will read data from the `doc.tensor`
 | 
				
			||||||
 | 
					attribute during prediction. During training, the `Tok2Vec` component will save
 | 
				
			||||||
 | 
					its prediction and backprop callback for each batch, so that the subsequent
 | 
				
			||||||
 | 
					components can backpropagate to the shared weights. This implementation is used
 | 
				
			||||||
 | 
					because it allows us to avoid relying on object identity within the models to
 | 
				
			||||||
 | 
					achieve the parameter sharing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Config and implementation {#config}
 | 
					## Config and implementation {#config}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -28,8 +41,8 @@ architectures and their arguments and hyperparameters.
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Setting | Type                                       | Description                                                             | Default                                         |
 | 
					| Setting | Type                                       | Description                                                             | Default                                         |
 | 
				
			||||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
 | 
					| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
 | 
				
			||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
 | 
					| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
 | 
					https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
 | 
				
			||||||
| 
						 | 
					@ -64,9 +77,11 @@ shortcut for this and instantiate the component using its string name and
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
 | 
					## Tok2Vec.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Apply the pipe to one document. The document is modified in place, and returned.
 | 
					Apply the pipe to one document and add context-sensitive embeddings to the
 | 
				
			||||||
This usually happens under the hood when the `nlp` object is called on a text
 | 
					`Doc.tensor` attribute, allowing them to be used as features by downstream
 | 
				
			||||||
and all pipeline components are applied to the `Doc` in order. Both
 | 
					components. The document is modified in place, and returned. This usually
 | 
				
			||||||
 | 
					happens under the hood when the `nlp` object is called on a text and all
 | 
				
			||||||
 | 
					pipeline components are applied to the `Doc` in order. Both
 | 
				
			||||||
[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
 | 
					[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the
 | 
				
			||||||
[`predict`](/api/tok2vec#predict) and
 | 
					[`predict`](/api/tok2vec#predict) and
 | 
				
			||||||
[`set_annotations`](/api/tok2vec#set_annotations) methods.
 | 
					[`set_annotations`](/api/tok2vec#set_annotations) methods.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in New Issue
	
	Block a user