mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-27 09:44:36 +03:00
Docs: update trf_data examples and pipeline design info (#13164)
This commit is contained in:
parent
da7ad97519
commit
e467573550
|
@ -400,6 +400,14 @@ identifiers are grouped by token. Instances of this class are typically assigned
|
||||||
to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension
|
to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension
|
||||||
attribute.
|
attribute.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> # Get the last hidden layer output for "is" (token index 1)
|
||||||
|
> doc = nlp("This is a text.")
|
||||||
|
> tensors = doc._.trf_data.last_hidden_layer_state[1]
|
||||||
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
|
| `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
|
||||||
|
|
|
@ -397,6 +397,17 @@ are wrapped into the
|
||||||
by this class. Instances of this class are typically assigned to the
|
by this class. Instances of this class are typically assigned to the
|
||||||
[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
|
[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> # Get the last hidden layer output for "is" (token index 1)
|
||||||
|
> doc = nlp("This is a text.")
|
||||||
|
> indices = doc._.trf_data.align[1].data.flatten()
|
||||||
|
> last_hidden_state = doc._.trf_data.model_output.last_hidden_state
|
||||||
|
> dim = last_hidden_state.shape[-1]
|
||||||
|
> tensors = last_hidden_state.reshape(-1, dim)[indices]
|
||||||
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
|
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
|
||||||
|
|
|
@ -108,12 +108,12 @@ In the `sm`/`md`/`lg` models:
|
||||||
|
|
||||||
#### CNN/CPU pipelines with floret vectors
|
#### CNN/CPU pipelines with floret vectors
|
||||||
|
|
||||||
The Finnish, Korean and Swedish `md` and `lg` pipelines use
|
The Croatian, Finnish, Korean, Slovenian, Swedish and Ukrainian `md` and `lg`
|
||||||
[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
|
pipelines use [floret vectors](/usage/v3-2#vectors) instead of default vectors.
|
||||||
running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
|
If you're running a trained pipeline on texts and working with [`Doc`](/api/doc)
|
||||||
you shouldn't notice any difference with floret vectors. With floret vectors no
|
objects, you shouldn't notice any difference with floret vectors. With floret
|
||||||
tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
|
vectors no tokens are out-of-vocabulary, so
|
||||||
return `False` for all tokens.
|
[`Token.is_oov`](/api/token#attributes) will return `False` for all tokens.
|
||||||
|
|
||||||
If you access vectors directly for similarity comparisons, there are a few
|
If you access vectors directly for similarity comparisons, there are a few
|
||||||
differences because floret vectors don't include a fixed word list like the
|
differences because floret vectors don't include a fixed word list like the
|
||||||
|
@ -132,10 +132,20 @@ vector keys for default vectors.
|
||||||
|
|
||||||
### Transformer pipeline design {id="design-trf"}
|
### Transformer pipeline design {id="design-trf"}
|
||||||
|
|
||||||
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
In the transformer (`trf`) pipelines, the `tagger`, `parser` and `ner` (if
|
||||||
all listen to the `transformer` component. The `attribute_ruler` and
|
present) all listen to the `transformer` component. The `attribute_ruler` and
|
||||||
`lemmatizer` have the same configuration as in the CNN models.
|
`lemmatizer` have the same configuration as in the CNN models.
|
||||||
|
|
||||||
|
For spaCy v3.0-v3.6, `trf` pipelines use
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) and the
|
||||||
|
transformer output in `doc._.trf_data` is a
|
||||||
|
[`TransformerData`](/api/transformer#transformerdata) object.
|
||||||
|
|
||||||
|
For spaCy v3.7+, `trf` pipelines use
|
||||||
|
[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers)
|
||||||
|
and `doc._.trf_data` is a
|
||||||
|
[`DocTransformerOutput`](/api/curatedtransformer#doctransformeroutput) object.
|
||||||
|
|
||||||
### Modifying the default pipeline {id="design-modify"}
|
### Modifying the default pipeline {id="design-modify"}
|
||||||
|
|
||||||
For faster processing, you may only want to run a subset of the components in a
|
For faster processing, you may only want to run a subset of the components in a
|
||||||
|
|
Loading…
Reference in New Issue
Block a user