Docs: update trf_data examples and pipeline design info (#13164)

2025-10-26 13:41:21 +03:00 · 2023-12-04 15:15:54 +01:00 · 2023-12-04 15:15:54 +01:00 · e467573550
commit e467573550
parent da7ad97519
3 changed files with 37 additions and 8 deletions
--- a/website/docs/api/curatedtransformer.mdx
+++ b/website/docs/api/curatedtransformer.mdx
@ -400,6 +400,14 @@ identifiers are grouped by token. Instances of this class are typically assigned
 to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension
 attribute.
 > #### Example
 >
 > ```python
 > # Get the last hidden layer output for "is" (token index 1)
 > doc = nlp("This is a text.")
 > tensors = doc._.trf_data.last_hidden_layer_state[1]
 > ```
 | Name              | Description                                                                                                                                                                        |
 | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `all_outputs`     | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
--- a/website/docs/api/transformer.mdx
+++ b/website/docs/api/transformer.mdx
@ -397,6 +397,17 @@ are wrapped into the
 by this class. Instances of this class are typically assigned to the
 [`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
 > #### Example
 >
 > ```python
 > # Get the last hidden layer output for "is" (token index 1)
 > doc = nlp("This is a text.")
 > indices = doc._.trf_data.align[1].data.flatten()
 > last_hidden_state = doc._.trf_data.model_output.last_hidden_state
 > dim = last_hidden_state.shape[-1]
 > tensors = last_hidden_state.reshape(-1, dim)[indices]
 > ```
 | Name           | Description                                                                                                                                                                                                                                                                                                                          |
 | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `tokens`       | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~                       |
--- a/website/docs/models/index.mdx
+++ b/website/docs/models/index.mdx
@ -108,12 +108,12 @@ In the `sm`/`md`/`lg` models:
 #### CNN/CPU pipelines with floret vectors
-The Finnish, Korean and Swedish `md` and `lg` pipelines use
+The Croatian, Finnish, Korean, Slovenian, Swedish and Ukrainian `md` and `lg`
-[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
+pipelines use [floret vectors](/usage/v3-2#vectors) instead of default vectors.
-running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
+If you're running a trained pipeline on texts and working with [`Doc`](/api/doc)
-you shouldn't notice any difference with floret vectors. With floret vectors no
+objects, you shouldn't notice any difference with floret vectors. With floret
-tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
+vectors no tokens are out-of-vocabulary, so
-return `False` for all tokens.
+[`Token.is_oov`](/api/token#attributes) will return `False` for all tokens.
 If you access vectors directly for similarity comparisons, there are a few
 differences because floret vectors don't include a fixed word list like the
@ -132,10 +132,20 @@ vector keys for default vectors.
 ### Transformer pipeline design {id="design-trf"}
-In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
+In the transformer (`trf`) pipelines, the `tagger`, `parser` and `ner` (if
-all listen to the `transformer` component. The `attribute_ruler` and
+present) all listen to the `transformer` component. The `attribute_ruler` and
 `lemmatizer` have the same configuration as in the CNN models.
 For spaCy v3.0-v3.6, `trf` pipelines use
 [`spacy-transformers`](https://github.com/explosion/spacy-transformers) and the
 transformer output in `doc._.trf_data` is a
 [`TransformerData`](/api/transformer#transformerdata) object.
 For spaCy v3.7+, `trf` pipelines use
 [`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers)
 and `doc._.trf_data` is a
 [`DocTransformerOutput`](/api/curatedtransformer#doctransformeroutput) object.
 ### Modifying the default pipeline {id="design-modify"}
 For faster processing, you may only want to run a subset of the components in a