diff --git a/website/docs/api/curatedtransformer.mdx b/website/docs/api/curatedtransformer.mdx index 5fdbd86cb..3e63ef7c2 100644 --- a/website/docs/api/curatedtransformer.mdx +++ b/website/docs/api/curatedtransformer.mdx @@ -400,6 +400,14 @@ identifiers are grouped by token. Instances of this class are typically assigned to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension attribute. +> #### Example +> +> ```python +> # Get the last hidden layer output for "is" (token index 1) +> doc = nlp("This is a text.") +> tensors = doc._.trf_data.last_hidden_layer_state[1] +> ``` + | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ | diff --git a/website/docs/api/transformer.mdx b/website/docs/api/transformer.mdx index ad8ecce54..8f024553d 100644 --- a/website/docs/api/transformer.mdx +++ b/website/docs/api/transformer.mdx @@ -397,6 +397,17 @@ are wrapped into the by this class. Instances of this class are typically assigned to the [`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute. +> #### Example +> +> ```python +> # Get the last hidden layer output for "is" (token index 1) +> doc = nlp("This is a text.") +> indices = doc._.trf_data.align[1].data.flatten() +> last_hidden_state = doc._.trf_data.model_output.last_hidden_state +> dim = last_hidden_state.shape[-1] +> tensors = last_hidden_state.reshape(-1, dim)[indices] +> ``` + | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | diff --git a/website/docs/models/index.mdx b/website/docs/models/index.mdx index 366d44f0e..54f3c4906 100644 --- a/website/docs/models/index.mdx +++ b/website/docs/models/index.mdx @@ -108,12 +108,12 @@ In the `sm`/`md`/`lg` models: #### CNN/CPU pipelines with floret vectors -The Finnish, Korean and Swedish `md` and `lg` pipelines use -[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're -running a trained pipeline on texts and working with [`Doc`](/api/doc) objects, -you shouldn't notice any difference with floret vectors. With floret vectors no -tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will -return `False` for all tokens. +The Croatian, Finnish, Korean, Slovenian, Swedish and Ukrainian `md` and `lg` +pipelines use [floret vectors](/usage/v3-2#vectors) instead of default vectors. +If you're running a trained pipeline on texts and working with [`Doc`](/api/doc) +objects, you shouldn't notice any difference with floret vectors. With floret +vectors no tokens are out-of-vocabulary, so +[`Token.is_oov`](/api/token#attributes) will return `False` for all tokens. If you access vectors directly for similarity comparisons, there are a few differences because floret vectors don't include a fixed word list like the @@ -132,10 +132,20 @@ vector keys for default vectors. ### Transformer pipeline design {id="design-trf"} -In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present) -all listen to the `transformer` component. The `attribute_ruler` and +In the transformer (`trf`) pipelines, the `tagger`, `parser` and `ner` (if +present) all listen to the `transformer` component. The `attribute_ruler` and `lemmatizer` have the same configuration as in the CNN models. +For spaCy v3.0-v3.6, `trf` pipelines use +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) and the +transformer output in `doc._.trf_data` is a +[`TransformerData`](/api/transformer#transformerdata) object. + +For spaCy v3.7+, `trf` pipelines use +[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers) +and `doc._.trf_data` is a +[`DocTransformerOutput`](/api/curatedtransformer#doctransformeroutput) object. + ### Modifying the default pipeline {id="design-modify"} For faster processing, you may only want to run a subset of the components in a