mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	Docs: update trf_data examples and pipeline design info (#13164)
This commit is contained in:
		
							parent
							
								
									da7ad97519
								
							
						
					
					
						commit
						e467573550
					
				|  | @ -400,6 +400,14 @@ identifiers are grouped by token. Instances of this class are typically assigned | |||
| to the [`Doc._.trf_data`](/api/curatedtransformer#assigned-attributes) extension | ||||
| attribute. | ||||
| 
 | ||||
| > #### Example | ||||
| > | ||||
| > ```python | ||||
| > # Get the last hidden layer output for "is" (token index 1) | ||||
| > doc = nlp("This is a text.") | ||||
| > tensors = doc._.trf_data.last_hidden_layer_state[1] | ||||
| > ``` | ||||
| 
 | ||||
| | Name              | Description                                                                                                                                                                        | | ||||
| | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||
| | `all_outputs`     | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ | | ||||
|  |  | |||
|  | @ -397,6 +397,17 @@ are wrapped into the | |||
| by this class. Instances of this class are typically assigned to the | ||||
| [`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute. | ||||
| 
 | ||||
| > #### Example | ||||
| > | ||||
| > ```python | ||||
| > # Get the last hidden layer output for "is" (token index 1) | ||||
| > doc = nlp("This is a text.") | ||||
| > indices = doc._.trf_data.align[1].data.flatten() | ||||
| > last_hidden_state = doc._.trf_data.model_output.last_hidden_state | ||||
| > dim = last_hidden_state.shape[-1] | ||||
| > tensors = last_hidden_state.reshape(-1, dim)[indices] | ||||
| > ``` | ||||
| 
 | ||||
| | Name           | Description                                                                                                                                                                                                                                                                                                                          | | ||||
| | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||||
| | `tokens`       | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~                       | | ||||
|  |  | |||
|  | @ -108,12 +108,12 @@ In the `sm`/`md`/`lg` models: | |||
| 
 | ||||
| #### CNN/CPU pipelines with floret vectors | ||||
| 
 | ||||
| The Finnish, Korean and Swedish `md` and `lg` pipelines use | ||||
| [floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're | ||||
| running a trained pipeline on texts and working with [`Doc`](/api/doc) objects, | ||||
| you shouldn't notice any difference with floret vectors. With floret vectors no | ||||
| tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will | ||||
| return `False` for all tokens. | ||||
| The Croatian, Finnish, Korean, Slovenian, Swedish and Ukrainian `md` and `lg` | ||||
| pipelines use [floret vectors](/usage/v3-2#vectors) instead of default vectors. | ||||
| If you're running a trained pipeline on texts and working with [`Doc`](/api/doc) | ||||
| objects, you shouldn't notice any difference with floret vectors. With floret | ||||
| vectors no tokens are out-of-vocabulary, so | ||||
| [`Token.is_oov`](/api/token#attributes) will return `False` for all tokens. | ||||
| 
 | ||||
| If you access vectors directly for similarity comparisons, there are a few | ||||
| differences because floret vectors don't include a fixed word list like the | ||||
|  | @ -132,10 +132,20 @@ vector keys for default vectors. | |||
| 
 | ||||
| ### Transformer pipeline design {id="design-trf"} | ||||
| 
 | ||||
| In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present) | ||||
| all listen to the `transformer` component. The `attribute_ruler` and | ||||
| In the transformer (`trf`) pipelines, the `tagger`, `parser` and `ner` (if | ||||
| present) all listen to the `transformer` component. The `attribute_ruler` and | ||||
| `lemmatizer` have the same configuration as in the CNN models. | ||||
| 
 | ||||
| For spaCy v3.0-v3.6, `trf` pipelines use | ||||
| [`spacy-transformers`](https://github.com/explosion/spacy-transformers) and the | ||||
| transformer output in `doc._.trf_data` is a | ||||
| [`TransformerData`](/api/transformer#transformerdata) object. | ||||
| 
 | ||||
| For spaCy v3.7+, `trf` pipelines use | ||||
| [`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers) | ||||
| and `doc._.trf_data` is a | ||||
| [`DocTransformerOutput`](/api/curatedtransformer#doctransformeroutput) object. | ||||
| 
 | ||||
| ### Modifying the default pipeline {id="design-modify"} | ||||
| 
 | ||||
| For faster processing, you may only want to run a subset of the components in a | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user