mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 16:07:41 +03:00 
			
		
		
		
	Update docs [ci skip]
This commit is contained in:
		
							parent
							
								
									b0f57a0cac
								
							
						
					
					
						commit
						158d8c1e48
					
				|  | @ -26,6 +26,8 @@ TODO: intro and how architectures work, link to | ||||||
| 
 | 
 | ||||||
| ### spacy-transformers.TransformerModel.v1 {#TransformerModel} | ### spacy-transformers.TransformerModel.v1 {#TransformerModel} | ||||||
| 
 | 
 | ||||||
|  | ### spacy-transformers.Tok2VecListener.v1 {#spacy-transformers.Tok2VecListener.v1} | ||||||
|  | 
 | ||||||
| ## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"} | ## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"} | ||||||
| 
 | 
 | ||||||
| ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser} | ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser} | ||||||
|  |  | ||||||
|  | @ -304,6 +304,31 @@ factories. | ||||||
| | `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                      | | | `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                      | | ||||||
| | `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                        | | | `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                        | | ||||||
| 
 | 
 | ||||||
|  | ### spacy-transformers registry {#registry-transformers} | ||||||
|  | 
 | ||||||
|  | The following registries are added by the | ||||||
|  | [`spacy-transformers`](https://github.com/explosion/spacy-transformers) package. | ||||||
|  | See the [`Transformer`](/api/transformer) API reference and | ||||||
|  | [usage docs](/usage/transformers) for details. | ||||||
|  | 
 | ||||||
|  | > #### Example | ||||||
|  | > | ||||||
|  | > ```python | ||||||
|  | > import spacy_transformers | ||||||
|  | > | ||||||
|  | > @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") | ||||||
|  | > def configure_custom_annotation_setter(): | ||||||
|  | >     def annotation_setter(docs, trf_data) -> None: | ||||||
|  | >        # Set annotations on the docs | ||||||
|  | > | ||||||
|  | >     return annotation_sette | ||||||
|  | > ``` | ||||||
|  | 
 | ||||||
|  | | Registry name                                                | Description                                                                                                                                                                                                                                       | | ||||||
|  | | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||||
|  | | [`span_getters`](/api/transformer#span_getters)              | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      | | ||||||
|  | | [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | | ||||||
|  | 
 | ||||||
| ## Training data and alignment {#gold source="spacy/gold"} | ## Training data and alignment {#gold source="spacy/gold"} | ||||||
| 
 | 
 | ||||||
| ### gold.docs_to_json {#docs_to_json tag="function"} | ### gold.docs_to_json {#docs_to_json tag="function"} | ||||||
|  |  | ||||||
|  | @ -31,8 +31,10 @@ attributes. We also calculate an alignment between the word-piece tokens and the | ||||||
| spaCy tokenization, so that we can use the last hidden states to set the | spaCy tokenization, so that we can use the last hidden states to set the | ||||||
| `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy | `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy | ||||||
| token, the spaCy token receives the sum of their values. To access the values, | token, the spaCy token receives the sum of their values. To access the values, | ||||||
| you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. For | you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The | ||||||
| more details, see the [usage documentation](/usage/transformers). | package also adds the function registries [`@span_getters`](#span_getters) and | ||||||
|  | [`@annotation_setters`](#annotation_setters) with several built-in registered | ||||||
|  | functions. For more details, see the [usage documentation](/usage/transformers). | ||||||
| 
 | 
 | ||||||
| ## Config and implementation {#config} | ## Config and implementation {#config} | ||||||
| 
 | 
 | ||||||
|  | @ -51,11 +53,11 @@ architectures and their arguments and hyperparameters. | ||||||
| > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) | > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) | ||||||
| > ``` | > ``` | ||||||
| 
 | 
 | ||||||
| | Setting             | Type                                       | Description                                                                                                                                         | Default                                                 | | | Setting             | Type                                       | Description                                                                                                                                                         | Default                                                 | | ||||||
| | ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | | | ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | | ||||||
| | `max_batch_items`   | int                                        | Maximum size of a padded batch.                                                                                                                     | `4096`                                                  | | | `max_batch_items`   | int                                        | Maximum size of a padded batch.                                                                                                                                     | `4096`                                                  | | ||||||
| | `annotation_setter` | Callable                                   | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter`                                | | | `annotation_setter` | Callable                                   | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter`                                | | ||||||
| | `model`             | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                   | [TransformerModel](/api/architectures#TransformerModel) | | | `model`             | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                                                                                                                   | [TransformerModel](/api/architectures#TransformerModel) | | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py | https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py | ||||||
|  | @ -390,6 +392,72 @@ Split a `TransformerData` object that represents a batch into a list with one | ||||||
| | ----------- | ----------------------- | -------------- | | | ----------- | ----------------------- | -------------- | | ||||||
| | **RETURNS** | `List[TransformerData]` | <!-- TODO: --> | | | **RETURNS** | `List[TransformerData]` | <!-- TODO: --> | | ||||||
| 
 | 
 | ||||||
|  | ## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} | ||||||
|  | 
 | ||||||
|  | Span getters are functions that take a batch of [`Doc`](/api/doc) objects and | ||||||
|  | return a lists of [`Span`](/api/span) objects for each doc, to be processed by | ||||||
|  | the transformer. The returned spans can overlap. | ||||||
|  | 
 | ||||||
|  | <!-- TODO: details on what this is for --> Span getters can be referenced in the | ||||||
|  | 
 | ||||||
|  | config's `[components.transformer.model.get_spans]` block to customize the | ||||||
|  | sequences processed by the transformer. You can also register custom span | ||||||
|  | getters using the `@registry.span_getters` decorator. | ||||||
|  | 
 | ||||||
|  | > #### Example | ||||||
|  | > | ||||||
|  | > ```python | ||||||
|  | > @registry.span_getters("sent_spans.v1") | ||||||
|  | > def configure_get_sent_spans() -> Callable: | ||||||
|  | >     def get_sent_spans(docs: Iterable[Doc]) -> List[List[Span]]: | ||||||
|  | >         return [list(doc.sents) for doc in docs] | ||||||
|  | > | ||||||
|  | >     return get_sent_spans | ||||||
|  | > ``` | ||||||
|  | 
 | ||||||
|  | | Name        | Type               | Description                                                  | | ||||||
|  | | ----------- | ------------------ | ------------------------------------------------------------ | | ||||||
|  | | `docs`      | `Iterable[Doc]`    | A batch of `Doc` objects.                                    | | ||||||
|  | | **RETURNS** | `List[List[Span]]` | The spans to process by the transformer, one list per `Doc`. | | ||||||
|  | 
 | ||||||
|  | The following built-in functions are available: | ||||||
|  | 
 | ||||||
|  | | Name               | Description                                                        | | ||||||
|  | | ------------------ | ------------------------------------------------------------------ | | ||||||
|  | | `doc_spans.v1`     | Create a span for each doc (no transformation, process each text). | | ||||||
|  | | `sent_spans.v1`    | Create a span for each sentence if sentence boundaries are set.    | | ||||||
|  | | `strided_spans.v1` | <!-- TODO: -->                                                     | | ||||||
|  | 
 | ||||||
|  | ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} | ||||||
|  | 
 | ||||||
|  | Annotation setters are functions that that take a batch of `Doc` objects and a | ||||||
|  | [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set | ||||||
|  | additional annotations on the `Doc`, e.g. to set custom or built-in attributes. | ||||||
|  | You can register custom annotation setters using the | ||||||
|  | `@registry.annotation_setters` decorator. | ||||||
|  | 
 | ||||||
|  | > #### Example | ||||||
|  | > | ||||||
|  | > ```python | ||||||
|  | > @registry.annotation_setters("spacy-transformer.null_annotation_setter.v1") | ||||||
|  | > def configure_null_annotation_setter() -> Callable: | ||||||
|  | >     def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: | ||||||
|  | >         pass | ||||||
|  | > | ||||||
|  | >     return setter | ||||||
|  | > ``` | ||||||
|  | 
 | ||||||
|  | | Name       | Type                   | Description                          | | ||||||
|  | | ---------- | ---------------------- | ------------------------------------ | | ||||||
|  | | `docs`     | `List[Doc]`            | A batch of `Doc` objects.            | | ||||||
|  | | `trf_data` | `FullTransformerBatch` | The transformers data for the batch. | | ||||||
|  | 
 | ||||||
|  | The following built-in functions are available: | ||||||
|  | 
 | ||||||
|  | | Name                                          | Description                           | | ||||||
|  | | --------------------------------------------- | ------------------------------------- | | ||||||
|  | | `spacy-transformer.null_annotation_setter.v1` | Don't set any additional annotations. | | ||||||
|  | 
 | ||||||
| ## Custom attributes {#custom-attributes} | ## Custom attributes {#custom-attributes} | ||||||
| 
 | 
 | ||||||
| The component sets the following | The component sets the following | ||||||
|  |  | ||||||
							
								
								
									
										37
									
								
								website/docs/images/pipeline_transformer.svg
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										37
									
								
								website/docs/images/pipeline_transformer.svg
									
									
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,37 @@ | ||||||
|  | <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1155" height="221" viewBox="0 0 1155 221"> | ||||||
|  |   <defs> | ||||||
|  |     <rect id="a" width="735" height="170" x="210" y="25" rx="30"/> | ||||||
|  |     <path id="c" d="M395 75h174l23.5 43.4L569 160H395l23.5-41.5z"/> | ||||||
|  |     <mask id="b" width="735" height="170" x="0" y="0" fill="#fff" maskContentUnits="userSpaceOnUse" maskUnits="objectBoundingBox"> | ||||||
|  |       <use xlink:href="#a"/> | ||||||
|  |     </mask> | ||||||
|  |   </defs> | ||||||
|  |   <g fill="none" fill-rule="evenodd" transform="translate(0 26)"> | ||||||
|  |     <rect width="145" height="80" x="2.5" y="2.5" fill="#D8D8D8" stroke="#6A6A6A" stroke-width="5" rx="10" transform="translate(0 70)"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M55.4 99.7v3.9h-7.6V125H43v-21.4h-7.7v-3.9h20zm10.2 7c1 0 2.1.2 3 .6a6.8 6.8 0 014.1 4.1 9.6 9.6 0 01.6 4.3l-.2.5-.3.3H61.3c0 2 .6 3.3 1.4 4.1.9.9 2 1.3 3.5 1.3a6 6 0 001.8-.2l1.3-.6 1-.5.8-.3c.2 0 .3 0 .5.2l.3.2 1.3 1.6c-.5.6-1 1-1.6 1.4a9 9 0 01-3.9 1.4l-2 .2c-1.2 0-2.3-.2-3.4-.7-1-.4-2-1-2.8-1.8a8.6 8.6 0 01-1.9-3 11.6 11.6 0 010-7.6c.3-1.1.9-2 1.6-2.8a8 8 0 012.7-2 9 9 0 013.7-.6zm0 3.2a4 4 0 00-3 1c-.6.7-1 1.8-1.3 3h8.1c0-.5 0-1-.2-1.5-.1-.5-.4-1-.7-1.3-.3-.4-.7-.7-1.2-1a4 4 0 00-1.7-.2zm15.5 5.8l-5.9-8.7h4.2c.3 0 .5 0 .7.2l.4.4 3.7 6a4.9 4.9 0 01.6-1.2l3-4.7.4-.5.6-.2h4l-6 8.5L93 125h-4.2c-.3 0-.5 0-.7-.2l-.5-.6-3.8-6.3-.4 1.1-3.4 5.2-.5.5a1 1 0 01-.7.3H75l6-9.3zm20.5 9.6c-1.5 0-2.7-.5-3.5-1.3a5 5 0 01-1.3-3.7v-10H95c-.3 0-.5 0-.6-.2-.2-.2-.3-.4-.3-.7v-1.7l2.9-.5 1-5c0-.1 0-.3.2-.5l.7-.2h2.2v5.7h4.7v3h-4.7v9.8c0 .6.2 1 .4 1.3.3.3.7.5 1.2.5l.6-.1a3.7 3.7 0 00.9-.4l.3-.1.3.1.3.3 1.2 2c-.6.6-1.3 1-2.1 1.3a8 8 0 01-2.6.4z"/> | ||||||
|  |     <rect width="145" height="80" x="2.5" y="2.5" fill="#D7CCF4" stroke="#8978B5" stroke-width="5" rx="10" transform="translate(1005 70)"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M1050.3 101.5a58.8 58.8 0 016.8-.4c2.2 0 4 .4 5.4 1 1.4.6 2.5 1.5 3.4 2.6a10 10 0 011.7 4 23.2 23.2 0 010 9.6c-.3 1.5-1 2.9-1.8 4-.8 1.3-2 2.2-3.5 3-1.5.7-3.4 1-5.8 1a37.3 37.3 0 01-5-.1l-1.2-.2v-24.5zm7 4a15.6 15.6 0 00-2.3 0V122h.5a158 158 0 001.6.1 6 6 0 003.2-.7c.8-.5 1.4-1.2 1.8-2 .4-.8.7-1.8.8-2.8a27.3 27.3 0 000-5.8 8 8 0 00-.7-2.6c-.4-.8-1-1.5-1.8-2-.7-.5-1.8-.8-3.1-.8zm13.4 11.8c0-1.5.2-2.8.7-4a8 8 0 014.8-4.7c1.1-.4 2.4-.6 3.8-.6 1.5 0 2.8.2 4 .7 1 .4 2 1 2.9 1.8.8.9 1.4 1.8 1.8 3 .4 1.1.6 2.4.6 3.7 0 1.5-.2 2.8-.7 4a8 8 0 01-4.8 4.7c-1.1.4-2.4.6-3.8.6a11 11 0 01-4-.7c-1-.4-2-1-2.9-1.8a7.9 7.9 0 01-1.8-3c-.4-1.1-.6-2.4-.6-3.8zm4.7 0c0 .7.1 1.4.3 2 .2.7.5 1.3 1 1.8a4.1 4.1 0 003.3 1.5c1.4 0 2.5-.4 3.3-1.3.9-.8 1.3-2.2 1.3-4a6 6 0 00-1.2-4c-.8-1-2-1.4-3.4-1.4-.7 0-1.3 0-1.8.3-.6.2-1 .5-1.5 1-.4.4-.7 1-1 1.6-.2.7-.3 1.5-.3 2.4zm34.2 7c-1 .7-2 1.3-3.3 1.6-1.3.4-2.7.6-4 .6-1.6 0-3-.2-4.1-.7-1.2-.4-2.2-1-3-1.8a8 8 0 01-1.8-3 10.9 10.9 0 010-7.7 8.2 8.2 0 015.2-4.7 14.3 14.3 0 017.6-.2l2.6 1v6.1h-3.8v-3.2l-2.2-.3c-.7 0-1.3.1-2 .3a4.8 4.8 0 00-2.9 2.6c-.3.7-.5 1.4-.5 2.3 0 .8.2 1.5.4 2.1a5 5 0 002.8 2.8 8.2 8.2 0 005.6-.2l1.9-1 1.5 3.4z"/> | ||||||
|  |     <use stroke="#3AC" stroke-dasharray="5 10" stroke-width="10" mask="url(#b)" xlink:href="#a"/> | ||||||
|  |     <g transform="translate(540)"> | ||||||
|  |       <rect width="95" height="50" x="2.5" y="2.5" fill="#C3E7F1" stroke="#3AC" stroke-width="5" rx="10"/> | ||||||
|  |       <path fill="#3D4251" fill-rule="nonzero" d="M27.8 24.5h4.4l.3 1.6h.1a5.2 5.2 0 014.2-2c.7 0 1.3.1 1.8.3.6.2 1 .4 1.4.8.4.4.7 1 1 1.6.1.6.3 1.5.3 2.4V37H38v-7.1c0-1-.2-1.8-.7-2.2-.4-.5-1-.7-1.7-.7-.6 0-1.2.2-1.7.6-.5.3-.9.8-1 1.3V37h-3.3v-9.8h-1.8v-2.7zm16.9-5H50v11.6c0 1.2.2 2.1.5 2.6s.8.8 1.5.8c.5 0 1 0 1.3-.2l1-.4 1.2 2.2a15.3 15.3 0 01-1.8 1 6.1 6.1 0 01-2.3.3c-1.5 0-2.7-.4-3.5-1.3-.8-.8-1.1-1.9-1.1-3.4V22.3h-2.1v-2.7zm12.8 5h4.3L62 26h.1c.9-1.2 2.3-1.9 4.2-1.9a6 6 0 012.1.4c.7.3 1.2.6 1.7 1.1.4.6.8 1.2 1 2 .3.8.4 1.7.4 2.8 0 1-.1 2-.4 3-.3.8-.7 1.5-1.2 2.1-.6.6-1.2 1-2 1.4-.7.3-1.6.5-2.6.5-.5 0-1 0-1.5-.2-.5 0-1-.2-1.3-.3V42h-3.2V27.2h-1.9v-2.7zm8 2.4c-.7 0-1.3.2-1.8.5s-.9.8-1 1.4V34c.2.2.5.3 1 .4l1.3.2c.4 0 .9 0 1.3-.2s.7-.4 1-.8c.3-.4.6-.8.7-1.3.2-.6.3-1.2.3-2 0-1-.3-1.9-.8-2.5-.6-.6-1.2-.9-2-.9z"/> | ||||||
|  |     </g> | ||||||
|  |     <path fill="#3AC" d="M205 112.5L180 125v-25z"/> | ||||||
|  |     <path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M180 112.5h-23.1"/> | ||||||
|  |     <path fill="#3AC" d="M1000 112.5L975 125v-25z"/> | ||||||
|  |     <path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M975 112.5h-23.1"/> | ||||||
|  |     <path fill="#EAC1CC" stroke="#F03969" stroke-linejoin="round" stroke-width="3.8" d="M230 75h135l23.5 43.4L365 160H230l23.5-41.5z"/> | ||||||
|  |     <g stroke-linejoin="round"> | ||||||
|  |       <use fill="#F2D7B2" style="mix-blend-mode:color-burn" xlink:href="#c"/> | ||||||
|  |       <use stroke="#F0A439" stroke-width="3.8" xlink:href="#c"/> | ||||||
|  |     </g> | ||||||
|  |     <path fill="#F2E7A6" stroke="#CDB217" stroke-linejoin="round" stroke-width="3.8" d="M563 75h100l23.5 43.4L663 160H563l23.5-41.5z"/> | ||||||
|  |     <path fill="#D7E99A" stroke="#B2D73A" stroke-linejoin="round" stroke-width="3.8" d="M664 75h131l23.5 43.4L795 160H664l23.5-41.5z"/> | ||||||
|  |     <path fill="#B5F3D4" stroke="#3AD787" stroke-linejoin="round" stroke-width="3.8" d="M790 75h110l23.5 43.4L900 160H790l23.5-41.5z"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M265.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm10.9-13.2c1 0 1.8.1 2.6.4a5.6 5.6 0 013.3 3.4c.3.8.4 1.8.4 2.8 0 1-.1 1.9-.4 2.7a5.5 5.5 0 01-3.3 3.4 7 7 0 01-2.6.5 7 7 0 01-2.6-.5 5.6 5.6 0 01-3.3-3.4 7.8 7.8 0 010-5.5c.3-.8.7-1.5 1.3-2 .5-.6 1.2-1 2-1.4a7 7 0 012.6-.4zm0 10.8c1 0 1.9-.3 2.4-1 .5-.8.7-1.8.7-3.2 0-1.4-.2-2.4-.7-3.2-.5-.7-1.3-1-2.4-1-1 0-1.9.3-2.4 1-.5.8-.8 1.8-.8 3.2 0 1.4.3 2.4.8 3.1.5.8 1.3 1.1 2.4 1.1zm11.9-16.4v10.7h.5l.5-.1.4-.3 3.2-4 .4-.4.7-.1h2.8l-4 4.7-.4.5-.5.4.4.4.4.6 4.3 6.2h-2.8l-.6-.1c-.2-.1-.3-.2-.4-.5l-3.3-4.8a1 1 0 00-.4-.4h-1.2v5.8h-3.1v-18.6h3zm16 5.6c.7 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1.1 3 .6.6 1.4.9 2.4.9.6 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.7-.2-2.5-.5s-1.4-.7-2-1.3c-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-1 0-1.6.2-2.1.8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9c.4 0 .6.2.8.5l.2 1a7 7 0 011.7-1.2 4.6 4.6 0 012.2-.5c.7 0 1.4 0 1.9.3l1.4 1 .8 1.6c.2.6.3 1.2.3 2v8.1h-3.1v-8.2c0-.7-.2-1.4-.6-1.8-.3-.4-.9-.6-1.6-.6l-1.5.3c-.5.3-1 .6-1.3 1v9.3h-3.1zm17.5-12.8V125H327v-12.8h3zm.4-3.8l-.1.8a2 2 0 01-1 1 2 2 0 01-2.2-.4 2 2 0 01-.4-.6l-.2-.8a2 2 0 01.6-1.4 2 2 0 011.3-.5l.8.1a2 2 0 011 1l.3.8zm12.3 5v.7l-.3.5-6.2 8h6.4v2.4h-10v-1.3l.2-.5c0-.2.1-.4.3-.5l6.1-8.2h-6.2v-2.3h9.8v1.3zm7.8-1.4c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.8 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.3.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.3 2.3-.2.3-.3.1h-.6l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M436.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm5.4-.2v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.7-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3c0 .1 0 .2-.2.3l-.3.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3zm20.3 0h-1.4l-.7-.1c-.2-.1-.3-.3-.4-.6l-.2-.9a10.6 10.6 0 01-2 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.8-.1l-1.2-.7a3 3 0 01-.7-1c-.2-.5-.3-1-.3-1.6 0-.5.1-1 .4-1.4.2-.5.7-.9 1.3-1.3.5-.4 1.3-.7 2.3-1 1-.2 2.2-.3 3.7-.3v-.8c0-.9-.2-1.5-.6-2-.3-.3-.9-.5-1.6-.5a3.8 3.8 0 00-2 .5l-.8.4c-.2.2-.4.2-.6.2-.2 0-.4 0-.6-.2l-.3-.3-.6-1c1.5-1.4 3.3-2 5.3-2 .8 0 1.5 0 2 .3a4.3 4.3 0 012.5 2.6c.2.6.3 1.3.3 2v8.1zm-6-2h.9a3.3 3.3 0 001.4-.7l.7-.6v-2.2c-1 0-1.7.1-2.3.3a6 6 0 00-1.4.4l-.8.6c-.2.2-.3.5-.3.8 0 .5.2.9.5 1.1.4.3.8.4 1.3.4zm9 2v-12.8h2c.3 0 .6.2.7.5l.2 1a7 7 0 011.7-1.2 4.6 4.6 0 012.3-.5c.7 0 1.3 0 1.8.3.6.3 1 .6 1.4 1 .4.5.6 1 .8 1.6.2.6.3 1.2.3 2v8.1h-3v-8.2c0-.7-.3-1.4-.6-1.8-.4-.4-1-.6-1.7-.6l-1.5.3-1.3 1v9.3h-3zm21.8-10.3l-.2.3h-.8a32.9 32.9 0 00-1.4-.7h-1c-.6 0-1.1 0-1.5.3-.3.3-.5.6-.5 1 0 .3.1.5.3.7l.7.5 1 .4a33 33 0 012.3.8l1 .7c.3.2.5.5.7 1 .2.3.3.7.3 1.2 0 .7-.1 1.2-.4 1.7-.2.6-.5 1-1 1.4-.4.4-1 .7-1.5.9a7 7 0 01-3.5.2 7.6 7.6 0 01-2.3-.8l-.9-.7.7-1.1.4-.4h1a12 12 0 001.4.8l1.1.1h1l.6-.4.4-.5.1-.6c0-.3 0-.6-.3-.8l-.7-.5-1-.3a33.5 33.5 0 01-2.3-.9 4 4 0 01-1-.7 3 3 0 01-.7-1 3.7 3.7 0 011-4.2c.3-.3.8-.6 1.4-.8.6-.2 1.3-.3 2.1-.3 1 0 1.7.1 2.4.4.8.3 1.4.7 1.8 1.2l-.7 1zm4 10.3v-10.5l-1.2-.2-.6-.2a.7.7 0 01-.2-.5v-1.3h2v-1c0-.7 0-1.4.2-2a4.1 4.1 0 012.5-2.4 5.8 5.8 0 013.6 0v1.5c0 .2-.1.4-.3.4l-.8.1c-.3 0-.7 0-1 .2a1.7 1.7 0 00-1.1 1.1 4 4 0 00-.2 1.2v.9h3.3v2.2h-3.2V125h-3zm13.6-13c1 0 1.8.1 2.6.4a5.6 5.6 0 013.3 3.4c.3.8.4 1.8.4 2.8 0 1-.1 1.9-.4 2.7a5.5 5.5 0 01-3.3 3.4 7 7 0 01-2.6.5 7 7 0 01-2.6-.5 5.6 5.6 0 01-3.3-3.4 7.8 7.8 0 010-5.5c.3-.8.7-1.5 1.3-2 .5-.6 1.2-1 2-1.4a7 7 0 012.6-.4zm0 10.8c1 0 1.9-.3 2.4-1 .5-.8.7-1.8.7-3.2 0-1.4-.2-2.4-.7-3.2-.5-.7-1.3-1-2.4-1-1 0-1.9.3-2.4 1-.5.8-.8 1.8-.8 3.2 0 1.4.3 2.4.8 3.1.5.8 1.3 1.1 2.4 1.1zm8.7 2.2v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.7-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3zm10.3 0v-12.8h1.8c.4 0 .7.2.8.5l.2 1 .7-.7a4.5 4.5 0 011.7-.9 4 4 0 011-.1 3 3 0 012 .6c.6.5 1 1 1.2 1.8a4 4 0 011.8-1.9l1.1-.4a5.5 5.5 0 013.1.2c.6.2 1 .5 1.4 1 .4.4.7.9.9 1.5.2.6.3 1.3.3 2v8.2h-3.1v-8.2c0-.8-.2-1.4-.6-1.8-.3-.4-.8-.6-1.5-.6l-1 .1a2.1 2.1 0 00-1.1 1.3 3 3 0 00-.2 1v8.2h-3v-8.2c0-.8-.3-1.4-.6-1.8-.4-.4-.9-.6-1.5-.6-.5 0-.9 0-1.3.3-.4.2-.7.5-1 1v9.3H524zm26.3-13c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.2.3 1 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.8 1c-.4.2-.9.2-1.3.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.5-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.6.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1l-.5-1-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.2.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M610.6 125v-12.8h2c.3 0 .6.2.7.5l.2 1a7 7 0 011.8-1.2 4.6 4.6 0 012.2-.5c.7 0 1.3 0 1.9.3.5.3 1 .6 1.3 1 .4.5.7 1 .8 1.6.2.6.3 1.2.3 2v8.1h-3v-8.2c0-.7-.2-1.4-.6-1.8-.4-.4-1-.6-1.6-.6-.6 0-1 .1-1.5.3l-1.4 1v9.3h-3zm19.6-13c.8 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.4 3l-.1.3-.3.2H627c.2 1.4.5 2.4 1.1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.5-.1h.4l.2.3.9 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.8 1c-.5.2-1 .2-1.4.2-.9 0-1.7-.2-2.5-.5-.7-.3-1.4-.7-2-1.3-.5-.5-1-1.3-1.3-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.6-.3 1.5-.4 2.5-.4zm0 2.2c-.9 0-1.6.2-2 .8-.6.5-1 1.2-1 2.1h5.7l-.1-1.1-.5-1-.9-.6-1.2-.2zm8 10.8v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.6-1.9c.7-.4 1.4-.7 2.1-.7.7 0 1.2.2 1.6.5l-.4 2.3c0 .1 0 .2-.2.3l-.3.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3z"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M708.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm10.7-13.2c.8 0 1.6.1 2.3.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1.1 3 .6.6 1.4.9 2.4.9.6 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.7-.2-2.5-.5s-1.4-.7-2-1.3c-.6-.5-1-1.3-1.3-2.1a8.3 8.3 0 01-.1-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm11.1 4.2l-4.2-6.2h3.5l.3.4 2.7 4.3a3.5 3.5 0 01.4-.9l2.1-3.4.3-.3.4-.1h2.9l-4.3 6 4.4 6.8h-3c-.2 0-.3 0-.5-.2l-.3-.4-2.8-4.4c0 .3-.1.5-.3.7l-2.4 3.7c0 .2-.2.3-.3.4l-.5.2h-2.8l4.4-6.6zm14.7 6.8c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.1 0-.3 0-.4-.2-.1 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5c0-.2.1-.3.3-.4l.4-.2h1.6v4h3.4v2.3H745v7l.3.9c.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.3l.1.3 1 1.5-1.6.8-1.8.3zm14.5-10.3l-.3.3h-.8a14.9 14.9 0 00-1.3-.7l-1-.2c-.6 0-1.1.1-1.5.3-.4.2-.8.5-1 .9-.3.3-.5.8-.6 1.3-.2.5-.2 1.1-.2 1.8 0 .6 0 1.3.2 1.8.1.5.3 1 .6 1.3.3.4.6.7 1 .9l1.3.2a3.3 3.3 0 002-.5l.5-.4c.2-.2.4-.2.6-.2.2 0 .4 0 .5.3l1 1a5.6 5.6 0 01-2.4 1.7l-1.4.4h-1.3c-.8 0-1.6 0-2.3-.4-.7-.3-1.3-.7-1.8-1.3-.5-.5-1-1.2-1.2-2a8 8 0 01-.5-2.8c0-1 .1-1.9.4-2.7a5.5 5.5 0 013.1-3.5c.8-.3 1.7-.4 2.7-.4 1 0 1.8.1 2.5.4.8.3 1.4.8 2 1.4l-.8 1zm13 10.1h-1.4l-.7-.1c-.2-.1-.3-.3-.4-.6l-.3-.9a10.6 10.6 0 01-2 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.7-.1c-.5-.2-.9-.4-1.2-.7a3 3 0 01-.8-1c-.2-.5-.3-1-.3-1.6 0-.5.2-1 .4-1.4.3-.5.7-.9 1.3-1.3.6-.4 1.4-.7 2.4-1 1-.2 2.2-.3 3.6-.3v-.8c0-.9-.2-1.5-.5-2-.4-.3-1-.5-1.6-.5a3.8 3.8 0 00-2.1.5l-.7.4c-.2.2-.4.2-.7.2-.2 0-.4 0-.5-.2-.2 0-.3-.2-.4-.3l-.5-1c1.4-1.4 3.2-2 5.3-2a4.3 4.3 0 014.4 3c.2.5.3 1.2.3 1.9v8.1zm-6-2h.8a3.3 3.3 0 001.5-.7l.6-.6v-2.2c-.9 0-1.6.1-2.2.3a6 6 0 00-1.5.4l-.8.6-.2.8c0 .5.2.9.5 1.1.3.3.7.4 1.2.4zm13.2 2.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.1 0-.3 0-.4-.2-.1 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5c0-.2.1-.3.3-.4l.4-.2h1.6v4h3.4v2.3h-3.4v7l.3.9c.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3z"/> | ||||||
|  |     <path fill="#3D4251" fill-rule="nonzero" d="M855 123.3a2 2 0 01.5-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.3 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.2-.1-.3-.3-.4-.6a2 2 0 01-.2-.7zm5.5 0a2 2 0 01.6-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.2-.7zm5.7 0a2 2 0 01.5-1.3 2 2 0 011.4-.6 1.9 1.9 0 011.3.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.1-.7z"/> | ||||||
|  |   </g> | ||||||
|  | </svg> | ||||||
| After Width: | Height: | Size: 14 KiB | 
|  | @ -1,10 +1,17 @@ | ||||||
| --- | --- | ||||||
| title: Transformers | title: Transformers | ||||||
| teaser: Using transformer models like BERT in spaCy | teaser: Using transformer models like BERT in spaCy | ||||||
|  | menu: | ||||||
|  |   - ['Installation', 'install'] | ||||||
|  |   - ['Runtime Usage', 'runtime'] | ||||||
|  |   - ['Training Usage', 'training'] | ||||||
| --- | --- | ||||||
| 
 | 
 | ||||||
|  | ## Installation {#install hidden="true"} | ||||||
|  | 
 | ||||||
| spaCy v3.0 lets you use almost **any statistical model** to power your pipeline. | spaCy v3.0 lets you use almost **any statistical model** to power your pipeline. | ||||||
| You can use models implemented in a variety of frameworks, including TensorFlow, | You can use models implemented in a variety of | ||||||
|  | [frameworks](https://thinc.ai/docs/usage-frameworks), including TensorFlow, | ||||||
| PyTorch and MXNet. To keep things sane, spaCy expects models from these | PyTorch and MXNet. To keep things sane, spaCy expects models from these | ||||||
| frameworks to be wrapped with a common interface, using our machine learning | frameworks to be wrapped with a common interface, using our machine learning | ||||||
| library [Thinc](https://thinc.ai). A transformer model is just a statistical | library [Thinc](https://thinc.ai). A transformer model is just a statistical | ||||||
|  | @ -15,34 +22,110 @@ that do the required plumbing. We also provide a pipeline component, | ||||||
| [`Transformer`](/api/transformer), that lets you do multi-task learning and lets | [`Transformer`](/api/transformer), that lets you do multi-task learning and lets | ||||||
| you save the transformer outputs for later use. | you save the transformer outputs for later use. | ||||||
| 
 | 
 | ||||||
| <Project id="en_core_bert"> | To use transformers with spaCy, you need the | ||||||
|  | [`spacy-transformers`](https://github.com/explosion/spacy-transformers) package | ||||||
|  | installed. It takes care of all the setup behind the scenes, and makes sure the | ||||||
|  | transformer pipeline component is available to spaCy. | ||||||
| 
 | 
 | ||||||
| Try out a BERT-based model pipeline using this project template: swap in your | ```bash | ||||||
| data, edit the settings and hyperparameters and train, evaluate, package and | $ pip install spacy-transformers | ||||||
| visualize your model. | ``` | ||||||
| 
 | 
 | ||||||
| </Project> | <!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted --> | ||||||
| 
 | 
 | ||||||
| <!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted | ## Runtime usage {#runtime} | ||||||
| 
 | 
 | ||||||
| ### Training usage | Transformer models can be used as **drop-in replacements** for other types of | ||||||
|  | neural networks, so your spaCy pipeline can include them in a way that's | ||||||
|  | completely invisible to the user. Users will download, load and use the model in | ||||||
|  | the standard way, like any other spaCy pipeline. Instead of using the | ||||||
|  | transformers as subnetworks directly, you can also use them via the | ||||||
|  | [`Transformer`](/api/transformer) pipeline component. | ||||||
|  | 
 | ||||||
|  |  | ||||||
|  | 
 | ||||||
|  | The `Transformer` component sets the | ||||||
|  | [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, | ||||||
|  | which lets you access the transformers outputs at runtime. | ||||||
|  | 
 | ||||||
|  | ```bash | ||||||
|  | $ python -m spacy download en_core_trf_lg | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | ### Example | ||||||
|  | import spacy | ||||||
|  | 
 | ||||||
|  | nlp = spacy.load("en_core_trf_lg") | ||||||
|  | for doc in nlp.pipe(["some text", "some other text"]): | ||||||
|  |     tokvecs = doc._.trf_data.tensors[-1] | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | You can also customize how the [`Transformer`](/api/transformer) component sets | ||||||
|  | annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. | ||||||
|  | This callback will be called with the raw input and output data for the whole | ||||||
|  | batch, along with the batch of `Doc` objects, allowing you to implement whatever | ||||||
|  | you need. The annotation setter is called with a batch of [`Doc`](/api/doc) | ||||||
|  | objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) | ||||||
|  | containing the transformers data for the batch. | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | def custom_annotation_setter(docs, trf_data): | ||||||
|  |     # TODO: | ||||||
|  |     ... | ||||||
|  | 
 | ||||||
|  | nlp = spacy.load("en_core_trf_lg") | ||||||
|  | nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter | ||||||
|  | doc = nlp("This is a text") | ||||||
|  | print()  # TODO: | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## Training usage {#training} | ||||||
| 
 | 
 | ||||||
| The recommended workflow for training is to use spaCy's | The recommended workflow for training is to use spaCy's | ||||||
| [config system](/usage/training#config), usually via the | [config system](/usage/training#config), usually via the | ||||||
| [`spacy train`](/api/cli#train) command. The config system lets you describe a | [`spacy train`](/api/cli#train) command. The training config defines all | ||||||
| tree of objects by referring to creation functions, including functions you | component settings and hyperparameters in one place and lets you describe a tree | ||||||
| register yourself. Here's a config snippet for the `Transformer` component, | of objects by referring to creation functions, including functions you register | ||||||
| along with matching Python code. | yourself. | ||||||
|  | 
 | ||||||
|  | <Project id="en_core_bert"> | ||||||
|  | 
 | ||||||
|  | The easiest way to get started is to clone a transformers-based project | ||||||
|  | template. Swap in your data, edit the settings and hyperparameters and train, | ||||||
|  | evaluate, package and visualize your model. | ||||||
|  | 
 | ||||||
|  | </Project> | ||||||
|  | 
 | ||||||
|  | The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline | ||||||
|  | components and the settings used to construct them, including their model | ||||||
|  | implementation. Here's a config snippet for the | ||||||
|  | [`Transformer`](/api/transformer) component, along with matching Python code: | ||||||
|  | 
 | ||||||
|  | > #### Python equivalent | ||||||
|  | > | ||||||
|  | > ```python | ||||||
|  | > from spacy_transformers import Transformer, TransformerModel | ||||||
|  | > from spacy_transformers.annotation_setters import null_annotation_setter | ||||||
|  | > from spacy_transformers.span_getters import get_doc_spans | ||||||
|  | > | ||||||
|  | > trf = Transformer( | ||||||
|  | >     nlp.vocab, | ||||||
|  | >     TransformerModel( | ||||||
|  | >         "bert-base-cased", | ||||||
|  | >         get_spans=get_doc_spans, | ||||||
|  | >         tokenizer_config={"use_fast": True}, | ||||||
|  | >     ), | ||||||
|  | >     annotation_setter=null_annotation_setter, | ||||||
|  | >     max_batch_items=4096, | ||||||
|  | > ) | ||||||
|  | > ``` | ||||||
| 
 | 
 | ||||||
| ```ini | ```ini | ||||||
| [nlp] | ### config.cfg (excerpt) | ||||||
| lang = "en" |  | ||||||
| pipeline = ["transformer"] |  | ||||||
| 
 |  | ||||||
| [components.transformer] | [components.transformer] | ||||||
| factory = "transformer" | factory = "transformer" | ||||||
| extra_annotation_setter = null | max_batch_items = 4096 | ||||||
| max_batch_size = 32 |  | ||||||
| 
 | 
 | ||||||
| [components.transformer.model] | [components.transformer.model] | ||||||
| @architectures = "spacy-transformers.TransformerModel.v1" | @architectures = "spacy-transformers.TransformerModel.v1" | ||||||
|  | @ -50,46 +133,110 @@ name = "bert-base-cased" | ||||||
| tokenizer_config = {"use_fast": true} | tokenizer_config = {"use_fast": true} | ||||||
| 
 | 
 | ||||||
| [components.transformer.model.get_spans] | [components.transformer.model.get_spans] | ||||||
| @span_getters = "get_doc_spans.v1" | @span_getters = "doc_spans.v1" | ||||||
|  | 
 | ||||||
|  | [components.transformer.annotation_setter] | ||||||
|  | @annotation_setters = "spacy-transformer.null_annotation_setter.v1" | ||||||
|  | 
 | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | The `[components.transformer.model]` block describes the `model` argument passed | ||||||
|  | to the transformer component. It's a Thinc | ||||||
|  | [`Model`](https://thinc.ai/docs/api-model) object that will be passed into the | ||||||
|  | component. Here, it references the function | ||||||
|  | [spacy-transformers.TransformerModel.v1](/api/architectures#TransformerModel) | ||||||
|  | registered in the [`architectures` registry](/api/top-level#registry). If a key | ||||||
|  | in a block starts with `@`, it's **resolved to a function** and all other | ||||||
|  | settings are passed to the function as arguments. In this case, `name`, | ||||||
|  | `tokenizer_config` and `get_spans`. | ||||||
|  | 
 | ||||||
|  | `get_spans` is a function that takes a batch of `Doc` object and returns lists | ||||||
|  | of potentially overlapping `Span` objects to process by the transformer. Several | ||||||
|  | [built-in functions](/api/transformer#span-getters) are available – for example, | ||||||
|  | to process the whole document or individual sentences. When the config is | ||||||
|  | resolved, the function is created and passed into the model as an argument. | ||||||
|  | 
 | ||||||
|  | <Infobox variant="warning"> | ||||||
|  | 
 | ||||||
|  | Remember that the `config.cfg` used for training should contain **no missing | ||||||
|  | values** and requires all settings to be defined. You don't want any hidden | ||||||
|  | defaults creeping in and changing your results! spaCy will tell you if settings | ||||||
|  | are missing, and you can run [`spacy debug config`](/api/cli#debug-config) with | ||||||
|  | `--auto-fill` to automatically fill in all defaults. | ||||||
|  | 
 | ||||||
|  | <!-- TODO: update with details on getting started with a config --> | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
|  | 
 | ||||||
|  | ### Customizing the settings {#training-custom-settings} | ||||||
|  | 
 | ||||||
|  | To change any of the settings, you can edit the `config.cfg` and re-run the | ||||||
|  | training. To change any of the functions, like the span getter, you can replace | ||||||
|  | the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to | ||||||
|  | process sentences. You can also register your own functions using the | ||||||
|  | `span_getters` registry: | ||||||
|  | 
 | ||||||
|  | > #### config.cfg | ||||||
|  | > | ||||||
|  | > ```ini | ||||||
|  | > [components.transformer.model.get_spans] | ||||||
|  | > @span_getters = "custom_sent_spans" | ||||||
|  | > ``` | ||||||
|  | 
 | ||||||
| ```python | ```python | ||||||
| from spacy_transformers import Transformer | ### code.py | ||||||
|  | import spacy_transformers | ||||||
| 
 | 
 | ||||||
| trf = Transformer( | @spacy_transformers.registry.span_getters("custom_sent_spans") | ||||||
|     nlp.vocab, | def configure_custom_sent_spans(): | ||||||
|     TransformerModel( |     # TODO: write custom example | ||||||
|         "bert-base-cased", |     def get_sent_spans(docs): | ||||||
|         get_spans=get_doc_spans, |         return [list(doc.sents) for doc in docs] | ||||||
|         tokenizer_config={"use_fast": True}, | 
 | ||||||
|     ), |     return get_sent_spans | ||||||
|     annotation_setter=null_annotation_setter, |  | ||||||
|     max_batch_size=32, |  | ||||||
| ) |  | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| The `components.transformer` block adds the `transformer` component to the | To resolve the config during training, spaCy needs to know about your custom | ||||||
| pipeline, and the `components.transformer.model` block describes the creation of | function. You can make it available via the `--code` argument that can point to | ||||||
| a Thinc [`Model`](https://thinc.ai/docs/api-model) object that will be passed | a Python file: | ||||||
| into the component. The block names a function registered in the |  | ||||||
| `@architectures` registry. This function will be looked up and called using the |  | ||||||
| provided arguments. You're not limited to just that function --- you can write |  | ||||||
| your own or use someone else's. The only limitation is that it must return an |  | ||||||
| object of type `Model[List[Doc], FullTransformerBatch]`: that is, a Thinc model |  | ||||||
| that takes a list of `Doc` objects, and returns a `FullTransformerBatch` object |  | ||||||
| with the transformer data. |  | ||||||
| 
 | 
 | ||||||
| The same idea applies to task models that power the downstream components. Most | ```bash | ||||||
| of spaCy's built-in model creation functions support a `tok2vec` argument, which | $ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --code ./code.py | ||||||
| should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This is | ``` | ||||||
| where we'll plug in our transformer model, using the `Tok2VecTransformer` layer, | 
 | ||||||
| which sneakily delegates to the `Transformer` pipeline component. | ### Customizing the model implementations {#training-custom-model} | ||||||
|  | 
 | ||||||
|  | The [`Transformer`](/api/transformer) component expects a Thinc | ||||||
|  | [`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model` | ||||||
|  | argument. You're not limited to the implementation provided by | ||||||
|  | `spacy-transformers` – the only requirement is that your registered function | ||||||
|  | must return an object of type `Model[List[Doc], FullTransformerBatch]`: that is, | ||||||
|  | a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a | ||||||
|  | [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the | ||||||
|  | transformer data. | ||||||
|  | 
 | ||||||
|  | > #### Model type annotations | ||||||
|  | > | ||||||
|  | > In the documentation and code base, you may come across type annotations and | ||||||
|  | > descriptions of [Thinc](https://thinc.ai) model types, like | ||||||
|  | > `Model[List[Doc], List[Floats2d]]`. This so-called generic type describes the | ||||||
|  | > layer and its input and output type – in this case, it takes a list of `Doc` | ||||||
|  | > objects as the input and list of 2-dimensional arrays of floats as the output. | ||||||
|  | > You can read more about defining Thinc | ||||||
|  | > models [here](https://thinc.ai/docs/usage-models). Also see the | ||||||
|  | > [type checking](https://thinc.ai/docs/usage-type-checking) for how to enable | ||||||
|  | > linting in your editor to see live feedback if your inputs and outputs don't | ||||||
|  | > match. | ||||||
|  | 
 | ||||||
|  | The same idea applies to task models that power the **downstream components**. | ||||||
|  | Most of spaCy's built-in model creation functions support a `tok2vec` argument, | ||||||
|  | which should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This | ||||||
|  | is where we'll plug in our transformer model, using the | ||||||
|  | [Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily | ||||||
|  | delegates to the `Transformer` pipeline component. | ||||||
| 
 | 
 | ||||||
| ```ini | ```ini | ||||||
| [nlp] | ### config.cfg (excerpt) {highlight="12"} | ||||||
| lang = "en" |  | ||||||
| pipeline = ["ner"] |  | ||||||
| 
 |  | ||||||
| [components.ner] | [components.ner] | ||||||
| factory = "ner" | factory = "ner" | ||||||
| 
 | 
 | ||||||
|  | @ -108,49 +255,24 @@ grad_factor = 1.0 | ||||||
| @layers = "reduce_mean.v1" | @layers = "reduce_mean.v1" | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| The `Tok2VecListener` layer expects a `pooling` layer, which needs to be of type | The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a | ||||||
| `Model[Ragged, Floats2d]`. This layer determines how the vector for each spaCy | [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops), which needs to | ||||||
| token will be computed from the zero or more source rows the token is aligned | be of type `Model[Ragged, Floats2d]`. This layer determines how the vector for | ||||||
| against. Here we use the `reduce_mean` layer, which averages the wordpiece rows. | each spaCy token will be computed from the zero or more source rows the token is | ||||||
| We could instead use `reduce_last`, `reduce_max`, or a custom function you write | aligned against. Here we use the | ||||||
| yourself. | [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which | ||||||
|  | averages the wordpiece rows. We could instead use `reduce_last`, | ||||||
|  | [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom | ||||||
|  | function you write yourself. | ||||||
|  | 
 | ||||||
|  | <!--TODO: reduce_last: undocumented? --> | ||||||
| 
 | 
 | ||||||
| You can have multiple components all listening to the same transformer model, | You can have multiple components all listening to the same transformer model, | ||||||
| and all passing gradients back to it. By default, all of the gradients will be | and all passing gradients back to it. By default, all of the gradients will be | ||||||
| equally weighted. You can control this with the `grad_factor` setting, which | **equally weighted**. You can control this with the `grad_factor` setting, which | ||||||
| lets you reweight the gradients from the different listeners. For instance, | lets you reweight the gradients from the different listeners. For instance, | ||||||
| setting `grad_factor = 0` would disable gradients from one of the listeners, | setting `grad_factor = 0` would disable gradients from one of the listeners, | ||||||
| while `grad_factor = 2.0` would multiply them by 2. This is similar to having a | while `grad_factor = 2.0` would multiply them by 2. This is similar to having a | ||||||
| custom learning rate for each component. Instead of a constant, you can also | custom learning rate for each component. Instead of a constant, you can also | ||||||
| provide a schedule, allowing you to freeze the shared parameters at the start of | provide a schedule, allowing you to freeze the shared parameters at the start of | ||||||
| training. | training. | ||||||
| 
 |  | ||||||
| ### Runtime usage |  | ||||||
| 
 |  | ||||||
| Transformer models can be used as drop-in replacements for other types of neural |  | ||||||
| networks, so your spaCy pipeline can include them in a way that's completely |  | ||||||
| invisible to the user. Users will download, load and use the model in the |  | ||||||
| standard way, like any other spaCy pipeline. |  | ||||||
| 
 |  | ||||||
| Instead of using the transformers as subnetworks directly, you can also use them |  | ||||||
| via the [`Transformer`](/api/transformer) pipeline component. This sets the |  | ||||||
| [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, |  | ||||||
| which lets you access the transformers outputs at runtime via the |  | ||||||
| `doc._.trf_data` extension attribute. You can also customize how the |  | ||||||
| `Transformer` object sets annotations onto the `Doc`, by customizing the |  | ||||||
| `Transformer.annotation_setter` object. This callback will be called with the |  | ||||||
| raw input and output data for the whole batch, along with the batch of `Doc` |  | ||||||
| objects, allowing you to implement whatever you need. |  | ||||||
| 
 |  | ||||||
| ```python |  | ||||||
| import spacy |  | ||||||
| 
 |  | ||||||
| nlp = spacy.load("en_core_trf_lg") |  | ||||||
| for doc in nlp.pipe(["some text", "some other text"]): |  | ||||||
|     doc._.trf_data.tensors |  | ||||||
|     tokvecs = doc._.trf_data.tensors[-1] |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| The `nlp` object in this example is just like any other spaCy pipeline |  | ||||||
| 
 |  | ||||||
|  --> |  | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user