mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
Update docs [ci skip]
This commit is contained in:
parent
b0f57a0cac
commit
158d8c1e48
|
@ -26,6 +26,8 @@ TODO: intro and how architectures work, link to
|
||||||
|
|
||||||
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
||||||
|
|
||||||
|
### spacy-transformers.Tok2VecListener.v1 {#spacy-transformers.Tok2VecListener.v1}
|
||||||
|
|
||||||
## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"}
|
## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"}
|
||||||
|
|
||||||
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser}
|
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser}
|
||||||
|
|
|
@ -304,6 +304,31 @@ factories.
|
||||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||||
|
|
||||||
|
### spacy-transformers registry {#registry-transformers}
|
||||||
|
|
||||||
|
The following registries are added by the
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
|
||||||
|
See the [`Transformer`](/api/transformer) API reference and
|
||||||
|
[usage docs](/usage/transformers) for details.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> import spacy_transformers
|
||||||
|
>
|
||||||
|
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
|
||||||
|
> def configure_custom_annotation_setter():
|
||||||
|
> def annotation_setter(docs, trf_data) -> None:
|
||||||
|
> # Set annotations on the docs
|
||||||
|
>
|
||||||
|
> return annotation_sette
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Registry name | Description |
|
||||||
|
| ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||||
|
| [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||||
|
|
||||||
## Training data and alignment {#gold source="spacy/gold"}
|
## Training data and alignment {#gold source="spacy/gold"}
|
||||||
|
|
||||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||||
|
|
|
@ -31,8 +31,10 @@ attributes. We also calculate an alignment between the word-piece tokens and the
|
||||||
spaCy tokenization, so that we can use the last hidden states to set the
|
spaCy tokenization, so that we can use the last hidden states to set the
|
||||||
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
||||||
token, the spaCy token receives the sum of their values. To access the values,
|
token, the spaCy token receives the sum of their values. To access the values,
|
||||||
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. For
|
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
||||||
more details, see the [usage documentation](/usage/transformers).
|
package also adds the function registries [`@span_getters`](#span_getters) and
|
||||||
|
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
||||||
|
functions. For more details, see the [usage documentation](/usage/transformers).
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
@ -52,9 +54,9 @@ architectures and their arguments and hyperparameters.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Type | Description | Default |
|
||||||
| ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
|
| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
|
||||||
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
|
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
|
||||||
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` |
|
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) |
|
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -390,6 +392,72 @@ Split a `TransformerData` object that represents a batch into a list with one
|
||||||
| ----------- | ----------------------- | -------------- |
|
| ----------- | ----------------------- | -------------- |
|
||||||
| **RETURNS** | `List[TransformerData]` | <!-- TODO: --> |
|
| **RETURNS** | `List[TransformerData]` | <!-- TODO: --> |
|
||||||
|
|
||||||
|
## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
|
||||||
|
|
||||||
|
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
|
||||||
|
return a lists of [`Span`](/api/span) objects for each doc, to be processed by
|
||||||
|
the transformer. The returned spans can overlap.
|
||||||
|
|
||||||
|
<!-- TODO: details on what this is for --> Span getters can be referenced in the
|
||||||
|
|
||||||
|
config's `[components.transformer.model.get_spans]` block to customize the
|
||||||
|
sequences processed by the transformer. You can also register custom span
|
||||||
|
getters using the `@registry.span_getters` decorator.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> @registry.span_getters("sent_spans.v1")
|
||||||
|
> def configure_get_sent_spans() -> Callable:
|
||||||
|
> def get_sent_spans(docs: Iterable[Doc]) -> List[List[Span]]:
|
||||||
|
> return [list(doc.sents) for doc in docs]
|
||||||
|
>
|
||||||
|
> return get_sent_spans
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ----------- | ------------------ | ------------------------------------------------------------ |
|
||||||
|
| `docs` | `Iterable[Doc]` | A batch of `Doc` objects. |
|
||||||
|
| **RETURNS** | `List[List[Span]]` | The spans to process by the transformer, one list per `Doc`. |
|
||||||
|
|
||||||
|
The following built-in functions are available:
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------------------ | ------------------------------------------------------------------ |
|
||||||
|
| `doc_spans.v1` | Create a span for each doc (no transformation, process each text). |
|
||||||
|
| `sent_spans.v1` | Create a span for each sentence if sentence boundaries are set. |
|
||||||
|
| `strided_spans.v1` | <!-- TODO: --> |
|
||||||
|
|
||||||
|
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||||
|
|
||||||
|
Annotation setters are functions that that take a batch of `Doc` objects and a
|
||||||
|
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
|
||||||
|
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
|
||||||
|
You can register custom annotation setters using the
|
||||||
|
`@registry.annotation_setters` decorator.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> @registry.annotation_setters("spacy-transformer.null_annotation_setter.v1")
|
||||||
|
> def configure_null_annotation_setter() -> Callable:
|
||||||
|
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
||||||
|
> pass
|
||||||
|
>
|
||||||
|
> return setter
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ---------- | ---------------------- | ------------------------------------ |
|
||||||
|
| `docs` | `List[Doc]` | A batch of `Doc` objects. |
|
||||||
|
| `trf_data` | `FullTransformerBatch` | The transformers data for the batch. |
|
||||||
|
|
||||||
|
The following built-in functions are available:
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------------------------------------- | ------------------------------------- |
|
||||||
|
| `spacy-transformer.null_annotation_setter.v1` | Don't set any additional annotations. |
|
||||||
|
|
||||||
## Custom attributes {#custom-attributes}
|
## Custom attributes {#custom-attributes}
|
||||||
|
|
||||||
The component sets the following
|
The component sets the following
|
||||||
|
|
37
website/docs/images/pipeline_transformer.svg
Normal file
37
website/docs/images/pipeline_transformer.svg
Normal file
|
@ -0,0 +1,37 @@
|
||||||
|
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1155" height="221" viewBox="0 0 1155 221">
|
||||||
|
<defs>
|
||||||
|
<rect id="a" width="735" height="170" x="210" y="25" rx="30"/>
|
||||||
|
<path id="c" d="M395 75h174l23.5 43.4L569 160H395l23.5-41.5z"/>
|
||||||
|
<mask id="b" width="735" height="170" x="0" y="0" fill="#fff" maskContentUnits="userSpaceOnUse" maskUnits="objectBoundingBox">
|
||||||
|
<use xlink:href="#a"/>
|
||||||
|
</mask>
|
||||||
|
</defs>
|
||||||
|
<g fill="none" fill-rule="evenodd" transform="translate(0 26)">
|
||||||
|
<rect width="145" height="80" x="2.5" y="2.5" fill="#D8D8D8" stroke="#6A6A6A" stroke-width="5" rx="10" transform="translate(0 70)"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M55.4 99.7v3.9h-7.6V125H43v-21.4h-7.7v-3.9h20zm10.2 7c1 0 2.1.2 3 .6a6.8 6.8 0 014.1 4.1 9.6 9.6 0 01.6 4.3l-.2.5-.3.3H61.3c0 2 .6 3.3 1.4 4.1.9.9 2 1.3 3.5 1.3a6 6 0 001.8-.2l1.3-.6 1-.5.8-.3c.2 0 .3 0 .5.2l.3.2 1.3 1.6c-.5.6-1 1-1.6 1.4a9 9 0 01-3.9 1.4l-2 .2c-1.2 0-2.3-.2-3.4-.7-1-.4-2-1-2.8-1.8a8.6 8.6 0 01-1.9-3 11.6 11.6 0 010-7.6c.3-1.1.9-2 1.6-2.8a8 8 0 012.7-2 9 9 0 013.7-.6zm0 3.2a4 4 0 00-3 1c-.6.7-1 1.8-1.3 3h8.1c0-.5 0-1-.2-1.5-.1-.5-.4-1-.7-1.3-.3-.4-.7-.7-1.2-1a4 4 0 00-1.7-.2zm15.5 5.8l-5.9-8.7h4.2c.3 0 .5 0 .7.2l.4.4 3.7 6a4.9 4.9 0 01.6-1.2l3-4.7.4-.5.6-.2h4l-6 8.5L93 125h-4.2c-.3 0-.5 0-.7-.2l-.5-.6-3.8-6.3-.4 1.1-3.4 5.2-.5.5a1 1 0 01-.7.3H75l6-9.3zm20.5 9.6c-1.5 0-2.7-.5-3.5-1.3a5 5 0 01-1.3-3.7v-10H95c-.3 0-.5 0-.6-.2-.2-.2-.3-.4-.3-.7v-1.7l2.9-.5 1-5c0-.1 0-.3.2-.5l.7-.2h2.2v5.7h4.7v3h-4.7v9.8c0 .6.2 1 .4 1.3.3.3.7.5 1.2.5l.6-.1a3.7 3.7 0 00.9-.4l.3-.1.3.1.3.3 1.2 2c-.6.6-1.3 1-2.1 1.3a8 8 0 01-2.6.4z"/>
|
||||||
|
<rect width="145" height="80" x="2.5" y="2.5" fill="#D7CCF4" stroke="#8978B5" stroke-width="5" rx="10" transform="translate(1005 70)"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M1050.3 101.5a58.8 58.8 0 016.8-.4c2.2 0 4 .4 5.4 1 1.4.6 2.5 1.5 3.4 2.6a10 10 0 011.7 4 23.2 23.2 0 010 9.6c-.3 1.5-1 2.9-1.8 4-.8 1.3-2 2.2-3.5 3-1.5.7-3.4 1-5.8 1a37.3 37.3 0 01-5-.1l-1.2-.2v-24.5zm7 4a15.6 15.6 0 00-2.3 0V122h.5a158 158 0 001.6.1 6 6 0 003.2-.7c.8-.5 1.4-1.2 1.8-2 .4-.8.7-1.8.8-2.8a27.3 27.3 0 000-5.8 8 8 0 00-.7-2.6c-.4-.8-1-1.5-1.8-2-.7-.5-1.8-.8-3.1-.8zm13.4 11.8c0-1.5.2-2.8.7-4a8 8 0 014.8-4.7c1.1-.4 2.4-.6 3.8-.6 1.5 0 2.8.2 4 .7 1 .4 2 1 2.9 1.8.8.9 1.4 1.8 1.8 3 .4 1.1.6 2.4.6 3.7 0 1.5-.2 2.8-.7 4a8 8 0 01-4.8 4.7c-1.1.4-2.4.6-3.8.6a11 11 0 01-4-.7c-1-.4-2-1-2.9-1.8a7.9 7.9 0 01-1.8-3c-.4-1.1-.6-2.4-.6-3.8zm4.7 0c0 .7.1 1.4.3 2 .2.7.5 1.3 1 1.8a4.1 4.1 0 003.3 1.5c1.4 0 2.5-.4 3.3-1.3.9-.8 1.3-2.2 1.3-4a6 6 0 00-1.2-4c-.8-1-2-1.4-3.4-1.4-.7 0-1.3 0-1.8.3-.6.2-1 .5-1.5 1-.4.4-.7 1-1 1.6-.2.7-.3 1.5-.3 2.4zm34.2 7c-1 .7-2 1.3-3.3 1.6-1.3.4-2.7.6-4 .6-1.6 0-3-.2-4.1-.7-1.2-.4-2.2-1-3-1.8a8 8 0 01-1.8-3 10.9 10.9 0 010-7.7 8.2 8.2 0 015.2-4.7 14.3 14.3 0 017.6-.2l2.6 1v6.1h-3.8v-3.2l-2.2-.3c-.7 0-1.3.1-2 .3a4.8 4.8 0 00-2.9 2.6c-.3.7-.5 1.4-.5 2.3 0 .8.2 1.5.4 2.1a5 5 0 002.8 2.8 8.2 8.2 0 005.6-.2l1.9-1 1.5 3.4z"/>
|
||||||
|
<use stroke="#3AC" stroke-dasharray="5 10" stroke-width="10" mask="url(#b)" xlink:href="#a"/>
|
||||||
|
<g transform="translate(540)">
|
||||||
|
<rect width="95" height="50" x="2.5" y="2.5" fill="#C3E7F1" stroke="#3AC" stroke-width="5" rx="10"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M27.8 24.5h4.4l.3 1.6h.1a5.2 5.2 0 014.2-2c.7 0 1.3.1 1.8.3.6.2 1 .4 1.4.8.4.4.7 1 1 1.6.1.6.3 1.5.3 2.4V37H38v-7.1c0-1-.2-1.8-.7-2.2-.4-.5-1-.7-1.7-.7-.6 0-1.2.2-1.7.6-.5.3-.9.8-1 1.3V37h-3.3v-9.8h-1.8v-2.7zm16.9-5H50v11.6c0 1.2.2 2.1.5 2.6s.8.8 1.5.8c.5 0 1 0 1.3-.2l1-.4 1.2 2.2a15.3 15.3 0 01-1.8 1 6.1 6.1 0 01-2.3.3c-1.5 0-2.7-.4-3.5-1.3-.8-.8-1.1-1.9-1.1-3.4V22.3h-2.1v-2.7zm12.8 5h4.3L62 26h.1c.9-1.2 2.3-1.9 4.2-1.9a6 6 0 012.1.4c.7.3 1.2.6 1.7 1.1.4.6.8 1.2 1 2 .3.8.4 1.7.4 2.8 0 1-.1 2-.4 3-.3.8-.7 1.5-1.2 2.1-.6.6-1.2 1-2 1.4-.7.3-1.6.5-2.6.5-.5 0-1 0-1.5-.2-.5 0-1-.2-1.3-.3V42h-3.2V27.2h-1.9v-2.7zm8 2.4c-.7 0-1.3.2-1.8.5s-.9.8-1 1.4V34c.2.2.5.3 1 .4l1.3.2c.4 0 .9 0 1.3-.2s.7-.4 1-.8c.3-.4.6-.8.7-1.3.2-.6.3-1.2.3-2 0-1-.3-1.9-.8-2.5-.6-.6-1.2-.9-2-.9z"/>
|
||||||
|
</g>
|
||||||
|
<path fill="#3AC" d="M205 112.5L180 125v-25z"/>
|
||||||
|
<path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M180 112.5h-23.1"/>
|
||||||
|
<path fill="#3AC" d="M1000 112.5L975 125v-25z"/>
|
||||||
|
<path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M975 112.5h-23.1"/>
|
||||||
|
<path fill="#EAC1CC" stroke="#F03969" stroke-linejoin="round" stroke-width="3.8" d="M230 75h135l23.5 43.4L365 160H230l23.5-41.5z"/>
|
||||||
|
<g stroke-linejoin="round">
|
||||||
|
<use fill="#F2D7B2" style="mix-blend-mode:color-burn" xlink:href="#c"/>
|
||||||
|
<use stroke="#F0A439" stroke-width="3.8" xlink:href="#c"/>
|
||||||
|
</g>
|
||||||
|
<path fill="#F2E7A6" stroke="#CDB217" stroke-linejoin="round" stroke-width="3.8" d="M563 75h100l23.5 43.4L663 160H563l23.5-41.5z"/>
|
||||||
|
<path fill="#D7E99A" stroke="#B2D73A" stroke-linejoin="round" stroke-width="3.8" d="M664 75h131l23.5 43.4L795 160H664l23.5-41.5z"/>
|
||||||
|
<path fill="#B5F3D4" stroke="#3AD787" stroke-linejoin="round" stroke-width="3.8" d="M790 75h110l23.5 43.4L900 160H790l23.5-41.5z"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M265.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm10.9-13.2c1 0 1.8.1 2.6.4a5.6 5.6 0 013.3 3.4c.3.8.4 1.8.4 2.8 0 1-.1 1.9-.4 2.7a5.5 5.5 0 01-3.3 3.4 7 7 0 01-2.6.5 7 7 0 01-2.6-.5 5.6 5.6 0 01-3.3-3.4 7.8 7.8 0 010-5.5c.3-.8.7-1.5 1.3-2 .5-.6 1.2-1 2-1.4a7 7 0 012.6-.4zm0 10.8c1 0 1.9-.3 2.4-1 .5-.8.7-1.8.7-3.2 0-1.4-.2-2.4-.7-3.2-.5-.7-1.3-1-2.4-1-1 0-1.9.3-2.4 1-.5.8-.8 1.8-.8 3.2 0 1.4.3 2.4.8 3.1.5.8 1.3 1.1 2.4 1.1zm11.9-16.4v10.7h.5l.5-.1.4-.3 3.2-4 .4-.4.7-.1h2.8l-4 4.7-.4.5-.5.4.4.4.4.6 4.3 6.2h-2.8l-.6-.1c-.2-.1-.3-.2-.4-.5l-3.3-4.8a1 1 0 00-.4-.4h-1.2v5.8h-3.1v-18.6h3zm16 5.6c.7 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1.1 3 .6.6 1.4.9 2.4.9.6 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.7-.2-2.5-.5s-1.4-.7-2-1.3c-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-1 0-1.6.2-2.1.8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9c.4 0 .6.2.8.5l.2 1a7 7 0 011.7-1.2 4.6 4.6 0 012.2-.5c.7 0 1.4 0 1.9.3l1.4 1 .8 1.6c.2.6.3 1.2.3 2v8.1h-3.1v-8.2c0-.7-.2-1.4-.6-1.8-.3-.4-.9-.6-1.6-.6l-1.5.3c-.5.3-1 .6-1.3 1v9.3h-3.1zm17.5-12.8V125H327v-12.8h3zm.4-3.8l-.1.8a2 2 0 01-1 1 2 2 0 01-2.2-.4 2 2 0 01-.4-.6l-.2-.8a2 2 0 01.6-1.4 2 2 0 011.3-.5l.8.1a2 2 0 011 1l.3.8zm12.3 5v.7l-.3.5-6.2 8h6.4v2.4h-10v-1.3l.2-.5c0-.2.1-.4.3-.5l6.1-8.2h-6.2v-2.3h9.8v1.3zm7.8-1.4c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.8 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.3.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.3 2.3-.2.3-.3.1h-.6l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M436.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm5.4-.2v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.7-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3c0 .1 0 .2-.2.3l-.3.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3zm20.3 0h-1.4l-.7-.1c-.2-.1-.3-.3-.4-.6l-.2-.9a10.6 10.6 0 01-2 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.8-.1l-1.2-.7a3 3 0 01-.7-1c-.2-.5-.3-1-.3-1.6 0-.5.1-1 .4-1.4.2-.5.7-.9 1.3-1.3.5-.4 1.3-.7 2.3-1 1-.2 2.2-.3 3.7-.3v-.8c0-.9-.2-1.5-.6-2-.3-.3-.9-.5-1.6-.5a3.8 3.8 0 00-2 .5l-.8.4c-.2.2-.4.2-.6.2-.2 0-.4 0-.6-.2l-.3-.3-.6-1c1.5-1.4 3.3-2 5.3-2 .8 0 1.5 0 2 .3a4.3 4.3 0 012.5 2.6c.2.6.3 1.3.3 2v8.1zm-6-2h.9a3.3 3.3 0 001.4-.7l.7-.6v-2.2c-1 0-1.7.1-2.3.3a6 6 0 00-1.4.4l-.8.6c-.2.2-.3.5-.3.8 0 .5.2.9.5 1.1.4.3.8.4 1.3.4zm9 2v-12.8h2c.3 0 .6.2.7.5l.2 1a7 7 0 011.7-1.2 4.6 4.6 0 012.3-.5c.7 0 1.3 0 1.8.3.6.3 1 .6 1.4 1 .4.5.6 1 .8 1.6.2.6.3 1.2.3 2v8.1h-3v-8.2c0-.7-.3-1.4-.6-1.8-.4-.4-1-.6-1.7-.6l-1.5.3-1.3 1v9.3h-3zm21.8-10.3l-.2.3h-.8a32.9 32.9 0 00-1.4-.7h-1c-.6 0-1.1 0-1.5.3-.3.3-.5.6-.5 1 0 .3.1.5.3.7l.7.5 1 .4a33 33 0 012.3.8l1 .7c.3.2.5.5.7 1 .2.3.3.7.3 1.2 0 .7-.1 1.2-.4 1.7-.2.6-.5 1-1 1.4-.4.4-1 .7-1.5.9a7 7 0 01-3.5.2 7.6 7.6 0 01-2.3-.8l-.9-.7.7-1.1.4-.4h1a12 12 0 001.4.8l1.1.1h1l.6-.4.4-.5.1-.6c0-.3 0-.6-.3-.8l-.7-.5-1-.3a33.5 33.5 0 01-2.3-.9 4 4 0 01-1-.7 3 3 0 01-.7-1 3.7 3.7 0 011-4.2c.3-.3.8-.6 1.4-.8.6-.2 1.3-.3 2.1-.3 1 0 1.7.1 2.4.4.8.3 1.4.7 1.8 1.2l-.7 1zm4 10.3v-10.5l-1.2-.2-.6-.2a.7.7 0 01-.2-.5v-1.3h2v-1c0-.7 0-1.4.2-2a4.1 4.1 0 012.5-2.4 5.8 5.8 0 013.6 0v1.5c0 .2-.1.4-.3.4l-.8.1c-.3 0-.7 0-1 .2a1.7 1.7 0 00-1.1 1.1 4 4 0 00-.2 1.2v.9h3.3v2.2h-3.2V125h-3zm13.6-13c1 0 1.8.1 2.6.4a5.6 5.6 0 013.3 3.4c.3.8.4 1.8.4 2.8 0 1-.1 1.9-.4 2.7a5.5 5.5 0 01-3.3 3.4 7 7 0 01-2.6.5 7 7 0 01-2.6-.5 5.6 5.6 0 01-3.3-3.4 7.8 7.8 0 010-5.5c.3-.8.7-1.5 1.3-2 .5-.6 1.2-1 2-1.4a7 7 0 012.6-.4zm0 10.8c1 0 1.9-.3 2.4-1 .5-.8.7-1.8.7-3.2 0-1.4-.2-2.4-.7-3.2-.5-.7-1.3-1-2.4-1-1 0-1.9.3-2.4 1-.5.8-.8 1.8-.8 3.2 0 1.4.3 2.4.8 3.1.5.8 1.3 1.1 2.4 1.1zm8.7 2.2v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.7-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3zm10.3 0v-12.8h1.8c.4 0 .7.2.8.5l.2 1 .7-.7a4.5 4.5 0 011.7-.9 4 4 0 011-.1 3 3 0 012 .6c.6.5 1 1 1.2 1.8a4 4 0 011.8-1.9l1.1-.4a5.5 5.5 0 013.1.2c.6.2 1 .5 1.4 1 .4.4.7.9.9 1.5.2.6.3 1.3.3 2v8.2h-3.1v-8.2c0-.8-.2-1.4-.6-1.8-.3-.4-.8-.6-1.5-.6l-1 .1a2.1 2.1 0 00-1.1 1.3 3 3 0 00-.2 1v8.2h-3v-8.2c0-.8-.3-1.4-.6-1.8-.4-.4-.9-.6-1.5-.6-.5 0-.9 0-1.3.3-.4.2-.7.5-1 1v9.3H524zm26.3-13c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.2.3 1 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.8 1c-.4.2-.9.2-1.3.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.5-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.6.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1l-.5-1-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.2.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M610.6 125v-12.8h2c.3 0 .6.2.7.5l.2 1a7 7 0 011.8-1.2 4.6 4.6 0 012.2-.5c.7 0 1.3 0 1.9.3.5.3 1 .6 1.3 1 .4.5.7 1 .8 1.6.2.6.3 1.2.3 2v8.1h-3v-8.2c0-.7-.2-1.4-.6-1.8-.4-.4-1-.6-1.6-.6-.6 0-1 .1-1.5.3l-1.4 1v9.3h-3zm19.6-13c.8 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.4 3l-.1.3-.3.2H627c.2 1.4.5 2.4 1.1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.5-.1h.4l.2.3.9 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.8 1c-.5.2-1 .2-1.4.2-.9 0-1.7-.2-2.5-.5-.7-.3-1.4-.7-2-1.3-.5-.5-1-1.3-1.3-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.6-.3 1.5-.4 2.5-.4zm0 2.2c-.9 0-1.6.2-2 .8-.6.5-1 1.2-1 2.1h5.7l-.1-1.1-.5-1-.9-.6-1.2-.2zm8 10.8v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.6-1.9c.7-.4 1.4-.7 2.1-.7.7 0 1.2.2 1.6.5l-.4 2.3c0 .1 0 .2-.2.3l-.3.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3z"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M708.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm10.7-13.2c.8 0 1.6.1 2.3.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1.1 3 .6.6 1.4.9 2.4.9.6 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.7-.2-2.5-.5s-1.4-.7-2-1.3c-.6-.5-1-1.3-1.3-2.1a8.3 8.3 0 01-.1-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm11.1 4.2l-4.2-6.2h3.5l.3.4 2.7 4.3a3.5 3.5 0 01.4-.9l2.1-3.4.3-.3.4-.1h2.9l-4.3 6 4.4 6.8h-3c-.2 0-.3 0-.5-.2l-.3-.4-2.8-4.4c0 .3-.1.5-.3.7l-2.4 3.7c0 .2-.2.3-.3.4l-.5.2h-2.8l4.4-6.6zm14.7 6.8c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.1 0-.3 0-.4-.2-.1 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5c0-.2.1-.3.3-.4l.4-.2h1.6v4h3.4v2.3H745v7l.3.9c.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.3l.1.3 1 1.5-1.6.8-1.8.3zm14.5-10.3l-.3.3h-.8a14.9 14.9 0 00-1.3-.7l-1-.2c-.6 0-1.1.1-1.5.3-.4.2-.8.5-1 .9-.3.3-.5.8-.6 1.3-.2.5-.2 1.1-.2 1.8 0 .6 0 1.3.2 1.8.1.5.3 1 .6 1.3.3.4.6.7 1 .9l1.3.2a3.3 3.3 0 002-.5l.5-.4c.2-.2.4-.2.6-.2.2 0 .4 0 .5.3l1 1a5.6 5.6 0 01-2.4 1.7l-1.4.4h-1.3c-.8 0-1.6 0-2.3-.4-.7-.3-1.3-.7-1.8-1.3-.5-.5-1-1.2-1.2-2a8 8 0 01-.5-2.8c0-1 .1-1.9.4-2.7a5.5 5.5 0 013.1-3.5c.8-.3 1.7-.4 2.7-.4 1 0 1.8.1 2.5.4.8.3 1.4.8 2 1.4l-.8 1zm13 10.1h-1.4l-.7-.1c-.2-.1-.3-.3-.4-.6l-.3-.9a10.6 10.6 0 01-2 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.7-.1c-.5-.2-.9-.4-1.2-.7a3 3 0 01-.8-1c-.2-.5-.3-1-.3-1.6 0-.5.2-1 .4-1.4.3-.5.7-.9 1.3-1.3.6-.4 1.4-.7 2.4-1 1-.2 2.2-.3 3.6-.3v-.8c0-.9-.2-1.5-.5-2-.4-.3-1-.5-1.6-.5a3.8 3.8 0 00-2.1.5l-.7.4c-.2.2-.4.2-.7.2-.2 0-.4 0-.5-.2-.2 0-.3-.2-.4-.3l-.5-1c1.4-1.4 3.2-2 5.3-2a4.3 4.3 0 014.4 3c.2.5.3 1.2.3 1.9v8.1zm-6-2h.8a3.3 3.3 0 001.5-.7l.6-.6v-2.2c-.9 0-1.6.1-2.2.3a6 6 0 00-1.5.4l-.8.6-.2.8c0 .5.2.9.5 1.1.3.3.7.4 1.2.4zm13.2 2.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.1 0-.3 0-.4-.2-.1 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5c0-.2.1-.3.3-.4l.4-.2h1.6v4h3.4v2.3h-3.4v7l.3.9c.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3z"/>
|
||||||
|
<path fill="#3D4251" fill-rule="nonzero" d="M855 123.3a2 2 0 01.5-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.3 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.2-.1-.3-.3-.4-.6a2 2 0 01-.2-.7zm5.5 0a2 2 0 01.6-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.2-.7zm5.7 0a2 2 0 01.5-1.3 2 2 0 011.4-.6 1.9 1.9 0 011.3.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.1-.7z"/>
|
||||||
|
</g>
|
||||||
|
</svg>
|
After Width: | Height: | Size: 14 KiB |
|
@ -1,10 +1,17 @@
|
||||||
---
|
---
|
||||||
title: Transformers
|
title: Transformers
|
||||||
teaser: Using transformer models like BERT in spaCy
|
teaser: Using transformer models like BERT in spaCy
|
||||||
|
menu:
|
||||||
|
- ['Installation', 'install']
|
||||||
|
- ['Runtime Usage', 'runtime']
|
||||||
|
- ['Training Usage', 'training']
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Installation {#install hidden="true"}
|
||||||
|
|
||||||
spaCy v3.0 lets you use almost **any statistical model** to power your pipeline.
|
spaCy v3.0 lets you use almost **any statistical model** to power your pipeline.
|
||||||
You can use models implemented in a variety of frameworks, including TensorFlow,
|
You can use models implemented in a variety of
|
||||||
|
[frameworks](https://thinc.ai/docs/usage-frameworks), including TensorFlow,
|
||||||
PyTorch and MXNet. To keep things sane, spaCy expects models from these
|
PyTorch and MXNet. To keep things sane, spaCy expects models from these
|
||||||
frameworks to be wrapped with a common interface, using our machine learning
|
frameworks to be wrapped with a common interface, using our machine learning
|
||||||
library [Thinc](https://thinc.ai). A transformer model is just a statistical
|
library [Thinc](https://thinc.ai). A transformer model is just a statistical
|
||||||
|
@ -15,34 +22,110 @@ that do the required plumbing. We also provide a pipeline component,
|
||||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
||||||
you save the transformer outputs for later use.
|
you save the transformer outputs for later use.
|
||||||
|
|
||||||
<Project id="en_core_bert">
|
To use transformers with spaCy, you need the
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
||||||
|
installed. It takes care of all the setup behind the scenes, and makes sure the
|
||||||
|
transformer pipeline component is available to spaCy.
|
||||||
|
|
||||||
Try out a BERT-based model pipeline using this project template: swap in your
|
```bash
|
||||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
$ pip install spacy-transformers
|
||||||
visualize your model.
|
```
|
||||||
|
|
||||||
</Project>
|
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
|
||||||
|
|
||||||
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted
|
## Runtime usage {#runtime}
|
||||||
|
|
||||||
### Training usage
|
Transformer models can be used as **drop-in replacements** for other types of
|
||||||
|
neural networks, so your spaCy pipeline can include them in a way that's
|
||||||
|
completely invisible to the user. Users will download, load and use the model in
|
||||||
|
the standard way, like any other spaCy pipeline. Instead of using the
|
||||||
|
transformers as subnetworks directly, you can also use them via the
|
||||||
|
[`Transformer`](/api/transformer) pipeline component.
|
||||||
|
|
||||||
|
![The processing pipeline with the transformer component](../images/pipeline_transformer.svg)
|
||||||
|
|
||||||
|
The `Transformer` component sets the
|
||||||
|
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
||||||
|
which lets you access the transformers outputs at runtime.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy download en_core_trf_lg
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Example
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_trf_lg")
|
||||||
|
for doc in nlp.pipe(["some text", "some other text"]):
|
||||||
|
tokvecs = doc._.trf_data.tensors[-1]
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also customize how the [`Transformer`](/api/transformer) component sets
|
||||||
|
annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
|
||||||
|
This callback will be called with the raw input and output data for the whole
|
||||||
|
batch, along with the batch of `Doc` objects, allowing you to implement whatever
|
||||||
|
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
|
||||||
|
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||||||
|
containing the transformers data for the batch.
|
||||||
|
|
||||||
|
```python
|
||||||
|
def custom_annotation_setter(docs, trf_data):
|
||||||
|
# TODO:
|
||||||
|
...
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_trf_lg")
|
||||||
|
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
|
||||||
|
doc = nlp("This is a text")
|
||||||
|
print() # TODO:
|
||||||
|
```
|
||||||
|
|
||||||
|
## Training usage {#training}
|
||||||
|
|
||||||
The recommended workflow for training is to use spaCy's
|
The recommended workflow for training is to use spaCy's
|
||||||
[config system](/usage/training#config), usually via the
|
[config system](/usage/training#config), usually via the
|
||||||
[`spacy train`](/api/cli#train) command. The config system lets you describe a
|
[`spacy train`](/api/cli#train) command. The training config defines all
|
||||||
tree of objects by referring to creation functions, including functions you
|
component settings and hyperparameters in one place and lets you describe a tree
|
||||||
register yourself. Here's a config snippet for the `Transformer` component,
|
of objects by referring to creation functions, including functions you register
|
||||||
along with matching Python code.
|
yourself.
|
||||||
|
|
||||||
|
<Project id="en_core_bert">
|
||||||
|
|
||||||
|
The easiest way to get started is to clone a transformers-based project
|
||||||
|
template. Swap in your data, edit the settings and hyperparameters and train,
|
||||||
|
evaluate, package and visualize your model.
|
||||||
|
|
||||||
|
</Project>
|
||||||
|
|
||||||
|
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
|
||||||
|
components and the settings used to construct them, including their model
|
||||||
|
implementation. Here's a config snippet for the
|
||||||
|
[`Transformer`](/api/transformer) component, along with matching Python code:
|
||||||
|
|
||||||
|
> #### Python equivalent
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy_transformers import Transformer, TransformerModel
|
||||||
|
> from spacy_transformers.annotation_setters import null_annotation_setter
|
||||||
|
> from spacy_transformers.span_getters import get_doc_spans
|
||||||
|
>
|
||||||
|
> trf = Transformer(
|
||||||
|
> nlp.vocab,
|
||||||
|
> TransformerModel(
|
||||||
|
> "bert-base-cased",
|
||||||
|
> get_spans=get_doc_spans,
|
||||||
|
> tokenizer_config={"use_fast": True},
|
||||||
|
> ),
|
||||||
|
> annotation_setter=null_annotation_setter,
|
||||||
|
> max_batch_items=4096,
|
||||||
|
> )
|
||||||
|
> ```
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
[nlp]
|
### config.cfg (excerpt)
|
||||||
lang = "en"
|
|
||||||
pipeline = ["transformer"]
|
|
||||||
|
|
||||||
[components.transformer]
|
[components.transformer]
|
||||||
factory = "transformer"
|
factory = "transformer"
|
||||||
extra_annotation_setter = null
|
max_batch_items = 4096
|
||||||
max_batch_size = 32
|
|
||||||
|
|
||||||
[components.transformer.model]
|
[components.transformer.model]
|
||||||
@architectures = "spacy-transformers.TransformerModel.v1"
|
@architectures = "spacy-transformers.TransformerModel.v1"
|
||||||
|
@ -50,46 +133,110 @@ name = "bert-base-cased"
|
||||||
tokenizer_config = {"use_fast": true}
|
tokenizer_config = {"use_fast": true}
|
||||||
|
|
||||||
[components.transformer.model.get_spans]
|
[components.transformer.model.get_spans]
|
||||||
@span_getters = "get_doc_spans.v1"
|
@span_getters = "doc_spans.v1"
|
||||||
|
|
||||||
|
[components.transformer.annotation_setter]
|
||||||
|
@annotation_setters = "spacy-transformer.null_annotation_setter.v1"
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The `[components.transformer.model]` block describes the `model` argument passed
|
||||||
|
to the transformer component. It's a Thinc
|
||||||
|
[`Model`](https://thinc.ai/docs/api-model) object that will be passed into the
|
||||||
|
component. Here, it references the function
|
||||||
|
[spacy-transformers.TransformerModel.v1](/api/architectures#TransformerModel)
|
||||||
|
registered in the [`architectures` registry](/api/top-level#registry). If a key
|
||||||
|
in a block starts with `@`, it's **resolved to a function** and all other
|
||||||
|
settings are passed to the function as arguments. In this case, `name`,
|
||||||
|
`tokenizer_config` and `get_spans`.
|
||||||
|
|
||||||
|
`get_spans` is a function that takes a batch of `Doc` object and returns lists
|
||||||
|
of potentially overlapping `Span` objects to process by the transformer. Several
|
||||||
|
[built-in functions](/api/transformer#span-getters) are available – for example,
|
||||||
|
to process the whole document or individual sentences. When the config is
|
||||||
|
resolved, the function is created and passed into the model as an argument.
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Remember that the `config.cfg` used for training should contain **no missing
|
||||||
|
values** and requires all settings to be defined. You don't want any hidden
|
||||||
|
defaults creeping in and changing your results! spaCy will tell you if settings
|
||||||
|
are missing, and you can run [`spacy debug config`](/api/cli#debug-config) with
|
||||||
|
`--auto-fill` to automatically fill in all defaults.
|
||||||
|
|
||||||
|
<!-- TODO: update with details on getting started with a config -->
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
### Customizing the settings {#training-custom-settings}
|
||||||
|
|
||||||
|
To change any of the settings, you can edit the `config.cfg` and re-run the
|
||||||
|
training. To change any of the functions, like the span getter, you can replace
|
||||||
|
the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
|
||||||
|
process sentences. You can also register your own functions using the
|
||||||
|
`span_getters` registry:
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [components.transformer.model.get_spans]
|
||||||
|
> @span_getters = "custom_sent_spans"
|
||||||
|
> ```
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from spacy_transformers import Transformer
|
### code.py
|
||||||
|
import spacy_transformers
|
||||||
|
|
||||||
trf = Transformer(
|
@spacy_transformers.registry.span_getters("custom_sent_spans")
|
||||||
nlp.vocab,
|
def configure_custom_sent_spans():
|
||||||
TransformerModel(
|
# TODO: write custom example
|
||||||
"bert-base-cased",
|
def get_sent_spans(docs):
|
||||||
get_spans=get_doc_spans,
|
return [list(doc.sents) for doc in docs]
|
||||||
tokenizer_config={"use_fast": True},
|
|
||||||
),
|
return get_sent_spans
|
||||||
annotation_setter=null_annotation_setter,
|
|
||||||
max_batch_size=32,
|
|
||||||
)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `components.transformer` block adds the `transformer` component to the
|
To resolve the config during training, spaCy needs to know about your custom
|
||||||
pipeline, and the `components.transformer.model` block describes the creation of
|
function. You can make it available via the `--code` argument that can point to
|
||||||
a Thinc [`Model`](https://thinc.ai/docs/api-model) object that will be passed
|
a Python file:
|
||||||
into the component. The block names a function registered in the
|
|
||||||
`@architectures` registry. This function will be looked up and called using the
|
|
||||||
provided arguments. You're not limited to just that function --- you can write
|
|
||||||
your own or use someone else's. The only limitation is that it must return an
|
|
||||||
object of type `Model[List[Doc], FullTransformerBatch]`: that is, a Thinc model
|
|
||||||
that takes a list of `Doc` objects, and returns a `FullTransformerBatch` object
|
|
||||||
with the transformer data.
|
|
||||||
|
|
||||||
The same idea applies to task models that power the downstream components. Most
|
```bash
|
||||||
of spaCy's built-in model creation functions support a `tok2vec` argument, which
|
$ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --code ./code.py
|
||||||
should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This is
|
```
|
||||||
where we'll plug in our transformer model, using the `Tok2VecTransformer` layer,
|
|
||||||
which sneakily delegates to the `Transformer` pipeline component.
|
### Customizing the model implementations {#training-custom-model}
|
||||||
|
|
||||||
|
The [`Transformer`](/api/transformer) component expects a Thinc
|
||||||
|
[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model`
|
||||||
|
argument. You're not limited to the implementation provided by
|
||||||
|
`spacy-transformers` – the only requirement is that your registered function
|
||||||
|
must return an object of type `Model[List[Doc], FullTransformerBatch]`: that is,
|
||||||
|
a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a
|
||||||
|
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the
|
||||||
|
transformer data.
|
||||||
|
|
||||||
|
> #### Model type annotations
|
||||||
|
>
|
||||||
|
> In the documentation and code base, you may come across type annotations and
|
||||||
|
> descriptions of [Thinc](https://thinc.ai) model types, like
|
||||||
|
> `Model[List[Doc], List[Floats2d]]`. This so-called generic type describes the
|
||||||
|
> layer and its input and output type – in this case, it takes a list of `Doc`
|
||||||
|
> objects as the input and list of 2-dimensional arrays of floats as the output.
|
||||||
|
> You can read more about defining Thinc
|
||||||
|
> models [here](https://thinc.ai/docs/usage-models). Also see the
|
||||||
|
> [type checking](https://thinc.ai/docs/usage-type-checking) for how to enable
|
||||||
|
> linting in your editor to see live feedback if your inputs and outputs don't
|
||||||
|
> match.
|
||||||
|
|
||||||
|
The same idea applies to task models that power the **downstream components**.
|
||||||
|
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
|
||||||
|
which should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This
|
||||||
|
is where we'll plug in our transformer model, using the
|
||||||
|
[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily
|
||||||
|
delegates to the `Transformer` pipeline component.
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
[nlp]
|
### config.cfg (excerpt) {highlight="12"}
|
||||||
lang = "en"
|
|
||||||
pipeline = ["ner"]
|
|
||||||
|
|
||||||
[components.ner]
|
[components.ner]
|
||||||
factory = "ner"
|
factory = "ner"
|
||||||
|
|
||||||
|
@ -108,49 +255,24 @@ grad_factor = 1.0
|
||||||
@layers = "reduce_mean.v1"
|
@layers = "reduce_mean.v1"
|
||||||
```
|
```
|
||||||
|
|
||||||
The `Tok2VecListener` layer expects a `pooling` layer, which needs to be of type
|
The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
|
||||||
`Model[Ragged, Floats2d]`. This layer determines how the vector for each spaCy
|
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops), which needs to
|
||||||
token will be computed from the zero or more source rows the token is aligned
|
be of type `Model[Ragged, Floats2d]`. This layer determines how the vector for
|
||||||
against. Here we use the `reduce_mean` layer, which averages the wordpiece rows.
|
each spaCy token will be computed from the zero or more source rows the token is
|
||||||
We could instead use `reduce_last`, `reduce_max`, or a custom function you write
|
aligned against. Here we use the
|
||||||
yourself.
|
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
|
||||||
|
averages the wordpiece rows. We could instead use `reduce_last`,
|
||||||
|
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
|
||||||
|
function you write yourself.
|
||||||
|
|
||||||
|
<!--TODO: reduce_last: undocumented? -->
|
||||||
|
|
||||||
You can have multiple components all listening to the same transformer model,
|
You can have multiple components all listening to the same transformer model,
|
||||||
and all passing gradients back to it. By default, all of the gradients will be
|
and all passing gradients back to it. By default, all of the gradients will be
|
||||||
equally weighted. You can control this with the `grad_factor` setting, which
|
**equally weighted**. You can control this with the `grad_factor` setting, which
|
||||||
lets you reweight the gradients from the different listeners. For instance,
|
lets you reweight the gradients from the different listeners. For instance,
|
||||||
setting `grad_factor = 0` would disable gradients from one of the listeners,
|
setting `grad_factor = 0` would disable gradients from one of the listeners,
|
||||||
while `grad_factor = 2.0` would multiply them by 2. This is similar to having a
|
while `grad_factor = 2.0` would multiply them by 2. This is similar to having a
|
||||||
custom learning rate for each component. Instead of a constant, you can also
|
custom learning rate for each component. Instead of a constant, you can also
|
||||||
provide a schedule, allowing you to freeze the shared parameters at the start of
|
provide a schedule, allowing you to freeze the shared parameters at the start of
|
||||||
training.
|
training.
|
||||||
|
|
||||||
### Runtime usage
|
|
||||||
|
|
||||||
Transformer models can be used as drop-in replacements for other types of neural
|
|
||||||
networks, so your spaCy pipeline can include them in a way that's completely
|
|
||||||
invisible to the user. Users will download, load and use the model in the
|
|
||||||
standard way, like any other spaCy pipeline.
|
|
||||||
|
|
||||||
Instead of using the transformers as subnetworks directly, you can also use them
|
|
||||||
via the [`Transformer`](/api/transformer) pipeline component. This sets the
|
|
||||||
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
|
||||||
which lets you access the transformers outputs at runtime via the
|
|
||||||
`doc._.trf_data` extension attribute. You can also customize how the
|
|
||||||
`Transformer` object sets annotations onto the `Doc`, by customizing the
|
|
||||||
`Transformer.annotation_setter` object. This callback will be called with the
|
|
||||||
raw input and output data for the whole batch, along with the batch of `Doc`
|
|
||||||
objects, allowing you to implement whatever you need.
|
|
||||||
|
|
||||||
```python
|
|
||||||
import spacy
|
|
||||||
|
|
||||||
nlp = spacy.load("en_core_trf_lg")
|
|
||||||
for doc in nlp.pipe(["some text", "some other text"]):
|
|
||||||
doc._.trf_data.tensors
|
|
||||||
tokvecs = doc._.trf_data.tensors[-1]
|
|
||||||
```
|
|
||||||
|
|
||||||
The `nlp` object in this example is just like any other spaCy pipeline
|
|
||||||
|
|
||||||
-->
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user