Update intro section of the pipeline component docs

This commit is contained in:
shadeMe 2023-08-08 14:21:18 +02:00
parent 13e1d8ca90
commit 6e0f537c04
No known key found for this signature in database
GPG Key ID: 6FCA9FC635B2A402

View File

@ -1,56 +1,45 @@
--- ---
title: CuratedTransformer title: CuratedTransformer
teaser: Pipeline component for multi-task learning with transformer models teaser:
Pipeline component for multi-task learning with curated transformer models
tag: class tag: class
source: github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/pipeline_component.py source: github.com/explosion/spacy-transformers/blob/master/spacy_curated_transformers/pipeline_component.py
version: 3.7 version: 3.7
api_base_class: /api/pipe api_base_class: /api/pipe
api_string_name: transformer api_string_name: curated_transformer
--- ---
> #### Installation
>
> ```bash
> $ pip install -U spacy-curated-transformers
> ```
<Infobox title="Important note" variant="warning"> <Infobox title="Important note" variant="warning">
This component is available via the extension package This component is available via the extension package
[`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers). [`spacy-curated-transformers`](https://github.com/explosion/spacy-curated-transformers).
It exposes the component via entry points, so if you have the package installed, It exposes the component via entry points, so if you have the package installed,
using `factory = "curated_transformer"` in your using `factory = "curated_transformer"` in your
[training config](/usage/training#config) or [training config](/usage/training#config) will work out-of-the-box.
`nlp.add_pipe("curated_transformer")` will work out-of-the-box.
</Infobox> </Infobox>
This Python package provides a curated set of transformer models for spaCy. It This pipeline component lets you use a curated set of transformer models in your
is focused on deep integration into spaCy and will support deployment-focused pipeline. spaCy Curated Transformers currently supports the following model
features such as distillation and quantization in the future. spaCy curated types:
transformers currently supports the following model types:
- ALBERT - ALBERT
- BERT - BERT
- CamemBERT - CamemBERT
- RoBERTa - RoBERTa
- XLM-RoBERTa - XLM-RoBERT
If you want to use another type of model, use
[spacy-transformers](/api/spacy-transformers), which allows you to use all
Hugging Face transformer models with spaCy.
You will usually connect downstream components to a shared curated transformer You will usually connect downstream components to a shared curated transformer
using one of the curated transformer listener layers. This works similarly to using one of the curated transformer listener layers. This works similarly to
spaCy's [Tok2Vec](/api/tok2vec), and the spaCy's [Tok2Vec](/api/tok2vec), and the
[Tok2VecListener](/api/architectures/#Tok2VecListener) sublayer. [Tok2VecListener](/api/architectures/#Tok2VecListener) sublayer. The component
assigns the output of the transformer to the `Doc`'s extension attributes. To
Supporting a wide variety of transformer models is a non-goal. If you want to access the values, you can use the custom
use another type of model, use [spacy-transformers](/api/spacy-transformers), [`Doc._.trf_data`](#assigned-attributes) attribute.
which allows you to use Hugging Face transformers models with spaCy.
The component assigns the output of the transformer to the `Doc`'s extension
attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
token, the spaCy token receives the sum of their values. To access the values,
you can use the custom [`Doc._.trf_data`](#assigned-attributes) attribute.
For more details, see the [usage documentation](/usage/embeddings-transformers). For more details, see the [usage documentation](/usage/embeddings-transformers).
@ -60,8 +49,8 @@ The component sets the following
[custom extension attribute](/usage/processing-pipeline#custom-components-attributes): [custom extension attribute](/usage/processing-pipeline#custom-components-attributes):
| Location | Value | | Location | Value |
| ---------------- | ------------------------------------------------------------------------------------ | | ---------------- | -------------------------------------------------------------------------- |
| `Doc._.trf_data` | CuratedTransformer tokens and outputs for the `Doc` object. ~~DocTransformerOutput~~ | | `Doc._.trf_data` | Curated transformer outputs for the `Doc` object. ~~DocTransformerOutput~~ |
## Config and implementation {id="config"} ## Config and implementation {id="config"}
@ -72,13 +61,19 @@ how the component should be configured. You can override its settings via the
[model architectures](/api/architectures#transformers) documentation for details [model architectures](/api/architectures#transformers) documentation for details
on the transformer architectures and their arguments and hyperparameters. on the transformer architectures and their arguments and hyperparameters.
Note that the default config does not include the mandatory `vocab_size`
hyperparameter as this value can differ between different models. So, you will
need to explicitly specify this before adding the pipe (as shown in the example
below).
> #### Example > #### Example
> >
> ```python > ```python
> from spacy_curated_transformers.pipeline.transformer import DEFAULT_CONFIG > from spacy_curated_transformers.pipeline.transformer import DEFAULT_CONFIG
> >
> DEFAULT_CONFIG["transformer"]["model"]["vocab_size"] = 250002 > config = DEFAULT_CONFIG.copy()
> nlp.add_pipe("curated_transformer", config=DEFAULT_CONFIG["transformer"]) > config["transformer"]["model"]["vocab_size"] = 250002
> nlp.add_pipe("curated_transformer", config=config["transformer"])
> ``` > ```
| Setting | Description | | Setting | Description |