From 158d8c1e48961f8c962df01f72e5818f3ec2651d Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 29 Jul 2020 18:44:10 +0200 Subject: [PATCH] Update docs [ci skip] --- website/docs/api/architectures.md | 2 + website/docs/api/top-level.md | 25 ++ website/docs/api/transformer.md | 82 +++++- website/docs/images/pipeline_transformer.svg | 37 +++ website/docs/usage/transformers.md | 294 +++++++++++++------ 5 files changed, 347 insertions(+), 93 deletions(-) create mode 100644 website/docs/images/pipeline_transformer.svg diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index a87c2a1e8..43387b8ca 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -26,6 +26,8 @@ TODO: intro and how architectures work, link to ### spacy-transformers.TransformerModel.v1 {#TransformerModel} +### spacy-transformers.Tok2VecListener.v1 {#spacy-transformers.Tok2VecListener.v1} + ## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"} ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index a463441c7..ede7f9e21 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -304,6 +304,31 @@ factories. | `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | | `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +### spacy-transformers registry {#registry-transformers} + +The following registries are added by the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package. +See the [`Transformer`](/api/transformer) API reference and +[usage docs](/usage/transformers) for details. + +> #### Example +> +> ```python +> import spacy_transformers +> +> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") +> def configure_custom_annotation_setter(): +> def annotation_setter(docs, trf_data) -> None: +> # Set annotations on the docs +> +> return annotation_sette +> ``` + +| Registry name | Description | +| ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | + ## Training data and alignment {#gold source="spacy/gold"} ### gold.docs_to_json {#docs_to_json tag="function"} diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index e89ecb6b7..386f65a0a 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -31,8 +31,10 @@ attributes. We also calculate an alignment between the word-piece tokens and the spaCy tokenization, so that we can use the last hidden states to set the `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy token, the spaCy token receives the sum of their values. To access the values, -you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. For -more details, see the [usage documentation](/usage/transformers). +you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The +package also adds the function registries [`@span_getters`](#span_getters) and +[`@annotation_setters`](#annotation_setters) with several built-in registered +functions. For more details, see the [usage documentation](/usage/transformers). ## Config and implementation {#config} @@ -51,11 +53,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Type | Description | Default | -| ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | -| `max_batch_items` | int | Maximum size of a padded batch. | `4096` | -| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` | -| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) | +| Setting | Type | Description | Default | +| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| `max_batch_items` | int | Maximum size of a padded batch. | `4096` | +| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` | +| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -390,6 +392,72 @@ Split a `TransformerData` object that represents a batch into a list with one | ----------- | ----------------------- | -------------- | | **RETURNS** | `List[TransformerData]` | | +## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} + +Span getters are functions that take a batch of [`Doc`](/api/doc) objects and +return a lists of [`Span`](/api/span) objects for each doc, to be processed by +the transformer. The returned spans can overlap. + + Span getters can be referenced in the + +config's `[components.transformer.model.get_spans]` block to customize the +sequences processed by the transformer. You can also register custom span +getters using the `@registry.span_getters` decorator. + +> #### Example +> +> ```python +> @registry.span_getters("sent_spans.v1") +> def configure_get_sent_spans() -> Callable: +> def get_sent_spans(docs: Iterable[Doc]) -> List[List[Span]]: +> return [list(doc.sents) for doc in docs] +> +> return get_sent_spans +> ``` + +| Name | Type | Description | +| ----------- | ------------------ | ------------------------------------------------------------ | +| `docs` | `Iterable[Doc]` | A batch of `Doc` objects. | +| **RETURNS** | `List[List[Span]]` | The spans to process by the transformer, one list per `Doc`. | + +The following built-in functions are available: + +| Name | Description | +| ------------------ | ------------------------------------------------------------------ | +| `doc_spans.v1` | Create a span for each doc (no transformation, process each text). | +| `sent_spans.v1` | Create a span for each sentence if sentence boundaries are set. | +| `strided_spans.v1` | | + +## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} + +Annotation setters are functions that that take a batch of `Doc` objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set +additional annotations on the `Doc`, e.g. to set custom or built-in attributes. +You can register custom annotation setters using the +`@registry.annotation_setters` decorator. + +> #### Example +> +> ```python +> @registry.annotation_setters("spacy-transformer.null_annotation_setter.v1") +> def configure_null_annotation_setter() -> Callable: +> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: +> pass +> +> return setter +> ``` + +| Name | Type | Description | +| ---------- | ---------------------- | ------------------------------------ | +| `docs` | `List[Doc]` | A batch of `Doc` objects. | +| `trf_data` | `FullTransformerBatch` | The transformers data for the batch. | + +The following built-in functions are available: + +| Name | Description | +| --------------------------------------------- | ------------------------------------- | +| `spacy-transformer.null_annotation_setter.v1` | Don't set any additional annotations. | + ## Custom attributes {#custom-attributes} The component sets the following diff --git a/website/docs/images/pipeline_transformer.svg b/website/docs/images/pipeline_transformer.svg new file mode 100644 index 000000000..cfbf470cc --- /dev/null +++ b/website/docs/images/pipeline_transformer.svg @@ -0,0 +1,37 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/usage/transformers.md b/website/docs/usage/transformers.md index d5ce4e891..791eaac37 100644 --- a/website/docs/usage/transformers.md +++ b/website/docs/usage/transformers.md @@ -1,10 +1,17 @@ --- title: Transformers teaser: Using transformer models like BERT in spaCy +menu: + - ['Installation', 'install'] + - ['Runtime Usage', 'runtime'] + - ['Training Usage', 'training'] --- +## Installation {#install hidden="true"} + spaCy v3.0 lets you use almost **any statistical model** to power your pipeline. -You can use models implemented in a variety of frameworks, including TensorFlow, +You can use models implemented in a variety of +[frameworks](https://thinc.ai/docs/usage-frameworks), including TensorFlow, PyTorch and MXNet. To keep things sane, spaCy expects models from these frameworks to be wrapped with a common interface, using our machine learning library [Thinc](https://thinc.ai). A transformer model is just a statistical @@ -15,34 +22,110 @@ that do the required plumbing. We also provide a pipeline component, [`Transformer`](/api/transformer), that lets you do multi-task learning and lets you save the transformer outputs for later use. - +To use transformers with spaCy, you need the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package +installed. It takes care of all the setup behind the scenes, and makes sure the +transformer pipeline component is available to spaCy. -Try out a BERT-based model pipeline using this project template: swap in your -data, edit the settings and hyperparameters and train, evaluate, package and -visualize your model. +```bash +$ pip install spacy-transformers +``` - + - + + + +### Customizing the settings {#training-custom-settings} + +To change any of the settings, you can edit the `config.cfg` and re-run the +training. To change any of the functions, like the span getter, you can replace +the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to +process sentences. You can also register your own functions using the +`span_getters` registry: + +> #### config.cfg +> +> ```ini +> [components.transformer.model.get_spans] +> @span_getters = "custom_sent_spans" +> ``` + ```python -from spacy_transformers import Transformer +### code.py +import spacy_transformers -trf = Transformer( - nlp.vocab, - TransformerModel( - "bert-base-cased", - get_spans=get_doc_spans, - tokenizer_config={"use_fast": True}, - ), - annotation_setter=null_annotation_setter, - max_batch_size=32, -) +@spacy_transformers.registry.span_getters("custom_sent_spans") +def configure_custom_sent_spans(): + # TODO: write custom example + def get_sent_spans(docs): + return [list(doc.sents) for doc in docs] + + return get_sent_spans ``` -The `components.transformer` block adds the `transformer` component to the -pipeline, and the `components.transformer.model` block describes the creation of -a Thinc [`Model`](https://thinc.ai/docs/api-model) object that will be passed -into the component. The block names a function registered in the -`@architectures` registry. This function will be looked up and called using the -provided arguments. You're not limited to just that function --- you can write -your own or use someone else's. The only limitation is that it must return an -object of type `Model[List[Doc], FullTransformerBatch]`: that is, a Thinc model -that takes a list of `Doc` objects, and returns a `FullTransformerBatch` object -with the transformer data. +To resolve the config during training, spaCy needs to know about your custom +function. You can make it available via the `--code` argument that can point to +a Python file: -The same idea applies to task models that power the downstream components. Most -of spaCy's built-in model creation functions support a `tok2vec` argument, which -should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This is -where we'll plug in our transformer model, using the `Tok2VecTransformer` layer, -which sneakily delegates to the `Transformer` pipeline component. +```bash +$ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --code ./code.py +``` + +### Customizing the model implementations {#training-custom-model} + +The [`Transformer`](/api/transformer) component expects a Thinc +[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model` +argument. You're not limited to the implementation provided by +`spacy-transformers` – the only requirement is that your registered function +must return an object of type `Model[List[Doc], FullTransformerBatch]`: that is, +a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the +transformer data. + +> #### Model type annotations +> +> In the documentation and code base, you may come across type annotations and +> descriptions of [Thinc](https://thinc.ai) model types, like +> `Model[List[Doc], List[Floats2d]]`. This so-called generic type describes the +> layer and its input and output type – in this case, it takes a list of `Doc` +> objects as the input and list of 2-dimensional arrays of floats as the output. +> You can read more about defining Thinc +> models [here](https://thinc.ai/docs/usage-models). Also see the +> [type checking](https://thinc.ai/docs/usage-type-checking) for how to enable +> linting in your editor to see live feedback if your inputs and outputs don't +> match. + +The same idea applies to task models that power the **downstream components**. +Most of spaCy's built-in model creation functions support a `tok2vec` argument, +which should be a Thinc layer of type `Model[List[Doc], List[Floats2d]]`. This +is where we'll plug in our transformer model, using the +[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily +delegates to the `Transformer` pipeline component. ```ini -[nlp] -lang = "en" -pipeline = ["ner"] - +### config.cfg (excerpt) {highlight="12"} [components.ner] factory = "ner" @@ -108,49 +255,24 @@ grad_factor = 1.0 @layers = "reduce_mean.v1" ``` -The `Tok2VecListener` layer expects a `pooling` layer, which needs to be of type -`Model[Ragged, Floats2d]`. This layer determines how the vector for each spaCy -token will be computed from the zero or more source rows the token is aligned -against. Here we use the `reduce_mean` layer, which averages the wordpiece rows. -We could instead use `reduce_last`, `reduce_max`, or a custom function you write -yourself. +The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a +[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops), which needs to +be of type `Model[Ragged, Floats2d]`. This layer determines how the vector for +each spaCy token will be computed from the zero or more source rows the token is +aligned against. Here we use the +[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which +averages the wordpiece rows. We could instead use `reduce_last`, +[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom +function you write yourself. + + You can have multiple components all listening to the same transformer model, and all passing gradients back to it. By default, all of the gradients will be -equally weighted. You can control this with the `grad_factor` setting, which +**equally weighted**. You can control this with the `grad_factor` setting, which lets you reweight the gradients from the different listeners. For instance, setting `grad_factor = 0` would disable gradients from one of the listeners, while `grad_factor = 2.0` would multiply them by 2. This is similar to having a custom learning rate for each component. Instead of a constant, you can also provide a schedule, allowing you to freeze the shared parameters at the start of training. - -### Runtime usage - -Transformer models can be used as drop-in replacements for other types of neural -networks, so your spaCy pipeline can include them in a way that's completely -invisible to the user. Users will download, load and use the model in the -standard way, like any other spaCy pipeline. - -Instead of using the transformers as subnetworks directly, you can also use them -via the [`Transformer`](/api/transformer) pipeline component. This sets the -[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, -which lets you access the transformers outputs at runtime via the -`doc._.trf_data` extension attribute. You can also customize how the -`Transformer` object sets annotations onto the `Doc`, by customizing the -`Transformer.annotation_setter` object. This callback will be called with the -raw input and output data for the whole batch, along with the batch of `Doc` -objects, allowing you to implement whatever you need. - -```python -import spacy - -nlp = spacy.load("en_core_trf_lg") -for doc in nlp.pipe(["some text", "some other text"]): - doc._.trf_data.tensors - tokvecs = doc._.trf_data.tensors[-1] -``` - -The `nlp` object in this example is just like any other spaCy pipeline - - -->