From feb86d52066da8d53ca08b050621b1dd15ab2c09 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 11:21:30 +0200 Subject: [PATCH 01/27] clarify default --- website/docs/api/architectures.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 3089fa1b3..55b456656 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -118,11 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` argument that connects to the shared `tok2vec` component in the pipeline. -| Name | Description | -| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | -| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | -| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | +| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} From 15902c5aa27d18e4c6e9aff2a20e54a83216d0c7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 11:51:57 +0200 Subject: [PATCH 02/27] fix link --- website/docs/api/transformer.md | 4 ++-- website/docs/usage/embeddings-transformers.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index c32651e02..b09455b41 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -29,7 +29,7 @@ This pipeline component lets you use transformer models in your pipeline. Supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using -the [TransformerListener](/api/architectures#TransformerListener) layer. This +the [TransformerListener](/api/architectures##transformers-Tok2VecListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. @@ -233,7 +233,7 @@ The `Transformer` component therefore does **not** perform a weight update during its own `update` method. Instead, it runs its transformer model and communicates the output and the backpropagation callback to any **downstream components** that have been connected to it via the -[TransformerListener](/api/architectures#TransformerListener) sublayer. If there +[TransformerListener](/api/architectures##transformers-Tok2VecListener) sublayer. If there are multiple listeners, the last layer will actually backprop to the transformer and call the optimizer, while the others simply increment the gradients. diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index e2c1a6fd0..b5f58927a 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -101,7 +101,7 @@ it processes a batch of documents, it will pass forward its predictions to the listeners, allowing the listeners to **reuse the predictions** when they are eventually called. A similar mechanism is used to pass gradients from the listeners back to the model. The [`Transformer`](/api/transformer) component and -[TransformerListener](/api/architectures#TransformerListener) layer do the same +[TransformerListener](/api/architectures#transformers-Tok2VecListener) layer do the same thing for transformer models, but the `Transformer` component will also save the transformer outputs to the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, @@ -179,7 +179,7 @@ interoperates with [PyTorch](https://pytorch.org) and the giving you access to thousands of pretrained models for your pipelines. There are many [great guides](http://jalammar.github.io/illustrated-transformer/) to transformer models, but for practical purposes, you can simply think of them as -a drop-in replacement that let you achieve **higher accuracy** in exchange for +drop-in replacements that let you achieve **higher accuracy** in exchange for **higher training and runtime costs**. ### Setup and installation {#transformers-installation} From ec069627febd542cf1741e8be5e88e03cdda43ae Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 13:31:01 +0200 Subject: [PATCH 03/27] rename to TransformerListener --- website/docs/api/architectures.md | 4 ++-- website/docs/api/transformer.md | 4 ++-- website/docs/usage/embeddings-transformers.md | 2 +- website/docs/usage/v3.md | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 55b456656..374c133ff 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -346,13 +346,13 @@ in other components, see | `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], FullTransformerBatch]~~ | -### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener} +### spacy-transformers.TransformerListener.v1 {#TransformerListener} > #### Example Config > > ```ini > [model] -> @architectures = "spacy-transformers.Tok2VecListener.v1" +> @architectures = "spacy-transformers.TransformerListener.v1" > grad_factor = 1.0 > > [model.pooling] diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index b09455b41..c32651e02 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -29,7 +29,7 @@ This pipeline component lets you use transformer models in your pipeline. Supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using -the [TransformerListener](/api/architectures##transformers-Tok2VecListener) layer. This +the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. @@ -233,7 +233,7 @@ The `Transformer` component therefore does **not** perform a weight update during its own `update` method. Instead, it runs its transformer model and communicates the output and the backpropagation callback to any **downstream components** that have been connected to it via the -[TransformerListener](/api/architectures##transformers-Tok2VecListener) sublayer. If there +[TransformerListener](/api/architectures#TransformerListener) sublayer. If there are multiple listeners, the last layer will actually backprop to the transformer and call the optimizer, while the others simply increment the gradients. diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index b5f58927a..62336a826 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -101,7 +101,7 @@ it processes a batch of documents, it will pass forward its predictions to the listeners, allowing the listeners to **reuse the predictions** when they are eventually called. A similar mechanism is used to pass gradients from the listeners back to the model. The [`Transformer`](/api/transformer) component and -[TransformerListener](/api/architectures#transformers-Tok2VecListener) layer do the same +[TransformerListener](/api/architectures#TransformerListener) layer do the same thing for transformer models, but the `Transformer` component will also save the transformer outputs to the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index bf0c13b68..5d55a788f 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -64,7 +64,7 @@ menu: [`TransformerData`](/api/transformer#transformerdata), [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - [Tok2VecListener](/api/architectures#transformers-Tok2VecListener), + [TransformerListener](/api/architectures#TransformerListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) - **Models:** [`en_core_trf_lg_sm`](/models/en) - **Implementation:** From 559b65f2e08ca3d4ed3c04bc0c58e241aef2b1a6 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 09:43:32 +0200 Subject: [PATCH 04/27] adjust references to null_annotation_setter to trfdata_setter --- website/docs/api/transformer.md | 59 ++++++++++--------- website/docs/usage/embeddings-transformers.md | 6 +- 2 files changed, 34 insertions(+), 31 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index c32651e02..0b51487ed 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -25,24 +25,23 @@ work out-of-the-box. -This pipeline component lets you use transformer models in your pipeline. -Supports all models that are available via the +This pipeline component lets you use transformer models in your pipeline. It +supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. -The component assigns the output of the transformer to the `Doc`'s extension -attributes. We also calculate an alignment between the word-piece tokens and the -spaCy tokenization, so that we can use the last hidden states to set the -`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy -token, the spaCy token receives the sum of their values. To access the values, -you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The -package also adds the function registries [`@span_getters`](#span_getters) and -[`@annotation_setters`](#annotation_setters) with several built-in registered -functions. For more details, see the -[usage documentation](/usage/embeddings-transformers). +We calculate an alignment between the word-piece tokens and the spaCy +tokenization, so that we can use the last hidden states to store the information +on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the +spaCy token receives the sum of their values. By default, the information is +written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but +you can implement a custom [`@annotation_setter`](#annotation_setters) to change +this behaviour. The package also adds the function registry +[`@span_getters`](#span_getters) with several built-in registered functions. For +more details, see the [usage documentation](/usage/embeddings-transformers). ## Config and implementation {#config} @@ -61,11 +60,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -518,19 +517,23 @@ right context. ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} -Annotation setters are functions that that take a batch of `Doc` objects and a -[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set -additional annotations on the `Doc`, e.g. to set custom or built-in attributes. -You can register custom annotation setters using the -`@registry.annotation_setters` decorator. +Annotation setters are functions that take a batch of `Doc` objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the +annotations on the `Doc`, e.g. to set custom or built-in attributes. You can +register custom annotation setters using the `@registry.annotation_setters` +decorator. The default annotation setter used by the `Transformer` pipeline +component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` +attribute. > #### Example > > ```python -> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") -> def configure_null_annotation_setter() -> Callable: +> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") +> def configure_trfdata_setter() -> Callable: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: -> pass +> doc_data = list(trf_data.doc_data) +> for doc, data in zip(docs, doc_data): +> doc._.trf_data = data > > return setter > ``` @@ -542,9 +545,9 @@ You can register custom annotation setters using the The following built-in functions are available: -| Name | Description | -| ---------------------------------------------- | ------------------------------------- | -| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | +| Name | Description | +| -------------------------------------- | ------------------------------------------------------------- | +| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | ## Custom attributes {#custom-attributes} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 62336a826..fbae1da82 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -299,7 +299,7 @@ component: > > ```python > from spacy_transformers import Transformer, TransformerModel -> from spacy_transformers.annotation_setters import null_annotation_setter +> from spacy_transformers.annotation_setters import configure_trfdata_setter > from spacy_transformers.span_getters import get_doc_spans > > trf = Transformer( @@ -309,7 +309,7 @@ component: > get_spans=get_doc_spans, > tokenizer_config={"use_fast": True}, > ), -> annotation_setter=null_annotation_setter, +> annotation_setter=configure_trfdata_setter(), > max_batch_items=4096, > ) > ``` @@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true} @span_getters = "doc_spans.v1" [components.transformer.annotation_setter] -@annotation_setters = "spacy-transformers.null_annotation_setter.v1" +@annotation_setters = "spacy-transformers.trfdata_setter.v1" ``` From acc794c97525b32bac54cb8e7f900460eba6789d Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 10:10:10 +0200 Subject: [PATCH 05/27] example of writing to other custom attribute --- website/docs/usage/embeddings-transformers.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index fbae1da82..3e95114f0 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -225,7 +225,7 @@ transformers as subnetworks directly, you can also use them via the ![The processing pipeline with the transformer component](../images/pipeline_transformer.svg) -The `Transformer` component sets the +By default, the `Transformer` component sets the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, which lets you access the transformers outputs at runtime. @@ -249,8 +249,8 @@ for doc in nlp.pipe(["some text", "some other text"]): tokvecs = doc._.trf_data.tensors[-1] ``` -You can also customize how the [`Transformer`](/api/transformer) component sets -annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. +You can customize how the [`Transformer`](/api/transformer) component sets +annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`. This callback will be called with the raw input and output data for the whole batch, along with the batch of `Doc` objects, allowing you to implement whatever you need. The annotation setter is called with a batch of [`Doc`](/api/doc) @@ -259,13 +259,15 @@ containing the transformers data for the batch. ```python def custom_annotation_setter(docs, trf_data): - # TODO: - ... + doc_data = list(trf_data.doc_data) + for doc, data in zip(docs, doc_data): + doc._.custom_attr = data nlp = spacy.load("en_core_trf_lg") nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter doc = nlp("This is a text") -print() # TODO: +assert isinstance(doc._.custom_attr, TransformerData) +print(doc._.custom_attr.tensors) ``` ### Training usage {#transformers-training} From c68169f83f267c4ad614bec1a7a914c90c46843f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 10:19:43 +0200 Subject: [PATCH 06/27] fix link --- website/docs/usage/embeddings-transformers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 3e95114f0..c1d7ee333 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -345,9 +345,9 @@ in a block starts with `@`, it's **resolved to a function** and all other settings are passed to the function as arguments. In this case, `name`, `tokenizer_config` and `get_spans`. -`get_spans` is a function that takes a batch of `Doc` object and returns lists +`get_spans` is a function that takes a batch of `Doc` objects and returns lists of potentially overlapping `Span` objects to process by the transformer. Several -[built-in functions](/api/transformer#span-getters) are available – for example, +[built-in functions](/api/transformer#span_getters) are available – for example, to process the whole document or individual sentences. When the config is resolved, the function is created and passed into the model as an argument. From 4d37ac3f33834d8a189bfbe66d31f53b4306ac3a Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:14:16 +0200 Subject: [PATCH 07/27] configure_custom_sent_spans example --- website/docs/usage/embeddings-transformers.md | 27 ++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index c1d7ee333..21bedc3d3 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -368,13 +368,17 @@ To change any of the settings, you can edit the `config.cfg` and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to process sentences. You can also register your own functions using the -`span_getters` registry: +`span_getters` registry. For instance, the following custom function returns +`Span` objects following sentence boundaries, unless a sentence succeeds a +certain amount of tokens, in which case subsentences of at most `max_length` +tokens are returned. > #### config.cfg > > ```ini > [components.transformer.model.get_spans] > @span_getters = "custom_sent_spans" +> max_length = 25 > ``` ```python @@ -382,12 +386,23 @@ process sentences. You can also register your own functions using the import spacy_transformers @spacy_transformers.registry.span_getters("custom_sent_spans") -def configure_custom_sent_spans(): - # TODO: write custom example - def get_sent_spans(docs): - return [list(doc.sents) for doc in docs] +def configure_custom_sent_spans(max_length: int): + def get_custom_sent_spans(docs): + spans = [] + for doc in docs: + spans.append([]) + for sent in doc.sents: + start = 0 + end = max_length + while end <= len(sent): + spans[-1].append(sent[start:end]) + start += max_length + end += max_length + if start < len(sent): + spans[-1].append(sent[start : len(sent)]) + return spans - return get_sent_spans + return get_custom_sent_spans ``` To resolve the config during training, spaCy needs to know about your custom From 28e4ba72702485d5379309967bc751b5d6f48a9f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:33:28 +0200 Subject: [PATCH 08/27] fix references to TransformerListener --- website/docs/usage/embeddings-transformers.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 21bedc3d3..e78baeb67 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -399,7 +399,7 @@ def configure_custom_sent_spans(max_length: int): start += max_length end += max_length if start < len(sent): - spans[-1].append(sent[start : len(sent)]) + spans[-1].append(sent[start:len(sent)]) return spans return get_custom_sent_spans @@ -429,7 +429,7 @@ The same idea applies to task models that power the **downstream components**. Most of spaCy's built-in model creation functions support a `tok2vec` argument, which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This is where we'll plug in our transformer model, using the -[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily +[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily delegates to the `Transformer` pipeline component. ```ini @@ -445,14 +445,14 @@ maxout_pieces = 3 use_upper = false [nlp.pipeline.ner.model.tok2vec] -@architectures = "spacy-transformers.Tok2VecListener.v1" +@architectures = "spacy-transformers.TransformerListener.v1" grad_factor = 1.0 [nlp.pipeline.ner.model.tok2vec.pooling] @layers = "reduce_mean.v1" ``` -The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a +The [TransformerListener](/api/architectures#TransformerListener) layer expects a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer determines how the vector for each spaCy token will be computed from the zero or From 329e49056008680b428ea0112ff5f52d5cdc2de7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:50:43 +0200 Subject: [PATCH 09/27] small import fixes --- website/docs/usage/embeddings-transformers.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index e78baeb67..96ae1978d 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -552,8 +552,9 @@ vectors, but combines them via summation with a smaller table of learned embeddings. ```python -from thinc.api import add, chain, remap_ids, Embed +from thinc.api import add, chain, remap_ids, Embed, FeatureExtractor from spacy.ml.staticvectors import StaticVectors +from spacy.util import registry @registry.architectures("my_example.MyEmbedding.v1") def MyCustomVectors( From 556e975a30fb7d9589ed9375e4e92008fc008848 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 19:24:44 +0200 Subject: [PATCH 10/27] various fixes --- website/docs/api/transformer.md | 61 ++++++++++--------- website/docs/usage/embeddings-transformers.md | 14 ++--- 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 0b51487ed..a3f6deb7d 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -49,8 +49,8 @@ The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your [`config.cfg` for training](/usage/training#config). See the -[model architectures](/api/architectures) documentation for details on the -architectures and their arguments and hyperparameters. +[model architectures](/api/architectures#transformers) documentation for details +on the transformer architectures and their arguments and hyperparameters. > #### Example > @@ -60,11 +60,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -97,18 +97,19 @@ Construct a `Transformer` component. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights. The activations from the transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension -attribute. You can also provide a callback to set additional annotations. In -your application, you would normally use a shortcut for this and instantiate the -component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). +attribute by default, but you can provide a different `annotation_setter` to +customize this behaviour. In your application, you would normally use a shortcut +and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#create_pipe). -| Name | Description | -| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| _keyword-only_ | | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | +| Name | Description | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| _keyword-only_ | | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | ## Transformer.\_\_call\_\_ {#call tag="method"} @@ -204,8 +205,9 @@ modifying them. Assign the extracted features to the Doc objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the -[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter -callback is then called, if provided. +[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be +customized by providing a different `annotation_setter` argument upon +construction. > #### Example > @@ -382,9 +384,8 @@ return tensors that refer to a whole padded batch of documents. These tensors are wrapped into the [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The `FullTransformerBatch` then splits out the per-document data, which is handled -by this class. Instances of this class -are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes) -extension attribute. +by this class. Instances of this class are typically assigned to the +[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute. | Name | Description | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -446,8 +447,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant. Span getters can be referenced in the `[components.transformer.model.get_spans]` block of the config to customize the sequences processed by the transformer. You -can also register custom span getters using the `@spacy.registry.span_getters` -decorator. +can also register +[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings) +using the `@spacy.registry.span_getters` decorator. > #### Example > @@ -522,8 +524,7 @@ Annotation setters are functions that take a batch of `Doc` objects and a annotations on the `Doc`, e.g. to set custom or built-in attributes. You can register custom annotation setters using the `@registry.annotation_setters` decorator. The default annotation setter used by the `Transformer` pipeline -component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` -attribute. +component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. > #### Example > @@ -554,6 +555,6 @@ The following built-in functions are available: The component sets the following [custom extension attributes](/usage/processing-pipeline#custom-components-attributes): -| Name | Description | -| -------------- | ------------------------------------------------------------------------ | -| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | +| Name | Description | +| ---------------- | ------------------------------------------------------------------------ | +| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 96ae1978d..751cff6a5 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -429,8 +429,8 @@ The same idea applies to task models that power the **downstream components**. Most of spaCy's built-in model creation functions support a `tok2vec` argument, which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This is where we'll plug in our transformer model, using the -[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily -delegates to the `Transformer` pipeline component. +[TransformerListener](/api/architectures#TransformerListener) layer, which +sneakily delegates to the `Transformer` pipeline component. ```ini ### config.cfg (excerpt) {highlight="12"} @@ -452,11 +452,11 @@ grad_factor = 1.0 @layers = "reduce_mean.v1" ``` -The [TransformerListener](/api/architectures#TransformerListener) layer expects a -[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument -`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer -determines how the vector for each spaCy token will be computed from the zero or -more source rows the token is aligned against. Here we use the +The [TransformerListener](/api/architectures#TransformerListener) layer expects +a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the +argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This +layer determines how the vector for each spaCy token will be computed from the +zero or more source rows the token is aligned against. Here we use the [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which averages the wordpiece rows. We could instead use [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom From aa9e0c9c39ccb00ed211aa2b718b8308258fa1df Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 19:56:52 +0200 Subject: [PATCH 11/27] small fix --- website/docs/api/architectures.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 374c133ff..b55027356 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -323,11 +323,11 @@ for details and system requirements. Load and wrap a transformer model from the [HuggingFace `transformers`](https://huggingface.co/transformers) library. You -can any transformer that has pretrained weights and a PyTorch implementation. -The `name` variable is passed through to the underlying library, so it can be -either a string or a path. If it's a string, the pretrained weights will be -downloaded via the transformers library if they are not already available -locally. +can use any transformer that has pretrained weights and a PyTorch +implementation. The `name` variable is passed through to the underlying library, +so it can be either a string or a path. If it's a string, the pretrained weights +will be downloaded via the transformers library if they are not already +available locally. In order to support longer documents, the [TransformerModel](/api/architectures#TransformerModel) layer allows you to pass From 72a87095d98c4b4e70b174122a0854e6f8812b23 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 20:26:28 +0200 Subject: [PATCH 12/27] add loggers registry --- website/docs/api/top-level.md | 1 + 1 file changed, 1 insertion(+) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 797fa0191..73474a81b 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -307,6 +307,7 @@ factories. | `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | | `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | | `lookups` | Registry for large lookup tables available via `vocab.lookups`. | | `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | | `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | From 5230529de2edd89869800a7c19bf3890d003b5bb Mon Sep 17 00:00:00 2001 From: svlandeg Date: Fri, 28 Aug 2020 21:44:04 +0200 Subject: [PATCH 13/27] add loggers registry & logger docs sections --- spacy/cli/train.py | 2 +- website/docs/api/top-level.md | 106 ++++++++++++++++++++++++--------- website/docs/usage/training.md | 18 ++++++ 3 files changed, 98 insertions(+), 28 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index d9ab8eca5..655a5ae58 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -272,7 +272,7 @@ def train_while_improving( step (int): How many steps have been completed. score (float): The main score form the last evaluation. other_scores: : The other scores from the last evaluation. - loss: The accumulated losses throughout training. + losses: The accumulated losses throughout training. checkpoints: A list of previous results, where each result is a (score, step, epoch) tuple. """ diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 73474a81b..011716060 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -296,24 +296,25 @@ factories. > i += 1 > ``` -| Registry name | Description | -| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `assets` | Registry for data assets, knowledge bases etc. | -| `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | -| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `loggers` | Registry for functions that log [training results](/usage/training). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | +| Registry name | Description | +| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `assets` | Registry for data assets, knowledge bases etc. | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | ### spacy-transformers registry {#registry-transformers} @@ -327,18 +328,69 @@ See the [`Transformer`](/api/transformer) API reference and > ```python > import spacy_transformers > -> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") -> def configure_custom_annotation_setter(): -> def annotation_setter(docs, trf_data) -> None: -> # Set annotations on the docs +> @spacy_transformers.registry.span_getters("my_span_getter.v1") +> def configure_custom_span_getter() -> Callable: +> def span_getter(docs: List[Doc]) -> List[List[Span]]: +> # Transform each Doc into a List of Span objects > -> return annotation_sette +> return span_getter > ``` -| Registry name | Description | -| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | -| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | + +| Registry name | Description | +| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | + +## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} + +A logger records the training results for each step. When a logger is created, +it returns a `log_step` function and a `finalize` function. The `log_step` +function is called by the [training script](/api/cli#train) and receives a +dictionary of information, including + +# TODO + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.ConsoleLogger.v1" +> ``` + +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. + +#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} + +Writes the results of a training step to the console in a tabular format. + +#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} + +> #### Installation +> +> ```bash +> $ pip install wandb +> $ wandb login +> ``` + +Built-in logger that sends the results of each training step to the dashboard of +the [Weights & Biases`](https://www.wandb.com/) dashboard. To use this logger, +Weights & Biases should be installed, and you should be logged in. The logger +will send the full config file to W&B, as well as various system information +such as GPU + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.WandbLogger.v1" +> project_name = "monitor_spacy_training" +> ``` ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 59766bada..878161b1b 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -605,6 +605,24 @@ to your Python file. Before loading the config, spaCy will import the $ python -m spacy train config.cfg --output ./output --code ./functions.py ``` +#### Example: Custom logging function {#custom-logging} + +During training, the results of each step are passed to a logger function in a +dictionary providing the following information: + +| Key | Value | +| -------------- | ---------------------------------------------------------------------------------------------- | +| `epoch` | How many passes over the data have been completed. ~~int~~ | +| `step` | How many steps have been completed. ~~int~~ | +| `score` | The main score form the last evaluation, measured on the dev set. ~~float~~ | +| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | +| `losses` | The accumulated training losses. ~~Dict[str, float]~~ | +| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | + +By default, these results are written to the console with the [`ConsoleLogger`](/api/top-level#ConsoleLogger) + +# TODO + #### Example: Custom batch size schedule {#custom-code-schedule} For example, let's say you've implemented your own batch size schedule to use From 2c90a06fee86128a504d95e5caf0e15ad439ebac Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 13:43:17 +0200 Subject: [PATCH 14/27] some more information about the loggers --- website/docs/api/top-level.md | 48 ++++++++++++++++++++++------------- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 6fbb1c821..518711a8a 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -4,6 +4,7 @@ menu: - ['spacy', 'spacy'] - ['displacy', 'displacy'] - ['registry', 'registry'] + - ['Loggers', 'loggers'] - ['Batchers', 'batchers'] - ['Data & Alignment', 'gold'] - ['Utility Functions', 'util'] @@ -345,19 +346,26 @@ See the [`Transformer`](/api/transformer) API reference and > return span_getter > ``` - | Registry name | Description | | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} -A logger records the training results for each step. When a logger is created, -it returns a `log_step` function and a `finalize` function. The `log_step` -function is called by the [training script](/api/cli#train) and receives a -dictionary of information, including +A logger records the training results. When a logger is created, two functions +are returned: one for logging the information for each training step, and a +second function that is called to finalize the logging when the training is +finished. To log each training step, a +[dictionary](/usage/training#custom-logging) is passed on from the +[training script](/api/cli#train), including information such as the training +loss and the accuracy scores on the development set. -# TODO +There are two built-in logging functions: a logger printing results to the +console in tabular format (which is the default), and one that also sends the +results to a [Weights & Biases`](https://www.wandb.com/) dashboard dashboard. +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. > #### Example config > @@ -366,10 +374,6 @@ dictionary of information, including > @loggers = "spacy.ConsoleLogger.v1" > ``` -Instead of using one of the built-in batchers listed here, you can also -[implement your own](/usage/training#custom-code-readers-batchers), which may or -may not use a custom schedule. - #### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} Writes the results of a training step to the console in a tabular format. @@ -384,14 +388,18 @@ Writes the results of a training step to the console in a tabular format. > ``` Built-in logger that sends the results of each training step to the dashboard of -the [Weights & Biases`](https://www.wandb.com/) dashboard. To use this logger, -Weights & Biases should be installed, and you should be logged in. The logger -will send the full config file to W&B, as well as various system information -such as GPU +the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights +& Biases should be installed, and you should be logged in. The logger will send +the full config file to W&B, as well as various system information such as +memory utilization, network traffic, disk IO, GPU statistics, etc. This will +also include information such as your hostname and operating system, as well as +the location of your Python executable. -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +Note that by default, the full (interpolated) training config file is sent over +to the W&B dashboard. If you prefer to exclude certain information such as path +names, you can list those fields in "dot notation" in the `remove_config_values` +parameter. These fields will then be removed from the config before uploading, +but will otherwise remain in the config file stored on your local system. > #### Example config > @@ -399,8 +407,14 @@ such as GPU > [training.logger] > @loggers = "spacy.WandbLogger.v1" > project_name = "monitor_spacy_training" +> remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"] > ``` +| Name | Description | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | + ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} A data batcher implements a batching strategy that essentially turns a stream of From 13ee742fb47466bf2b61e7c56094607a3bce815e Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:24:41 +0200 Subject: [PATCH 15/27] example of custom logger --- spacy/cli/train.py | 2 +- website/docs/usage/training.md | 49 +++++++++++++++++++++++++++++++--- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 655a5ae58..075b29c30 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -270,7 +270,7 @@ def train_while_improving( epoch (int): How many passes over the data have been completed. step (int): How many steps have been completed. - score (float): The main score form the last evaluation. + score (float): The main score from the last evaluation. other_scores: : The other scores from the last evaluation. losses: The accumulated losses throughout training. checkpoints: A list of previous results, where each result is a diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 878161b1b..069e7c00a 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -614,14 +614,55 @@ dictionary providing the following information: | -------------- | ---------------------------------------------------------------------------------------------- | | `epoch` | How many passes over the data have been completed. ~~int~~ | | `step` | How many steps have been completed. ~~int~~ | -| `score` | The main score form the last evaluation, measured on the dev set. ~~float~~ | +| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ | | `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | -| `losses` | The accumulated training losses. ~~Dict[str, float]~~ | +| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | | `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | -By default, these results are written to the console with the [`ConsoleLogger`](/api/top-level#ConsoleLogger) +By default, these results are written to the console with the +[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support +for writing the log files to [Weights & Biases](https://www.wandb.com/) with +the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement +your own logger as well, for instance to write the tabular results to file: -# TODO +```python +### functions.py +from typing import Tuple, Callable, Dict, Any +import spacy +from pathlib import Path + +@spacy.registry.loggers("my_custom_logger.v1") +def custom_logger(log_path): + def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]: + with Path(log_path).open("w") as file_: + file_.write("step\t") + file_.write("score\t") + for pipe in nlp.pipe_names: + file_.write(f"loss_{pipe}\t") + file_.write("\n") + + def log_step(info: Dict[str, Any]): + with Path(log_path).open("a") as file_: + file_.write(f"{info['step']}\t") + file_.write(f"{info['score']}\t") + for pipe in nlp.pipe_names: + file_.write(f"{info['losses'][pipe]}\t") + file_.write("\n") + + def finalize(): + pass + + return log_step, finalize + + return setup_logger +``` + +```ini +### config.cfg (excerpt) +[training.logger] +@loggers = "my_custom_logger.v1" +file_path = "my_file.tab" +``` #### Example: Custom batch size schedule {#custom-code-schedule} From e47ea88aeb8f07465b3a46c320e4ab7d11acb482 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:40:55 +0200 Subject: [PATCH 16/27] revert annotations refactor --- website/docs/api/top-level.md | 54 +++++++-------- website/docs/api/transformer.md | 66 +++++++++---------- website/docs/usage/embeddings-transformers.md | 19 +++--- 3 files changed, 68 insertions(+), 71 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 518711a8a..2643c3c02 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -306,25 +306,24 @@ factories. > i += 1 > ``` -| Registry name | Description | -| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. | -| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `assets` | Registry for data assets, knowledge bases etc. | -| `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | -| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `loggers` | Registry for functions that log [training results](/usage/training). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | +| Registry name | Description | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `assets` | Registry for data assets, knowledge bases etc. | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | ### spacy-transformers registry {#registry-transformers} @@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and > ```python > import spacy_transformers > -> @spacy_transformers.registry.span_getters("my_span_getter.v1") -> def configure_custom_span_getter() -> Callable: -> def span_getter(docs: List[Doc]) -> List[List[Span]]: -> # Transform each Doc into a List of Span objects +> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") +> def configure_custom_annotation_setter(): +> def annotation_setter(docs, trf_data) -> None: +> # Set annotations on the docs > -> return span_getter +> return annotation_setter > ``` -| Registry name | Description | -| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| Registry name | Description | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index a3f6deb7d..0b38c2e8d 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. -We calculate an alignment between the word-piece tokens and the spaCy -tokenization, so that we can use the last hidden states to store the information -on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the -spaCy token receives the sum of their values. By default, the information is -written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but -you can implement a custom [`@annotation_setter`](#annotation_setters) to change -this behaviour. The package also adds the function registry -[`@span_getters`](#span_getters) with several built-in registered functions. For -more details, see the [usage documentation](/usage/embeddings-transformers). +The component assigns the output of the transformer to the `Doc`'s extension +attributes. We also calculate an alignment between the word-piece tokens and the +spaCy tokenization, so that we can use the last hidden states to set the +`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy +token, the spaCy token receives the sum of their values. To access the values, +you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The +package also adds the function registries [`@span_getters`](#span_getters) and +[`@annotation_setters`](#annotation_setters) with several built-in registered +functions. For more details, see the +[usage documentation](/usage/embeddings-transformers). ## Config and implementation {#config} @@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights. The activations from the transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension -attribute by default, but you can provide a different `annotation_setter` to -customize this behaviour. In your application, you would normally use a shortcut -and instantiate the component using its string name and -[`nlp.add_pipe`](/api/language#create_pipe). +attribute. You can also provide a callback to set additional annotations. In +your application, you would normally use a shortcut for this and instantiate the +component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -205,9 +205,8 @@ modifying them. Assign the extracted features to the Doc objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the -[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be -customized by providing a different `annotation_setter` argument upon -construction. +[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter +callback is then called, if provided. > #### Example > @@ -520,23 +519,20 @@ right context. ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} Annotation setters are functions that take a batch of `Doc` objects and a -[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the -annotations on the `Doc`, e.g. to set custom or built-in attributes. You can -register custom annotation setters using the `@registry.annotation_setters` -decorator. The default annotation setter used by the `Transformer` pipeline -component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set +additional annotations on the `Doc`, e.g. to set custom or built-in attributes. +You can register custom annotation setters using the +`@registry.annotation_setters` decorator. > #### Example > > ```python -> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") -> def configure_trfdata_setter() -> Callable: +> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") +> def configure_null_annotation_setter() -> Callable: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: -> doc_data = list(trf_data.doc_data) -> for doc, data in zip(docs, doc_data): -> doc._.trf_data = data +> pass > -> return setter +> return setter > ``` | Name | Description | @@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. The following built-in functions are available: -| Name | Description | -| -------------------------------------- | ------------------------------------------------------------- | -| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | +| Name | Description | +| ---------------------------------------------- | ------------------------------------- | +| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | ## Custom attributes {#custom-attributes} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index cc49a86c2..aaa5fde10 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]): ``` You can also customize how the [`Transformer`](/api/transformer) component sets -annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. -This callback will be called with the raw input and output data for the whole -batch, along with the batch of `Doc` objects, allowing you to implement whatever -you need. The annotation setter is called with a batch of [`Doc`](/api/doc) -objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) -containing the transformers data for the batch. +annotations onto the [`Doc`](/api/doc), by specifying a custom +`annotation_setter`. This callback will be called with the raw input and output +data for the whole batch, along with the batch of `Doc` objects, allowing you to +implement whatever you need. The annotation setter is called with a batch of +[`Doc`](/api/doc) objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the +transformers data for the batch. ```python def custom_annotation_setter(docs, trf_data): @@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to process sentences. You can also register your own functions using the -[`span_getters` registry](/api/top-level#registry). For instance, the following -custom function returns [`Span`](/api/span) objects following sentence -boundaries, unless a sentence succeeds a certain amount of tokens, in which case +[`span_getters` registry](/api/top-level#registry). For instance, the following +custom function returns [`Span`](/api/span) objects following sentence +boundaries, unless a sentence succeeds a certain amount of tokens, in which case subsentences of at most `max_length` tokens are returned. > #### config.cfg From 56ba691ecd628a73d3f21ed8cbce3fc6feec598f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:46:00 +0200 Subject: [PATCH 17/27] small fixes --- website/docs/api/transformer.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 0b38c2e8d..5ac95cb29 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -102,14 +102,14 @@ attribute. You can also provide a callback to set additional annotations. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). -| Name | Description | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| _keyword-only_ | | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | +| Name | Description | +| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| _keyword-only_ | | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | ## Transformer.\_\_call\_\_ {#call tag="method"} @@ -532,7 +532,7 @@ You can register custom annotation setters using the > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: > pass > -> return setter +> return setter > ``` | Name | Description | From 0e0abb03785d087b14716679efa6f543d1ceda91 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:50:29 +0200 Subject: [PATCH 18/27] fix --- website/docs/api/top-level.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 2643c3c02..834b71701 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,7 +362,7 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases`](https://www.wandb.com/) dashboard dashboard. +results to a [Weights & Biases](https://www.wandb.com/) dashboard dashboard. Instead of using one of the built-in batchers listed here, you can also [implement your own](/usage/training#custom-code-readers-batchers), which may or may not use a custom schedule. From fe6c08218e0e57d9b9f936634e64330fa224323f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:51:49 +0200 Subject: [PATCH 19/27] fixes --- website/docs/api/top-level.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 834b71701..bde64b77b 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,10 +362,9 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases](https://www.wandb.com/) dashboard dashboard. -Instead of using one of the built-in batchers listed here, you can also -[implement your own](/usage/training#custom-code-readers-batchers), which may or -may not use a custom schedule. +results to a [Weights & Biases](https://www.wandb.com/) dashboard. +Instead of using one of the built-in loggers listed here, you can also +[implement your own](/usage/training#custom-logging). > #### Example config > From 6340d1c63d1d29e603ff8d84eaf23032e2687698 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 14:53:22 +0200 Subject: [PATCH 20/27] Add as_spans to Matcher/PhraseMatcher --- spacy/matcher/matcher.pyx | 14 ++++++++++---- spacy/matcher/phrasematcher.pyx | 15 +++++++++++---- spacy/tests/matcher/test_matcher_api.py | 17 ++++++++++++++++- spacy/tests/matcher/test_phrase_matcher.py | 18 +++++++++++++++++- 4 files changed, 54 insertions(+), 10 deletions(-) diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 16ab73735..fdce7e9fa 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -203,13 +203,16 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike): + def __call__(self, object doclike, as_spans=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. """ if isinstance(doclike, Doc): doc = doclike @@ -262,7 +265,10 @@ cdef class Matcher: on_match = self._callbacks.get(key, None) if on_match is not None: on_match(self, doc, i, final_matches) - return final_matches + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in final_matches] + else: + return final_matches def _normalize_key(self, key): if isinstance(key, basestring): diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 060c4d37f..6658c713e 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -7,6 +7,7 @@ import warnings from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA from ..structs cimport TokenC from ..tokens.token cimport Token +from ..tokens.span cimport Span from ..typedefs cimport attr_t from ..schemas import TokenPattern @@ -216,13 +217,16 @@ cdef class PhraseMatcher: result = internal_node map_set(self.mem, result, self.vocab.strings[key], NULL) - def __call__(self, doc): + def __call__(self, doc, as_spans=False): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. DOCS: https://spacy.io/api/phrasematcher#call """ @@ -239,7 +243,10 @@ cdef class PhraseMatcher: on_match = self._callbacks.get(self.vocab.strings[ent_id]) if on_match is not None: on_match(self, doc, i, matches) - return matches + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in matches] + else: + return matches cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil: cdef MapStruct* current_node = self.c_map diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index bcb224bd3..aeac509b5 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -2,7 +2,8 @@ import pytest import re from mock import Mock from spacy.matcher import Matcher, DependencyMatcher -from spacy.tokens import Doc, Token +from spacy.tokens import Doc, Token, Span + from ..doc.test_underscore import clean_underscore # noqa: F401 @@ -469,3 +470,17 @@ def test_matcher_span(matcher): assert len(matcher(doc)) == 2 assert len(matcher(span_js)) == 1 assert len(matcher(span_java)) == 1 + + +def test_matcher_as_spans(matcher): + """Test the new as_spans=True API.""" + text = "JavaScript is good but Java is better" + doc = Doc(matcher.vocab, words=text.split()) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "JavaScript" + assert matches[0].label_ == "JS" + assert isinstance(matches[1], Span) + assert matches[1].text == "Java" + assert matches[1].label_ == "Java" diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index 2a3c7d693..edffaa900 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -2,7 +2,7 @@ import pytest import srsly from mock import Mock from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc +from spacy.tokens import Doc, Span from ..util import get_doc @@ -287,3 +287,19 @@ def test_phrase_matcher_pickle(en_vocab): # clunky way to vaguely check that callback is unpickled (vocab, docs, callbacks, attr) = matcher_unpickled.__reduce__()[1] assert isinstance(callbacks.get("TEST2"), Mock) + + +def test_phrase_matcher_as_spans(en_vocab): + """Test the new as_spans=True API.""" + matcher = PhraseMatcher(en_vocab) + matcher.add("A", [Doc(en_vocab, words=["hello", "world"])]) + matcher.add("B", [Doc(en_vocab, words=["test"])]) + doc = Doc(en_vocab, words=["...", "hello", "world", "this", "is", "a", "test"]) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "hello world" + assert matches[0].label_ == "A" + assert isinstance(matches[1], Span) + assert matches[1].text == "test" + assert matches[1].label_ == "B" From 83aff38c59a638a795a154c51a25de3f98558a31 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 15:39:03 +0200 Subject: [PATCH 21/27] Make argument keyword-only Co-authored-by: Matthew Honnibal --- spacy/matcher/matcher.pyx | 2 +- spacy/matcher/phrasematcher.pyx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index fdce7e9fa..ee8efd688 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -203,7 +203,7 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike, as_spans=False): + def __call__(self, object doclike, *, as_spans=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 6658c713e..44dda115b 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -217,7 +217,7 @@ cdef class PhraseMatcher: result = internal_node map_set(self.mem, result, self.vocab.strings[key], NULL) - def __call__(self, doc, as_spans=False): + def __call__(self, doc, *, as_spans=False): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. From db9f8896f5f8e4c6bde1839a4f04a06babbf64fc Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 16:10:41 +0200 Subject: [PATCH 22/27] Add docs [ci skip] --- website/docs/api/matcher.md | 10 ++++--- website/docs/api/phrasematcher.md | 10 ++++--- website/docs/usage/rule-based-matching.md | 33 +++++++++++++++++++++++ 3 files changed, 45 insertions(+), 8 deletions(-) diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index f259174e2..136bac3c8 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -116,10 +116,12 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.pipe {#pipe tag="method"} diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 143eb9edf..8064a621e 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -57,10 +57,12 @@ Find all token sequences matching the supplied patterns on the `Doc`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ----------------------------------- | -| `doc` | The document to match over. ~~Doc~~ | -| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The document to match over. ~~Doc~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 7fdce032e..e3e0f2c19 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -493,6 +493,39 @@ you prefer. | `i` | Index of the current match (`matches[i`]). ~~int~~ | | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ | +### Creating spans from matches {#matcher-spans} + +Creating [`Span`](/api/span) objects from the returned matches is a very common +use case. spaCy makes this easy by giving you access to the `start` and `end` +token of each match, which you can use to construct a new span with an optional +label. As of spaCy v3.0, you can also set `as_spans=True` when calling the +matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects +using the `match_id` as the span label. + +```python +### {executable="true"} +import spacy +from spacy.matcher import Matcher +from spacy.tokens import Span + +nlp = spacy.blank("en") +matcher = Matcher(nlp.vocab) +matcher.add("PERSON", [[{"lower": "barack"}, {"lower": "obama"}]]) +doc = nlp("Barack Obama was the 44th president of the United States") + +# 1. Return (match_id, start, end) tuples +matches = matcher(doc) +for match_id, start, end in matches: + # Create the matched span and assign the match_id as a label + span = Span(doc, start, end, label=match_id) + print(span.text, span.label_) + +# 2. Return Span objects directly +matches = matcher(doc, as_spans=True) +for span in matches: + print(span.text, span.label_) +``` + ### Using custom pipeline components {#matcher-pipeline} Let's say your data also contains some annoying pre-processing artifacts, like From bca6bf8ddabca87bc963186cb8f449eb27d93ce3 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 16:39:53 +0200 Subject: [PATCH 23/27] Update docs [ci skip] --- website/docs/api/top-level.md | 19 +++++++++----- website/docs/usage/projects.md | 2 +- website/docs/usage/training.md | 47 ++++++++++++++++++---------------- 3 files changed, 38 insertions(+), 30 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index bde64b77b..b1a2d9532 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,8 +362,8 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases](https://www.wandb.com/) dashboard. -Instead of using one of the built-in loggers listed here, you can also +results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of +using one of the built-in loggers listed here, you can also [implement your own](/usage/training#custom-logging). > #### Example config @@ -394,11 +394,16 @@ memory utilization, network traffic, disk IO, GPU statistics, etc. This will also include information such as your hostname and operating system, as well as the location of your Python executable. -Note that by default, the full (interpolated) training config file is sent over -to the W&B dashboard. If you prefer to exclude certain information such as path -names, you can list those fields in "dot notation" in the `remove_config_values` -parameter. These fields will then be removed from the config before uploading, -but will otherwise remain in the config file stored on your local system. + + +Note that by default, the full (interpolated) +[training config](/usage/training#config) is sent over to the W&B dashboard. If +you prefer to **exclude certain information** such as path names, you can list +those fields in "dot notation" in the `remove_config_values` parameter. These +fields will then be removed from the config before uploading, but will otherwise +remain in the config file stored on your local system. + + > #### Example config > diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 620526280..ef895195c 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -914,4 +914,4 @@ mattis pretium. ### Weights & Biases {#wandb} - + diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 069e7c00a..20f25924e 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -607,8 +607,12 @@ $ python -m spacy train config.cfg --output ./output --code ./functions.py #### Example: Custom logging function {#custom-logging} -During training, the results of each step are passed to a logger function in a -dictionary providing the following information: +During training, the results of each step are passed to a logger function. By +default, these results are written to the console with the +[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support +for writing the log files to [Weights & Biases](https://www.wandb.com/) with the +[`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a +**dictionary** with the following keys: | Key | Value | | -------------- | ---------------------------------------------------------------------------------------------- | @@ -619,11 +623,17 @@ dictionary providing the following information: | `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | | `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | -By default, these results are written to the console with the -[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support -for writing the log files to [Weights & Biases](https://www.wandb.com/) with -the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement -your own logger as well, for instance to write the tabular results to file: +You can easily implement and plug in your own logger that records the training +results in a custom way, or sends them to an experiment management tracker of +your choice. In this example, the function `my_custom_logger.v1` writes the +tabular results to a file: + +> ```ini +> ### config.cfg (excerpt) +> [training.logger] +> @loggers = "my_custom_logger.v1" +> file_path = "my_file.tab" +> ``` ```python ### functions.py @@ -635,19 +645,19 @@ from pathlib import Path def custom_logger(log_path): def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]: with Path(log_path).open("w") as file_: - file_.write("step\t") - file_.write("score\t") + file_.write("step\\t") + file_.write("score\\t") for pipe in nlp.pipe_names: - file_.write(f"loss_{pipe}\t") - file_.write("\n") + file_.write(f"loss_{pipe}\\t") + file_.write("\\n") def log_step(info: Dict[str, Any]): with Path(log_path).open("a") as file_: - file_.write(f"{info['step']}\t") - file_.write(f"{info['score']}\t") + file_.write(f"{info['step']}\\t") + file_.write(f"{info['score']}\\t") for pipe in nlp.pipe_names: - file_.write(f"{info['losses'][pipe]}\t") - file_.write("\n") + file_.write(f"{info['losses'][pipe]}\\t") + file_.write("\\n") def finalize(): pass @@ -657,13 +667,6 @@ def custom_logger(log_path): return setup_logger ``` -```ini -### config.cfg (excerpt) -[training.logger] -@loggers = "my_custom_logger.v1" -file_path = "my_file.tab" -``` - #### Example: Custom batch size schedule {#custom-code-schedule} For example, let's say you've implemented your own batch size schedule to use From 2c3b64a567b2afb54edef14dbd8b87f2adb5e7e6 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 16:56:13 +0200 Subject: [PATCH 24/27] console logging example --- website/docs/api/top-level.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index b1a2d9532..b747007b0 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -377,6 +377,37 @@ using one of the built-in loggers listed here, you can also Writes the results of a training step to the console in a tabular format. + + +``` +$ python -m spacy train config.cfg +ℹ Using CPU +ℹ Loading config and nlp from: config.cfg +ℹ Pipeline: ['tok2vec', 'tagger'] +ℹ Start training +ℹ Training. Initial learn rate: 0.0 +E # LOSS TOK2VEC LOSS TAGGER TAG_ACC SCORE +--- ------ ------------ ----------- ------- ------ + 1 0 0.00 86.20 0.22 0.00 + 1 200 3.08 18968.78 34.00 0.34 + 1 400 31.81 22539.06 33.64 0.34 + 1 600 92.13 22794.91 43.80 0.44 + 1 800 183.62 21541.39 56.05 0.56 + 1 1000 352.49 25461.82 65.15 0.65 + 1 1200 422.87 23708.82 71.84 0.72 + 1 1400 601.92 24994.79 76.57 0.77 + 1 1600 662.57 22268.02 80.20 0.80 + 1 1800 1101.50 28413.77 82.56 0.83 + 1 2000 1253.43 28736.36 85.00 0.85 + 1 2200 1411.02 28237.53 87.42 0.87 + 1 2400 1605.35 28439.95 88.70 0.89 +``` + +Note that the cumulative loss keeps increasing within one epoch, but should +start decreasing across epochs. + + + #### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} > #### Installation From add9de548717f793b6dc5fa577aae34e1a24b874 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 17:01:24 +0200 Subject: [PATCH 25/27] Deprecate (Phrase)Matcher.pipe --- spacy/errors.py | 3 +++ spacy/matcher/matcher.pyx | 14 +++----------- spacy/matcher/phrasematcher.pyx | 16 +++------------- spacy/tests/matcher/test_matcher_api.py | 9 +++++++++ spacy/tests/matcher/test_phrase_matcher.py | 11 +++++++++++ website/docs/api/matcher.md | 21 --------------------- website/docs/api/phrasematcher.md | 21 --------------------- website/docs/usage/rule-based-matching.md | 9 --------- website/docs/usage/v3.md | 1 + 9 files changed, 30 insertions(+), 75 deletions(-) diff --git a/spacy/errors.py b/spacy/errors.py index e53aaef07..b99c99959 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -112,6 +112,9 @@ class Warnings: "word segmenters: {supported}. Defaulting to {default}.") W104 = ("Skipping modifications for '{target}' segmenter. The current " "segmenter is '{current}'.") + W105 = ("As of spaCy v3.0, the {matcher}.pipe method is deprecated. If you " + "need to match on a stream of documents, you can use nlp.pipe and " + "call the {matcher} on each Doc object.") @add_codes diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index ee8efd688..d3a8fa539 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -176,18 +176,10 @@ cdef class Matcher: return (self._callbacks[key], self._patterns[key]) def pipe(self, docs, batch_size=1000, return_matches=False, as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (Iterable[Union[Doc, Span]]): A stream of documents or spans. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ + warnings.warn(Warnings.W105.format(matcher="Matcher"), DeprecationWarning) if as_tuples: for doc, context in docs: matches = self(doc) diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 44dda115b..ba0f515b5 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -292,20 +292,10 @@ cdef class PhraseMatcher: idx += 1 def pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (iterable): A stream of documents. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. - - DOCS: https://spacy.io/api/phrasematcher#pipe + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ + warnings.warn(Warnings.W105.format(matcher="PhraseMatcher"), DeprecationWarning) if as_tuples: for doc, context in stream: matches = self(doc) diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index aeac509b5..8310c4466 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -484,3 +484,12 @@ def test_matcher_as_spans(matcher): assert isinstance(matches[1], Span) assert matches[1].text == "Java" assert matches[1].label_ == "Java" + + +def test_matcher_deprecated(matcher): + doc = Doc(matcher.vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index edffaa900..4b7027f87 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -303,3 +303,14 @@ def test_phrase_matcher_as_spans(en_vocab): assert isinstance(matches[1], Span) assert matches[1].text == "test" assert matches[1].label_ == "B" + + +def test_phrase_matcher_deprecated(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add("TEST", [Doc(en_vocab, words=["helllo"])]) + doc = Doc(en_vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 136bac3c8..1f1946be5 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -123,27 +123,6 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. | `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | | **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | -## Matcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import Matcher -> matcher = Matcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Description | -| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | A stream of documents or spans. ~~Iterable[Union[Doc, Span]]~~ | -| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ | -| `return_matches` 2.1 | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ | -| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ | -| **YIELDS** | Documents, in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ | - ## Matcher.\_\_len\_\_ {#len tag="method" new="2"} Get the number of rules added to the matcher. Note that this only returns the diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 8064a621e..39e3a298b 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -76,27 +76,6 @@ match_id_string = nlp.vocab.strings[match_id] -## PhraseMatcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import PhraseMatcher -> matcher = PhraseMatcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Description | -| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | A stream of documents. ~~Iterable[Doc]~~ | -| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ | -| `return_matches` 2.1 | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ | -| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ | -| **YIELDS** | Documents and optional matches or context in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ | - ## PhraseMatcher.\_\_len\_\_ {#len tag="method"} Get the number of rules added to the matcher. Note that this only returns the diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index e3e0f2c19..a589c556e 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -856,15 +856,6 @@ for token in doc: print(token.text, token._.is_hashtag) ``` -To process a stream of social media posts, we can use -[`Language.pipe`](/api/language#pipe), which will return a stream of `Doc` -objects that we can pass to [`Matcher.pipe`](/api/matcher#pipe). - -```python -docs = nlp.pipe(LOTS_OF_TWEETS) -matches = matcher.pipe(docs) -``` - ## Efficient phrase matching {#phrasematcher} If you need to match large terminology lists, you can also use the diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index de3d7ce33..6a1499bdf 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -389,6 +389,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**. | `GoldParse` | [`Example`](/api/example) | | `GoldCorpus` | [`Corpus`](/api/corpus) | | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | +| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed | | `spacy init-model` | [`spacy init model`](/api/cli#init-model) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | From 3929431af1991d76f3594f89c8dda2c87ee8d5e6 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 17:06:33 +0200 Subject: [PATCH 26/27] Update docs [ci skip] --- website/docs/api/top-level.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index b747007b0..d437ecc07 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -357,8 +357,8 @@ are returned: one for logging the information for each training step, and a second function that is called to finalize the logging when the training is finished. To log each training step, a [dictionary](/usage/training#custom-logging) is passed on from the -[training script](/api/cli#train), including information such as the training -loss and the accuracy scores on the development set. +[`spacy train`](/api/cli#train), including information such as the training loss +and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the @@ -366,6 +366,8 @@ results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of using one of the built-in loggers listed here, you can also [implement your own](/usage/training#custom-logging). +#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} + > #### Example config > > ```ini @@ -373,19 +375,21 @@ using one of the built-in loggers listed here, you can also > @loggers = "spacy.ConsoleLogger.v1" > ``` -#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} - Writes the results of a training step to the console in a tabular format. - + + +```cli +$ python -m spacy train config.cfg +``` ``` -$ python -m spacy train config.cfg ℹ Using CPU ℹ Loading config and nlp from: config.cfg ℹ Pipeline: ['tok2vec', 'tagger'] ℹ Start training ℹ Training. Initial learn rate: 0.0 + E # LOSS TOK2VEC LOSS TAGGER TAG_ACC SCORE --- ------ ------------ ----------- ------- ------ 1 0 0.00 86.20 0.22 0.00 From 3ac620f09d2d18dcc4e347f610eeb3aba32875a0 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Mon, 31 Aug 2020 17:40:04 +0200 Subject: [PATCH 27/27] fix config example [ci skip] --- website/docs/usage/training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 20f25924e..2d7905230 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -632,7 +632,7 @@ tabular results to a file: > ### config.cfg (excerpt) > [training.logger] > @loggers = "my_custom_logger.v1" -> file_path = "my_file.tab" +> log_path = "my_file.tab" > ``` ```python