From feb86d52066da8d53ca08b050621b1dd15ab2c09 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 11:21:30 +0200 Subject: [PATCH 01/84] clarify default --- website/docs/api/architectures.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 3089fa1b3..55b456656 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -118,11 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` argument that connects to the shared `tok2vec` component in the pipeline. -| Name | Description | -| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | -| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | -| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | +| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} From 15902c5aa27d18e4c6e9aff2a20e54a83216d0c7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 11:51:57 +0200 Subject: [PATCH 02/84] fix link --- website/docs/api/transformer.md | 4 ++-- website/docs/usage/embeddings-transformers.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index c32651e02..b09455b41 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -29,7 +29,7 @@ This pipeline component lets you use transformer models in your pipeline. Supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using -the [TransformerListener](/api/architectures#TransformerListener) layer. This +the [TransformerListener](/api/architectures##transformers-Tok2VecListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. @@ -233,7 +233,7 @@ The `Transformer` component therefore does **not** perform a weight update during its own `update` method. Instead, it runs its transformer model and communicates the output and the backpropagation callback to any **downstream components** that have been connected to it via the -[TransformerListener](/api/architectures#TransformerListener) sublayer. If there +[TransformerListener](/api/architectures##transformers-Tok2VecListener) sublayer. If there are multiple listeners, the last layer will actually backprop to the transformer and call the optimizer, while the others simply increment the gradients. diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index e2c1a6fd0..b5f58927a 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -101,7 +101,7 @@ it processes a batch of documents, it will pass forward its predictions to the listeners, allowing the listeners to **reuse the predictions** when they are eventually called. A similar mechanism is used to pass gradients from the listeners back to the model. The [`Transformer`](/api/transformer) component and -[TransformerListener](/api/architectures#TransformerListener) layer do the same +[TransformerListener](/api/architectures#transformers-Tok2VecListener) layer do the same thing for transformer models, but the `Transformer` component will also save the transformer outputs to the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, @@ -179,7 +179,7 @@ interoperates with [PyTorch](https://pytorch.org) and the giving you access to thousands of pretrained models for your pipelines. There are many [great guides](http://jalammar.github.io/illustrated-transformer/) to transformer models, but for practical purposes, you can simply think of them as -a drop-in replacement that let you achieve **higher accuracy** in exchange for +drop-in replacements that let you achieve **higher accuracy** in exchange for **higher training and runtime costs**. ### Setup and installation {#transformers-installation} From ec069627febd542cf1741e8be5e88e03cdda43ae Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 26 Aug 2020 13:31:01 +0200 Subject: [PATCH 03/84] rename to TransformerListener --- website/docs/api/architectures.md | 4 ++-- website/docs/api/transformer.md | 4 ++-- website/docs/usage/embeddings-transformers.md | 2 +- website/docs/usage/v3.md | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 55b456656..374c133ff 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -346,13 +346,13 @@ in other components, see | `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], FullTransformerBatch]~~ | -### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener} +### spacy-transformers.TransformerListener.v1 {#TransformerListener} > #### Example Config > > ```ini > [model] -> @architectures = "spacy-transformers.Tok2VecListener.v1" +> @architectures = "spacy-transformers.TransformerListener.v1" > grad_factor = 1.0 > > [model.pooling] diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index b09455b41..c32651e02 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -29,7 +29,7 @@ This pipeline component lets you use transformer models in your pipeline. Supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using -the [TransformerListener](/api/architectures##transformers-Tok2VecListener) layer. This +the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. @@ -233,7 +233,7 @@ The `Transformer` component therefore does **not** perform a weight update during its own `update` method. Instead, it runs its transformer model and communicates the output and the backpropagation callback to any **downstream components** that have been connected to it via the -[TransformerListener](/api/architectures##transformers-Tok2VecListener) sublayer. If there +[TransformerListener](/api/architectures#TransformerListener) sublayer. If there are multiple listeners, the last layer will actually backprop to the transformer and call the optimizer, while the others simply increment the gradients. diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index b5f58927a..62336a826 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -101,7 +101,7 @@ it processes a batch of documents, it will pass forward its predictions to the listeners, allowing the listeners to **reuse the predictions** when they are eventually called. A similar mechanism is used to pass gradients from the listeners back to the model. The [`Transformer`](/api/transformer) component and -[TransformerListener](/api/architectures#transformers-Tok2VecListener) layer do the same +[TransformerListener](/api/architectures#TransformerListener) layer do the same thing for transformer models, but the `Transformer` component will also save the transformer outputs to the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index bf0c13b68..5d55a788f 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -64,7 +64,7 @@ menu: [`TransformerData`](/api/transformer#transformerdata), [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - [Tok2VecListener](/api/architectures#transformers-Tok2VecListener), + [TransformerListener](/api/architectures#TransformerListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) - **Models:** [`en_core_trf_lg_sm`](/models/en) - **Implementation:** From 559b65f2e08ca3d4ed3c04bc0c58e241aef2b1a6 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 09:43:32 +0200 Subject: [PATCH 04/84] adjust references to null_annotation_setter to trfdata_setter --- website/docs/api/transformer.md | 59 ++++++++++--------- website/docs/usage/embeddings-transformers.md | 6 +- 2 files changed, 34 insertions(+), 31 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index c32651e02..0b51487ed 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -25,24 +25,23 @@ work out-of-the-box. -This pipeline component lets you use transformer models in your pipeline. -Supports all models that are available via the +This pipeline component lets you use transformer models in your pipeline. It +supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. -The component assigns the output of the transformer to the `Doc`'s extension -attributes. We also calculate an alignment between the word-piece tokens and the -spaCy tokenization, so that we can use the last hidden states to set the -`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy -token, the spaCy token receives the sum of their values. To access the values, -you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The -package also adds the function registries [`@span_getters`](#span_getters) and -[`@annotation_setters`](#annotation_setters) with several built-in registered -functions. For more details, see the -[usage documentation](/usage/embeddings-transformers). +We calculate an alignment between the word-piece tokens and the spaCy +tokenization, so that we can use the last hidden states to store the information +on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the +spaCy token receives the sum of their values. By default, the information is +written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but +you can implement a custom [`@annotation_setter`](#annotation_setters) to change +this behaviour. The package also adds the function registry +[`@span_getters`](#span_getters) with several built-in registered functions. For +more details, see the [usage documentation](/usage/embeddings-transformers). ## Config and implementation {#config} @@ -61,11 +60,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -518,19 +517,23 @@ right context. ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} -Annotation setters are functions that that take a batch of `Doc` objects and a -[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set -additional annotations on the `Doc`, e.g. to set custom or built-in attributes. -You can register custom annotation setters using the -`@registry.annotation_setters` decorator. +Annotation setters are functions that take a batch of `Doc` objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the +annotations on the `Doc`, e.g. to set custom or built-in attributes. You can +register custom annotation setters using the `@registry.annotation_setters` +decorator. The default annotation setter used by the `Transformer` pipeline +component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` +attribute. > #### Example > > ```python -> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") -> def configure_null_annotation_setter() -> Callable: +> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") +> def configure_trfdata_setter() -> Callable: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: -> pass +> doc_data = list(trf_data.doc_data) +> for doc, data in zip(docs, doc_data): +> doc._.trf_data = data > > return setter > ``` @@ -542,9 +545,9 @@ You can register custom annotation setters using the The following built-in functions are available: -| Name | Description | -| ---------------------------------------------- | ------------------------------------- | -| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | +| Name | Description | +| -------------------------------------- | ------------------------------------------------------------- | +| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | ## Custom attributes {#custom-attributes} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 62336a826..fbae1da82 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -299,7 +299,7 @@ component: > > ```python > from spacy_transformers import Transformer, TransformerModel -> from spacy_transformers.annotation_setters import null_annotation_setter +> from spacy_transformers.annotation_setters import configure_trfdata_setter > from spacy_transformers.span_getters import get_doc_spans > > trf = Transformer( @@ -309,7 +309,7 @@ component: > get_spans=get_doc_spans, > tokenizer_config={"use_fast": True}, > ), -> annotation_setter=null_annotation_setter, +> annotation_setter=configure_trfdata_setter(), > max_batch_items=4096, > ) > ``` @@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true} @span_getters = "doc_spans.v1" [components.transformer.annotation_setter] -@annotation_setters = "spacy-transformers.null_annotation_setter.v1" +@annotation_setters = "spacy-transformers.trfdata_setter.v1" ``` From acc794c97525b32bac54cb8e7f900460eba6789d Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 10:10:10 +0200 Subject: [PATCH 05/84] example of writing to other custom attribute --- website/docs/usage/embeddings-transformers.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index fbae1da82..3e95114f0 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -225,7 +225,7 @@ transformers as subnetworks directly, you can also use them via the ![The processing pipeline with the transformer component](../images/pipeline_transformer.svg) -The `Transformer` component sets the +By default, the `Transformer` component sets the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, which lets you access the transformers outputs at runtime. @@ -249,8 +249,8 @@ for doc in nlp.pipe(["some text", "some other text"]): tokvecs = doc._.trf_data.tensors[-1] ``` -You can also customize how the [`Transformer`](/api/transformer) component sets -annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. +You can customize how the [`Transformer`](/api/transformer) component sets +annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`. This callback will be called with the raw input and output data for the whole batch, along with the batch of `Doc` objects, allowing you to implement whatever you need. The annotation setter is called with a batch of [`Doc`](/api/doc) @@ -259,13 +259,15 @@ containing the transformers data for the batch. ```python def custom_annotation_setter(docs, trf_data): - # TODO: - ... + doc_data = list(trf_data.doc_data) + for doc, data in zip(docs, doc_data): + doc._.custom_attr = data nlp = spacy.load("en_core_trf_lg") nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter doc = nlp("This is a text") -print() # TODO: +assert isinstance(doc._.custom_attr, TransformerData) +print(doc._.custom_attr.tensors) ``` ### Training usage {#transformers-training} From c68169f83f267c4ad614bec1a7a914c90c46843f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 10:19:43 +0200 Subject: [PATCH 06/84] fix link --- website/docs/usage/embeddings-transformers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 3e95114f0..c1d7ee333 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -345,9 +345,9 @@ in a block starts with `@`, it's **resolved to a function** and all other settings are passed to the function as arguments. In this case, `name`, `tokenizer_config` and `get_spans`. -`get_spans` is a function that takes a batch of `Doc` object and returns lists +`get_spans` is a function that takes a batch of `Doc` objects and returns lists of potentially overlapping `Span` objects to process by the transformer. Several -[built-in functions](/api/transformer#span-getters) are available – for example, +[built-in functions](/api/transformer#span_getters) are available – for example, to process the whole document or individual sentences. When the config is resolved, the function is created and passed into the model as an argument. From 4d37ac3f33834d8a189bfbe66d31f53b4306ac3a Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:14:16 +0200 Subject: [PATCH 07/84] configure_custom_sent_spans example --- website/docs/usage/embeddings-transformers.md | 27 ++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index c1d7ee333..21bedc3d3 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -368,13 +368,17 @@ To change any of the settings, you can edit the `config.cfg` and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to process sentences. You can also register your own functions using the -`span_getters` registry: +`span_getters` registry. For instance, the following custom function returns +`Span` objects following sentence boundaries, unless a sentence succeeds a +certain amount of tokens, in which case subsentences of at most `max_length` +tokens are returned. > #### config.cfg > > ```ini > [components.transformer.model.get_spans] > @span_getters = "custom_sent_spans" +> max_length = 25 > ``` ```python @@ -382,12 +386,23 @@ process sentences. You can also register your own functions using the import spacy_transformers @spacy_transformers.registry.span_getters("custom_sent_spans") -def configure_custom_sent_spans(): - # TODO: write custom example - def get_sent_spans(docs): - return [list(doc.sents) for doc in docs] +def configure_custom_sent_spans(max_length: int): + def get_custom_sent_spans(docs): + spans = [] + for doc in docs: + spans.append([]) + for sent in doc.sents: + start = 0 + end = max_length + while end <= len(sent): + spans[-1].append(sent[start:end]) + start += max_length + end += max_length + if start < len(sent): + spans[-1].append(sent[start : len(sent)]) + return spans - return get_sent_spans + return get_custom_sent_spans ``` To resolve the config during training, spaCy needs to know about your custom From 28e4ba72702485d5379309967bc751b5d6f48a9f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:33:28 +0200 Subject: [PATCH 08/84] fix references to TransformerListener --- website/docs/usage/embeddings-transformers.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 21bedc3d3..e78baeb67 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -399,7 +399,7 @@ def configure_custom_sent_spans(max_length: int): start += max_length end += max_length if start < len(sent): - spans[-1].append(sent[start : len(sent)]) + spans[-1].append(sent[start:len(sent)]) return spans return get_custom_sent_spans @@ -429,7 +429,7 @@ The same idea applies to task models that power the **downstream components**. Most of spaCy's built-in model creation functions support a `tok2vec` argument, which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This is where we'll plug in our transformer model, using the -[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily +[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily delegates to the `Transformer` pipeline component. ```ini @@ -445,14 +445,14 @@ maxout_pieces = 3 use_upper = false [nlp.pipeline.ner.model.tok2vec] -@architectures = "spacy-transformers.Tok2VecListener.v1" +@architectures = "spacy-transformers.TransformerListener.v1" grad_factor = 1.0 [nlp.pipeline.ner.model.tok2vec.pooling] @layers = "reduce_mean.v1" ``` -The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a +The [TransformerListener](/api/architectures#TransformerListener) layer expects a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer determines how the vector for each spaCy token will be computed from the zero or From 329e49056008680b428ea0112ff5f52d5cdc2de7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 14:50:43 +0200 Subject: [PATCH 09/84] small import fixes --- website/docs/usage/embeddings-transformers.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index e78baeb67..96ae1978d 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -552,8 +552,9 @@ vectors, but combines them via summation with a smaller table of learned embeddings. ```python -from thinc.api import add, chain, remap_ids, Embed +from thinc.api import add, chain, remap_ids, Embed, FeatureExtractor from spacy.ml.staticvectors import StaticVectors +from spacy.util import registry @registry.architectures("my_example.MyEmbedding.v1") def MyCustomVectors( From 556e975a30fb7d9589ed9375e4e92008fc008848 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 19:24:44 +0200 Subject: [PATCH 10/84] various fixes --- website/docs/api/transformer.md | 61 ++++++++++--------- website/docs/usage/embeddings-transformers.md | 14 ++--- 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 0b51487ed..a3f6deb7d 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -49,8 +49,8 @@ The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your [`config.cfg` for training](/usage/training#config). See the -[model architectures](/api/architectures) documentation for details on the -architectures and their arguments and hyperparameters. +[model architectures](/api/architectures#transformers) documentation for details +on the transformer architectures and their arguments and hyperparameters. > #### Example > @@ -60,11 +60,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -97,18 +97,19 @@ Construct a `Transformer` component. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights. The activations from the transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension -attribute. You can also provide a callback to set additional annotations. In -your application, you would normally use a shortcut for this and instantiate the -component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). +attribute by default, but you can provide a different `annotation_setter` to +customize this behaviour. In your application, you would normally use a shortcut +and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#create_pipe). -| Name | Description | -| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| _keyword-only_ | | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | +| Name | Description | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| _keyword-only_ | | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | ## Transformer.\_\_call\_\_ {#call tag="method"} @@ -204,8 +205,9 @@ modifying them. Assign the extracted features to the Doc objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the -[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter -callback is then called, if provided. +[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be +customized by providing a different `annotation_setter` argument upon +construction. > #### Example > @@ -382,9 +384,8 @@ return tensors that refer to a whole padded batch of documents. These tensors are wrapped into the [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The `FullTransformerBatch` then splits out the per-document data, which is handled -by this class. Instances of this class -are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes) -extension attribute. +by this class. Instances of this class are typically assigned to the +[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute. | Name | Description | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -446,8 +447,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant. Span getters can be referenced in the `[components.transformer.model.get_spans]` block of the config to customize the sequences processed by the transformer. You -can also register custom span getters using the `@spacy.registry.span_getters` -decorator. +can also register +[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings) +using the `@spacy.registry.span_getters` decorator. > #### Example > @@ -522,8 +524,7 @@ Annotation setters are functions that take a batch of `Doc` objects and a annotations on the `Doc`, e.g. to set custom or built-in attributes. You can register custom annotation setters using the `@registry.annotation_setters` decorator. The default annotation setter used by the `Transformer` pipeline -component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` -attribute. +component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. > #### Example > @@ -554,6 +555,6 @@ The following built-in functions are available: The component sets the following [custom extension attributes](/usage/processing-pipeline#custom-components-attributes): -| Name | Description | -| -------------- | ------------------------------------------------------------------------ | -| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | +| Name | Description | +| ---------------- | ------------------------------------------------------------------------ | +| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 96ae1978d..751cff6a5 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -429,8 +429,8 @@ The same idea applies to task models that power the **downstream components**. Most of spaCy's built-in model creation functions support a `tok2vec` argument, which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This is where we'll plug in our transformer model, using the -[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily -delegates to the `Transformer` pipeline component. +[TransformerListener](/api/architectures#TransformerListener) layer, which +sneakily delegates to the `Transformer` pipeline component. ```ini ### config.cfg (excerpt) {highlight="12"} @@ -452,11 +452,11 @@ grad_factor = 1.0 @layers = "reduce_mean.v1" ``` -The [TransformerListener](/api/architectures#TransformerListener) layer expects a -[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument -`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer -determines how the vector for each spaCy token will be computed from the zero or -more source rows the token is aligned against. Here we use the +The [TransformerListener](/api/architectures#TransformerListener) layer expects +a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the +argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This +layer determines how the vector for each spaCy token will be computed from the +zero or more source rows the token is aligned against. Here we use the [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which averages the wordpiece rows. We could instead use [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom From aa9e0c9c39ccb00ed211aa2b718b8308258fa1df Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 19:56:52 +0200 Subject: [PATCH 11/84] small fix --- website/docs/api/architectures.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 374c133ff..b55027356 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -323,11 +323,11 @@ for details and system requirements. Load and wrap a transformer model from the [HuggingFace `transformers`](https://huggingface.co/transformers) library. You -can any transformer that has pretrained weights and a PyTorch implementation. -The `name` variable is passed through to the underlying library, so it can be -either a string or a path. If it's a string, the pretrained weights will be -downloaded via the transformers library if they are not already available -locally. +can use any transformer that has pretrained weights and a PyTorch +implementation. The `name` variable is passed through to the underlying library, +so it can be either a string or a path. If it's a string, the pretrained weights +will be downloaded via the transformers library if they are not already +available locally. In order to support longer documents, the [TransformerModel](/api/architectures#TransformerModel) layer allows you to pass From 72a87095d98c4b4e70b174122a0854e6f8812b23 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 27 Aug 2020 20:26:28 +0200 Subject: [PATCH 12/84] add loggers registry --- website/docs/api/top-level.md | 1 + 1 file changed, 1 insertion(+) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 797fa0191..73474a81b 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -307,6 +307,7 @@ factories. | `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | | `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | | `lookups` | Registry for large lookup tables available via `vocab.lookups`. | | `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | | `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | From 5230529de2edd89869800a7c19bf3890d003b5bb Mon Sep 17 00:00:00 2001 From: svlandeg Date: Fri, 28 Aug 2020 21:44:04 +0200 Subject: [PATCH 13/84] add loggers registry & logger docs sections --- spacy/cli/train.py | 2 +- website/docs/api/top-level.md | 106 ++++++++++++++++++++++++--------- website/docs/usage/training.md | 18 ++++++ 3 files changed, 98 insertions(+), 28 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index d9ab8eca5..655a5ae58 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -272,7 +272,7 @@ def train_while_improving( step (int): How many steps have been completed. score (float): The main score form the last evaluation. other_scores: : The other scores from the last evaluation. - loss: The accumulated losses throughout training. + losses: The accumulated losses throughout training. checkpoints: A list of previous results, where each result is a (score, step, epoch) tuple. """ diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 73474a81b..011716060 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -296,24 +296,25 @@ factories. > i += 1 > ``` -| Registry name | Description | -| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `assets` | Registry for data assets, knowledge bases etc. | -| `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | -| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `loggers` | Registry for functions that log [training results](/usage/training). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | +| Registry name | Description | +| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `assets` | Registry for data assets, knowledge bases etc. | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | ### spacy-transformers registry {#registry-transformers} @@ -327,18 +328,69 @@ See the [`Transformer`](/api/transformer) API reference and > ```python > import spacy_transformers > -> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") -> def configure_custom_annotation_setter(): -> def annotation_setter(docs, trf_data) -> None: -> # Set annotations on the docs +> @spacy_transformers.registry.span_getters("my_span_getter.v1") +> def configure_custom_span_getter() -> Callable: +> def span_getter(docs: List[Doc]) -> List[List[Span]]: +> # Transform each Doc into a List of Span objects > -> return annotation_sette +> return span_getter > ``` -| Registry name | Description | -| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | -| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | + +| Registry name | Description | +| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | + +## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} + +A logger records the training results for each step. When a logger is created, +it returns a `log_step` function and a `finalize` function. The `log_step` +function is called by the [training script](/api/cli#train) and receives a +dictionary of information, including + +# TODO + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.ConsoleLogger.v1" +> ``` + +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. + +#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} + +Writes the results of a training step to the console in a tabular format. + +#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} + +> #### Installation +> +> ```bash +> $ pip install wandb +> $ wandb login +> ``` + +Built-in logger that sends the results of each training step to the dashboard of +the [Weights & Biases`](https://www.wandb.com/) dashboard. To use this logger, +Weights & Biases should be installed, and you should be logged in. The logger +will send the full config file to W&B, as well as various system information +such as GPU + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.WandbLogger.v1" +> project_name = "monitor_spacy_training" +> ``` ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 59766bada..878161b1b 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -605,6 +605,24 @@ to your Python file. Before loading the config, spaCy will import the $ python -m spacy train config.cfg --output ./output --code ./functions.py ``` +#### Example: Custom logging function {#custom-logging} + +During training, the results of each step are passed to a logger function in a +dictionary providing the following information: + +| Key | Value | +| -------------- | ---------------------------------------------------------------------------------------------- | +| `epoch` | How many passes over the data have been completed. ~~int~~ | +| `step` | How many steps have been completed. ~~int~~ | +| `score` | The main score form the last evaluation, measured on the dev set. ~~float~~ | +| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | +| `losses` | The accumulated training losses. ~~Dict[str, float]~~ | +| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | + +By default, these results are written to the console with the [`ConsoleLogger`](/api/top-level#ConsoleLogger) + +# TODO + #### Example: Custom batch size schedule {#custom-code-schedule} For example, let's say you've implemented your own batch size schedule to use From e3d959d4b41d2ff72fdcd03b27b8f660e4c79442 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 16:16:30 +0200 Subject: [PATCH 14/84] Fix makefile --- Makefile | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 64bb0b57a..006d32471 100644 --- a/Makefile +++ b/Makefile @@ -1,11 +1,15 @@ SHELL := /bin/bash -PYVER := 3.6 -VENV := ./env$(PYVER) ifndef SPACY_EXTRAS override SPACY_EXTRAS = spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core endif +ifndef PYVER +override PYVER = 3.6 +endif + +VENV := ./env$(PYVER) + version := $(shell "bin/get-version.sh") package := $(shell "bin/get-package.sh") From d62a3c65512185b4113a268fe63bcd155f28993a Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 16:35:10 +0200 Subject: [PATCH 15/84] Fix makefile --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 006d32471..1b8175878 100644 --- a/Makefile +++ b/Makefile @@ -33,7 +33,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock chmod a+rx $@ -wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* +wheelhouse/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* $(VENV)/bin/pip wheel . -w ./wheelhouse $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w ./wheelhouse From b2463e4d04c036f2c5a4b2c5e031dcc11550db48 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 16:37:04 +0200 Subject: [PATCH 16/84] Fix makefile --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 1b8175878..ffb07d5d1 100644 --- a/Makefile +++ b/Makefile @@ -17,7 +17,7 @@ ifndef SPACY_BIN override SPACY_BIN = $(package)-$(version).pex endif -dist/$(SPACY_BIN) : wheelhouse/spacy-$(version).stamp +dist/$(SPACY_BIN) : wheelhouse/spacy-$(PYVER)-$(version).stamp $(VENV)/bin/pex \ -f ./wheelhouse \ --no-index \ From 2ee0154bd03d5c657b832e531dc62e5263e1fb2a Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 17:11:24 +0200 Subject: [PATCH 17/84] Fix makefile --- Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/Makefile b/Makefile index ffb07d5d1..4f5eecc4b 100644 --- a/Makefile +++ b/Makefile @@ -45,6 +45,7 @@ wheelhouse/pytest-%.whl : $(VENV)/bin/pex $(VENV)/bin/pex : python$(PYVER) -m venv $(VENV) $(VENV)/bin/pip install -U pip setuptools pex wheel + $(VENV)/bin/pip install numpy .PHONY : clean test From acdd7b9478895bb0e81fe705e052e55d06589f75 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 20:00:49 +0200 Subject: [PATCH 18/84] Allow wheelhouse to be set in makefile --- Makefile | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/Makefile b/Makefile index 4f5eecc4b..caf418f37 100644 --- a/Makefile +++ b/Makefile @@ -17,9 +17,15 @@ ifndef SPACY_BIN override SPACY_BIN = $(package)-$(version).pex endif -dist/$(SPACY_BIN) : wheelhouse/spacy-$(PYVER)-$(version).stamp +ifndef WHEELHOUSE +override WHEELHOUSE = "./wheelhouse" +endif + + + +dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp $(VENV)/bin/pex \ - -f ./wheelhouse \ + -f $(WHEELHOUSE) \ --no-index \ --disable-cache \ -m spacy \ @@ -29,18 +35,19 @@ dist/$(SPACY_BIN) : wheelhouse/spacy-$(PYVER)-$(version).stamp chmod a+rx $@ cp $@ dist/spacy.pex -dist/pytest.pex : wheelhouse/pytest-*.whl - $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock +dist/pytest.pex $(WHEELHOUSE)/pytest-*.whl + $(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock chmod a+rx $@ -wheelhouse/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* - $(VENV)/bin/pip wheel . -w ./wheelhouse - $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w ./wheelhouse +$(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* + mkdir -p $(WHEELHOUSE) + $(VENV)/bin/pip wheel . -w $(WHEELHOUSE) + $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) touch $@ -wheelhouse/pytest-%.whl : $(VENV)/bin/pex - $(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse +$(WHEELHOUSE)/pytest-%.whl : $(VENV)/bin/pex + $(VENV)/bin/pip wheel pytest pytest-timeout mock -w $(WHEELHOUSE) $(VENV)/bin/pex : python$(PYVER) -m venv $(VENV) @@ -55,6 +62,6 @@ test : dist/spacy-$(version).pex dist/pytest.pex clean : setup.py rm -rf dist/* - rm -rf ./wheelhouse + rm -rf $(WHEELHOUSE)/* rm -rf $(VENV) python setup.py clean --all From b69a0e332d7aedc09f8d5bf77fc700955ca16345 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 20:14:52 +0200 Subject: [PATCH 19/84] Fix makefile --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index caf418f37..5cd616529 100644 --- a/Makefile +++ b/Makefile @@ -35,7 +35,7 @@ dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp chmod a+rx $@ cp $@ dist/spacy.pex -dist/pytest.pex $(WHEELHOUSE)/pytest-*.whl +dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl $(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock chmod a+rx $@ From 9341cbc013b4f471654d1fd5ad79a0827e572545 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 30 Aug 2020 23:10:43 +0200 Subject: [PATCH 20/84] Set version to v3.0.0a13 --- spacy/about.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spacy/about.py b/spacy/about.py index 418e44c1d..3fe720dbc 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy-nightly" -__version__ = "3.0.0a12" +__version__ = "3.0.0a13" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" From 216efaf5f53960f80519cfa2c343a1f9efdbf72e Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Mon, 31 Aug 2020 09:42:06 +0200 Subject: [PATCH 21/84] Restrict tokenizer exceptions to ORTH and NORM --- spacy/errors.py | 3 +++ spacy/tests/tokenizer/test_tokenizer.py | 8 +++++++- spacy/tokenizer.pyx | 12 +++++++++--- website/docs/usage/v3.md | 14 ++++++++++++++ 4 files changed, 33 insertions(+), 4 deletions(-) diff --git a/spacy/errors.py b/spacy/errors.py index 38c89c479..e53aaef07 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -645,6 +645,9 @@ class Errors: "Required tables '{tables}', found '{found}'. If you are not " "providing custom lookups, make sure you have the package " "spacy-lookups-data installed.") + E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for " + "'{chunk}'. Tokenizer exceptions are only allowed to specify " + "`ORTH` and `NORM`.") @add_codes diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index b89c0627f..ff31ae8a9 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -105,7 +105,13 @@ def test_tokenizer_add_special_case(tokenizer, text, tokens): assert doc[1].text == tokens[1]["orth"] -@pytest.mark.parametrize("text,tokens", [("lorem", [{"orth": "lo"}, {"orth": "re"}])]) +@pytest.mark.parametrize( + "text,tokens", + [ + ("lorem", [{"orth": "lo"}, {"orth": "re"}]), + ("lorem", [{"orth": "lo", "tag": "A"}, {"orth": "rem"}]), + ], +) def test_tokenizer_validate_special_case(tokenizer, text, tokens): with pytest.raises(ValueError): tokenizer.add_special_case(text, tokens) diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 9fda1800b..12c634e61 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -17,7 +17,7 @@ from .strings cimport hash_string from .lexeme cimport EMPTY_LEXEME from .attrs import intify_attrs -from .symbols import ORTH +from .symbols import ORTH, NORM from .errors import Errors, Warnings from . import util from .util import registry @@ -584,9 +584,11 @@ cdef class Tokenizer: self.add_special_case(chunk, substrings) def _validate_special_case(self, chunk, substrings): - """Check whether the `ORTH` fields match the string. + """Check whether the `ORTH` fields match the string. Check that + additional features beyond `ORTH` and `NORM` are not set by the + exception. - string (str): The string to specially tokenize. + chunk (str): The string to specially tokenize. substrings (iterable): A sequence of dicts, where each dict describes a token and its attributes. """ @@ -594,6 +596,10 @@ cdef class Tokenizer: orth = "".join([spec[ORTH] for spec in attrs]) if chunk != orth: raise ValueError(Errors.E997.format(chunk=chunk, orth=orth, token_attrs=substrings)) + for substring in attrs: + for attr in substring: + if attr not in (ORTH, NORM): + raise ValueError(Errors.E1005.format(attr=self.vocab.strings[attr], chunk=chunk)) def add_special_case(self, unicode string, substrings): """Add a special-case tokenization rule. diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index d5fea9fee..20b7a139b 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -566,6 +566,20 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")] + matcher.add("HEALTH", patterns, on_match=on_match) ``` +### Migrating attributes in tokenizer exceptions {#migrating-tokenizer-exceptions} + +Tokenizer exceptions are now only allowed to set `ORTH` and `NORM` values as +part of the token attributes. Exceptions for other attributes such as `TAG` and +`LEMMA` should be moved to an [`AttributeRuler`](/api/attributeruler) component: + +```diff +nlp = spacy.blank("en") +- nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't", "LEMMA": "not"}]) ++ nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't"}]) ++ ruler = nlp.add_pipe("attribute_ruler") ++ ruler.add(patterns=[[{"ORTH": "n't"}]], attrs={"LEMMA": "not"}) +``` + ### Migrating tag maps and morph rules {#migrating-training-mappings-exceptions} Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy From ec14744ee44e5dfb42f27f9c4edd02910420bdca Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Mon, 31 Aug 2020 12:41:39 +0200 Subject: [PATCH 22/84] Rename Transformer listener (#6001) * rename to spacy-transformers.TransformerListener * add some more tok2vec tests * use select_pipes * fix docs - annotation setter was not changed in the end --- spacy/cli/templates/quickstart_training.jinja | 6 +- spacy/pipeline/pipe.pyx | 2 +- spacy/pipeline/tok2vec.py | 2 +- spacy/tests/test_tok2vec.py | 97 ++++++++++++++++++- website/docs/api/architectures.md | 4 +- website/docs/usage/embeddings-transformers.md | 8 +- website/docs/usage/v3.md | 2 +- 7 files changed, 107 insertions(+), 14 deletions(-) diff --git a/spacy/cli/templates/quickstart_training.jinja b/spacy/cli/templates/quickstart_training.jinja index 0071f1b1a..fa9bb6d76 100644 --- a/spacy/cli/templates/quickstart_training.jinja +++ b/spacy/cli/templates/quickstart_training.jinja @@ -42,7 +42,7 @@ factory = "tagger" nO = null [components.tagger.model.tok2vec] -@architectures = "spacy-transformers.Tok2VecListener.v1" +@architectures = "spacy-transformers.TransformerListener.v1" grad_factor = 1.0 [components.tagger.model.tok2vec.pooling] @@ -62,7 +62,7 @@ use_upper = false nO = null [components.parser.model.tok2vec] -@architectures = "spacy-transformers.Tok2VecListener.v1" +@architectures = "spacy-transformers.TransformerListener.v1" grad_factor = 1.0 [components.parser.model.tok2vec.pooling] @@ -82,7 +82,7 @@ use_upper = false nO = null [components.ner.model.tok2vec] -@architectures = "spacy-transformers.Tok2VecListener.v1" +@architectures = "spacy-transformers.TransformerListener.v1" grad_factor = 1.0 [components.ner.model.tok2vec.pooling] diff --git a/spacy/pipeline/pipe.pyx b/spacy/pipeline/pipe.pyx index 51251dacc..a3f379a97 100644 --- a/spacy/pipeline/pipe.pyx +++ b/spacy/pipeline/pipe.pyx @@ -37,7 +37,7 @@ cdef class Pipe: and returned. This usually happens under the hood when the nlp object is called on a text and all components are applied to the Doc. - docs (Doc): The Doc to preocess. + docs (Doc): The Doc to process. RETURNS (Doc): The processed Doc. DOCS: https://spacy.io/api/pipe#call diff --git a/spacy/pipeline/tok2vec.py b/spacy/pipeline/tok2vec.py index dad66ddb3..7e61ccc02 100644 --- a/spacy/pipeline/tok2vec.py +++ b/spacy/pipeline/tok2vec.py @@ -88,7 +88,7 @@ class Tok2Vec(Pipe): """Add context-sensitive embeddings to the Doc.tensor attribute, allowing them to be used as features by downstream components. - docs (Doc): The Doc to preocess. + docs (Doc): The Doc to process. RETURNS (Doc): The processed Doc. DOCS: https://spacy.io/api/tok2vec#call diff --git a/spacy/tests/test_tok2vec.py b/spacy/tests/test_tok2vec.py index b30705088..1068b662d 100644 --- a/spacy/tests/test_tok2vec.py +++ b/spacy/tests/test_tok2vec.py @@ -3,11 +3,18 @@ import pytest from spacy.ml.models.tok2vec import build_Tok2Vec_model from spacy.ml.models.tok2vec import MultiHashEmbed, CharacterEmbed from spacy.ml.models.tok2vec import MishWindowEncoder, MaxoutWindowEncoder +from spacy.pipeline.tok2vec import Tok2Vec, Tok2VecListener from spacy.vocab import Vocab from spacy.tokens import Doc - +from spacy.gold import Example +from spacy import util +from spacy.lang.en import English from .util import get_batch +from thinc.api import Config + +from numpy.testing import assert_equal + def test_empty_doc(): width = 128 @@ -41,7 +48,7 @@ def test_tok2vec_batch_sizes(batch_size, width, embed_size): also_use_static_vectors=False, also_embed_subwords=True, ), - MaxoutWindowEncoder(width=width, depth=4, window_size=1, maxout_pieces=3,), + MaxoutWindowEncoder(width=width, depth=4, window_size=1, maxout_pieces=3), ) tok2vec.initialize() vectors, backprop = tok2vec.begin_update(batch) @@ -74,3 +81,89 @@ def test_tok2vec_configs(width, embed_arch, embed_config, encode_arch, encode_co assert len(vectors) == len(docs) assert vectors[0].shape == (len(docs[0]), width) backprop(vectors) + + +def test_init_tok2vec(): + # Simple test to initialize the default tok2vec + nlp = English() + tok2vec = nlp.add_pipe("tok2vec") + assert tok2vec.listeners == [] + nlp.begin_training() + + +cfg_string = """ + [nlp] + lang = "en" + pipeline = ["tok2vec","tagger"] + + [components] + + [components.tagger] + factory = "tagger" + + [components.tagger.model] + @architectures = "spacy.Tagger.v1" + nO = null + + [components.tagger.model.tok2vec] + @architectures = "spacy.Tok2VecListener.v1" + width = ${components.tok2vec.model.encode.width} + + [components.tok2vec] + factory = "tok2vec" + + [components.tok2vec.model] + @architectures = "spacy.Tok2Vec.v1" + + [components.tok2vec.model.embed] + @architectures = "spacy.MultiHashEmbed.v1" + width = ${components.tok2vec.model.encode.width} + rows = 2000 + also_embed_subwords = true + also_use_static_vectors = false + + [components.tok2vec.model.encode] + @architectures = "spacy.MaxoutWindowEncoder.v1" + width = 96 + depth = 4 + window_size = 1 + maxout_pieces = 3 + """ + +TRAIN_DATA = [ + ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), + ("Eat blue ham", {"tags": ["V", "J", "N"]}), +] + +def test_tok2vec_listener(): + orig_config = Config().from_str(cfg_string) + nlp, config = util.load_model_from_config(orig_config, auto_fill=True, validate=True) + assert nlp.pipe_names == ["tok2vec", "tagger"] + tagger = nlp.get_pipe("tagger") + tok2vec = nlp.get_pipe("tok2vec") + tagger_tok2vec = tagger.model.get_ref("tok2vec") + assert isinstance(tok2vec, Tok2Vec) + assert isinstance(tagger_tok2vec, Tok2VecListener) + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + for tag in t[1]["tags"]: + tagger.add_label(tag) + + # Check that the Tok2Vec component finds it listeners + assert tok2vec.listeners == [] + optimizer = nlp.begin_training(lambda: train_examples) + assert tok2vec.listeners == [tagger_tok2vec] + + for i in range(5): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + + doc = nlp("Running the pipeline as a whole.") + doc_tensor = tagger_tok2vec.predict([doc])[0] + assert_equal(doc.tensor, doc_tensor) + + # TODO: should this warn or error? + nlp.select_pipes(disable="tok2vec") + assert nlp.pipe_names == ["tagger"] + nlp("Running the pipeline with the Tok2Vec component disabled.") diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 3089fa1b3..e3b26a961 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -346,13 +346,13 @@ in other components, see | `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], FullTransformerBatch]~~ | -### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener} +### spacy-transformers.TransformerListener.v1 {#TransformerListener} > #### Example Config > > ```ini > [model] -> @architectures = "spacy-transformers.Tok2VecListener.v1" +> @architectures = "spacy-transformers.TransformerListener.v1" > grad_factor = 1.0 > > [model.pooling] diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 75be71845..fe7fc29c0 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -225,7 +225,7 @@ transformers as subnetworks directly, you can also use them via the ![The processing pipeline with the transformer component](../images/pipeline_transformer.svg) -By default, the `Transformer` component sets the +The `Transformer` component sets the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, which lets you access the transformers outputs at runtime. @@ -303,7 +303,7 @@ component: > > ```python > from spacy_transformers import Transformer, TransformerModel -> from spacy_transformers.annotation_setters import configure_trfdata_setter +> from spacy_transformers.annotation_setters import null_annotation_setter > from spacy_transformers.span_getters import get_doc_spans > > trf = Transformer( @@ -313,7 +313,7 @@ component: > get_spans=get_doc_spans, > tokenizer_config={"use_fast": True}, > ), -> annotation_setter=configure_trfdata_setter(), +> annotation_setter=null_annotation_setter, > max_batch_items=4096, > ) > ``` @@ -333,7 +333,7 @@ tokenizer_config = {"use_fast": true} @span_getters = "doc_spans.v1" [components.transformer.annotation_setter] -@annotation_setters = "spacy-transformers.trfdata_setter.v1" +@annotation_setters = "spacy-transformers.null_annotation_setter.v1" ``` diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 20b7a139b..de3d7ce33 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -64,7 +64,7 @@ menu: [`TransformerData`](/api/transformer#transformerdata), [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - [Tok2VecListener](/api/architectures#transformers-Tok2VecListener), + [TransformerListener](/api/architectures#TransformerListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) - **Models:** [`en_core_trf_lg_sm`](/models/en) - **Implementation:** From 2c90a06fee86128a504d95e5caf0e15ad439ebac Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 13:43:17 +0200 Subject: [PATCH 23/84] some more information about the loggers --- website/docs/api/top-level.md | 48 ++++++++++++++++++++++------------- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 6fbb1c821..518711a8a 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -4,6 +4,7 @@ menu: - ['spacy', 'spacy'] - ['displacy', 'displacy'] - ['registry', 'registry'] + - ['Loggers', 'loggers'] - ['Batchers', 'batchers'] - ['Data & Alignment', 'gold'] - ['Utility Functions', 'util'] @@ -345,19 +346,26 @@ See the [`Transformer`](/api/transformer) API reference and > return span_getter > ``` - | Registry name | Description | | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} -A logger records the training results for each step. When a logger is created, -it returns a `log_step` function and a `finalize` function. The `log_step` -function is called by the [training script](/api/cli#train) and receives a -dictionary of information, including +A logger records the training results. When a logger is created, two functions +are returned: one for logging the information for each training step, and a +second function that is called to finalize the logging when the training is +finished. To log each training step, a +[dictionary](/usage/training#custom-logging) is passed on from the +[training script](/api/cli#train), including information such as the training +loss and the accuracy scores on the development set. -# TODO +There are two built-in logging functions: a logger printing results to the +console in tabular format (which is the default), and one that also sends the +results to a [Weights & Biases`](https://www.wandb.com/) dashboard dashboard. +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. > #### Example config > @@ -366,10 +374,6 @@ dictionary of information, including > @loggers = "spacy.ConsoleLogger.v1" > ``` -Instead of using one of the built-in batchers listed here, you can also -[implement your own](/usage/training#custom-code-readers-batchers), which may or -may not use a custom schedule. - #### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} Writes the results of a training step to the console in a tabular format. @@ -384,14 +388,18 @@ Writes the results of a training step to the console in a tabular format. > ``` Built-in logger that sends the results of each training step to the dashboard of -the [Weights & Biases`](https://www.wandb.com/) dashboard. To use this logger, -Weights & Biases should be installed, and you should be logged in. The logger -will send the full config file to W&B, as well as various system information -such as GPU +the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights +& Biases should be installed, and you should be logged in. The logger will send +the full config file to W&B, as well as various system information such as +memory utilization, network traffic, disk IO, GPU statistics, etc. This will +also include information such as your hostname and operating system, as well as +the location of your Python executable. -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +Note that by default, the full (interpolated) training config file is sent over +to the W&B dashboard. If you prefer to exclude certain information such as path +names, you can list those fields in "dot notation" in the `remove_config_values` +parameter. These fields will then be removed from the config before uploading, +but will otherwise remain in the config file stored on your local system. > #### Example config > @@ -399,8 +407,14 @@ such as GPU > [training.logger] > @loggers = "spacy.WandbLogger.v1" > project_name = "monitor_spacy_training" +> remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"] > ``` +| Name | Description | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | + ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} A data batcher implements a batching strategy that essentially turns a stream of From 13ee742fb47466bf2b61e7c56094607a3bce815e Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:24:41 +0200 Subject: [PATCH 24/84] example of custom logger --- spacy/cli/train.py | 2 +- website/docs/usage/training.md | 49 +++++++++++++++++++++++++++++++--- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 655a5ae58..075b29c30 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -270,7 +270,7 @@ def train_while_improving( epoch (int): How many passes over the data have been completed. step (int): How many steps have been completed. - score (float): The main score form the last evaluation. + score (float): The main score from the last evaluation. other_scores: : The other scores from the last evaluation. losses: The accumulated losses throughout training. checkpoints: A list of previous results, where each result is a diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 878161b1b..069e7c00a 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -614,14 +614,55 @@ dictionary providing the following information: | -------------- | ---------------------------------------------------------------------------------------------- | | `epoch` | How many passes over the data have been completed. ~~int~~ | | `step` | How many steps have been completed. ~~int~~ | -| `score` | The main score form the last evaluation, measured on the dev set. ~~float~~ | +| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ | | `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | -| `losses` | The accumulated training losses. ~~Dict[str, float]~~ | +| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | | `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | -By default, these results are written to the console with the [`ConsoleLogger`](/api/top-level#ConsoleLogger) +By default, these results are written to the console with the +[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support +for writing the log files to [Weights & Biases](https://www.wandb.com/) with +the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement +your own logger as well, for instance to write the tabular results to file: -# TODO +```python +### functions.py +from typing import Tuple, Callable, Dict, Any +import spacy +from pathlib import Path + +@spacy.registry.loggers("my_custom_logger.v1") +def custom_logger(log_path): + def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]: + with Path(log_path).open("w") as file_: + file_.write("step\t") + file_.write("score\t") + for pipe in nlp.pipe_names: + file_.write(f"loss_{pipe}\t") + file_.write("\n") + + def log_step(info: Dict[str, Any]): + with Path(log_path).open("a") as file_: + file_.write(f"{info['step']}\t") + file_.write(f"{info['score']}\t") + for pipe in nlp.pipe_names: + file_.write(f"{info['losses'][pipe]}\t") + file_.write("\n") + + def finalize(): + pass + + return log_step, finalize + + return setup_logger +``` + +```ini +### config.cfg (excerpt) +[training.logger] +@loggers = "my_custom_logger.v1" +file_path = "my_file.tab" +``` #### Example: Custom batch size schedule {#custom-code-schedule} From e47ea88aeb8f07465b3a46c320e4ab7d11acb482 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:40:55 +0200 Subject: [PATCH 25/84] revert annotations refactor --- website/docs/api/top-level.md | 54 +++++++-------- website/docs/api/transformer.md | 66 +++++++++---------- website/docs/usage/embeddings-transformers.md | 19 +++--- 3 files changed, 68 insertions(+), 71 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 518711a8a..2643c3c02 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -306,25 +306,24 @@ factories. > i += 1 > ``` -| Registry name | Description | -| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `annotation_setters` | Registry for functions that store Tok2Vec annotations on `Doc` objects. | -| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `assets` | Registry for data assets, knowledge bases etc. | -| `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | -| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `loggers` | Registry for functions that log [training results](/usage/training). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | +| Registry name | Description | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `assets` | Registry for data assets, knowledge bases etc. | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | ### spacy-transformers registry {#registry-transformers} @@ -338,17 +337,18 @@ See the [`Transformer`](/api/transformer) API reference and > ```python > import spacy_transformers > -> @spacy_transformers.registry.span_getters("my_span_getter.v1") -> def configure_custom_span_getter() -> Callable: -> def span_getter(docs: List[Doc]) -> List[List[Span]]: -> # Transform each Doc into a List of Span objects +> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") +> def configure_custom_annotation_setter(): +> def annotation_setter(docs, trf_data) -> None: +> # Set annotations on the docs > -> return span_getter +> return annotation_setter > ``` -| Registry name | Description | -| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| Registry name | Description | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | ## Loggers {#loggers source="spacy/gold/loggers.py" new="3"} diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index a3f6deb7d..0b38c2e8d 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -33,15 +33,16 @@ the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. -We calculate an alignment between the word-piece tokens and the spaCy -tokenization, so that we can use the last hidden states to store the information -on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the -spaCy token receives the sum of their values. By default, the information is -written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but -you can implement a custom [`@annotation_setter`](#annotation_setters) to change -this behaviour. The package also adds the function registry -[`@span_getters`](#span_getters) with several built-in registered functions. For -more details, see the [usage documentation](/usage/embeddings-transformers). +The component assigns the output of the transformer to the `Doc`'s extension +attributes. We also calculate an alignment between the word-piece tokens and the +spaCy tokenization, so that we can use the last hidden states to set the +`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy +token, the spaCy token receives the sum of their values. To access the values, +you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The +package also adds the function registries [`@span_getters`](#span_getters) and +[`@annotation_setters`](#annotation_setters) with several built-in registered +functions. For more details, see the +[usage documentation](/usage/embeddings-transformers). ## Config and implementation {#config} @@ -60,11 +61,11 @@ on the transformer architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -97,10 +98,9 @@ Construct a `Transformer` component. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights. The activations from the transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension -attribute by default, but you can provide a different `annotation_setter` to -customize this behaviour. In your application, you would normally use a shortcut -and instantiate the component using its string name and -[`nlp.add_pipe`](/api/language#create_pipe). +attribute. You can also provide a callback to set additional annotations. In +your application, you would normally use a shortcut for this and instantiate the +component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -205,9 +205,8 @@ modifying them. Assign the extracted features to the Doc objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the -[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be -customized by providing a different `annotation_setter` argument upon -construction. +[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter +callback is then called, if provided. > #### Example > @@ -520,23 +519,20 @@ right context. ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} Annotation setters are functions that take a batch of `Doc` objects and a -[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the -annotations on the `Doc`, e.g. to set custom or built-in attributes. You can -register custom annotation setters using the `@registry.annotation_setters` -decorator. The default annotation setter used by the `Transformer` pipeline -component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set +additional annotations on the `Doc`, e.g. to set custom or built-in attributes. +You can register custom annotation setters using the +`@registry.annotation_setters` decorator. > #### Example > > ```python -> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") -> def configure_trfdata_setter() -> Callable: +> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") +> def configure_null_annotation_setter() -> Callable: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: -> doc_data = list(trf_data.doc_data) -> for doc, data in zip(docs, doc_data): -> doc._.trf_data = data +> pass > -> return setter +> return setter > ``` | Name | Description | @@ -546,9 +542,9 @@ component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute. The following built-in functions are available: -| Name | Description | -| -------------------------------------- | ------------------------------------------------------------- | -| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | +| Name | Description | +| ---------------------------------------------- | ------------------------------------- | +| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | ## Custom attributes {#custom-attributes} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index cc49a86c2..aaa5fde10 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -252,12 +252,13 @@ for doc in nlp.pipe(["some text", "some other text"]): ``` You can also customize how the [`Transformer`](/api/transformer) component sets -annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`. -This callback will be called with the raw input and output data for the whole -batch, along with the batch of `Doc` objects, allowing you to implement whatever -you need. The annotation setter is called with a batch of [`Doc`](/api/doc) -objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) -containing the transformers data for the batch. +annotations onto the [`Doc`](/api/doc), by specifying a custom +`annotation_setter`. This callback will be called with the raw input and output +data for the whole batch, along with the batch of `Doc` objects, allowing you to +implement whatever you need. The annotation setter is called with a batch of +[`Doc`](/api/doc) objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the +transformers data for the batch. ```python def custom_annotation_setter(docs, trf_data): @@ -370,9 +371,9 @@ To change any of the settings, you can edit the `config.cfg` and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to process sentences. You can also register your own functions using the -[`span_getters` registry](/api/top-level#registry). For instance, the following -custom function returns [`Span`](/api/span) objects following sentence -boundaries, unless a sentence succeeds a certain amount of tokens, in which case +[`span_getters` registry](/api/top-level#registry). For instance, the following +custom function returns [`Span`](/api/span) objects following sentence +boundaries, unless a sentence succeeds a certain amount of tokens, in which case subsentences of at most `max_length` tokens are returned. > #### config.cfg From 56ba691ecd628a73d3f21ed8cbce3fc6feec598f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:46:00 +0200 Subject: [PATCH 26/84] small fixes --- website/docs/api/transformer.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 0b38c2e8d..5ac95cb29 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -102,14 +102,14 @@ attribute. You can also provide a callback to set additional annotations. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). -| Name | Description | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| _keyword-only_ | | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | +| Name | Description | +| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| _keyword-only_ | | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | ## Transformer.\_\_call\_\_ {#call tag="method"} @@ -532,7 +532,7 @@ You can register custom annotation setters using the > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: > pass > -> return setter +> return setter > ``` | Name | Description | From 0e0abb03785d087b14716679efa6f543d1ceda91 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:50:29 +0200 Subject: [PATCH 27/84] fix --- website/docs/api/top-level.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 2643c3c02..834b71701 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,7 +362,7 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases`](https://www.wandb.com/) dashboard dashboard. +results to a [Weights & Biases](https://www.wandb.com/) dashboard dashboard. Instead of using one of the built-in batchers listed here, you can also [implement your own](/usage/training#custom-code-readers-batchers), which may or may not use a custom schedule. From fe6c08218e0e57d9b9f936634e64330fa224323f Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 14:51:49 +0200 Subject: [PATCH 28/84] fixes --- website/docs/api/top-level.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 834b71701..bde64b77b 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,10 +362,9 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases](https://www.wandb.com/) dashboard dashboard. -Instead of using one of the built-in batchers listed here, you can also -[implement your own](/usage/training#custom-code-readers-batchers), which may or -may not use a custom schedule. +results to a [Weights & Biases](https://www.wandb.com/) dashboard. +Instead of using one of the built-in loggers listed here, you can also +[implement your own](/usage/training#custom-logging). > #### Example config > From 6340d1c63d1d29e603ff8d84eaf23032e2687698 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 14:53:22 +0200 Subject: [PATCH 29/84] Add as_spans to Matcher/PhraseMatcher --- spacy/matcher/matcher.pyx | 14 ++++++++++---- spacy/matcher/phrasematcher.pyx | 15 +++++++++++---- spacy/tests/matcher/test_matcher_api.py | 17 ++++++++++++++++- spacy/tests/matcher/test_phrase_matcher.py | 18 +++++++++++++++++- 4 files changed, 54 insertions(+), 10 deletions(-) diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 16ab73735..fdce7e9fa 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -203,13 +203,16 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike): + def __call__(self, object doclike, as_spans=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. """ if isinstance(doclike, Doc): doc = doclike @@ -262,7 +265,10 @@ cdef class Matcher: on_match = self._callbacks.get(key, None) if on_match is not None: on_match(self, doc, i, final_matches) - return final_matches + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in final_matches] + else: + return final_matches def _normalize_key(self, key): if isinstance(key, basestring): diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 060c4d37f..6658c713e 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -7,6 +7,7 @@ import warnings from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA from ..structs cimport TokenC from ..tokens.token cimport Token +from ..tokens.span cimport Span from ..typedefs cimport attr_t from ..schemas import TokenPattern @@ -216,13 +217,16 @@ cdef class PhraseMatcher: result = internal_node map_set(self.mem, result, self.vocab.strings[key], NULL) - def __call__(self, doc): + def __call__(self, doc, as_spans=False): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. DOCS: https://spacy.io/api/phrasematcher#call """ @@ -239,7 +243,10 @@ cdef class PhraseMatcher: on_match = self._callbacks.get(self.vocab.strings[ent_id]) if on_match is not None: on_match(self, doc, i, matches) - return matches + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in matches] + else: + return matches cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil: cdef MapStruct* current_node = self.c_map diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index bcb224bd3..aeac509b5 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -2,7 +2,8 @@ import pytest import re from mock import Mock from spacy.matcher import Matcher, DependencyMatcher -from spacy.tokens import Doc, Token +from spacy.tokens import Doc, Token, Span + from ..doc.test_underscore import clean_underscore # noqa: F401 @@ -469,3 +470,17 @@ def test_matcher_span(matcher): assert len(matcher(doc)) == 2 assert len(matcher(span_js)) == 1 assert len(matcher(span_java)) == 1 + + +def test_matcher_as_spans(matcher): + """Test the new as_spans=True API.""" + text = "JavaScript is good but Java is better" + doc = Doc(matcher.vocab, words=text.split()) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "JavaScript" + assert matches[0].label_ == "JS" + assert isinstance(matches[1], Span) + assert matches[1].text == "Java" + assert matches[1].label_ == "Java" diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index 2a3c7d693..edffaa900 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -2,7 +2,7 @@ import pytest import srsly from mock import Mock from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc +from spacy.tokens import Doc, Span from ..util import get_doc @@ -287,3 +287,19 @@ def test_phrase_matcher_pickle(en_vocab): # clunky way to vaguely check that callback is unpickled (vocab, docs, callbacks, attr) = matcher_unpickled.__reduce__()[1] assert isinstance(callbacks.get("TEST2"), Mock) + + +def test_phrase_matcher_as_spans(en_vocab): + """Test the new as_spans=True API.""" + matcher = PhraseMatcher(en_vocab) + matcher.add("A", [Doc(en_vocab, words=["hello", "world"])]) + matcher.add("B", [Doc(en_vocab, words=["test"])]) + doc = Doc(en_vocab, words=["...", "hello", "world", "this", "is", "a", "test"]) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "hello world" + assert matches[0].label_ == "A" + assert isinstance(matches[1], Span) + assert matches[1].text == "test" + assert matches[1].label_ == "B" From 83aff38c59a638a795a154c51a25de3f98558a31 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 15:39:03 +0200 Subject: [PATCH 30/84] Make argument keyword-only Co-authored-by: Matthew Honnibal --- spacy/matcher/matcher.pyx | 2 +- spacy/matcher/phrasematcher.pyx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index fdce7e9fa..ee8efd688 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -203,7 +203,7 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike, as_spans=False): + def __call__(self, object doclike, *, as_spans=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 6658c713e..44dda115b 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -217,7 +217,7 @@ cdef class PhraseMatcher: result = internal_node map_set(self.mem, result, self.vocab.strings[key], NULL) - def __call__(self, doc, as_spans=False): + def __call__(self, doc, *, as_spans=False): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. From db9f8896f5f8e4c6bde1839a4f04a06babbf64fc Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 16:10:41 +0200 Subject: [PATCH 31/84] Add docs [ci skip] --- website/docs/api/matcher.md | 10 ++++--- website/docs/api/phrasematcher.md | 10 ++++--- website/docs/usage/rule-based-matching.md | 33 +++++++++++++++++++++++ 3 files changed, 45 insertions(+), 8 deletions(-) diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index f259174e2..136bac3c8 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -116,10 +116,12 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.pipe {#pipe tag="method"} diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 143eb9edf..8064a621e 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -57,10 +57,12 @@ Find all token sequences matching the supplied patterns on the `Doc`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ----------------------------------- | -| `doc` | The document to match over. ~~Doc~~ | -| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The document to match over. ~~Doc~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 7fdce032e..e3e0f2c19 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -493,6 +493,39 @@ you prefer. | `i` | Index of the current match (`matches[i`]). ~~int~~ | | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ | +### Creating spans from matches {#matcher-spans} + +Creating [`Span`](/api/span) objects from the returned matches is a very common +use case. spaCy makes this easy by giving you access to the `start` and `end` +token of each match, which you can use to construct a new span with an optional +label. As of spaCy v3.0, you can also set `as_spans=True` when calling the +matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects +using the `match_id` as the span label. + +```python +### {executable="true"} +import spacy +from spacy.matcher import Matcher +from spacy.tokens import Span + +nlp = spacy.blank("en") +matcher = Matcher(nlp.vocab) +matcher.add("PERSON", [[{"lower": "barack"}, {"lower": "obama"}]]) +doc = nlp("Barack Obama was the 44th president of the United States") + +# 1. Return (match_id, start, end) tuples +matches = matcher(doc) +for match_id, start, end in matches: + # Create the matched span and assign the match_id as a label + span = Span(doc, start, end, label=match_id) + print(span.text, span.label_) + +# 2. Return Span objects directly +matches = matcher(doc, as_spans=True) +for span in matches: + print(span.text, span.label_) +``` + ### Using custom pipeline components {#matcher-pipeline} Let's say your data also contains some annoying pre-processing artifacts, like From bca6bf8ddabca87bc963186cb8f449eb27d93ce3 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 16:39:53 +0200 Subject: [PATCH 32/84] Update docs [ci skip] --- website/docs/api/top-level.md | 19 +++++++++----- website/docs/usage/projects.md | 2 +- website/docs/usage/training.md | 47 ++++++++++++++++++---------------- 3 files changed, 38 insertions(+), 30 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index bde64b77b..b1a2d9532 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -362,8 +362,8 @@ loss and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the -results to a [Weights & Biases](https://www.wandb.com/) dashboard. -Instead of using one of the built-in loggers listed here, you can also +results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of +using one of the built-in loggers listed here, you can also [implement your own](/usage/training#custom-logging). > #### Example config @@ -394,11 +394,16 @@ memory utilization, network traffic, disk IO, GPU statistics, etc. This will also include information such as your hostname and operating system, as well as the location of your Python executable. -Note that by default, the full (interpolated) training config file is sent over -to the W&B dashboard. If you prefer to exclude certain information such as path -names, you can list those fields in "dot notation" in the `remove_config_values` -parameter. These fields will then be removed from the config before uploading, -but will otherwise remain in the config file stored on your local system. + + +Note that by default, the full (interpolated) +[training config](/usage/training#config) is sent over to the W&B dashboard. If +you prefer to **exclude certain information** such as path names, you can list +those fields in "dot notation" in the `remove_config_values` parameter. These +fields will then be removed from the config before uploading, but will otherwise +remain in the config file stored on your local system. + + > #### Example config > diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 620526280..ef895195c 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -914,4 +914,4 @@ mattis pretium. ### Weights & Biases {#wandb} - + diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 069e7c00a..20f25924e 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -607,8 +607,12 @@ $ python -m spacy train config.cfg --output ./output --code ./functions.py #### Example: Custom logging function {#custom-logging} -During training, the results of each step are passed to a logger function in a -dictionary providing the following information: +During training, the results of each step are passed to a logger function. By +default, these results are written to the console with the +[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support +for writing the log files to [Weights & Biases](https://www.wandb.com/) with the +[`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a +**dictionary** with the following keys: | Key | Value | | -------------- | ---------------------------------------------------------------------------------------------- | @@ -619,11 +623,17 @@ dictionary providing the following information: | `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | | `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | -By default, these results are written to the console with the -[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support -for writing the log files to [Weights & Biases](https://www.wandb.com/) with -the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement -your own logger as well, for instance to write the tabular results to file: +You can easily implement and plug in your own logger that records the training +results in a custom way, or sends them to an experiment management tracker of +your choice. In this example, the function `my_custom_logger.v1` writes the +tabular results to a file: + +> ```ini +> ### config.cfg (excerpt) +> [training.logger] +> @loggers = "my_custom_logger.v1" +> file_path = "my_file.tab" +> ``` ```python ### functions.py @@ -635,19 +645,19 @@ from pathlib import Path def custom_logger(log_path): def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]: with Path(log_path).open("w") as file_: - file_.write("step\t") - file_.write("score\t") + file_.write("step\\t") + file_.write("score\\t") for pipe in nlp.pipe_names: - file_.write(f"loss_{pipe}\t") - file_.write("\n") + file_.write(f"loss_{pipe}\\t") + file_.write("\\n") def log_step(info: Dict[str, Any]): with Path(log_path).open("a") as file_: - file_.write(f"{info['step']}\t") - file_.write(f"{info['score']}\t") + file_.write(f"{info['step']}\\t") + file_.write(f"{info['score']}\\t") for pipe in nlp.pipe_names: - file_.write(f"{info['losses'][pipe]}\t") - file_.write("\n") + file_.write(f"{info['losses'][pipe]}\\t") + file_.write("\\n") def finalize(): pass @@ -657,13 +667,6 @@ def custom_logger(log_path): return setup_logger ``` -```ini -### config.cfg (excerpt) -[training.logger] -@loggers = "my_custom_logger.v1" -file_path = "my_file.tab" -``` - #### Example: Custom batch size schedule {#custom-code-schedule} For example, let's say you've implemented your own batch size schedule to use From 2c3b64a567b2afb54edef14dbd8b87f2adb5e7e6 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 31 Aug 2020 16:56:13 +0200 Subject: [PATCH 33/84] console logging example --- website/docs/api/top-level.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index b1a2d9532..b747007b0 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -377,6 +377,37 @@ using one of the built-in loggers listed here, you can also Writes the results of a training step to the console in a tabular format. + + +``` +$ python -m spacy train config.cfg +ℹ Using CPU +ℹ Loading config and nlp from: config.cfg +ℹ Pipeline: ['tok2vec', 'tagger'] +ℹ Start training +ℹ Training. Initial learn rate: 0.0 +E # LOSS TOK2VEC LOSS TAGGER TAG_ACC SCORE +--- ------ ------------ ----------- ------- ------ + 1 0 0.00 86.20 0.22 0.00 + 1 200 3.08 18968.78 34.00 0.34 + 1 400 31.81 22539.06 33.64 0.34 + 1 600 92.13 22794.91 43.80 0.44 + 1 800 183.62 21541.39 56.05 0.56 + 1 1000 352.49 25461.82 65.15 0.65 + 1 1200 422.87 23708.82 71.84 0.72 + 1 1400 601.92 24994.79 76.57 0.77 + 1 1600 662.57 22268.02 80.20 0.80 + 1 1800 1101.50 28413.77 82.56 0.83 + 1 2000 1253.43 28736.36 85.00 0.85 + 1 2200 1411.02 28237.53 87.42 0.87 + 1 2400 1605.35 28439.95 88.70 0.89 +``` + +Note that the cumulative loss keeps increasing within one epoch, but should +start decreasing across epochs. + + + #### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} > #### Installation From add9de548717f793b6dc5fa577aae34e1a24b874 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 17:01:24 +0200 Subject: [PATCH 34/84] Deprecate (Phrase)Matcher.pipe --- spacy/errors.py | 3 +++ spacy/matcher/matcher.pyx | 14 +++----------- spacy/matcher/phrasematcher.pyx | 16 +++------------- spacy/tests/matcher/test_matcher_api.py | 9 +++++++++ spacy/tests/matcher/test_phrase_matcher.py | 11 +++++++++++ website/docs/api/matcher.md | 21 --------------------- website/docs/api/phrasematcher.md | 21 --------------------- website/docs/usage/rule-based-matching.md | 9 --------- website/docs/usage/v3.md | 1 + 9 files changed, 30 insertions(+), 75 deletions(-) diff --git a/spacy/errors.py b/spacy/errors.py index e53aaef07..b99c99959 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -112,6 +112,9 @@ class Warnings: "word segmenters: {supported}. Defaulting to {default}.") W104 = ("Skipping modifications for '{target}' segmenter. The current " "segmenter is '{current}'.") + W105 = ("As of spaCy v3.0, the {matcher}.pipe method is deprecated. If you " + "need to match on a stream of documents, you can use nlp.pipe and " + "call the {matcher} on each Doc object.") @add_codes diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index ee8efd688..d3a8fa539 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -176,18 +176,10 @@ cdef class Matcher: return (self._callbacks[key], self._patterns[key]) def pipe(self, docs, batch_size=1000, return_matches=False, as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (Iterable[Union[Doc, Span]]): A stream of documents or spans. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ + warnings.warn(Warnings.W105.format(matcher="Matcher"), DeprecationWarning) if as_tuples: for doc, context in docs: matches = self(doc) diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 44dda115b..ba0f515b5 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -292,20 +292,10 @@ cdef class PhraseMatcher: idx += 1 def pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (iterable): A stream of documents. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. - - DOCS: https://spacy.io/api/phrasematcher#pipe + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ + warnings.warn(Warnings.W105.format(matcher="PhraseMatcher"), DeprecationWarning) if as_tuples: for doc, context in stream: matches = self(doc) diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index aeac509b5..8310c4466 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -484,3 +484,12 @@ def test_matcher_as_spans(matcher): assert isinstance(matches[1], Span) assert matches[1].text == "Java" assert matches[1].label_ == "Java" + + +def test_matcher_deprecated(matcher): + doc = Doc(matcher.vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index edffaa900..4b7027f87 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -303,3 +303,14 @@ def test_phrase_matcher_as_spans(en_vocab): assert isinstance(matches[1], Span) assert matches[1].text == "test" assert matches[1].label_ == "B" + + +def test_phrase_matcher_deprecated(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add("TEST", [Doc(en_vocab, words=["helllo"])]) + doc = Doc(en_vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 136bac3c8..1f1946be5 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -123,27 +123,6 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. | `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | | **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | -## Matcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import Matcher -> matcher = Matcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Description | -| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | A stream of documents or spans. ~~Iterable[Union[Doc, Span]]~~ | -| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ | -| `return_matches` 2.1 | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ | -| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ | -| **YIELDS** | Documents, in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ | - ## Matcher.\_\_len\_\_ {#len tag="method" new="2"} Get the number of rules added to the matcher. Note that this only returns the diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 8064a621e..39e3a298b 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -76,27 +76,6 @@ match_id_string = nlp.vocab.strings[match_id] -## PhraseMatcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import PhraseMatcher -> matcher = PhraseMatcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Description | -| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | A stream of documents. ~~Iterable[Doc]~~ | -| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ | -| `return_matches` 2.1 | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ | -| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ | -| **YIELDS** | Documents and optional matches or context in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ | - ## PhraseMatcher.\_\_len\_\_ {#len tag="method"} Get the number of rules added to the matcher. Note that this only returns the diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index e3e0f2c19..a589c556e 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -856,15 +856,6 @@ for token in doc: print(token.text, token._.is_hashtag) ``` -To process a stream of social media posts, we can use -[`Language.pipe`](/api/language#pipe), which will return a stream of `Doc` -objects that we can pass to [`Matcher.pipe`](/api/matcher#pipe). - -```python -docs = nlp.pipe(LOTS_OF_TWEETS) -matches = matcher.pipe(docs) -``` - ## Efficient phrase matching {#phrasematcher} If you need to match large terminology lists, you can also use the diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index de3d7ce33..6a1499bdf 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -389,6 +389,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**. | `GoldParse` | [`Example`](/api/example) | | `GoldCorpus` | [`Corpus`](/api/corpus) | | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | +| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed | | `spacy init-model` | [`spacy init model`](/api/cli#init-model) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | From 3929431af1991d76f3594f89c8dda2c87ee8d5e6 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 31 Aug 2020 17:06:33 +0200 Subject: [PATCH 35/84] Update docs [ci skip] --- website/docs/api/top-level.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index b747007b0..d437ecc07 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -357,8 +357,8 @@ are returned: one for logging the information for each training step, and a second function that is called to finalize the logging when the training is finished. To log each training step, a [dictionary](/usage/training#custom-logging) is passed on from the -[training script](/api/cli#train), including information such as the training -loss and the accuracy scores on the development set. +[`spacy train`](/api/cli#train), including information such as the training loss +and the accuracy scores on the development set. There are two built-in logging functions: a logger printing results to the console in tabular format (which is the default), and one that also sends the @@ -366,6 +366,8 @@ results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of using one of the built-in loggers listed here, you can also [implement your own](/usage/training#custom-logging). +#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} + > #### Example config > > ```ini @@ -373,19 +375,21 @@ using one of the built-in loggers listed here, you can also > @loggers = "spacy.ConsoleLogger.v1" > ``` -#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} - Writes the results of a training step to the console in a tabular format. - + + +```cli +$ python -m spacy train config.cfg +``` ``` -$ python -m spacy train config.cfg ℹ Using CPU ℹ Loading config and nlp from: config.cfg ℹ Pipeline: ['tok2vec', 'tagger'] ℹ Start training ℹ Training. Initial learn rate: 0.0 + E # LOSS TOK2VEC LOSS TAGGER TAG_ACC SCORE --- ------ ------------ ----------- ------- ------ 1 0 0.00 86.20 0.22 0.00 From 3ac620f09d2d18dcc4e347f610eeb3aba32875a0 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Mon, 31 Aug 2020 17:40:04 +0200 Subject: [PATCH 36/84] fix config example [ci skip] --- website/docs/usage/training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 20f25924e..2d7905230 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -632,7 +632,7 @@ tabular results to a file: > ### config.cfg (excerpt) > [training.logger] > @loggers = "my_custom_logger.v1" -> file_path = "my_file.tab" +> log_path = "my_file.tab" > ``` ```python From fe298fa50ab9854443b0b0229d6b5800d38869d1 Mon Sep 17 00:00:00 2001 From: Matthw Honnibal Date: Mon, 31 Aug 2020 19:55:22 +0200 Subject: [PATCH 37/84] Shuffle on first epoch of train --- spacy/cli/train.py | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index d9ab8eca5..2bfa5c56e 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -186,18 +186,12 @@ def train( def create_train_batches(iterator, batcher, max_epochs: int): - epoch = 1 - examples = [] - # Stream the first epoch, so we start training faster and support - # infinite streams. - for batch in batcher(iterator): - yield epoch, batch - if max_epochs != 1: - examples.extend(batch) + epoch = 0 + examples = list(iterator) if not examples: # Raise error if no data raise ValueError(Errors.E986) - while epoch != max_epochs: + while max_epochs < 1 or epoch != max_epochs: random.shuffle(examples) for batch in batcher(examples): yield epoch, batch From 9130094199788c8962e4d288fd38c07d9dc537a3 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Mon, 31 Aug 2020 21:24:33 +0200 Subject: [PATCH 38/84] Prevent Tagger model init with 0 labels (#5984) * Prevent Tagger model init with 0 labels Raise an error before trying to initialize a tagger model with 0 labels. * Add dummy tagger label for test * Remove tagless tagger model initializiation * Fix error number after merge * Add dummy tagger label to test * Fix formatting Co-authored-by: Matthew Honnibal --- spacy/errors.py | 1 + spacy/pipeline/tagger.pyx | 8 ++++---- spacy/tests/pipeline/test_tagger.py | 7 +++++++ spacy/tests/regression/test_issue4001-4500.py | 3 ++- spacy/tests/regression/test_issue5230.py | 1 + spacy/tests/serialize/test_serialize_config.py | 1 + 6 files changed, 16 insertions(+), 5 deletions(-) diff --git a/spacy/errors.py b/spacy/errors.py index b99c99959..be71de820 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -651,6 +651,7 @@ class Errors: E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for " "'{chunk}'. Tokenizer exceptions are only allowed to specify " "`ORTH` and `NORM`.") + E1006 = ("Unable to initialize {name} model with 0 labels.") @add_codes diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx index af24bf336..c94cb6b58 100644 --- a/spacy/pipeline/tagger.pyx +++ b/spacy/pipeline/tagger.pyx @@ -285,11 +285,11 @@ class Tagger(Pipe): doc_sample.append(Doc(self.vocab, words=["hello"])) for tag in sorted(tags): self.add_label(tag) + if len(self.labels) == 0: + err = Errors.E1006.format(name="Tagger") + raise ValueError(err) self.set_output(len(self.labels)) - if self.labels: - self.model.initialize(X=doc_sample) - else: - self.model.initialize() + self.model.initialize(X=doc_sample) if sgd is None: sgd = self.create_optimizer() return sgd diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index 1af4a5121..b1b52b9fa 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -69,3 +69,10 @@ def test_overfitting_IO(): assert doc2[1].tag_ is "V" assert doc2[2].tag_ is "J" assert doc2[3].tag_ is "N" + + +def test_tagger_requires_labels(): + nlp = English() + tagger = nlp.add_pipe("tagger") + with pytest.raises(ValueError): + optimizer = nlp.begin_training() diff --git a/spacy/tests/regression/test_issue4001-4500.py b/spacy/tests/regression/test_issue4001-4500.py index 1789973e9..e846841d4 100644 --- a/spacy/tests/regression/test_issue4001-4500.py +++ b/spacy/tests/regression/test_issue4001-4500.py @@ -326,7 +326,8 @@ def test_issue4348(): nlp = English() example = Example.from_dict(nlp.make_doc(""), {"tags": []}) TRAIN_DATA = [example, example] - nlp.add_pipe("tagger") + tagger = nlp.add_pipe("tagger") + tagger.add_label("A") optimizer = nlp.begin_training() for i in range(5): losses = {} diff --git a/spacy/tests/regression/test_issue5230.py b/spacy/tests/regression/test_issue5230.py index 2ac886625..78ae04bbb 100644 --- a/spacy/tests/regression/test_issue5230.py +++ b/spacy/tests/regression/test_issue5230.py @@ -63,6 +63,7 @@ def tagger(): # need to add model for two reasons: # 1. no model leads to error in serialization, # 2. the affected line is the one for model serialization + tagger.add_label("A") tagger.begin_training(lambda: [], pipeline=nlp.pipeline) return tagger diff --git a/spacy/tests/serialize/test_serialize_config.py b/spacy/tests/serialize/test_serialize_config.py index e425d370d..fde92b0af 100644 --- a/spacy/tests/serialize/test_serialize_config.py +++ b/spacy/tests/serialize/test_serialize_config.py @@ -144,6 +144,7 @@ def test_serialize_nlp(): """ Create a custom nlp pipeline from config and ensure it serializes it correctly """ nlp_config = Config().from_str(nlp_config_string) nlp, _ = load_model_from_config(nlp_config, auto_fill=True) + nlp.get_pipe("tagger").add_label("A") nlp.begin_training() assert "tok2vec" in nlp.pipe_names assert "tagger" in nlp.pipe_names From ec660e3131f8898cfe6570a32cda2cff7543e0e8 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 00:41:38 +0200 Subject: [PATCH 39/84] Fix use_pytorch_for_gpu_memory --- spacy/cli/train.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index b688dd384..7525b9669 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -77,6 +77,9 @@ def train( ) if config.get("training", {}).get("seed") is not None: fix_random_seed(config["training"]["seed"]) + if config.get("system", {}).get("use_pytorch_for_gpu_memory"): + # It feels kind of weird to not have a default for this. + use_pytorch_for_gpu_memory() # Use original config here before it's resolved to functions sourced_components = get_sourced_components(config) with show_validation_error(config_path): @@ -85,9 +88,6 @@ def train( util.load_vectors_into_model(nlp, config["training"]["vectors"]) verify_config(nlp) raw_text, tag_map, morph_rules, weights_data = load_from_paths(config) - if config.get("system", {}).get("use_pytorch_for_gpu_memory"): - # It feels kind of weird to not have a default for this. - use_pytorch_for_gpu_memory() T_cfg = config["training"] optimizer = T_cfg["optimizer"] train_corpus = T_cfg["train_corpus"] From 61a71d8bccf5d91ecd20c1ec7e7809c5f58cb282 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 01:10:53 +0200 Subject: [PATCH 40/84] Try to debug tmpdir problem --- Makefile | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 5cd616529..463de81b8 100644 --- a/Makefile +++ b/Makefile @@ -24,7 +24,11 @@ endif dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp - $(VENV)/bin/pex \ + tmp_dir = $(TMPDIR) + echo $(tmp_dir) + source $(VENV)/bin/activate + export TMPDIR=$(tmp_dir) + pex \ -f $(WHEELHOUSE) \ --no-index \ --disable-cache \ From bff1640a75ad08e9d8ab3fcf0a147df32d26c7e9 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 01:13:09 +0200 Subject: [PATCH 41/84] Try to debug tmpdir problem --- Makefile | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 463de81b8..d3b31de1d 100644 --- a/Makefile +++ b/Makefile @@ -45,8 +45,13 @@ dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* mkdir -p $(WHEELHOUSE) - $(VENV)/bin/pip wheel . -w $(WHEELHOUSE) - $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) + tmp_dir = $(TMPDIR) + echo $(tmp_dir) + source $(VENV)/bin/activate + export TMPDIR=$(tmp_dir) + + pip wheel . -w $(WHEELHOUSE) + pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) touch $@ From 027c82c0686133cd25b80f498f1acc763fca3fb2 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 01:22:54 +0200 Subject: [PATCH 42/84] Update makefile --- Makefile | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/Makefile b/Makefile index d3b31de1d..c4e77d101 100644 --- a/Makefile +++ b/Makefile @@ -24,11 +24,7 @@ endif dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp - tmp_dir = $(TMPDIR) - echo $(tmp_dir) - source $(VENV)/bin/activate - export TMPDIR=$(tmp_dir) - pex \ + $(VENV)/bin/pex \ -f $(WHEELHOUSE) \ --no-index \ --disable-cache \ @@ -44,14 +40,8 @@ dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl chmod a+rx $@ $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* - mkdir -p $(WHEELHOUSE) - tmp_dir = $(TMPDIR) - echo $(tmp_dir) - source $(VENV)/bin/activate - export TMPDIR=$(tmp_dir) - - pip wheel . -w $(WHEELHOUSE) - pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) + $(VENV)/bin/pip wheel . -w $(WHEELHOUSE) + $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) touch $@ From ef9005273bac4468df21aa3b6a4b3c62a8877932 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 1 Sep 2020 12:07:04 +0200 Subject: [PATCH 43/84] Update fill-config command and add silent mode [ci skip] --- spacy/cli/init_config.py | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/spacy/cli/init_config.py b/spacy/cli/init_config.py index b5335df51..1e1e55e06 100644 --- a/spacy/cli/init_config.py +++ b/spacy/cli/init_config.py @@ -64,10 +64,16 @@ def init_fill_config_cli( def fill_config( - output_file: Path, base_path: Path, *, pretraining: bool = False, diff: bool = False + output_file: Path, + base_path: Path, + *, + pretraining: bool = False, + diff: bool = False, + silent: bool = False, ) -> Tuple[Config, Config]: is_stdout = str(output_file) == "-" - msg = Printer(no_print=is_stdout) + no_print = is_stdout or silent + msg = Printer(no_print=no_print) with show_validation_error(hint_fill=False): config = util.load_config(base_path) nlp, _ = util.load_model_from_config(config, auto_fill=True, validate=False) @@ -85,7 +91,7 @@ def fill_config( msg.warn("Nothing to auto-fill: base config is already complete") else: msg.good("Auto-filled config with all values") - if diff and not is_stdout: + if diff and not no_print: if before == after: msg.warn("No diff to show: nothing was auto-filled") else: @@ -94,7 +100,8 @@ def fill_config( print(diff_strings(before, after)) msg.divider("END CONFIG DIFF") print("") - save_config(filled, output_file, is_stdout=is_stdout) + save_config(filled, output_file, is_stdout=is_stdout, silent=silent) + return config, filled def init_config( @@ -149,8 +156,11 @@ def init_config( save_config(nlp.config, output_file, is_stdout=is_stdout) -def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> None: - msg = Printer(no_print=is_stdout) +def save_config( + config: Config, output_file: Path, is_stdout: bool = False, silent: bool = False +) -> None: + no_print = is_stdout or silent + msg = Printer(no_print=no_print) if is_stdout: print(config.to_str()) else: @@ -160,7 +170,8 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N msg.good("Saved config", output_file) msg.text("You can now add your data and train your model:") variables = ["--paths.train ./train.spacy", "--paths.dev ./dev.spacy"] - print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") + if not no_print: + print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") def has_spacy_transformers() -> bool: From a4c51f0f18e0eea6d994b53de9571e07fa077ded Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 1 Sep 2020 12:36:21 +0200 Subject: [PATCH 44/84] Add v3 info to project docs [ci skip] --- spacy/cli/project/document.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/spacy/cli/project/document.py b/spacy/cli/project/document.py index ffa77b2d5..30cb2af25 100644 --- a/spacy/cli/project/document.py +++ b/spacy/cli/project/document.py @@ -6,6 +6,9 @@ from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config DOCS_URL = "https://nightly.spacy.io" +INTRO = f"""> ⚠️ This project template uses the new [**spaCy v3.0**]({DOCS_URL}), which +> is currently available as a nightly pre-release. You can install it from pip as `spacy-nightly`: +> `pip install spacy-nightly`. Make sure to use a fresh virtual environment.""" INTRO_PROJECT = f"""The [`{PROJECT_FILE}`]({PROJECT_FILE}) defines the data assets required by the project, as well as the available commands and workflows. For details, see the [spaCy projects documentation]({DOCS_URL}/usage/projects).""" @@ -52,6 +55,7 @@ def project_document( title = config.get("title") description = config.get("description") md.add(md.title(1, f"spaCy Project{f': {title}' if title else ''}", "🪐")) + md.add(INTRO) if description: md.add(description) md.add(md.title(2, PROJECT_FILE, "📋")) From 70b226f69db29e457586c377926bd45ac7bc42e2 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 1 Sep 2020 12:49:04 +0200 Subject: [PATCH 45/84] Support ignore marker in project document [ci skip] --- spacy/cli/project/document.py | 11 ++++++++--- website/docs/usage/projects.md | 9 +++++++++ 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/spacy/cli/project/document.py b/spacy/cli/project/document.py index 30cb2af25..ab345ecd8 100644 --- a/spacy/cli/project/document.py +++ b/spacy/cli/project/document.py @@ -24,8 +24,10 @@ be fetched by running [`spacy project assets`]({DOCS_URL}/api/cli#project-assets in the project directory.""" # These markers are added to the Markdown and can be used to update the file in # place if it already exists. Only the auto-generated part will be replaced. -MARKER_START = "" -MARKER_END = "" +MARKER_START = "" +MARKER_END = "" +# If this marker is used in an existing README, it's ignored and not replaced +MARKER_IGNORE = "" @project_cli.command("document") @@ -100,13 +102,16 @@ def project_document( if output_file.exists(): with output_file.open("r", encoding="utf8") as f: existing = f.read() + if MARKER_IGNORE in existing: + msg.warn("Found ignore marker in existing file: skipping", output_file) + return if MARKER_START in existing and MARKER_END in existing: msg.info("Found existing file: only replacing auto-generated docs") before = existing.split(MARKER_START)[0] after = existing.split(MARKER_END)[1] content = f"{before}{content}{after}" else: - msg.info("Replacing existing file") + msg.warn("Replacing existing file") with output_file.open("w") as f: f.write(content) msg.good("Saved project documentation", output_file) diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index ef895195c..97a0caed8 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -526,6 +526,15 @@ before or after it and re-running the `project document` command will **only update the auto-generated part**. This makes it easy to keep your documentation up to date. + + +Note that the contents of an existing file will be **replaced** if no existing +auto-generated docs are found. If you want spaCy to ignore a file and not update +it, you can add the comment marker `` anywhere in +your markup. + + + ### Cloning from your own repo {#custom-repo} The [`spacy project clone`](/api/cli#project-clone) command lets you customize From 690bd77669a34b77cc9ad5b06b3f6f7d62e6a991 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 1 Sep 2020 14:04:36 +0200 Subject: [PATCH 46/84] Add todos [ci skip] --- website/docs/usage/embeddings-transformers.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index aaa5fde10..7792ce124 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -578,7 +578,12 @@ def MyCustomVectors( ## Pretraining {#pretraining} - +- explain general concept and idea (short!) +- present it as a separate lightweight mechanism for pretraining the tok2vec + layer +- advantages (could also be pros/cons table) +- explain how it generates a separate file (!) and how it depends on the same + vectors > #### Raw text format > From 046c38bd265fdb504006f6ffbeccc5a9b765afee Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 16:12:15 +0200 Subject: [PATCH 47/84] Remove 'cleanup' of strings (#6007) A long time ago we went to some trouble to try to clean up "unused" strings, to avoid the `StringStore` growing in long-running processes. This never really worked reliably, and I think it was a really wrong approach. It's much better to let the user reload the `nlp` object as necessary, now that the string encoding is stable (in v1, the string IDs were sequential integers, making reloading the NLP object really annoying.) The extra book-keeping does make some performance difference, and the feature is unsed, so it's past time we killed it. --- spacy/language.py | 29 ----------------------------- spacy/strings.pxd | 1 - spacy/strings.pyx | 36 ------------------------------------ 3 files changed, 66 deletions(-) diff --git a/spacy/language.py b/spacy/language.py index e20bbdd80..8e7c39b90 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1314,7 +1314,6 @@ class Language: as_tuples: bool = False, batch_size: int = 1000, disable: Iterable[str] = SimpleFrozenList(), - cleanup: bool = False, component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, n_process: int = 1, ): @@ -1326,8 +1325,6 @@ class Language: (doc, context) tuples. Defaults to False. batch_size (int): The number of texts to buffer. disable (List[str]): Names of the pipeline components to disable. - cleanup (bool): If True, unneeded strings are freed to control memory - use. Experimental. component_cfg (Dict[str, Dict]): An optional dictionary with extra keyword arguments for specific components. n_process (int): Number of processors to process texts. If -1, set `multiprocessing.cpu_count()`. @@ -1378,35 +1375,9 @@ class Language: for pipe in pipes: docs = pipe(docs) - # Track weakrefs of "recent" documents, so that we can see when they - # expire from memory. When they do, we know we don't need old strings. - # This way, we avoid maintaining an unbounded growth in string entries - # in the string store. - recent_refs = weakref.WeakSet() - old_refs = weakref.WeakSet() - # Keep track of the original string data, so that if we flush old strings, - # we can recover the original ones. However, we only want to do this if we're - # really adding strings, to save up-front costs. - original_strings_data = None nr_seen = 0 for doc in docs: yield doc - if cleanup: - recent_refs.add(doc) - if nr_seen < 10000: - old_refs.add(doc) - nr_seen += 1 - elif len(old_refs) == 0: - old_refs, recent_refs = recent_refs, old_refs - if original_strings_data is None: - original_strings_data = list(self.vocab.strings) - else: - keys, strings = self.vocab.strings._cleanup_stale_strings( - original_strings_data - ) - self.vocab._reset_cache(keys, strings) - self.tokenizer._reset_cache(keys) - nr_seen = 0 def _multiprocessing_pipe( self, diff --git a/spacy/strings.pxd b/spacy/strings.pxd index ba2476ec7..07768d347 100644 --- a/spacy/strings.pxd +++ b/spacy/strings.pxd @@ -23,7 +23,6 @@ cdef class StringStore: cdef Pool mem cdef vector[hash_t] keys - cdef set[hash_t] hits cdef public PreshMap _map cdef const Utf8Str* intern_unicode(self, unicode py_string) diff --git a/spacy/strings.pyx b/spacy/strings.pyx index 136eda9ff..6a1d68221 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -127,7 +127,6 @@ cdef class StringStore: return SYMBOLS_BY_INT[string_or_id] else: key = string_or_id - self.hits.insert(key) utf8str = self._map.get(key) if utf8str is NULL: raise KeyError(Errors.E018.format(hash_value=string_or_id)) @@ -198,7 +197,6 @@ cdef class StringStore: if key < len(SYMBOLS_BY_INT): return True else: - self.hits.insert(key) return self._map.get(key) is not NULL def __iter__(self): @@ -210,7 +208,6 @@ cdef class StringStore: cdef hash_t key for i in range(self.keys.size()): key = self.keys[i] - self.hits.insert(key) utf8str = self._map.get(key) yield decode_Utf8Str(utf8str) # TODO: Iterate OOV here? @@ -269,41 +266,9 @@ cdef class StringStore: self.mem = Pool() self._map = PreshMap() self.keys.clear() - self.hits.clear() for string in strings: self.add(string) - def _cleanup_stale_strings(self, excepted): - """ - excepted (list): Strings that should not be removed. - RETURNS (keys, strings): Dropped strings and keys that can be dropped from other places - """ - if self.hits.size() == 0: - # If we don't have any hits, just skip cleanup - return - - cdef vector[hash_t] tmp - dropped_strings = [] - dropped_keys = [] - for i in range(self.keys.size()): - key = self.keys[i] - # Here we cannot use __getitem__ because it also set hit. - utf8str = self._map.get(key) - value = decode_Utf8Str(utf8str) - if self.hits.count(key) != 0 or value in excepted: - tmp.push_back(key) - else: - dropped_keys.append(key) - dropped_strings.append(value) - - self.keys.swap(tmp) - strings = list(self) - self._reset_and_load(strings) - # Here we have strings but hits to it should be reseted - self.hits.clear() - - return dropped_keys, dropped_strings - cdef const Utf8Str* intern_unicode(self, unicode py_string): # 0 means missing, but we don't bother offsetting the index. cdef bytes byte_string = py_string.encode("utf8") @@ -319,6 +284,5 @@ cdef class StringStore: return value value = _allocate(self.mem, utf8_string, length) self._map.set(key, value) - self.hits.insert(key) self.keys.push_back(key) return value From 4cce32f090d02ef84ad40fef6ef9bba7b9d65c9c Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 1 Sep 2020 16:38:34 +0200 Subject: [PATCH 48/84] Fix tagger initialization --- spacy/pipeline/tagger.pyx | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx index c94cb6b58..f831caefe 100644 --- a/spacy/pipeline/tagger.pyx +++ b/spacy/pipeline/tagger.pyx @@ -289,7 +289,14 @@ class Tagger(Pipe): err = Errors.E1006.format(name="Tagger") raise ValueError(err) self.set_output(len(self.labels)) - self.model.initialize(X=doc_sample) + if doc_sample: + label_sample = [ + self.model.ops.alloc2f(len(doc), len(self.labels)) + for doc in doc_sample + ] + self.model.initialize(X=doc_sample, Y=label_sample) + else: + self.model.initialize() if sgd is None: sgd = self.create_optimizer() return sgd From 6bfb1b3a29fa556daad7e81ab4980fb3a54c616e Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Tue, 1 Sep 2020 19:49:01 +0200 Subject: [PATCH 49/84] Fix sparse checkout for 'spacy project' (#6008) * exit if cloning fails * UX * rewrite http link to git protocol, don't use stdin * fixes to sparse checkout * formatting --- spacy/cli/_util.py | 29 ++++++++++++++++++----------- spacy/cli/project/clone.py | 5 +++-- 2 files changed, 21 insertions(+), 13 deletions(-) diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py index 16e257ce2..cfa126cc4 100644 --- a/spacy/cli/_util.py +++ b/spacy/cli/_util.py @@ -297,9 +297,7 @@ def ensure_pathy(path): return Pathy(path) -def git_sparse_checkout( - repo: str, subpath: str, dest: Path, *, branch: Optional[str] = None -): +def git_sparse_checkout(repo: str, subpath: str, dest: Path, *, branch: str = "master"): if dest.exists(): msg.fail("Destination of checkout must not exist", exits=1) if not dest.parent.exists(): @@ -323,21 +321,30 @@ def git_sparse_checkout( # This is the "clone, but don't download anything" part. cmd = ( f"git clone {repo} {tmp_dir} --no-checkout --depth 1 " - "--filter=blob:none" # <-- The key bit + f"--filter=blob:none " # <-- The key bit + f"-b {branch}" ) - if branch is not None: - cmd = f"{cmd} -b {branch}" run_command(cmd, capture=True) # Now we need to find the missing filenames for the subpath we want. # Looking for this 'rev-list' command in the git --help? Hah. cmd = f"git -C {tmp_dir} rev-list --objects --all --missing=print -- {subpath}" ret = run_command(cmd, capture=True) - missings = "\n".join([x[1:] for x in ret.stdout.split() if x.startswith("?")]) + repo = _from_http_to_git(repo) # Now pass those missings into another bit of git internals - run_command( - f"git -C {tmp_dir} fetch-pack --stdin {repo}", capture=True, stdin=missings - ) + missings = " ".join([x[1:] for x in ret.stdout.split() if x.startswith("?")]) + cmd = f"git -C {tmp_dir} fetch-pack {repo} {missings}" + run_command(cmd, capture=True) # And finally, we can checkout our subpath - run_command(f"git -C {tmp_dir} checkout {branch} {subpath}") + cmd = f"git -C {tmp_dir} checkout {branch} {subpath}" + run_command(cmd) # We need Path(name) to make sure we also support subdirectories shutil.move(str(tmp_dir / Path(subpath)), str(dest)) + + +def _from_http_to_git(repo): + if repo.startswith("http://"): + repo = repo.replace(r"http://", r"https://") + if repo.startswith(r"https://"): + repo = repo.replace("https://", "git@").replace("/", ":", 1) + repo = f"{repo}.git" + return repo diff --git a/spacy/cli/project/clone.py b/spacy/cli/project/clone.py index 7f9a46a46..751c389bc 100644 --- a/spacy/cli/project/clone.py +++ b/spacy/cli/project/clone.py @@ -43,7 +43,7 @@ def project_clone(name: str, dest: Path, *, repo: str = about.__projects__) -> N git_sparse_checkout(repo, name, dest) except subprocess.CalledProcessError: err = f"Could not clone '{name}' from repo '{repo_name}'" - msg.fail(err) + msg.fail(err, exits=1) msg.good(f"Cloned '{name}' from {repo_name}", project_dir) if not (project_dir / PROJECT_FILE).exists(): msg.warn(f"No {PROJECT_FILE} found in directory") @@ -78,6 +78,7 @@ def check_clone(name: str, dest: Path, repo: str) -> None: if not dest.parent.exists(): # We're not creating parents, parent dir should exist msg.fail( - f"Can't clone project, parent directory doesn't exist: {dest.parent}", + f"Can't clone project, parent directory doesn't exist: {dest.parent}. " + f"Create the necessary folder(s) first before continuing.", exits=1, ) From 3d9ae9286ff15049eccd0f840d78daf47a25f045 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 10:46:38 +0200 Subject: [PATCH 50/84] small fixes --- website/docs/usage/layers-architectures.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index aa398f752..aca9a76e5 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -62,7 +62,7 @@ are: ​ | ~~Ints2d~~ | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. | | ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token. | | ~~Ragged~~ | A container to handle variable-length sequence data in an unpadded contiguous array. | -| ~~Padded~~ | A container to handle variable-length sequence data in a passed contiguous array. | +| ~~Padded~~ | A container to handle variable-length sequence data in a padded contiguous array. | The model type signatures help you figure out which model architectures and components can **fit together**. For instance, the @@ -94,7 +94,7 @@ code. ## Defining sublayers {#sublayers} -​ Model architecture functions often accept **sublayers as arguments**, so that +Model architecture functions often accept **sublayers as arguments**, so that you can try **substituting a different layer** into the network. Depending on how the architecture function is structured, you might be able to define your network structure entirely through the [config system](/usage/training#config), @@ -112,7 +112,7 @@ you can control this important part of the network separately. This makes it easy to **switch between** transformer, CNN, BiLSTM or other feature extraction approaches. And if you want to define your own solution, all you need to do is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and -you'll be able to try it out in any of spaCy components. ​ +you'll be able to try it out in any of the spaCy components. ​ From 6fd7f140ecbf8604f39aa32dda74c27a157ff9d7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 11:14:06 +0200 Subject: [PATCH 51/84] custom-architectures section --- website/docs/usage/training.md | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 2d7905230..6d56f5767 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -669,7 +669,7 @@ def custom_logger(log_path): #### Example: Custom batch size schedule {#custom-code-schedule} -For example, let's say you've implemented your own batch size schedule to use +You can also implement your own batch size schedule to use during training. The `@spacy.registry.schedules` decorator lets you register that function in the `schedules` [registry](/api/top-level#registry) and assign it a string name: @@ -806,7 +806,37 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp ### Defining custom architectures {#custom-architectures} - +Built-in pipeline components such as the tagger or named entity recognizer are +constructed with default neural network [models](/api/architectures). +You can change the model architecture +entirely by implementing your own custom models and providing those in the config +when creating the pipeline component. See the +documentation on +[layers and model architectures](/usage/layers-architectures) for more details. + + +```python +### functions.py +from typing import List +from thinc.types import Floats2d +from thinc.api import Model +import spacy +from spacy.tokens import Doc + +@spacy.registry.architectures("custom_neural_network.v1") +def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]: + # ... +``` + +```ini +### config.cfg (excerpt) +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "custom_neural_network.v1" +output_width = 512 +``` ## Internal training API {#api} From 474abb2e5969f7708ba93bae901fda3ad542dc78 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 11:37:56 +0200 Subject: [PATCH 52/84] remove unused MORPH_RULES from test --- spacy/tests/pipeline/test_tagger.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index b1b52b9fa..a1aa7e1e1 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -28,8 +28,6 @@ def test_tagger_begin_training_tag_map(): TAGS = ("N", "V", "J") -MORPH_RULES = {"V": {"like": {"lemma": "luck"}}} - TRAIN_DATA = [ ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), ("Eat blue ham", {"tags": ["V", "J", "N"]}), From c1bf3a5602db211caf3d25abcf3b09ca42aec0f7 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Wed, 2 Sep 2020 12:57:13 +0200 Subject: [PATCH 53/84] Fix significant performance bug in parser training (#6010) The parser training makes use of a trick for long documents, where we use the oracle to cut up the document into sections, so that we can have batch items in the middle of a document. For instance, if we have one document of 600 words, we might make 6 states, starting at words 0, 100, 200, 300, 400 and 500. The problem is for v3, I screwed this up and didn't stop parsing! So instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch of [600, 500, 400, 300, 200, 100]. Oops. The implementation here could probably be improved, it's annoying to have this extra variable in the state. But this'll do. This makes the v3 parser training 5-10 times faster, depending on document lengths. This problem wasn't in v2. --- spacy/pipeline/_parser_internals/_state.pxd | 5 +++++ spacy/pipeline/_parser_internals/stateclass.pyx | 4 ++++ spacy/pipeline/transition_parser.pyx | 11 ++++++----- 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/spacy/pipeline/_parser_internals/_state.pxd b/spacy/pipeline/_parser_internals/_state.pxd index 0d0dd8c05..d31430124 100644 --- a/spacy/pipeline/_parser_internals/_state.pxd +++ b/spacy/pipeline/_parser_internals/_state.pxd @@ -42,6 +42,7 @@ cdef cppclass StateC: RingBufferC _hist int length int offset + int n_pushes int _s_i int _b_i int _e_i @@ -49,6 +50,7 @@ cdef cppclass StateC: __init__(const TokenC* sent, int length) nogil: cdef int PADDING = 5 + this.n_pushes = 0 this._buffer = calloc(length + (PADDING * 2), sizeof(int)) this._stack = calloc(length + (PADDING * 2), sizeof(int)) this.shifted = calloc(length + (PADDING * 2), sizeof(bint)) @@ -335,6 +337,7 @@ cdef cppclass StateC: this.set_break(this.B_(0).l_edge) if this._b_i > this._break: this._break = -1 + this.n_pushes += 1 void pop() nogil: if this._s_i >= 1: @@ -351,6 +354,7 @@ cdef cppclass StateC: this._buffer[this._b_i] = this.S(0) this._s_i -= 1 this.shifted[this.B(0)] = True + this.n_pushes -= 1 void add_arc(int head, int child, attr_t label) nogil: if this.has_head(child): @@ -431,6 +435,7 @@ cdef cppclass StateC: this._break = src._break this.offset = src.offset this._empty_token = src._empty_token + this.n_pushes = src.n_pushes void fast_forward() nogil: # space token attachement policy: diff --git a/spacy/pipeline/_parser_internals/stateclass.pyx b/spacy/pipeline/_parser_internals/stateclass.pyx index 880cf6cc5..d59ade467 100644 --- a/spacy/pipeline/_parser_internals/stateclass.pyx +++ b/spacy/pipeline/_parser_internals/stateclass.pyx @@ -36,6 +36,10 @@ cdef class StateClass: hist[i] = self.c.get_hist(i+1) return hist + @property + def n_pushes(self): + return self.c.n_pushes + def is_final(self): return self.c.is_final() diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx index 2eadfa6aa..2169b4c17 100644 --- a/spacy/pipeline/transition_parser.pyx +++ b/spacy/pipeline/transition_parser.pyx @@ -279,14 +279,14 @@ cdef class Parser(Pipe): # Chop sequences into lengths of this many transitions, to make the # batch uniform length. # We used to randomize this, but it's not clear that actually helps? - cut_size = self.cfg["update_with_oracle_cut_size"] - states, golds, max_steps = self._init_gold_batch( + max_pushes = self.cfg["update_with_oracle_cut_size"] + states, golds, _ = self._init_gold_batch( examples, - max_length=cut_size + max_length=max_pushes ) else: states, golds, _ = self.moves.init_gold_batch(examples) - max_steps = max([len(eg.x) for eg in examples]) + max_pushes = max([len(eg.x) for eg in examples]) if not states: return losses all_states = list(states) @@ -302,7 +302,8 @@ cdef class Parser(Pipe): backprop(d_scores) # Follow the predicted action self.transition_states(states, scores) - states_golds = [(s, g) for (s, g) in zip(states, golds) if not s.is_final()] + states_golds = [(s, g) for (s, g) in zip(states, golds) + if s.n_pushes < max_pushes and not s.is_final()] backprop_tok2vec(golds) if sgd not in (None, False): From 70238543c868187432d384477b250281f485f0a6 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 2 Sep 2020 13:04:35 +0200 Subject: [PATCH 54/84] Update layers/arch docs structure [ci skip] --- website/docs/usage/layers-architectures.md | 78 ++++++++-------------- 1 file changed, 28 insertions(+), 50 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index aa398f752..8bb73b404 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -3,8 +3,9 @@ title: Layers and Model Architectures teaser: Power spaCy components with custom neural networks menu: - ['Type Signatures', 'type-sigs'] - - ['Defining Sublayers', 'sublayers'] + - ['Swapping Architectures', 'swap-architectures'] - ['PyTorch & TensorFlow', 'frameworks'] + - ['Thinc Models', 'thinc'] - ['Trainable Components', 'components'] next: /usage/projects --- @@ -22,8 +23,6 @@ its model architecture. The architecture is like a recipe for the network, and you can't change the recipe once the dish has already been prepared. You have to make a new one. -![Diagram of a pipeline component with its model](../images/layers-architectures.svg) - ## Type signatures {#type-sigs} @@ -92,9 +91,13 @@ code. -## Defining sublayers {#sublayers} +## Swapping model architectures {#swap-architectures} -​ Model architecture functions often accept **sublayers as arguments**, so that + + +### Defining sublayers {#sublayers} + +​Model architecture functions often accept **sublayers as arguments**, so that you can try **substituting a different layer** into the network. Depending on how the architecture function is structured, you might be able to define your network structure entirely through the [config system](/usage/training#config), @@ -114,62 +117,37 @@ approaches. And if you want to define your own solution, all you need to do is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and you'll be able to try it out in any of spaCy components. ​ - - -### Registering new architectures - -- Recap concept, link to config docs. ​ + ## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks} - +Thinc allows you to [wrap models](https://thinc.ai/docs/usage-frameworks) +written in other machine learning frameworks like PyTorch, TensorFlow and MXNet +using a unified [`Model`](https://thinc.ai/docs/api-model) API. As well as +**wrapping whole models**, Thinc lets you call into an external framework for +just **part of your model**: you can have a model where you use PyTorch just for +the transformer layers, using "native" Thinc layers to do fiddly input and +output transformations and add on task-specific "heads", as efficiency is less +of a consideration for those parts of the network. -Thinc allows you to wrap models written in other machine learning frameworks -like PyTorch, TensorFlow and MXNet using a unified -[`Model`](https://thinc.ai/docs/api-model) API. As well as **wrapping whole -models**, Thinc lets you call into an external framework for just **part of your -model**: you can have a model where you use PyTorch just for the transformer -layers, using "native" Thinc layers to do fiddly input and output -transformations and add on task-specific "heads", as efficiency is less of a -consideration for those parts of the network. + -Thinc uses a special class, [`Shim`](https://thinc.ai/docs/api-model#shim), to -hold references to external objects. This allows each wrapper space to define a -custom type, with whatever attributes and methods are helpful, to assist in -managing the communication between Thinc and the external library. The -[`Model`](https://thinc.ai/docs/api-model#model) class holds `shim` instances in -a separate list, and communicates with the shims about updates, serialization, -changes of device, etc. +## Implementing models in Thinc {#thinc} -The wrapper will receive each batch of inputs, convert them into a suitable form -for the underlying model instance, and pass them over to the shim, which will -**manage the actual communication** with the model. The output is then passed -back into the wrapper, and converted for use in the rest of the network. The -equivalent procedure happens during backpropagation. Array conversion is handled -via the [DLPack](https://github.com/dmlc/dlpack) standard wherever possible, so -that data can be passed between the frameworks **without copying the data back** -to the host device unnecessarily. - -| Framework | Wrapper layer | Shim | DLPack | -| -------------- | ------------------------------------------------------------------------- | --------------------------------------------------------- | --------------- | -| **PyTorch** | [`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper) | [`PyTorchShim`](https://thinc.ai/docs/api-model#shims) | ✅ | -| **TensorFlow** | [`TensorFlowWrapper`](https://thinc.ai/docs/api-layers#tensorflowwrapper) | [`TensorFlowShim`](https://thinc.ai/docs/api-model#shims) | ❌ 1 | -| **MXNet** | [`MXNetWrapper`](https://thinc.ai/docs/api-layers#mxnetwrapper) | [`MXNetShim`](https://thinc.ai/docs/api-model#shims) | ✅ | - -1. DLPack support in TensorFlow is now - [available](<(https://github.com/tensorflow/tensorflow/issues/24453)>) but - still experimental. - - + ## Models for trainable components {#components} + + +![Diagram of a pipeline component with its model](../images/layers-architectures.svg) ```python def update(self, examples): From b97d98783a211f485ac7bbd69dee292f31cd0849 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 2 Sep 2020 13:06:16 +0200 Subject: [PATCH 55/84] Fix Hungarian % tokenization (#6013) --- spacy/lang/hu/punctuation.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/spacy/lang/hu/punctuation.py b/spacy/lang/hu/punctuation.py index 597f01b65..f827cd677 100644 --- a/spacy/lang/hu/punctuation.py +++ b/spacy/lang/hu/punctuation.py @@ -7,6 +7,7 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "") _currency = r"\$¢£€¥฿" _quotes = CONCAT_QUOTES.replace("'", "") +_units = UNITS.replace("%", "") _prefixes = ( LIST_PUNCT @@ -26,7 +27,7 @@ _suffixes = ( r"(?<=[0-9])\+", r"(?<=°[FfCcKk])\.", r"(?<=[0-9])(?:[{c}])".format(c=_currency), - r"(?<=[0-9])(?:{u})".format(u=UNITS), + r"(?<=[0-9])(?:{u})".format(u=_units), r"(?<=[{al}{e}{q}(?:{c})])\.".format( al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency ), From eb5637779987b9b7062acb48d72ec917b8ad0229 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Wed, 2 Sep 2020 13:07:41 +0200 Subject: [PATCH 56/84] Fix overfitting test (#6011) * remove unused MORPH_RULES * fix textcat architecture in overfitting test --- spacy/tests/pipeline/test_tagger.py | 2 -- spacy/tests/pipeline/test_textcat.py | 12 +++++------- 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index b1b52b9fa..a1aa7e1e1 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -28,8 +28,6 @@ def test_tagger_begin_training_tag_map(): TAGS = ("N", "V", "J") -MORPH_RULES = {"V": {"like": {"lemma": "luck"}}} - TRAIN_DATA = [ ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), ("Eat blue ham", {"tags": ["V", "J", "N"]}), diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index 66c27b233..12ead90cb 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -84,9 +84,8 @@ def test_overfitting_IO(): # Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly fix_random_seed(0) nlp = English() - textcat = nlp.add_pipe("textcat") # Set exclusive labels - textcat.model.attrs["multi_label"] = False + textcat = nlp.add_pipe("textcat", config={"model": {"exclusive_classes": True}}) train_examples = [] for text, annotations in TRAIN_DATA: train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) @@ -103,9 +102,8 @@ def test_overfitting_IO(): test_text = "I am happy." doc = nlp(test_text) cats = doc.cats - # note that by default, exclusive_classes = false so we need a bigger error margin - assert cats["POSITIVE"] > 0.8 - assert cats["POSITIVE"] + cats["NEGATIVE"] == pytest.approx(1.0, 0.1) + assert cats["POSITIVE"] > 0.9 + assert cats["POSITIVE"] + cats["NEGATIVE"] == pytest.approx(1.0, 0.001) # Also test the results are still the same after IO with make_tempdir() as tmp_dir: @@ -113,8 +111,8 @@ def test_overfitting_IO(): nlp2 = util.load_model_from_path(tmp_dir) doc2 = nlp2(test_text) cats2 = doc2.cats - assert cats2["POSITIVE"] > 0.8 - assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.1) + assert cats2["POSITIVE"] > 0.9 + assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.001) # Test scoring scores = nlp.evaluate(train_examples, scorer_cfg={"positive_label": "POSITIVE"}) From e29a33449dc66192868a8290c564f622ee2d79b5 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 13:41:18 +0200 Subject: [PATCH 57/84] rewrite intro, simpel Model example --- website/docs/usage/layers-architectures.md | 38 +++++++++++++++------- 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 3ef28acaf..ac91ca0ad 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -10,18 +10,32 @@ menu: next: /usage/projects --- -​A **model architecture** is a function that wires up a -[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then -use in a component or as a layer of a larger network. You can use Thinc as a -thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can -implement your logic in Thinc directly. ​ spaCy's built-in components will never -construct their `Model` instances themselves, so you won't have to subclass the -component to change its model architecture. You can just **update the config** -so that it refers to a different registered function. Once the component has -been created, its model instance has already been assigned, so you cannot change -its model architecture. The architecture is like a recipe for the network, and -you can't change the recipe once the dish has already been prepared. You have to -make a new one. +> #### Example +> +> ````python +> from thinc.api import Model, chain +> +> def build_model(width: int, classes: int) -> Model: +> tok2vec = build_tok2vec(width) +> output_layer = build_output_layer(width, classes) +> model = chain(tok2vec, output_layer) +> return model +> ```` + +A **model architecture** is a function that wires up a +[Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the +neural network that is run internally as part of a component in a spaCy +pipeline. To define the actual architecture, you can implement your logic in +Thinc directly, but you can also use Thinc as a thin wrapper around frameworks +such as PyTorch, TensorFlow or MXNet. + +spaCy's built-in components require a `Model` instance to be passed to them via +the config system. To change the model architecture of an existing component, +you just need to **update the config** so that it refers to a different +registered function. Once the component has been created from this config, you +won't be able to change it anymore. The architecture is like a recipe for the +network, and you can't change the recipe once the dish has already been +prepared. You have to make a new one. ## Type signatures {#type-sigs} From 821b2d4e630438a7fd5b93439a28d67b0823c451 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 14:15:50 +0200 Subject: [PATCH 58/84] update examples --- website/docs/usage/layers-architectures.md | 46 ++++++++++++++-------- 1 file changed, 29 insertions(+), 17 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index ac91ca0ad..ea0427903 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -14,7 +14,8 @@ next: /usage/projects > > ````python > from thinc.api import Model, chain -> +> +> @spacy.registry.architectures.register("model.v1") > def build_model(width: int, classes: int) -> Model: > tok2vec = build_tok2vec(width) > output_layer = build_output_layer(width, classes) @@ -24,10 +25,12 @@ next: /usage/projects A **model architecture** is a function that wires up a [Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the -neural network that is run internally as part of a component in a spaCy -pipeline. To define the actual architecture, you can implement your logic in -Thinc directly, but you can also use Thinc as a thin wrapper around frameworks -such as PyTorch, TensorFlow or MXNet. +neural network that is run internally as part of a component in a spaCy pipeline. +To define the actual architecture, you can implement your logic in +Thinc directly, or you can use Thinc as a thin wrapper around frameworks +such as PyTorch, TensorFlow and MXNet. Each Model can also be used as a sublayer +of a larger network, allowing you to freely combine implementations from different +frameworks into one `Thinc` Model. spaCy's built-in components require a `Model` instance to be passed to them via the config system. To change the model architecture of an existing component, @@ -37,6 +40,17 @@ won't be able to change it anymore. The architecture is like a recipe for the network, and you can't change the recipe once the dish has already been prepared. You have to make a new one. +```ini +### config.cfg (excerpt) +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "model.v1" +width = 512 +classes = 16 +``` + ## Type signatures {#type-sigs} @@ -44,17 +58,15 @@ prepared. You have to make a new one. > #### Example > > ```python -> @spacy.registry.architectures.register("spacy.Tagger.v1") -> def build_tagger_model( -> tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None -> ) -> Model[List[Doc], List[Floats2d]]: -> t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None -> output_layer = Softmax(nO, t2v_width, init_W=zero_init) -> softmax = with_array(output_layer) -> model = chain(tok2vec, softmax) -> model.set_ref("tok2vec", tok2vec) -> model.set_ref("softmax", output_layer) -> model.set_ref("output_layer", output_layer) +> from typing import List +> from thinc.api import Model, chain +> from thinc.types import Floats2d +> def chain_model( +> tok2vec: Model[List[Doc], List[Floats2d]], +> layer1: Model[List[Floats2d], Floats2d], +> layer2: Model[Floats2d, Floats2d] +> ) -> Model[List[Doc], Floats2d]: +> model = chain(tok2vec, layer1, layer2) > return model > ``` @@ -65,7 +77,7 @@ list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict` are also generics, allowing you to be more specific about the data. For instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that the model expects a list of [`Doc`](/api/doc) objects as input, and returns a -dictionary mapping strings to floats. Some of the most common types you'll see +dictionary mapping of strings to floats. Some of the most common types you'll see are: ​ | Type | Description | From d19ec6c67b18db504172bc8bd1e2a89694e6b5c7 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 14:25:18 +0200 Subject: [PATCH 59/84] small rewrites in types paragraph --- website/docs/usage/layers-architectures.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index ea0427903..32319ca07 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -70,12 +70,11 @@ classes = 16 > return model > ``` -​ The Thinc `Model` class is a **generic type** that can specify its input and +The Thinc `Model` class is a **generic type** that can specify its input and output types. Python uses a square-bracket notation for this, so the type ~~Model[List, Dict]~~ says that each batch of inputs to the model will be a -list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict` -are also generics, allowing you to be more specific about the data. For -instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that +list, and the outputs will be a dictionary. You can be even more specific and +write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that the model expects a list of [`Doc`](/api/doc) objects as input, and returns a dictionary mapping of strings to floats. Some of the most common types you'll see are: ​ @@ -103,8 +102,8 @@ interchangeably. There are many other ways they could be incompatible. However, if the types don't match, they almost surely _won't_ be compatible. This little bit of validation goes a long way, especially if you [configure your editor](https://thinc.ai/docs/usage-type-checking) or other -tools to highlight these errors early. Thinc will also verify that your types -match correctly when your config file is processed at the beginning of training. +tools to highlight these errors early. The config file is also validated +at the beginning of training, to verify that all the types match correctly. From 57e432ba2aaa5258abf1b18e2e19995bd284e3a1 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 14:26:57 +0200 Subject: [PATCH 60/84] editor tip as Accordion instead of Infobox --- website/docs/usage/layers-architectures.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 32319ca07..f5cdb1ca1 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -105,7 +105,7 @@ bit of validation goes a long way, especially if you tools to highlight these errors early. The config file is also validated at the beginning of training, to verify that all the types match correctly. - + If you're using a modern editor like Visual Studio Code, you can [set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the @@ -114,7 +114,7 @@ code. [![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting) - + ## Swapping model architectures {#swap-architectures} From 737a1408d9801fc93927296c132b55fe432bb012 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Wed, 2 Sep 2020 14:42:32 +0200 Subject: [PATCH 61/84] Improve implementation of fix #6010 Follow-ups to the parser efficiency fix. * Avoid introducing new counter for number of pushes * Base cut on number of transitions, keeping it more even * Reintroduce the randomization we had in v2. --- spacy/pipeline/_parser_internals/_state.pxd | 5 --- .../pipeline/_parser_internals/stateclass.pyx | 4 -- spacy/pipeline/transition_parser.pyx | 40 ++++++++----------- 3 files changed, 17 insertions(+), 32 deletions(-) diff --git a/spacy/pipeline/_parser_internals/_state.pxd b/spacy/pipeline/_parser_internals/_state.pxd index d31430124..0d0dd8c05 100644 --- a/spacy/pipeline/_parser_internals/_state.pxd +++ b/spacy/pipeline/_parser_internals/_state.pxd @@ -42,7 +42,6 @@ cdef cppclass StateC: RingBufferC _hist int length int offset - int n_pushes int _s_i int _b_i int _e_i @@ -50,7 +49,6 @@ cdef cppclass StateC: __init__(const TokenC* sent, int length) nogil: cdef int PADDING = 5 - this.n_pushes = 0 this._buffer = calloc(length + (PADDING * 2), sizeof(int)) this._stack = calloc(length + (PADDING * 2), sizeof(int)) this.shifted = calloc(length + (PADDING * 2), sizeof(bint)) @@ -337,7 +335,6 @@ cdef cppclass StateC: this.set_break(this.B_(0).l_edge) if this._b_i > this._break: this._break = -1 - this.n_pushes += 1 void pop() nogil: if this._s_i >= 1: @@ -354,7 +351,6 @@ cdef cppclass StateC: this._buffer[this._b_i] = this.S(0) this._s_i -= 1 this.shifted[this.B(0)] = True - this.n_pushes -= 1 void add_arc(int head, int child, attr_t label) nogil: if this.has_head(child): @@ -435,7 +431,6 @@ cdef cppclass StateC: this._break = src._break this.offset = src.offset this._empty_token = src._empty_token - this.n_pushes = src.n_pushes void fast_forward() nogil: # space token attachement policy: diff --git a/spacy/pipeline/_parser_internals/stateclass.pyx b/spacy/pipeline/_parser_internals/stateclass.pyx index d59ade467..880cf6cc5 100644 --- a/spacy/pipeline/_parser_internals/stateclass.pyx +++ b/spacy/pipeline/_parser_internals/stateclass.pyx @@ -36,10 +36,6 @@ cdef class StateClass: hist[i] = self.c.get_hist(i+1) return hist - @property - def n_pushes(self): - return self.c.n_pushes - def is_final(self): return self.c.is_final() diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx index 2169b4c17..5a6b491e0 100644 --- a/spacy/pipeline/transition_parser.pyx +++ b/spacy/pipeline/transition_parser.pyx @@ -6,6 +6,7 @@ from itertools import islice from libcpp.vector cimport vector from libc.string cimport memset from libc.stdlib cimport calloc, free +import random import srsly from thinc.api import set_dropout_rate @@ -275,22 +276,22 @@ cdef class Parser(Pipe): # Prepare the stepwise model, and get the callback for finishing the batch model, backprop_tok2vec = self.model.begin_update( [eg.predicted for eg in examples]) - if self.cfg["update_with_oracle_cut_size"] >= 1: - # Chop sequences into lengths of this many transitions, to make the + max_moves = self.cfg["update_with_oracle_cut_size"] + if max_moves >= 1: + # Chop sequences into lengths of this many words, to make the # batch uniform length. - # We used to randomize this, but it's not clear that actually helps? - max_pushes = self.cfg["update_with_oracle_cut_size"] + max_moves = int(random.uniform(max_moves // 2, max_moves * 2)) states, golds, _ = self._init_gold_batch( examples, - max_length=max_pushes + max_length=max_moves ) else: states, golds, _ = self.moves.init_gold_batch(examples) - max_pushes = max([len(eg.x) for eg in examples]) if not states: return losses all_states = list(states) states_golds = list(zip(states, golds)) + n_moves = 0 while states_golds: states, golds = zip(*states_golds) scores, backprop = model.begin_update(states) @@ -302,8 +303,10 @@ cdef class Parser(Pipe): backprop(d_scores) # Follow the predicted action self.transition_states(states, scores) - states_golds = [(s, g) for (s, g) in zip(states, golds) - if s.n_pushes < max_pushes and not s.is_final()] + states_golds = [(s, g) for (s, g) in zip(states, golds) if not s.is_final()] + if max_moves >= 1 and n_moves >= max_moves: + break + n_moves += 1 backprop_tok2vec(golds) if sgd not in (None, False): @@ -499,7 +502,7 @@ cdef class Parser(Pipe): raise ValueError(Errors.E149) from None return self - def _init_gold_batch(self, examples, min_length=5, max_length=500): + def _init_gold_batch(self, examples, max_length): """Make a square batch, of length equal to the shortest transition sequence or a cap. A long doc will get multiple states. Let's say we have a doc of length 2*N, @@ -512,8 +515,7 @@ cdef class Parser(Pipe): all_states = self.moves.init_batch([eg.predicted for eg in examples]) states = [] golds = [] - kept = [] - max_length_seen = 0 + to_cut = [] for state, eg in zip(all_states, examples): if self.moves.has_gold(eg) and not state.is_final(): gold = self.moves.init_gold(state, eg) @@ -523,30 +525,22 @@ cdef class Parser(Pipe): else: oracle_actions = self.moves.get_oracle_sequence_from_state( state.copy(), gold) - kept.append((eg, state, gold, oracle_actions)) - min_length = min(min_length, len(oracle_actions)) - max_length_seen = max(max_length, len(oracle_actions)) - if not kept: + to_cut.append((eg, state, gold, oracle_actions)) + if not to_cut: return states, golds, 0 - max_length = max(min_length, min(max_length, max_length_seen)) cdef int clas - max_moves = 0 - for eg, state, gold, oracle_actions in kept: + for eg, state, gold, oracle_actions in to_cut: for i in range(0, len(oracle_actions), max_length): start_state = state.copy() - n_moves = 0 for clas in oracle_actions[i:i+max_length]: action = self.moves.c[clas] action.do(state.c, action.label) state.c.push_hist(action.clas) - n_moves += 1 if state.is_final(): break - max_moves = max(max_moves, n_moves) if self.moves.has_gold(eg, start_state.B(0), state.B(0)): states.append(start_state) golds.append(gold) - max_moves = max(max_moves, n_moves) if state.is_final(): break - return states, golds, max_moves + return states, golds, max_length From 1be7ff02a61879b002e4c1236f91b8b1329278af Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 15:26:07 +0200 Subject: [PATCH 62/84] swapping section --- website/docs/usage/layers-architectures.md | 93 ++++++++++++++++------ 1 file changed, 68 insertions(+), 25 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index f5cdb1ca1..8f10f4069 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -12,33 +12,33 @@ next: /usage/projects > #### Example > -> ````python +> ```python > from thinc.api import Model, chain -> +> > @spacy.registry.architectures.register("model.v1") > def build_model(width: int, classes: int) -> Model: > tok2vec = build_tok2vec(width) > output_layer = build_output_layer(width, classes) > model = chain(tok2vec, output_layer) > return model -> ```` +> ``` A **model architecture** is a function that wires up a [Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the -neural network that is run internally as part of a component in a spaCy pipeline. -To define the actual architecture, you can implement your logic in -Thinc directly, or you can use Thinc as a thin wrapper around frameworks -such as PyTorch, TensorFlow and MXNet. Each Model can also be used as a sublayer -of a larger network, allowing you to freely combine implementations from different +neural network that is run internally as part of a component in a spaCy +pipeline. To define the actual architecture, you can implement your logic in +Thinc directly, or you can use Thinc as a thin wrapper around frameworks such as +PyTorch, TensorFlow and MXNet. Each Model can also be used as a sublayer of a +larger network, allowing you to freely combine implementations from different frameworks into one `Thinc` Model. spaCy's built-in components require a `Model` instance to be passed to them via the config system. To change the model architecture of an existing component, -you just need to **update the config** so that it refers to a different -registered function. Once the component has been created from this config, you -won't be able to change it anymore. The architecture is like a recipe for the -network, and you can't change the recipe once the dish has already been -prepared. You have to make a new one. +you just need to [**update the config**](#swap-architectures) so that it refers +to a different registered function. Once the component has been created from +this config, you won't be able to change it anymore. The architecture is like a +recipe for the network, and you can't change the recipe once the dish has +already been prepared. You have to make a new one. ```ini ### config.cfg (excerpt) @@ -53,8 +53,6 @@ classes = 16 ## Type signatures {#type-sigs} - - > #### Example > > ```python @@ -62,8 +60,8 @@ classes = 16 > from thinc.api import Model, chain > from thinc.types import Floats2d > def chain_model( -> tok2vec: Model[List[Doc], List[Floats2d]], -> layer1: Model[List[Floats2d], Floats2d], +> tok2vec: Model[List[Doc], List[Floats2d]], +> layer1: Model[List[Floats2d], Floats2d], > layer2: Model[Floats2d, Floats2d] > ) -> Model[List[Doc], Floats2d]: > model = chain(tok2vec, layer1, layer2) @@ -73,11 +71,11 @@ classes = 16 The Thinc `Model` class is a **generic type** that can specify its input and output types. Python uses a square-bracket notation for this, so the type ~~Model[List, Dict]~~ says that each batch of inputs to the model will be a -list, and the outputs will be a dictionary. You can be even more specific and -write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that -the model expects a list of [`Doc`](/api/doc) objects as input, and returns a -dictionary mapping of strings to floats. Some of the most common types you'll see -are: ​ +list, and the outputs will be a dictionary. You can be even more specific and +write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that the +model expects a list of [`Doc`](/api/doc) objects as input, and returns a +dictionary mapping of strings to floats. Some of the most common types you'll +see are: ​ | Type | Description | | ------------------ | ---------------------------------------------------------------------------------------------------- | @@ -102,8 +100,8 @@ interchangeably. There are many other ways they could be incompatible. However, if the types don't match, they almost surely _won't_ be compatible. This little bit of validation goes a long way, especially if you [configure your editor](https://thinc.ai/docs/usage-type-checking) or other -tools to highlight these errors early. The config file is also validated -at the beginning of training, to verify that all the types match correctly. +tools to highlight these errors early. The config file is also validated at the +beginning of training, to verify that all the types match correctly. @@ -118,7 +116,52 @@ code. ## Swapping model architectures {#swap-architectures} - +If no model is specified for the [`TextCategorizer`](/api/textcategorizer), the +[TextCatEnsemble](/api/architectures#TextCatEnsemble) architecture is used by +default. This architecture combines a simpel bag-of-words model with a neural +network, usually resulting in the most accurate results, but at the cost of +speed. The config file for this model would look something like this: + +```ini +### config.cfg (excerpt) +[components.textcat] +factory = "textcat" +labels = [] + +[components.textcat.model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +pretrained_vectors = null +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +dropout = 0 +nO = null +``` + +spaCy has two additional built-in `textcat` architectures, and you can easily +use those by swapping out the definition of the textcat's model. For instance, +to use the simpel and fast [bag-of-words model](/api/architectures#TextCatBOW), +you can change the config to: + +```ini +### config.cfg (excerpt) +[components.textcat] +factory = "textcat" +labels = [] + +[components.textcat.model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size = 1 +no_output_layer = false +nO = null +``` + +The details of all prebuilt architectures and their parameters, can be consulted +on the [API page for model architectures](/api/architectures). ### Defining sublayers {#sublayers} From bbaea530f6ede494e708a10306f517e5b60c6ba2 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 17:36:22 +0200 Subject: [PATCH 63/84] sublayers paragraph --- website/docs/api/architectures.md | 66 ++++++++++++---------- website/docs/usage/layers-architectures.md | 59 ++++++++++++++----- 2 files changed, 81 insertions(+), 44 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index b55027356..93e50bfb3 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -25,36 +25,6 @@ usage documentation on ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} -### spacy.HashEmbedCNN.v1 {#HashEmbedCNN} - -> #### Example Config -> -> ```ini -> [model] -> @architectures = "spacy.HashEmbedCNN.v1" -> pretrained_vectors = null -> width = 96 -> depth = 4 -> embed_size = 2000 -> window_size = 1 -> maxout_pieces = 3 -> subword_features = true -> ``` - -Build spaCy's "standard" embedding layer, which uses hash embedding with subword -features and a CNN with layer-normalized maxout. - -| Name | Description | -| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ | -| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ | -| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ | -| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ | -| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ | -| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ | -| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ | -| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | - ### spacy.Tok2Vec.v1 {#Tok2Vec} > #### Example config @@ -72,7 +42,8 @@ features and a CNN with layer-normalized maxout. > # ... > ``` -Construct a tok2vec model out of embedding and encoding subnetworks. See the +Construct a tok2vec model out of two subnetworks: one for embedding and one for +encoding. See the ["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp) blog post for background. @@ -82,6 +53,39 @@ blog post for background. | `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | +### spacy.HashEmbedCNN.v1 {#HashEmbedCNN} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.HashEmbedCNN.v1" +> pretrained_vectors = null +> width = 96 +> depth = 4 +> embed_size = 2000 +> window_size = 1 +> maxout_pieces = 3 +> subword_features = true +> ``` + +Build spaCy's "standard" tok2vec layer. This layer is defined by a +[MultiHashEmbed](/api/architectures#MultiHashEmbed) embedding layer that uses +subword features, and a +[MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder) encoding layer +consisting of a CNN and a layer-normalized maxout activation function. + +| Name | Description | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ | +| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ | +| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ | +| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ | +| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ | +| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ | +| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + ### spacy.Tok2VecListener.v1 {#Tok2VecListener} > #### Example config diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 8f10f4069..419048f65 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -165,27 +165,60 @@ on the [API page for model architectures](/api/architectures). ### Defining sublayers {#sublayers} -​Model architecture functions often accept **sublayers as arguments**, so that +Model architecture functions often accept **sublayers as arguments**, so that you can try **substituting a different layer** into the network. Depending on how the architecture function is structured, you might be able to define your network structure entirely through the [config system](/usage/training#config), -using layers that have already been defined. ​The -[transformers documentation](/usage/embeddings-transformers#transformers) -section shows a common example of swapping in a different sublayer. +using layers that have already been defined. ​ In most neural network models for NLP, the most important parts of the network are what we refer to as the -[embed and encode](https://explosion.ai/blog/embed-encode-attend-predict) steps. +[embed and encode](https://explosion.ai/blog/deep-learning-formula-nlp) steps. These steps together compute dense, context-sensitive representations of the -tokens. Most of spaCy's default architectures accept a -[`tok2vec` embedding layer](/api/architectures#tok2vec-arch) as an argument, so -you can control this important part of the network separately. This makes it -easy to **switch between** transformer, CNN, BiLSTM or other feature extraction -approaches. And if you want to define your own solution, all you need to do is -register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and -you'll be able to try it out in any of the spaCy components. ​ +tokens, and their combination forms a typical +[`Tok2Vec`](/api/architectures#Tok2Vec) layer: - +```ini +### config.cfg (excerpt) +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.Tok2Vec.v1" + +[components.tok2vec.model.embed] +@architectures = "spacy.MultiHashEmbed.v1" +# ... + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +# ... +``` + +By defining these sublayers specifically, it becomes straightforward to swap out +a sublayer for another one, for instance changing the first sublayer to a +character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed) +architecture: + +```ini +### config.cfg (excerpt) +[components.tok2vec.model.embed] +@architectures = "spacy.CharacterEmbed.v1" +# ... + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +# ... +``` + +Most of spaCy's default architectures accept a `tok2vec` layer as a sublayer +within the larger task-specific neural network. This makes it easy to **switch +between** transformer, CNN, BiLSTM or other feature extraction approaches. The +[transformers documentation](/usage/embeddings-transformers#training-custom-model) +section shows an example of swapping out a model's standard `tok2vec` layer with +a transformer. And if you want to define your own solution, all you need to do +is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and +you'll be able to try it out in any of the spaCy components. ​ ## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks} From 19298de3524d2aee05579c6b451d2306960a6591 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 2 Sep 2020 17:43:11 +0200 Subject: [PATCH 64/84] small fix --- website/docs/usage/training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 6d56f5767..2967a0353 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -825,7 +825,7 @@ from spacy.tokens import Doc @spacy.registry.architectures("custom_neural_network.v1") def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]: - # ... + return create_model(output_width) ``` ```ini From 122cb020010a3f4e34e696726206eb92ea9974e8 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Wed, 2 Sep 2020 19:37:43 +0200 Subject: [PATCH 65/84] Fix averages --- spacy/cli/train.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 7525b9669..4ce02286a 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -159,7 +159,8 @@ def train( print_row(info) if is_best_checkpoint and output_path is not None: update_meta(T_cfg, nlp, info) - nlp.to_disk(output_path / "model-best") + with nlp.use_params(optimizer.averages): + nlp.to_disk(output_path / "model-best") progress = tqdm.tqdm(total=T_cfg["eval_frequency"], leave=False) progress.set_description(f"Epoch {info['epoch']}") except Exception as e: From 77ac4a38aab806e50bd7304718bd7a363a0a0973 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Thu, 3 Sep 2020 09:42:49 +0200 Subject: [PATCH 66/84] Simplify specials and cache checks (#6012) --- spacy/tokenizer.pxd | 6 ++-- spacy/tokenizer.pyx | 69 ++++++++++++++++++--------------------------- 2 files changed, 31 insertions(+), 44 deletions(-) diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd index 828f4550b..9c1398a17 100644 --- a/spacy/tokenizer.pxd +++ b/spacy/tokenizer.pxd @@ -34,9 +34,9 @@ cdef class Tokenizer: vector[SpanC] &filtered) cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, object span_data) - cdef int _try_cache(self, hash_t key, Doc tokens) except -1 - cdef int _try_specials(self, hash_t key, Doc tokens, - int* has_special) except -1 + cdef int _try_specials_and_cache(self, hash_t key, Doc tokens, + int* has_special, + bint with_special_cases) except -1 cdef int _tokenize(self, Doc tokens, unicode span, hash_t key, int* has_special, bint with_special_cases) except -1 cdef unicode _split_affixes(self, Pool mem, unicode string, diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 12c634e61..759de90d3 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -169,8 +169,6 @@ cdef class Tokenizer: cdef int i = 0 cdef int start = 0 cdef int has_special = 0 - cdef bint specials_hit = 0 - cdef bint cache_hit = 0 cdef bint in_ws = string[0].isspace() cdef unicode span # The task here is much like string.split, but not quite @@ -186,13 +184,7 @@ cdef class Tokenizer: # we don't have to create the slice when we hit the cache. span = string[start:i] key = hash_string(span) - specials_hit = 0 - cache_hit = 0 - if with_special_cases: - specials_hit = self._try_specials(key, doc, &has_special) - if not specials_hit: - cache_hit = self._try_cache(key, doc) - if not specials_hit and not cache_hit: + if not self._try_specials_and_cache(key, doc, &has_special, with_special_cases): self._tokenize(doc, span, key, &has_special, with_special_cases) if uc == ' ': doc.c[doc.length - 1].spacy = True @@ -204,13 +196,7 @@ cdef class Tokenizer: if start < i: span = string[start:] key = hash_string(span) - specials_hit = 0 - cache_hit = 0 - if with_special_cases: - specials_hit = self._try_specials(key, doc, &has_special) - if not specials_hit: - cache_hit = self._try_cache(key, doc) - if not specials_hit and not cache_hit: + if not self._try_specials_and_cache(key, doc, &has_special, with_special_cases): self._tokenize(doc, span, key, &has_special, with_special_cases) doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws return doc @@ -364,27 +350,33 @@ cdef class Tokenizer: offset += span[3] return offset - cdef int _try_cache(self, hash_t key, Doc tokens) except -1: - cached = <_Cached*>self._cache.get(key) - if cached == NULL: - return False + cdef int _try_specials_and_cache(self, hash_t key, Doc tokens, int* has_special, bint with_special_cases) except -1: + cdef bint specials_hit = 0 + cdef bint cache_hit = 0 cdef int i - if cached.is_lex: - for i in range(cached.length): - tokens.push_back(cached.data.lexemes[i], False) - else: - for i in range(cached.length): - tokens.push_back(&cached.data.tokens[i], False) - return True - - cdef int _try_specials(self, hash_t key, Doc tokens, int* has_special) except -1: - cached = <_Cached*>self._specials.get(key) - if cached == NULL: + if with_special_cases: + cached = <_Cached*>self._specials.get(key) + if cached == NULL: + specials_hit = False + else: + for i in range(cached.length): + tokens.push_back(&cached.data.tokens[i], False) + has_special[0] = 1 + specials_hit = True + if not specials_hit: + cached = <_Cached*>self._cache.get(key) + if cached == NULL: + cache_hit = False + else: + if cached.is_lex: + for i in range(cached.length): + tokens.push_back(cached.data.lexemes[i], False) + else: + for i in range(cached.length): + tokens.push_back(&cached.data.tokens[i], False) + cache_hit = True + if not specials_hit and not cache_hit: return False - cdef int i - for i in range(cached.length): - tokens.push_back(&cached.data.tokens[i], False) - has_special[0] = 1 return True cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key, int* has_special, bint with_special_cases) except -1: @@ -462,12 +454,7 @@ cdef class Tokenizer: for i in range(prefixes.size()): tokens.push_back(prefixes[0][i], False) if string: - if with_special_cases: - specials_hit = self._try_specials(hash_string(string), tokens, - has_special) - if not specials_hit: - cache_hit = self._try_cache(hash_string(string), tokens) - if specials_hit or cache_hit: + if self._try_specials_and_cache(hash_string(string), tokens, has_special, with_special_cases): pass elif (self.token_match and self.token_match(string)) or \ (self.url_match and \ From 1815c613c90d29d3d18ed377d166cf1dec3813ad Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 3 Sep 2020 10:07:45 +0200 Subject: [PATCH 67/84] Update docs [ci skip] --- website/docs/usage/layers-architectures.md | 13 ++++---- website/docs/usage/training.md | 39 ++++++++++------------ 2 files changed, 25 insertions(+), 27 deletions(-) diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 419048f65..e24b776c8 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -103,7 +103,7 @@ bit of validation goes a long way, especially if you tools to highlight these errors early. The config file is also validated at the beginning of training, to verify that all the types match correctly. - + If you're using a modern editor like Visual Studio Code, you can [set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the @@ -143,11 +143,11 @@ nO = null spaCy has two additional built-in `textcat` architectures, and you can easily use those by swapping out the definition of the textcat's model. For instance, -to use the simpel and fast [bag-of-words model](/api/architectures#TextCatBOW), -you can change the config to: +to use the simple and fast bag-of-words model +[TextCatBOW](/api/architectures#TextCatBOW), you can change the config to: ```ini -### config.cfg (excerpt) +### config.cfg (excerpt) {highlight="6-10"} [components.textcat] factory = "textcat" labels = [] @@ -160,8 +160,9 @@ no_output_layer = false nO = null ``` -The details of all prebuilt architectures and their parameters, can be consulted -on the [API page for model architectures](/api/architectures). +For details on all pre-defined architectures shipped with spaCy and how to +configure them, check out the [model architectures](/api/architectures) +documentation. ### Defining sublayers {#sublayers} diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 2967a0353..43e1193ab 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -669,10 +669,9 @@ def custom_logger(log_path): #### Example: Custom batch size schedule {#custom-code-schedule} -You can also implement your own batch size schedule to use -during training. The `@spacy.registry.schedules` decorator lets you register -that function in the `schedules` [registry](/api/top-level#registry) and assign -it a string name: +You can also implement your own batch size schedule to use during training. The +`@spacy.registry.schedules` decorator lets you register that function in the +`schedules` [registry](/api/top-level#registry) and assign it a string name: > #### Why the version in the name? > @@ -806,14 +805,22 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp ### Defining custom architectures {#custom-architectures} -Built-in pipeline components such as the tagger or named entity recognizer are -constructed with default neural network [models](/api/architectures). -You can change the model architecture -entirely by implementing your own custom models and providing those in the config -when creating the pipeline component. See the -documentation on -[layers and model architectures](/usage/layers-architectures) for more details. +Built-in pipeline components such as the tagger or named entity recognizer are +constructed with default neural network [models](/api/architectures). You can +change the model architecture entirely by implementing your own custom models +and providing those in the config when creating the pipeline component. See the +documentation on [layers and model architectures](/usage/layers-architectures) +for more details. +> ```ini +> ### config.cfg +> [components.tagger] +> factory = "tagger" +> +> [components.tagger.model] +> @architectures = "custom_neural_network.v1" +> output_width = 512 +> ``` ```python ### functions.py @@ -828,16 +835,6 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]: return create_model(output_width) ``` -```ini -### config.cfg (excerpt) -[components.tagger] -factory = "tagger" - -[components.tagger.model] -@architectures = "custom_neural_network.v1" -output_width = 512 -``` - ## Internal training API {#api} From 5af432e0f2db1d6aeba7a031a8a707fb90b6332a Mon Sep 17 00:00:00 2001 From: Yohei Tamura Date: Thu, 3 Sep 2020 17:09:03 +0900 Subject: [PATCH 68/84] fix for empty string (#5936) --- spacy/tests/doc/test_doc_api.py | 19 ++++++++++--------- spacy/tokens/doc.pyx | 6 ++++-- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py index 954181df5..b37a31e43 100644 --- a/spacy/tests/doc/test_doc_api.py +++ b/spacy/tests/doc/test_doc_api.py @@ -317,7 +317,8 @@ def test_doc_from_array_morph(en_vocab): def test_doc_api_from_docs(en_tokenizer, de_tokenizer): - en_texts = ["Merging the docs is fun.", "They don't think alike."] + en_texts = ["Merging the docs is fun.", "", "They don't think alike."] + en_texts_without_empty = [t for t in en_texts if len(t)] de_text = "Wie war die Frage?" en_docs = [en_tokenizer(text) for text in en_texts] docs_idx = en_texts[0].index("docs") @@ -338,14 +339,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): Doc.from_docs(en_docs + [de_doc]) m_doc = Doc.from_docs(en_docs) - assert len(en_docs) == len(list(m_doc.sents)) + assert len(en_texts_without_empty) == len(list(m_doc.sents)) assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) - assert str(m_doc) == " ".join(en_texts) + assert str(m_doc) == " ".join(en_texts_without_empty) p_token = m_doc[len(en_docs[0]) - 1] assert p_token.text == "." and bool(p_token.whitespace_) en_docs_tokens = [t for doc in en_docs for t in doc] assert len(m_doc) == len(en_docs_tokens) - think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think") + think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") assert m_doc[9].idx == think_idx with pytest.raises(AttributeError): # not callable, because it was not set via set_extension @@ -353,14 +354,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there m_doc = Doc.from_docs(en_docs, ensure_whitespace=False) - assert len(en_docs) == len(list(m_doc.sents)) - assert len(str(m_doc)) == len(en_texts[0]) + len(en_texts[1]) + assert len(en_texts_without_empty) == len(list(m_doc.sents)) + assert len(str(m_doc)) == sum(len(t) for t in en_texts) assert str(m_doc) == "".join(en_texts) p_token = m_doc[len(en_docs[0]) - 1] assert p_token.text == "." and not bool(p_token.whitespace_) en_docs_tokens = [t for doc in en_docs for t in doc] assert len(m_doc) == len(en_docs_tokens) - think_idx = len(en_texts[0]) + 0 + en_texts[1].index("think") + think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think") assert m_doc[9].idx == think_idx m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"]) @@ -369,12 +370,12 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): assert list(m_doc.sents) assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) # space delimiter considered, although spacy attribute was missing - assert str(m_doc) == " ".join(en_texts) + assert str(m_doc) == " ".join(en_texts_without_empty) p_token = m_doc[len(en_docs[0]) - 1] assert p_token.text == "." and bool(p_token.whitespace_) en_docs_tokens = [t for doc in en_docs for t in doc] assert len(m_doc) == len(en_docs_tokens) - think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think") + think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") assert m_doc[9].idx == think_idx diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index cd080bf35..3c7b4f8b3 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -920,7 +920,9 @@ cdef class Doc: warnings.warn(Warnings.W101.format(name=name)) else: warnings.warn(Warnings.W102.format(key=key, value=value)) - char_offset += len(doc.text) if not ensure_whitespace or doc[-1].is_space else len(doc.text) + 1 + char_offset += len(doc.text) + if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space): + char_offset += 1 arrays = [doc.to_array(attrs) for doc in docs] @@ -932,7 +934,7 @@ cdef class Doc: token_offset = -1 for doc in docs[:-1]: token_offset += len(doc) - if not doc[-1].is_space: + if not (len(doc) > 0 and doc[-1].is_space): concat_spaces[token_offset] = True concat_array = numpy.concatenate(arrays) From b02ad8045bcec91ac8c234e3cb6c42f93e3a115e Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 3 Sep 2020 10:10:13 +0200 Subject: [PATCH 69/84] Update docs [ci skip] --- website/docs/usage/training.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 43e1193ab..2fabd3f7d 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -377,7 +377,8 @@ A **model architecture** is a function that wires up a Thinc component or as a layer of a larger network. You can use Thinc as a thin [wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc -[directly](https://thinc.ai/docs/usage-models). +[directly](https://thinc.ai/docs/usage-models). For more details and examples, +see the usage guide on [layers and architectures](/usage/layers-architectures). spaCy's built-in components will never construct their `Model` instances themselves, so you won't have to subclass the component to change its model @@ -395,8 +396,6 @@ different tasks. For example: | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ | | [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ | - - ### Metrics, training output and weighted scores {#metrics} When you train a model using the [`spacy train`](/api/cli#train) command, you'll @@ -474,11 +473,9 @@ Each custom function can have any numbers of arguments that are passed in via the [config](#config), just the built-in functions. If your function defines **default argument values**, spaCy is able to auto-fill your config when you run [`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a -given parameter is always explicitely set in the config, avoid setting a default +given parameter is always explicitly set in the config, avoid setting a default value for it. - - ### Training with custom code {#custom-code} > #### Example From ef0d0630a4fa5af2acfd71187a03b98784d80fed Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Thu, 3 Sep 2020 12:51:04 +0200 Subject: [PATCH 70/84] Let Langugae.use_params work with falsey inputs The Language.use_params method was failing if you passed in None, which meant we had to use awkward conditionals for the parameter averaging. This solves the problem. --- spacy/language.py | 43 +++++++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 20 deletions(-) diff --git a/spacy/language.py b/spacy/language.py index 8e7c39b90..7a354ee3d 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1,5 +1,5 @@ from typing import Optional, Any, Dict, Callable, Iterable, Union, List, Pattern -from typing import Tuple, Iterator +from typing import Tuple, Iterator, Optional from dataclasses import dataclass import random import itertools @@ -1275,7 +1275,7 @@ class Language: return results @contextmanager - def use_params(self, params: dict): + def use_params(self, params: Optional[dict]): """Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a contextmanager, in which case, models go back to their original weights after the block. @@ -1288,24 +1288,27 @@ class Language: DOCS: https://spacy.io/api/language#use_params """ - contexts = [ - pipe.use_params(params) - for name, pipe in self.pipeline - if hasattr(pipe, "use_params") and hasattr(pipe, "model") - ] - # TODO: Having trouble with contextlib - # Workaround: these aren't actually context managers atm. - for context in contexts: - try: - next(context) - except StopIteration: - pass - yield - for context in contexts: - try: - next(context) - except StopIteration: - pass + if not params: + yield + else: + contexts = [ + pipe.use_params(params) + for name, pipe in self.pipeline + if hasattr(pipe, "use_params") and hasattr(pipe, "model") + ] + # TODO: Having trouble with contextlib + # Workaround: these aren't actually context managers atm. + for context in contexts: + try: + next(context) + except StopIteration: + pass + yield + for context in contexts: + try: + next(context) + except StopIteration: + pass def pipe( self, From b5a0657fd6a104ff61c7c18a0fbdd1c251df5d31 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 3 Sep 2020 13:13:03 +0200 Subject: [PATCH 71/84] "model" terminology consistency in docs --- netlify.toml | 2 +- spacy/cli/__init__.py | 8 +- spacy/cli/_util.py | 2 +- spacy/cli/convert.py | 2 +- spacy/cli/debug_data.py | 10 +- spacy/cli/download.py | 29 ++- spacy/cli/evaluate.py | 4 +- spacy/cli/info.py | 18 +- spacy/cli/init_config.py | 4 +- spacy/cli/init_model.py | 28 ++- spacy/cli/package.py | 30 +-- spacy/cli/pretrain.py | 9 +- spacy/cli/profile.py | 6 +- spacy/cli/train.py | 10 +- spacy/cli/validate.py | 16 +- spacy/language.py | 2 +- website/docs/api/cli.md | 152 +++++------ website/docs/api/data-formats.md | 104 ++++---- website/docs/api/dependencymatcher.md | 4 +- website/docs/api/entitylinker.md | 4 +- website/docs/api/language.md | 81 +++--- website/docs/api/pipe.md | 2 +- website/docs/api/top-level.md | 133 +++++----- website/docs/models/index.md | 65 +++-- website/docs/usage/101/_pipelines.md | 19 +- website/docs/usage/101/_pos-deps.md | 13 +- website/docs/usage/101/_serialization.md | 10 +- website/docs/usage/101/_training.md | 20 +- website/docs/usage/101/_vectors-similarity.md | 35 ++- website/docs/usage/index.md | 64 +++-- website/docs/usage/linguistic-features.md | 136 +++++----- website/docs/usage/models.md | 235 +++++++++--------- website/docs/usage/processing-pipelines.md | 137 +++++----- website/docs/usage/projects.md | 86 +++---- website/docs/usage/rule-based-matching.md | 55 ++-- website/docs/usage/saving-loading.md | 193 +++++++------- website/docs/usage/spacy-101.md | 56 +++-- website/docs/usage/training.md | 85 +++---- website/docs/usage/v3.md | 90 +++---- website/docs/usage/visualizers.md | 14 +- website/meta/sidebars.json | 4 +- website/src/components/tag.js | 2 +- website/src/templates/models.js | 10 +- website/src/widgets/quickstart-install.js | 4 +- website/src/widgets/quickstart-models.js | 4 +- 45 files changed, 1006 insertions(+), 991 deletions(-) diff --git a/netlify.toml b/netlify.toml index 2f3e350e6..3c17b876c 100644 --- a/netlify.toml +++ b/netlify.toml @@ -24,7 +24,7 @@ redirects = [ {from = "/docs/usage/customizing-tokenizer", to = "/usage/linguistic-features#tokenization", force = true}, {from = "/docs/usage/language-processing-pipeline", to = "/usage/processing-pipelines", force = true}, {from = "/docs/usage/customizing-pipeline", to = "/usage/processing-pipelines", force = true}, - {from = "/docs/usage/training-ner", to = "/usage/training#ner", force = true}, + {from = "/docs/usage/training-ner", to = "/usage/training", force = true}, {from = "/docs/usage/tutorials", to = "/usage/examples", force = true}, {from = "/docs/usage/data-model", to = "/api", force = true}, {from = "/docs/usage/cli", to = "/api/cli", force = true}, diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py index b47c1c16b..92cb76971 100644 --- a/spacy/cli/__init__.py +++ b/spacy/cli/__init__.py @@ -29,9 +29,9 @@ from .project.document import project_document # noqa: F401 @app.command("link", no_args_is_help=True, deprecated=True, hidden=True) def link(*args, **kwargs): - """As of spaCy v3.0, model symlinks are deprecated. You can load models - using their full names or from a directory path.""" + """As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained + pipeline packages using their full names or from a directory path.""" msg.warn( - "As of spaCy v3.0, model symlinks are deprecated. You can load models " - "using their full names or from a directory path." + "As of spaCy v3.0, model symlinks are deprecated. You can load trained " + "pipeline packages using their full names or from a directory path." ) diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py index cfa126cc4..6a24a4ba4 100644 --- a/spacy/cli/_util.py +++ b/spacy/cli/_util.py @@ -36,7 +36,7 @@ DEBUG_HELP = """Suite of helpful commands for debugging and profiling. Includes commands to check and validate your config files, training and evaluation data, and custom model implementations. """ -INIT_HELP = """Commands for initializing configs and models.""" +INIT_HELP = """Commands for initializing configs and pipeline packages.""" # Wrappers for Typer's annotations. Initially created to set defaults and to # keep the names short, but not needed at the moment. diff --git a/spacy/cli/convert.py b/spacy/cli/convert.py index f73c2f2c0..2a24bd145 100644 --- a/spacy/cli/convert.py +++ b/spacy/cli/convert.py @@ -44,7 +44,7 @@ def convert_cli( file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"), n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"), seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"), - model: Optional[str] = Opt(None, "--model", "-b", help="Model for sentence segmentation (for -s)"), + model: Optional[str] = Opt(None, "--model", "-b", help="Trained spaCy pipeline for sentence segmentation (for -s)"), morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"), merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"), converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"), diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 2f48a29cd..a4269796f 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -18,7 +18,7 @@ from .. import util NEW_LABEL_THRESHOLD = 50 # Minimum number of expected occurrences of dependency labels DEP_LABEL_THRESHOLD = 20 -# Minimum number of expected examples to train a blank model +# Minimum number of expected examples to train a new pipeline BLANK_MODEL_MIN_THRESHOLD = 100 BLANK_MODEL_THRESHOLD = 2000 @@ -148,7 +148,7 @@ def debug_data( msg.text(f"Language: {config['nlp']['lang']}") msg.text(f"Training pipeline: {', '.join(pipeline)}") if resume_components: - msg.text(f"Components from other models: {', '.join(resume_components)}") + msg.text(f"Components from other pipelines: {', '.join(resume_components)}") if frozen_components: msg.text(f"Frozen components: {', '.join(frozen_components)}") msg.text(f"{len(train_dataset)} training docs") @@ -164,9 +164,7 @@ def debug_data( # TODO: make this feedback more fine-grained and report on updated # components vs. blank components if not resume_components and len(train_dataset) < BLANK_MODEL_THRESHOLD: - text = ( - f"Low number of examples to train from a blank model ({len(train_dataset)})" - ) + text = f"Low number of examples to train a new pipeline ({len(train_dataset)})" if len(train_dataset) < BLANK_MODEL_MIN_THRESHOLD: msg.fail(text) else: @@ -214,7 +212,7 @@ def debug_data( show=verbose, ) else: - msg.info("No word vectors present in the model") + msg.info("No word vectors present in the package") if "ner" in factory_names: # Get all unique NER labels present in the data diff --git a/spacy/cli/download.py b/spacy/cli/download.py index e55e6e40e..3d5e0a765 100644 --- a/spacy/cli/download.py +++ b/spacy/cli/download.py @@ -17,16 +17,19 @@ from ..errors import OLD_MODEL_SHORTCUTS def download_cli( # fmt: off ctx: typer.Context, - model: str = Arg(..., help="Name of model to download"), + model: str = Arg(..., help="Name of pipeline package to download"), direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"), # fmt: on ): """ - Download compatible model from default download path using pip. If --direct - flag is set, the command expects the full model name with version. - For direct downloads, the compatibility check will be skipped. All + Download compatible trained pipeline from the default download path using + pip. If --direct flag is set, the command expects the full package name with + version. For direct downloads, the compatibility check will be skipped. All additional arguments provided to this command will be passed to `pip install` - on model installation. + on package installation. + + DOCS: https://spacy.io/api/cli#download + AVAILABLE PACKAGES: https://spacy.io/models """ download(model, direct, *ctx.args) @@ -34,11 +37,11 @@ def download_cli( def download(model: str, direct: bool = False, *pip_args) -> None: if not is_package("spacy") and "--no-deps" not in pip_args: msg.warn( - "Skipping model package dependencies and setting `--no-deps`. " + "Skipping pipeline package dependencies and setting `--no-deps`. " "You don't seem to have the spaCy package itself installed " "(maybe because you've built from source?), so installing the " - "model dependencies would cause spaCy to be downloaded, which " - "probably isn't what you want. If the model package has other " + "package dependencies would cause spaCy to be downloaded, which " + "probably isn't what you want. If the pipeline package has other " "dependencies, you'll have to install them manually." ) pip_args = pip_args + ("--no-deps",) @@ -53,7 +56,7 @@ def download(model: str, direct: bool = False, *pip_args) -> None: if model in OLD_MODEL_SHORTCUTS: msg.warn( f"As of spaCy v3.0, shortcuts like '{model}' are deprecated. Please" - f"use the full model name '{OLD_MODEL_SHORTCUTS[model]}' instead." + f"use the full pipeline package name '{OLD_MODEL_SHORTCUTS[model]}' instead." ) model_name = OLD_MODEL_SHORTCUTS[model] compatibility = get_compatibility() @@ -61,7 +64,7 @@ def download(model: str, direct: bool = False, *pip_args) -> None: download_model(dl_tpl.format(m=model_name, v=version), pip_args) msg.good( "Download and installation successful", - f"You can now load the model via spacy.load('{model_name}')", + f"You can now load the package via spacy.load('{model_name}')", ) @@ -71,7 +74,7 @@ def get_compatibility() -> dict: if r.status_code != 200: msg.fail( f"Server error ({r.status_code})", - f"Couldn't fetch compatibility table. Please find a model for your spaCy " + f"Couldn't fetch compatibility table. Please find a package for your spaCy " f"installation (v{about.__version__}), and download it manually. " f"For more details, see the documentation: " f"https://spacy.io/usage/models", @@ -80,7 +83,7 @@ def get_compatibility() -> dict: comp_table = r.json() comp = comp_table["spacy"] if version not in comp: - msg.fail(f"No compatible models found for v{version} of spaCy", exits=1) + msg.fail(f"No compatible packages found for v{version} of spaCy", exits=1) return comp[version] @@ -88,7 +91,7 @@ def get_version(model: str, comp: dict) -> str: model = get_base_version(model) if model not in comp: msg.fail( - f"No compatible model found for '{model}' (spaCy v{about.__version__})", + f"No compatible package found for '{model}' (spaCy v{about.__version__})", exits=1, ) return comp[model][0] diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 3847c74f3..3898c89a1 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -26,8 +26,8 @@ def evaluate_cli( # fmt: on ): """ - Evaluate a model. Expects a loadable spaCy model and evaluation data in the - binary .spacy format. The --gold-preproc option sets up the evaluation + Evaluate a trained pipeline. Expects a loadable spaCy pipeline and evaluation + data in the binary .spacy format. The --gold-preproc option sets up the evaluation examples with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce diff --git a/spacy/cli/info.py b/spacy/cli/info.py index ca082b939..98cd042a8 100644 --- a/spacy/cli/info.py +++ b/spacy/cli/info.py @@ -12,14 +12,14 @@ from .. import about @app.command("info") def info_cli( # fmt: off - model: Optional[str] = Arg(None, help="Optional model name"), + model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"), markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"), silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"), # fmt: on ): """ - Print info about spaCy installation. If a model is speficied as an argument, - print model information. Flag --markdown prints details in Markdown for easy + Print info about spaCy installation. If a pipeline is speficied as an argument, + print its meta information. Flag --markdown prints details in Markdown for easy copy-pasting to GitHub issues. """ info(model, markdown=markdown, silent=silent) @@ -30,14 +30,16 @@ def info( ) -> Union[str, dict]: msg = Printer(no_print=silent, pretty=not silent) if model: - title = f"Info about model '{model}'" + title = f"Info about pipeline '{model}'" data = info_model(model, silent=silent) else: title = "Info about spaCy" data = info_spacy() raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()} - if "Models" in data and isinstance(data["Models"], dict): - data["Models"] = ", ".join(f"{n} ({v})" for n, v in data["Models"].items()) + if "Pipelines" in data and isinstance(data["Pipelines"], dict): + data["Pipelines"] = ", ".join( + f"{n} ({v})" for n, v in data["Pipelines"].items() + ) markdown_data = get_markdown(data, title=title) if markdown: if not silent: @@ -63,7 +65,7 @@ def info_spacy() -> Dict[str, any]: "Location": str(Path(__file__).parent.parent), "Platform": platform.platform(), "Python version": platform.python_version(), - "Models": all_models, + "Pipelines": all_models, } @@ -81,7 +83,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]: model_path = model meta_path = model_path / "meta.json" if not meta_path.is_file(): - msg.fail("Can't find model meta.json", meta_path, exits=1) + msg.fail("Can't find pipeline meta.json", meta_path, exits=1) meta = srsly.read_json(meta_path) if model_path.resolve() != model_path: meta["source"] = str(model_path.resolve()) diff --git a/spacy/cli/init_config.py b/spacy/cli/init_config.py index 1e1e55e06..b75718a2e 100644 --- a/spacy/cli/init_config.py +++ b/spacy/cli/init_config.py @@ -27,7 +27,7 @@ def init_config_cli( # fmt: off output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True), lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"), - pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"), + pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"), optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), cpu: bool = Opt(False, "--cpu", "-C", help="Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."), # fmt: on @@ -168,7 +168,7 @@ def save_config( output_file.parent.mkdir(parents=True) config.to_disk(output_file, interpolate=False) msg.good("Saved config", output_file) - msg.text("You can now add your data and train your model:") + msg.text("You can now add your data and train your pipeline:") variables = ["--paths.train ./train.spacy", "--paths.dev ./dev.spacy"] if not no_print: print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py index 4fdd2bbbc..071d5f659 100644 --- a/spacy/cli/init_model.py +++ b/spacy/cli/init_model.py @@ -28,7 +28,7 @@ except ImportError: DEFAULT_OOV_PROB = -20 -@init_cli.command("model") +@init_cli.command("vectors") @app.command( "init-model", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, @@ -37,8 +37,8 @@ DEFAULT_OOV_PROB = -20 def init_model_cli( # fmt: off ctx: typer.Context, # This is only used to read additional arguments - lang: str = Arg(..., help="Model language"), - output_dir: Path = Arg(..., help="Model output directory"), + lang: str = Arg(..., help="Pipeline language"), + output_dir: Path = Arg(..., help="Pipeline output directory"), freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True), clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True), jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True), @@ -46,19 +46,20 @@ def init_model_cli( prune_vectors: int = Opt(-1, "--prune-vectors", "-V", help="Optional number of vectors to prune to"), truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"), vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"), - model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"), - base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base model (for languages with custom tokenizers)") + model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the pipeline meta"), + base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base pipeline (for languages with custom tokenizers)") # fmt: on ): """ - Create a new model from raw data. If vectors are provided in Word2Vec format, - they can be either a .txt or zipped as a .zip or .tar.gz. + Create a new blank pipeline directory with vocab and vectors from raw data. + If vectors are provided in Word2Vec format, they can be either a .txt or + zipped as a .zip or .tar.gz. """ if ctx.command.name == "init-model": msg.warn( - "The init-model command is now available via the 'init model' " - "subcommand (without the hyphen). You can run python -m spacy init " - "--help for an overview of the other available initialization commands." + "The init-model command is now called 'init vocab'. You can run " + "'python -m spacy init --help' for an overview of the other " + "available initialization commands." ) init_model( lang, @@ -115,10 +116,10 @@ def init_model( msg.fail("Can't find words frequencies file", freqs_loc, exits=1) lex_attrs = read_attrs_from_deprecated(msg, freqs_loc, clusters_loc) - with msg.loading("Creating model..."): + with msg.loading("Creating blank pipeline..."): nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model) - msg.good("Successfully created model") + msg.good("Successfully created blank pipeline") if vectors_loc is not None: add_vectors( msg, nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name @@ -242,7 +243,8 @@ def add_vectors( if vectors_data is not None: nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) if name is None: - nlp.vocab.vectors.name = f"{nlp.meta['lang']}_model.vectors" + # TODO: Is this correct? Does this matter? + nlp.vocab.vectors.name = f"{nlp.meta['lang']}_{nlp.meta['name']}.vectors" else: nlp.vocab.vectors.name = name nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name diff --git a/spacy/cli/package.py b/spacy/cli/package.py index 4e5038951..f464c97e8 100644 --- a/spacy/cli/package.py +++ b/spacy/cli/package.py @@ -14,19 +14,19 @@ from .. import about @app.command("package") def package_cli( # fmt: off - input_dir: Path = Arg(..., help="Directory with model data", exists=True, file_okay=False), + input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False), output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False), meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False), create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"), version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"), no_sdist: bool = Opt(False, "--no-sdist", "-NS", help="Don't build .tar.gz sdist, can be set if you want to run this step manually"), - force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing model in output directory"), + force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"), # fmt: on ): """ - Generate an installable Python package for a model. Includes model data, + Generate an installable Python package for a pipeline. Includes binary data, meta and required installation files. A new directory will be created in the - specified output directory, and model data will be copied over. If + specified output directory, and the data will be copied over. If --create-meta is set and a meta.json already exists in the output directory, the existing values will be used as the defaults in the command-line prompt. After packaging, "python setup.py sdist" is run in the package directory, @@ -59,14 +59,14 @@ def package( output_path = util.ensure_path(output_dir) meta_path = util.ensure_path(meta_path) if not input_path or not input_path.exists(): - msg.fail("Can't locate model data", input_path, exits=1) + msg.fail("Can't locate pipeline data", input_path, exits=1) if not output_path or not output_path.exists(): msg.fail("Output directory not found", output_path, exits=1) if meta_path and not meta_path.exists(): - msg.fail("Can't find model meta.json", meta_path, exits=1) + msg.fail("Can't find pipeline meta.json", meta_path, exits=1) meta_path = meta_path or input_dir / "meta.json" if not meta_path.exists() or not meta_path.is_file(): - msg.fail("Can't load model meta.json", meta_path, exits=1) + msg.fail("Can't load pipeline meta.json", meta_path, exits=1) meta = srsly.read_json(meta_path) meta = get_meta(input_dir, meta) if version is not None: @@ -77,7 +77,7 @@ def package( meta = generate_meta(meta, msg) errors = validate(ModelMetaSchema, meta) if errors: - msg.fail("Invalid model meta.json") + msg.fail("Invalid pipeline meta.json") print("\n".join(errors)) sys.exit(1) model_name = meta["lang"] + "_" + meta["name"] @@ -118,7 +118,7 @@ def get_meta( ) -> Dict[str, Any]: meta = { "lang": "en", - "name": "model", + "name": "pipeline", "version": "0.0.0", "description": "", "author": "", @@ -143,10 +143,10 @@ def get_meta( def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]: meta = existing_meta or {} settings = [ - ("lang", "Model language", meta.get("lang", "en")), - ("name", "Model name", meta.get("name", "model")), - ("version", "Model version", meta.get("version", "0.0.0")), - ("description", "Model description", meta.get("description", None)), + ("lang", "Pipeline language", meta.get("lang", "en")), + ("name", "Pipeline name", meta.get("name", "pipeline")), + ("version", "Package version", meta.get("version", "0.0.0")), + ("description", "Package description", meta.get("description", None)), ("author", "Author", meta.get("author", None)), ("email", "Author email", meta.get("email", None)), ("url", "Author website", meta.get("url", None)), @@ -154,8 +154,8 @@ def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any] ] msg.divider("Generating meta.json") msg.text( - "Enter the package settings for your model. The following information " - "will be read from your model data: pipeline, vectors." + "Enter the package settings for your pipeline. The following information " + "will be read from your pipeline data: pipeline, vectors." ) for setting, desc, default in settings: response = get_raw_input(desc, default) diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index 5f20773e1..fe6bfa92e 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -31,7 +31,7 @@ def pretrain_cli( # fmt: off ctx: typer.Context, # This is only used to read additional arguments texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True), - output_dir: Path = Arg(..., help="Directory to write models to on each epoch"), + output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"), config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False), code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"), @@ -376,10 +376,9 @@ def verify_cli_args(texts_loc, output_dir, config_path, resume_path, epoch_resum if output_dir.exists() and [p for p in output_dir.iterdir()]: if resume_path: msg.warn( - "Output directory is not empty. ", - "If you're resuming a run from a previous model in this directory, " - "the old models for the consecutive epochs will be overwritten " - "with the new ones.", + "Output directory is not empty.", + "If you're resuming a run in this directory, the old weights " + "for the consecutive epochs will be overwritten with the new ones.", ) else: msg.warn( diff --git a/spacy/cli/profile.py b/spacy/cli/profile.py index 14d8435fe..1b995f4bc 100644 --- a/spacy/cli/profile.py +++ b/spacy/cli/profile.py @@ -19,7 +19,7 @@ from ..util import load_model def profile_cli( # fmt: off ctx: typer.Context, # This is only used to read current calling context - model: str = Arg(..., help="Model to load"), + model: str = Arg(..., help="Trained pipeline to load"), inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True), n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"), # fmt: on @@ -60,9 +60,9 @@ def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> inputs, _ = zip(*imdb_train) msg.info(f"Loaded IMDB dataset and using {n_inputs} examples") inputs = inputs[:n_inputs] - with msg.loading(f"Loading model '{model}'..."): + with msg.loading(f"Loading pipeline '{model}'..."): nlp = load_model(model) - msg.good(f"Loaded model '{model}'") + msg.good(f"Loaded pipeline '{model}'") texts = list(itertools.islice(inputs, n_texts)) cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof") s = pstats.Stats("Profile.prof") diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 4ce02286a..5377f7f8f 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -26,7 +26,7 @@ def train_cli( # fmt: off ctx: typer.Context, # This is only used to read additional arguments config_path: Path = Arg(..., help="Path to config file", exists=True), - output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store model in"), + output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"), code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"), @@ -34,7 +34,7 @@ def train_cli( # fmt: on ): """ - Train or update a spaCy model. Requires data in spaCy's binary format. To + Train or update a spaCy pipeline. Requires data in spaCy's binary format. To convert data from other formats, use the `spacy convert` command. The config file includes all settings and hyperparameters used during traing. To override settings in the config, e.g. settings that point to local @@ -113,12 +113,12 @@ def train( # Load morph rules nlp.vocab.morphology.load_morph_exceptions(morph_rules) - # Load a pretrained tok2vec model - cf. CLI command 'pretrain' + # Load pretrained tok2vec weights - cf. CLI command 'pretrain' if weights_data is not None: tok2vec_path = config["pretraining"].get("tok2vec_model", None) if tok2vec_path is None: msg.fail( - f"To use a pretrained tok2vec model, the config needs to specify which " + f"To pretrained tok2vec weights, the config needs to specify which " f"tok2vec layer to load in the setting [pretraining.tok2vec_model].", exits=1, ) @@ -183,7 +183,7 @@ def train( nlp.to_disk(final_model_path) else: nlp.to_disk(final_model_path) - msg.good(f"Saved model to output directory {final_model_path}") + msg.good(f"Saved pipeline to output directory {final_model_path}") def create_train_batches(iterator, batcher, max_epochs: int): diff --git a/spacy/cli/validate.py b/spacy/cli/validate.py index e6ba284df..a1e05fdcd 100644 --- a/spacy/cli/validate.py +++ b/spacy/cli/validate.py @@ -13,9 +13,9 @@ from ..util import get_package_path, get_model_meta, is_compatible_version @app.command("validate") def validate_cli(): """ - Validate the currently installed models and spaCy version. Checks if the - installed models are compatible and shows upgrade instructions if available. - Should be run after `pip install -U spacy`. + Validate the currently installed pipeline packages and spaCy version. Checks + if the installed packages are compatible and shows upgrade instructions if + available. Should be run after `pip install -U spacy`. """ validate() @@ -25,13 +25,13 @@ def validate() -> None: spacy_version = get_base_version(about.__version__) current_compat = compat.get(spacy_version, {}) if not current_compat: - msg.warn(f"No compatible models found for v{spacy_version} of spaCy") + msg.warn(f"No compatible packages found for v{spacy_version} of spaCy") incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]} na_models = [m for m in incompat_models if m not in current_compat] update_models = [m for m in incompat_models if m in current_compat] spacy_dir = Path(__file__).parent.parent - msg.divider(f"Installed models (spaCy v{about.__version__})") + msg.divider(f"Installed pipeline packages (spaCy v{about.__version__})") msg.info(f"spaCy installation: {spacy_dir}") if model_pkgs: @@ -47,15 +47,15 @@ def validate() -> None: rows.append((data["name"], data["spacy"], version, comp)) msg.table(rows, header=header) else: - msg.text("No models found in your current environment.", exits=0) + msg.text("No pipeline packages found in your current environment.", exits=0) if update_models: msg.divider("Install updates") - msg.text("Use the following commands to update the model packages:") + msg.text("Use the following commands to update the packages:") cmd = "python -m spacy download {}" print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n") if na_models: msg.info( - f"The following models are custom spaCy models or not " + f"The following packages are custom spaCy pipelines or not " f"available for spaCy v{about.__version__}:", ", ".join(na_models), ) diff --git a/spacy/language.py b/spacy/language.py index 8e7c39b90..211e6c547 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -192,7 +192,7 @@ class Language: self._meta.setdefault("lang", self.vocab.lang) else: self._meta.setdefault("lang", self.lang) - self._meta.setdefault("name", "model") + self._meta.setdefault("name", "pipeline") self._meta.setdefault("version", "0.0.0") self._meta.setdefault("spacy_version", spacy_version) self._meta.setdefault("description", "") diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index 9070855fa..98da62eb3 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -1,6 +1,6 @@ --- title: Command Line Interface -teaser: Download, train and package models, and debug spaCy +teaser: Download, train and package pipelines, and debug spaCy source: spacy/cli menu: - ['download', 'download'] @@ -17,45 +17,47 @@ menu: --- spaCy's CLI provides a range of helpful commands for downloading and training -models, converting data and debugging your config, data and installation. For a -list of available commands, you can type `python -m spacy --help`. You can also -add the `--help` flag to any command or subcommand to see the description, +pipelines, converting data and debugging your config, data and installation. For +a list of available commands, you can type `python -m spacy --help`. You can +also add the `--help` flag to any command or subcommand to see the description, available arguments and usage. ## download {#download tag="command"} -Download [models](/usage/models) for spaCy. The downloader finds the -best-matching compatible version and uses `pip install` to download the model as -a package. Direct downloads don't perform any compatibility checks and require -the model name to be specified with its version (e.g. `en_core_web_sm-2.2.0`). +Download [trained pipelines](/usage/models) for spaCy. The downloader finds the +best-matching compatible version and uses `pip install` to download the Python +package. Direct downloads don't perform any compatibility checks and require the +pipeline name to be specified with its version (e.g. `en_core_web_sm-2.2.0`). > #### Downloading best practices > > The `download` command is mostly intended as a convenient, interactive wrapper > – it performs compatibility checks and prints detailed messages in case things > go wrong. It's **not recommended** to use this command as part of an automated -> process. If you know which model your project needs, you should consider a -> [direct download via pip](/usage/models#download-pip), or uploading the model -> to a local PyPi installation and fetching it straight from there. This will -> also allow you to add it as a versioned package dependency to your project. +> process. If you know which package your project needs, you should consider a +> [direct download via pip](/usage/models#download-pip), or uploading the +> package to a local PyPi installation and fetching it straight from there. This +> will also allow you to add it as a versioned package dependency to your +> project. ```cli $ python -m spacy download [model] [--direct] [pip_args] ``` -| Name | Description | -| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `model` | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). ~~str (positional)~~ | -| `--direct`, `-d` | Force direct download of exact model version. ~~bool (flag)~~ | -| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| pip args 2.1 | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. ~~Any (option/flag)~~ | -| **CREATES** | The installed model package in your `site-packages` directory. | +| Name | Description | +| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `model` | Pipeline package name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). ~~str (positional)~~ | +| `--direct`, `-d` | Force direct download of exact package version. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| pip args 2.1 | Additional installation options to be passed to `pip install` when installing the pipeline package. For example, `--user` to install to the user home directory or `--no-deps` to not install package dependencies. ~~Any (option/flag)~~ | +| **CREATES** | The installed pipeline package in your `site-packages` directory. | ## info {#info tag="command"} -Print information about your spaCy installation, models and local setup, and -generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted markup to -copy-paste into [GitHub issues](https://github.com/explosion/spaCy/issues). +Print information about your spaCy installation, trained pipelines and local +setup, and generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted +markup to copy-paste into +[GitHub issues](https://github.com/explosion/spaCy/issues). ```cli $ python -m spacy info [--markdown] [--silent] @@ -65,41 +67,41 @@ $ python -m spacy info [--markdown] [--silent] $ python -m spacy info [model] [--markdown] [--silent] ``` -| Name | Description | -| ------------------------------------------------ | ------------------------------------------------------------------------------ | -| `model` | A model, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ | -| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ | -| `--silent`, `-s` 2.0.12 | Don't print anything, just return the values. ~~bool (flag)~~ | -| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| **PRINTS** | Information about your spaCy installation. | +| Name | Description | +| ------------------------------------------------ | ----------------------------------------------------------------------------------------- | +| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ | +| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ | +| `--silent`, `-s` 2.0.12 | Don't print anything, just return the values. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **PRINTS** | Information about your spaCy installation. | ## validate {#validate new="2" tag="command"} -Find all models installed in the current environment and check whether they are -compatible with the currently installed version of spaCy. Should be run after -upgrading spaCy via `pip install -U spacy` to ensure that all installed models -are can be used with the new version. It will show a list of models and their -installed versions. If any model is out of date, the latest compatible versions -and command for updating are shown. +Find all trained pipeline packages installed in the current environment and +check whether they are compatible with the currently installed version of spaCy. +Should be run after upgrading spaCy via `pip install -U spacy` to ensure that +all installed packages are can be used with the new version. It will show a list +of packages and their installed versions. If any package is out of date, the +latest compatible versions and command for updating are shown. > #### Automated validation > > You can also use the `validate` command as part of your build process or test -> suite, to ensure all models are up to date before proceeding. If incompatible -> models are found, it will return `1`. +> suite, to ensure all packages are up to date before proceeding. If +> incompatible packages are found, it will return `1`. ```cli $ python -m spacy validate ``` -| Name | Description | -| ---------- | --------------------------------------------------------- | -| **PRINTS** | Details about the compatibility of your installed models. | +| Name | Description | +| ---------- | -------------------------------------------------------------------- | +| **PRINTS** | Details about the compatibility of your installed pipeline packages. | ## init {#init new="3"} The `spacy init` CLI includes helpful commands for initializing training config -files and model directories. +files and pipeline directories. ### init config {#init-config new="3" tag="command"} @@ -125,7 +127,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [ | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `output_file` | Path to output `.cfg` file or `-` to write the config to stdout (so you can pipe it forward to a file). Note that if you're writing to stdout, no additional logging info is printed. ~~Path (positional)~~ | | `--lang`, `-l` | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. ~~str (option)~~ | -| `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include in the model. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ | +| `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ | | `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ | | `--cpu`, `-C` | Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | @@ -165,36 +167,36 @@ $ python -m spacy init fill-config [base_path] [output_file] [--diff] | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **CREATES** | Complete and auto-filled config file for training. | -### init model {#init-model new="2" tag="command"} +### init vocab {#init-vocab new="3" tag="command"} -Create a new model directory from raw data, like word frequencies, Brown -clusters and word vectors. Note that in order to populate the model's vocab, you +Create a blank pipeline directory from raw data, like word frequencies, Brown +clusters and word vectors. Note that in order to populate the vocabulary, you need to pass in a JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) as `--jsonl-loc` with optional `id` values that correspond to the vectors table. Just loading in vectors will not automatically populate the vocab. - + -The `init-model` command is now available as a subcommand of `spacy init`. +This command was previously called `init-model`. ```cli -$ python -m spacy init model [lang] [output_dir] [--jsonl-loc] [--vectors-loc] [--prune-vectors] +$ python -m spacy init vocab [lang] [output_dir] [--jsonl-loc] [--vectors-loc] [--prune-vectors] ``` | Name | Description | | ------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. ~~str (positional)~~ | -| `output_dir` | Model output directory. Will be created if it doesn't exist. ~~Path (positional)~~ | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. ~~str (positional)~~ | +| `output_dir` | Pipeline output directory. Will be created if it doesn't exist. ~~Path (positional)~~ | | `--jsonl-loc`, `-j` | Optional location of JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) with lexical attributes. ~~Optional[Path] \(option)~~ | | `--vectors-loc`, `-v` | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. ~~Optional[Path] \(option)~~ | | `--truncate-vectors`, `-t` 2.3 | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. ~~int (option)~~ | | `--prune-vectors`, `-V` | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. ~~int (option)~~ | | `--vectors-name`, `-vn` | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. ~~str (option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| **CREATES** | A spaCy model containing the vocab and vectors. | +| **CREATES** | A spaCy pipeline directory containing the vocab and vectors. | ## convert {#convert tag="command"} @@ -594,11 +596,11 @@ $ python -m spacy debug profile [model] [inputs] [--n-texts] | Name | Description | | ----------------- | ---------------------------------------------------------------------------------- | -| `model` | A loadable spaCy model. ~~str (positional)~~ | +| `model` | A loadable spaCy pipeline (package name or path). ~~str (positional)~~ | | `inputs` | Optional path to input file, or `-` for standard input. ~~Path (positional)~~ | | `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| **PRINTS** | Profiling information for the model. | +| **PRINTS** | Profiling information for the pipeline. | ### debug model {#debug-model new="3" tag="command"} @@ -724,10 +726,10 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P ## train {#train tag="command"} -Train a model. Expects data in spaCy's +Train a pipeline. Expects data in spaCy's [binary format](/api/data-formats#training) and a [config file](/api/data-formats#config) with all settings and hyperparameters. -Will save out the best model from all epochs, as well as the final model. The +Will save out the best model from all epochs, as well as the final pipeline. The `--code` argument can be used to provide a Python file that's imported before the training process starts. This lets you register [custom functions](/usage/training#custom-functions) and architectures and refer @@ -753,12 +755,12 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides | Name | Description | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | -| `--output`, `-o` | Directory to store model in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | +| `--output`, `-o` | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | | `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | -| **CREATES** | The final model and the best model. | +| **CREATES** | The final trained pipeline and the best trained pipeline. | ## pretrain {#pretrain new="2.1" tag="command,experimental"} @@ -769,7 +771,7 @@ a component like a CNN, BiLSTM, etc to predict vectors which match the pretrained ones. The weights are saved to a directory after each epoch. You can then include a **path to one of these pretrained weights files** in your [training config](/usage/training#config) as the `init_tok2vec` setting when you -train your model. This technique may be especially helpful if you have little +train your pipeline. This technique may be especially helpful if you have little labelled data. See the usage docs on [pretraining](/usage/training#pretraining) for more info. @@ -792,7 +794,7 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--re | Name | Description | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `texts_loc` | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. ~~Path (positional)~~ | -| `output_dir` | Directory to write models to on each epoch. ~~Path (positional)~~ | +| `output_dir` | Directory to save binary weights to on each epoch. ~~Path (positional)~~ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | | `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--resume-path`, `-r` | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~ | @@ -803,7 +805,8 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--re ## evaluate {#evaluate new="2" tag="command"} -Evaluate a model. Expects a loadable spaCy model and evaluation data in the +Evaluate a trained pipeline. Expects a loadable spaCy pipeline (package name or +path) and evaluation data in the [binary `.spacy` format](/api/data-formats#binary-training). The `--gold-preproc` option sets up the evaluation examples with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the @@ -819,7 +822,7 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--gold-preproc] [--gp | Name | Description | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `model` | Model to evaluate. Can be a package or a path to a model data directory. ~~str (positional)~~ | +| `model` | Pipeline to evaluate. Can be a package or a path to a data directory. ~~str (positional)~~ | | `data_path` | Location of evaluation data in spaCy's [binary format](/api/data-formats#training). ~~Path (positional)~~ | | `--output`, `-o` | Output JSON file for metrics. If not set, no metrics will be exported. ~~Optional[Path] \(option)~~ | | `--gold-preproc`, `-G` | Use gold preprocessing. ~~bool (flag)~~ | @@ -831,13 +834,12 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--gold-preproc] [--gp ## package {#package tag="command"} -Generate an installable -[model Python package](/usage/training#models-generating) from an existing model -data directory. All data files are copied over. If the path to a -[`meta.json`](/api/data-formats#meta) is supplied, or a `meta.json` is found in -the input directory, this file is used. Otherwise, the data can be entered -directly from the command line. spaCy will then create a `.tar.gz` archive file -that you can distribute and install with `pip install`. +Generate an installable [Python package](/usage/training#models-generating) from +an existing pipeline data directory. All data files are copied over. If the path +to a [`meta.json`](/api/data-formats#meta) is supplied, or a `meta.json` is +found in the input directory, this file is used. Otherwise, the data can be +entered directly from the command line. spaCy will then create a `.tar.gz` +archive file that you can distribute and install with `pip install`. @@ -855,13 +857,13 @@ $ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] > > ```cli > $ python -m spacy package /input /output -> $ cd /output/en_model-0.0.0 -> $ pip install dist/en_model-0.0.0.tar.gz +> $ cd /output/en_pipeline-0.0.0 +> $ pip install dist/en_pipeline-0.0.0.tar.gz > ``` | Name | Description | | ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `input_dir` | Path to directory containing model data. ~~Path (positional)~~ | +| `input_dir` | Path to directory containing pipeline data. ~~Path (positional)~~ | | `output_dir` | Directory to create package folder in. ~~Path (positional)~~ | | `--meta-path`, `-m` 2 | Path to [`meta.json`](/api/data-formats#meta) file (optional). ~~Optional[Path] \(option)~~ | | `--create-meta`, `-C` 2 | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. ~~bool (flag)~~ | @@ -869,13 +871,13 @@ $ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] | `--version`, `-v` 3 | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. ~~Optional[str] \(option)~~ | | `--force`, `-f` | Force overwriting of existing folder in output directory. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| **CREATES** | A Python package containing the spaCy model. | +| **CREATES** | A Python package containing the spaCy pipeline. | ## project {#project new="3"} The `spacy project` CLI includes subcommands for working with [spaCy projects](/usage/projects), end-to-end workflows for building and -deploying custom spaCy models. +deploying custom spaCy pipelines. ### project clone {#project-clone tag="command"} @@ -1015,9 +1017,9 @@ Download all files or directories listed as `outputs` for commands, unless they are not already present locally. When searching for files in the remote, `pull` won't just look at the output path, but will also consider the **command string** and the **hashes of the dependencies**. For instance, let's say you've -previously pushed a model checkpoint to the remote, but now you've changed some +previously pushed a checkpoint to the remote, but now you've changed some hyper-parameters. Because you've changed the inputs to the command, if you run -`pull`, you won't retrieve the stale result. If you train your model and push +`pull`, you won't retrieve the stale result. If you train your pipeline and push the outputs to the remote, the outputs will be saved alongside the prior outputs, so if you change the config back, you'll be able to fetch back the result. diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 8ef8041ee..3fd2818f4 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -6,18 +6,18 @@ menu: - ['Training Data', 'training'] - ['Pretraining Data', 'pretraining'] - ['Vocabulary', 'vocab-jsonl'] - - ['Model Meta', 'meta'] + - ['Pipeline Meta', 'meta'] --- This section documents input and output formats of data used by spaCy, including the [training config](/usage/training#config), training data and lexical vocabulary data. For an overview of label schemes used by the models, see the -[models directory](/models). Each model documents the label schemes used in its -components, depending on the data it was trained on. +[models directory](/models). Each trained pipeline documents the label schemes +used in its components, depending on the data it was trained on. ## Training config {#config new="3"} -Config files define the training process and model pipeline and can be passed to +Config files define the training process and pipeline and can be passed to [`spacy train`](/api/cli#train). They use [Thinc's configuration system](https://thinc.ai/docs/usage-config) under the hood. For details on how to use training configs, see the @@ -74,16 +74,16 @@ your config and check that it's valid, you can run the Defines the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. -| Name | Description | -| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `null`. ~~str~~ | -| `pipeline` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). Defaults to `[]`. ~~List[str]~~ | -| `disabled` | Names of pipeline components that are loaded but disabled by default and not run as part of the pipeline. Should correspond to components listed in `pipeline`. After a model is loaded, disabled components can be enabled using [`Language.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | -| `load_vocab_data` | Whether to load additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) if available. Defaults to `true`. ~~bool~~ | -| `before_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. Defaults to `null`. ~~Optional[Callable[[Type[Language]], Type[Language]]]~~ | -| `after_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | -| `after_pipeline_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | -| `tokenizer` | The tokenizer to use. Defaults to [`Tokenizer`](/api/tokenizer). ~~Callable[[str], Doc]~~ | +| Name | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `null`. ~~str~~ | +| `pipeline` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). Defaults to `[]`. ~~List[str]~~ | +| `disabled` | Names of pipeline components that are loaded but disabled by default and not run as part of the pipeline. Should correspond to components listed in `pipeline`. After a pipeline is loaded, disabled components can be enabled using [`Language.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | +| `load_vocab_data` | Whether to load additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) if available. Defaults to `true`. ~~bool~~ | +| `before_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. Defaults to `null`. ~~Optional[Callable[[Type[Language]], Type[Language]]]~~ | +| `after_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | +| `after_pipeline_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | +| `tokenizer` | The tokenizer to use. Defaults to [`Tokenizer`](/api/tokenizer). ~~Callable[[str], Doc]~~ | ### components {#config-components tag="section"} @@ -105,8 +105,8 @@ This section includes definitions of the [pipeline components](/usage/processing-pipelines) and their models, if available. Components in this section can be referenced in the `pipeline` of the `[nlp]` block. Component blocks need to specify either a `factory` (named -function to use to create component) or a `source` (name of path of pretrained -model to copy components from). See the docs on +function to use to create component) or a `source` (name of path of trained +pipeline to copy components from). See the docs on [defining pipeline components](/usage/training#config-components) for details. ### paths, system {#config-variables tag="variables"} @@ -145,7 +145,7 @@ process that are used when you run [`spacy train`](/api/cli#train). | `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | | `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | | `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ | -| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ | +| `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vocab`](/api/cli#init-vocab). Defaults to `null`. ~~Optional[str]~~ | ### pretraining {#config-pretraining tag="section,optional"} @@ -184,7 +184,7 @@ run [`spacy pretrain`](/api/cli#pretrain). The main data format used in spaCy v3.0 is a **binary format** created by serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc` -objects. This means that you can train spaCy models using the same format it +objects. This means that you can train spaCy pipelines using the same format it outputs: annotated `Doc` objects. The binary format is extremely **efficient in storage**, especially when packing multiple documents together. @@ -286,8 +286,8 @@ a dictionary of gold-standard annotations. [internal training API](/usage/training#api) and they're expected when you call [`nlp.update`](/api/language#update). However, for most use cases, you **shouldn't** have to write your own training scripts. It's recommended to train -your models via the [`spacy train`](/api/cli#train) command with a config file -to keep track of your settings and hyperparameters and your own +your pipelines via the [`spacy train`](/api/cli#train) command with a config +file to keep track of your settings and hyperparameters and your own [registered functions](/usage/training/#custom-code) to customize the setup. @@ -406,15 +406,15 @@ in line-by-line, while still making it easy to represent newlines in the data. ## Lexical data for vocabulary {#vocab-jsonl new="2"} -To populate a model's vocabulary, you can use the -[`spacy init model`](/api/cli#init-model) command and load in a +To populate a pipeline's vocabulary, you can use the +[`spacy init vocab`](/api/cli#init-vocab) command and load in a [newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one lexical entry per line via the `--jsonl-loc` option. The first line defines the language and vocabulary settings. All other lines are expected to be JSON objects describing an individual lexeme. The lexical attributes will be then set as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab` -command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical -data. +command outputs a ready-to-use spaCy pipeline with a `Vocab` containing the +lexical data. ```python ### First line @@ -459,11 +459,11 @@ Here's an example of the 20 most frequent lexemes in the English training data: https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl ``` -## Model meta {#meta} +## Pipeline meta {#meta} -The model meta is available as the file `meta.json` and exported automatically -when you save an `nlp` object to disk. Its contents are available as -[`nlp.meta`](/api/language#meta). +The pipeline meta is available as the file `meta.json` and exported +automatically when you save an `nlp` object to disk. Its contents are available +as [`nlp.meta`](/api/language#meta). @@ -473,8 +473,8 @@ creating a Python package with [`spacy package`](/api/cli#package). How to set up the `nlp` object is now defined in the [`config.cfg`](/api/data-formats#config), which includes detailed information about the pipeline components and their model architectures, and all other -settings and hyperparameters used to train the model. It's the **single source -of truth** used for loading a model. +settings and hyperparameters used to train the pipeline. It's the **single +source of truth** used for loading a pipeline. @@ -482,12 +482,12 @@ of truth** used for loading a model. > > ```json > { -> "name": "example_model", +> "name": "example_pipeline", > "lang": "en", > "version": "1.0.0", > "spacy_version": ">=3.0.0,<3.1.0", > "parent_package": "spacy", -> "description": "Example model for spaCy", +> "description": "Example pipeline for spaCy", > "author": "You", > "email": "you@example.com", > "url": "https://example.com", @@ -510,23 +510,23 @@ of truth** used for loading a model. > } > ``` -| Name | Description | -| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ | -| `name` | Model name, e.g. `"core_web_sm"`. The final model package name will be `{lang}_{name}`. Defaults to `"model"`. ~~str~~ | -| `version` | Model version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ | -| `spacy_version` | spaCy version range the model is compatible with. Defaults to the spaCy version used to create the model, up to next minor version, which is the default compatibility for the available [pretrained models](/models). For instance, a model trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ | -| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ | -| `description` | Model description. Also used for Python package. Defaults to `""`. ~~str~~ | -| `author` | Model author name. Also used for Python package. Defaults to `""`. ~~str~~ | -| `email` | Model author email. Also used for Python package. Defaults to `""`. ~~str~~ | -| `url` | Model author URL. Also used for Python package. Defaults to `""`. ~~str~~ | -| `license` | Model license. Also used for Python package. Defaults to `""`. ~~str~~ | -| `sources` | Data sources used to train the model. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ | -| `vectors` | Information about the word vectors included with the model. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ | -| `pipeline` | Names of pipeline component names in the model, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ | -| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ | -| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | -| `speed` | Model speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ | -| `spacy_git_version` 3 | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create model. ~~str~~ | -| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ | +| Name | Description | +| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ | +| `name` | Pipeline name, e.g. `"core_web_sm"`. The final package name will be `{lang}_{name}`. Defaults to `"pipeline"`. ~~str~~ | +| `version` | Pipeline version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ | +| `spacy_version` | spaCy version range the package is compatible with. Defaults to the spaCy version used to create the pipeline, up to next minor version, which is the default compatibility for the available [trained pipelines](/models). For instance, a pipeline trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ | +| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ | +| `description` | Pipeline description. Also used for Python package. Defaults to `""`. ~~str~~ | +| `author` | Pipeline author name. Also used for Python package. Defaults to `""`. ~~str~~ | +| `email` | Pipeline author email. Also used for Python package. Defaults to `""`. ~~str~~ | +| `url` | Pipeline author URL. Also used for Python package. Defaults to `""`. ~~str~~ | +| `license` | Pipeline license. Also used for Python package. Defaults to `""`. ~~str~~ | +| `sources` | Data sources used to train the pipeline. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ | +| `vectors` | Information about the word vectors included with the pipeline. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ | +| `pipeline` | Names of pipeline component names, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ | +| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ | +| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | +| `speed` | Inference speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ | +| `spacy_git_version` 3 | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create pipeline. ~~str~~ | +| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ | diff --git a/website/docs/api/dependencymatcher.md b/website/docs/api/dependencymatcher.md index 2fb903100..b0395cc42 100644 --- a/website/docs/api/dependencymatcher.md +++ b/website/docs/api/dependencymatcher.md @@ -9,8 +9,8 @@ The `DependencyMatcher` follows the same API as the [`Matcher`](/api/matcher) and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees using the [Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). -It requires a pretrained [`DependencyParser`](/api/parser) or other component -that sets the `Token.dep` attribute. +It requires a trained [`DependencyParser`](/api/parser) or other component that +sets the `Token.dep` attribute. ## Pattern format {#patterns} diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index 679c3c0c2..637bd3c68 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -13,8 +13,8 @@ An `EntityLinker` component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the "real world". It requires a `KnowledgeBase`, as well as a function to generate plausible candidates from that `KnowledgeBase` given a certain textual mention, -and a ML model to pick the right candidate, given the local context of the -mention. +and a machine learning model to pick the right candidate, given the local +context of the mention. ## Config and implementation {#config} diff --git a/website/docs/api/language.md b/website/docs/api/language.md index e2668c522..d65b217a4 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -7,9 +7,9 @@ source: spacy/language.py Usually you'll load this once per process as `nlp` and pass the instance around your application. The `Language` class is created when you call -[`spacy.load()`](/api/top-level#spacy.load) and contains the shared vocabulary -and [language data](/usage/adding-languages), optional model data loaded from a -[model package](/models) or a path, and a +[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and +[language data](/usage/adding-languages), optional binary weights, e.g. provided +by a [trained pipeline](/models), and the [processing pipeline](/usage/processing-pipelines) containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a `Doc` object, modify it and @@ -37,7 +37,7 @@ Initialize a `Language` object. | `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ | | _keyword-only_ | | | `max_length` | Maximum number of characters allowed in a single text. Defaults to `10 ** 6`. ~~int~~ | -| `meta` | Custom meta data for the `Language` class. Is written to by models to add model meta data. ~~dict~~ | +| `meta` | Custom meta data for the `Language` class. Is written to by pipelines to add meta data. ~~dict~~ | | `create_tokenizer` | Optional function that receives the `nlp` object and returns a tokenizer. ~~Callable[[Language], Callable[[str], Doc]]~~ | ## Language.from_config {#from_config tag="classmethod" new="3"} @@ -232,7 +232,7 @@ tuples of `Doc` and `GoldParse` objects. ## Language.resume_training {#resume_training tag="method,experimental" new="3"} -Continue training a pretrained model. Create and return an optimizer, and +Continue training a trained pipeline. Create and return an optimizer, and initialize "rehearsal" for any pipeline component that has a `rehearse` method. Rehearsal is used to prevent models from "forgetting" their initialized "knowledge". To perform rehearsal, collect samples of text you want the models @@ -314,7 +314,7 @@ the "catastrophic forgetting" problem. This feature is experimental. ## Language.evaluate {#evaluate tag="method"} -Evaluate a model's pipeline components. +Evaluate a pipeline's components. @@ -386,24 +386,24 @@ component, adds it to the pipeline and returns it. > nlp.add_pipe("component", before="ner") > component = nlp.add_pipe("component", name="custom_name", last=True) > -> # Add component from source model +> # Add component from source pipeline > source_nlp = spacy.load("en_core_web_sm") > nlp.add_pipe("ner", source=source_nlp) > ``` -| Name | Description | -| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `factory_name` | Name of the registered component factory. ~~str~~ | -| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ | -| _keyword-only_ | | -| `before` | Component name or index to insert component directly before. ~~Optional[Union[str, int]]~~ | -| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ | -| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ | -| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ | -| `config` 3 | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | -| `source` 3 | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. ~~Optional[Language]~~ | -| `validate` 3 | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | -| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | +| Name | Description | +| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `factory_name` | Name of the registered component factory. ~~str~~ | +| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ | +| _keyword-only_ | | +| `before` | Component name or index to insert component directly before. ~~Optional[Union[str, int]]~~ | +| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ | +| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ | +| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ | +| `config` 3 | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | +| `source` 3 | Optional source pipeline to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. ~~Optional[Language]~~ | +| `validate` 3 | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | +| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | ## Language.create_pipe {#create_pipe tag="method" new="2"} @@ -790,9 +790,10 @@ token.ent_iob, token.ent_type ## Language.meta {#meta tag="property"} -Custom meta data for the Language class. If a model is loaded, contains meta -data of the model. The `Language.meta` is also what's serialized as the -[`meta.json`](/api/data-formats#meta) when you save an `nlp` object to disk. +Custom meta data for the Language class. If a trained pipeline is loaded, this +contains meta data of the pipeline. The `Language.meta` is also what's +serialized as the [`meta.json`](/api/data-formats#meta) when you save an `nlp` +object to disk. > #### Example > @@ -827,13 +828,13 @@ subclass of the built-in `dict`. It supports the additional methods `to_disk` ## Language.to_disk {#to_disk tag="method" new="2"} -Save the current state to a directory. If a model is loaded, this will **include -the model**. +Save the current state to a directory. If a trained pipeline is loaded, this +will **include all model data**. > #### Example > > ```python -> nlp.to_disk("/path/to/models") +> nlp.to_disk("/path/to/pipeline") > ``` | Name | Description | @@ -844,22 +845,28 @@ the model**. ## Language.from_disk {#from_disk tag="method" new="2"} -Loads state from a directory. Modifies the object in place and returns it. If -the saved `Language` object contains a model, the model will be loaded. Note -that this method is commonly used via the subclasses like `English` or `German` -to make language-specific functionality like the -[lexical attribute getters](/usage/adding-languages#lex-attrs) available to the -loaded object. +Loads state from a directory, including all data that was saved with the +`Language` object. Modifies the object in place and returns it. + + + +Keep in mind that this method **only loads serialized state** and doesn't set up +the `nlp` object. This means that it requires the correct language class to be +initialized and all pipeline components to be added to the pipeline. If you want +to load a serialized pipeline from a directory, you should use +[`spacy.load`](/api/top-level#spacy.load), which will set everything up for you. + + > #### Example > > ```python > from spacy.language import Language -> nlp = Language().from_disk("/path/to/model") +> nlp = Language().from_disk("/path/to/pipeline") > -> # using language-specific subclass +> # Using language-specific subclass > from spacy.lang.en import English -> nlp = English().from_disk("/path/to/en_model") +> nlp = English().from_disk("/path/to/pipeline") > ``` | Name | Description | @@ -924,7 +931,7 @@ available to the loaded object. | `components` 3 | List of all available `(name, component)` tuples, including components that are currently disabled. ~~List[Tuple[str, Callable[[Doc], Doc]]]~~ | | `component_names` 3 | List of all available component names, including components that are currently disabled. ~~List[str]~~ | | `disabled` 3 | Names of components that are currently disabled and don't run as part of the pipeline. ~~List[str]~~ | -| `path` 2 | Path to the model data directory, if a model is loaded. Otherwise `None`. ~~Optional[Path]~~ | +| `path` 2 | Path to the pipeline data directory, if a pipeline is loaded from a path or package. Otherwise `None`. ~~Optional[Path]~~ | ## Class attributes {#class-attributes} @@ -1004,7 +1011,7 @@ serialization by passing in the string names via the `exclude` argument. > > ```python > data = nlp.to_bytes(exclude=["tokenizer", "vocab"]) -> nlp.from_disk("./model-data", exclude=["ner"]) +> nlp.from_disk("/pipeline", exclude=["ner"]) > ``` | Name | Description | diff --git a/website/docs/api/pipe.md b/website/docs/api/pipe.md index 9c3a4104e..57b2af44d 100644 --- a/website/docs/api/pipe.md +++ b/website/docs/api/pipe.md @@ -286,7 +286,7 @@ context, the original parameters are restored. ## Pipe.add_label {#add_label tag="method"} -Add a new label to the pipe. It's possible to extend pretrained models with new +Add a new label to the pipe. It's possible to extend trained models with new labels, but care should be taken to avoid the "catastrophic forgetting" problem. > #### Example diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index d437ecc07..6e52585ee 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -12,14 +12,14 @@ menu: ## spaCy {#spacy hidden="true"} -### spacy.load {#spacy.load tag="function" model="any"} +### spacy.load {#spacy.load tag="function"} -Load a model using the name of an installed -[model package](/usage/training#models-generating), a string path or a -`Path`-like object. spaCy will try resolving the load argument in this order. If -a model is loaded from a model name, spaCy will assume it's a Python package and -import it and call the model's own `load()` method. If a model is loaded from a -path, spaCy will assume it's a data directory, load its +Load a pipeline using the name of an installed +[package](/usage/saving-loading#models), a string path or a `Path`-like object. +spaCy will try resolving the load argument in this order. If a pipeline is +loaded from a string name, spaCy will assume it's a Python package and import it +and call the package's own `load()` method. If a pipeline is loaded from a path, +spaCy will assume it's a data directory, load its [`config.cfg`](/api/data-formats#config) and use the language and pipeline information to construct the `Language` class. The data will be loaded in via [`Language.from_disk`](/api/language#from_disk). @@ -36,38 +36,38 @@ specified separately using the new `exclude` keyword argument. > > ```python > nlp = spacy.load("en_core_web_sm") # package -> nlp = spacy.load("/path/to/en") # string path -> nlp = spacy.load(Path("/path/to/en")) # pathlib Path +> nlp = spacy.load("/path/to/pipeline") # string path +> nlp = spacy.load(Path("/path/to/pipeline")) # pathlib Path > > nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"]) > ``` | Name | Description | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `name` | Model to load, i.e. package name or path. ~~Union[str, Path]~~ | +| `name` | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~ | | _keyword-only_ | | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | | `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `config` 3 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ | -| **RETURNS** | A `Language` object with the loaded model. ~~Language~~ | +| **RETURNS** | A `Language` object with the loaded pipeline. ~~Language~~ | -Essentially, `spacy.load()` is a convenience wrapper that reads the model's +Essentially, `spacy.load()` is a convenience wrapper that reads the pipeline's [`config.cfg`](/api/data-formats#config), uses the language and pipeline information to construct a `Language` object, loads in the model data and -returns it. +weights, and returns it. ```python ### Abstract example -cls = util.get_lang_class(lang) # get language for ID, e.g. "en" -nlp = cls() # initialize the language +cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English +nlp = cls() # 2. Initialize it for name in pipeline: - nlp.add_pipe(name) # add component to pipeline -nlp.from_disk(model_data_path) # load in model data + nlp.add_pipe(name) # 3. Add the component to the pipeline +nlp.from_disk(data_path) # 4. Load in the binary data ``` ### spacy.blank {#spacy.blank tag="function" new="2"} -Create a blank model of a given language class. This function is the twin of +Create a blank pipeline of a given language class. This function is the twin of `spacy.load()`. > #### Example @@ -85,9 +85,7 @@ Create a blank model of a given language class. This function is the twin of ### spacy.info {#spacy.info tag="function"} The same as the [`info` command](/api/cli#info). Pretty-print information about -your installation, models and local setup from within spaCy. To get the model -meta data as a dictionary instead, you can use the `meta` attribute on your -`nlp` object with a loaded model, e.g. `nlp.meta`. +your installation, installed pipelines and local setup from within spaCy. > #### Example > @@ -97,12 +95,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your > markdown = spacy.info(markdown=True, silent=True) > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------ | -| `model` | A model, i.e. a package name or path (optional). ~~Optional[str]~~ | -| _keyword-only_ | | -| `markdown` | Print information as Markdown. ~~bool~~ | -| `silent` | Don't print anything, just return. ~~bool~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------- | +| `model` | Optional pipeline, i.e. a package name or path (optional). ~~Optional[str]~~ | +| _keyword-only_ | | +| `markdown` | Print information as Markdown. ~~bool~~ | +| `silent` | Don't print anything, just return. ~~bool~~ | ### spacy.explain {#spacy.explain tag="function"} @@ -133,7 +131,7 @@ list of available terms, see Allocate data and perform operations on [GPU](/usage/#gpu), if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and _before_ loading any -models. +pipelines. > #### Example > @@ -152,7 +150,7 @@ models. Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy -and _before_ loading any models. +and _before_ loading any pipelines. > #### Example > @@ -271,9 +269,9 @@ If a setting is not present in the options, the default value will be used. | `template` 2.2 | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | By default, displaCy comes with colors for all entity types used by -[spaCy models](/models). If you're using custom entity types, you can use the -`colors` setting to add your own colors for them. Your application or model -package can also expose a +[spaCy's trained pipelines](/models). If you're using custom entity types, you +can use the `colors` setting to add your own colors for them. Your application +or pipeline package can also expose a [`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy) to add custom labels and their colors automatically. @@ -666,8 +664,8 @@ loaded lazily, to avoid expensive setup code associated with the language data. ### util.load_model {#util.load_model tag="function" new="2"} -Load a model from a package or data path. If called with a package name, spaCy -will assume the model is a Python package and import and call its `load()` +Load a pipeline from a package or data path. If called with a string name, spaCy +will assume the pipeline is a Python package and import and call its `load()` method. If called with a path, spaCy will assume it's a data directory, read the language and pipeline settings from the [`config.cfg`](/api/data-formats#config) and create a `Language` object. The model data will then be loaded in via @@ -683,16 +681,16 @@ and create a `Language` object. The model data will then be loaded in via | Name | Description | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `name` | Package name or model path. ~~str~~ | +| `name` | Package name or path. ~~str~~ | | `vocab` 3 | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | | `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `config` 3 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | -| **RETURNS** | `Language` class with the loaded model. ~~Language~~ | +| **RETURNS** | `Language` class with the loaded pipeline. ~~Language~~ | ### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"} -A helper function to use in the `load()` method of a model package's +A helper function to use in the `load()` method of a pipeline package's [`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py). > #### Example @@ -706,70 +704,72 @@ A helper function to use in the `load()` method of a model package's | Name | Description | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `init_file` | Path to model's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ | +| `init_file` | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ | | `vocab` 3 | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | | `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `config` 3 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | -| **RETURNS** | `Language` class with the loaded model. ~~Language~~ | +| **RETURNS** | `Language` class with the loaded pipeline. ~~Language~~ | ### util.load_config {#util.load_config tag="function" new="3"} -Load a model's [`config.cfg`](/api/data-formats#config) from a file path. The -config typically includes details about the model pipeline and how its -components are created, as well as all training settings and hyperparameters. +Load a pipeline's [`config.cfg`](/api/data-formats#config) from a file path. The +config typically includes details about the components and how they're created, +as well as all training settings and hyperparameters. > #### Example > > ```python -> config = util.load_config("/path/to/model/config.cfg") +> config = util.load_config("/path/to/config.cfg") > print(config.to_str()) > ``` | Name | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | Path to the model's `config.cfg`. ~~Union[str, Path]~~ | +| `path` | Path to the pipeline's `config.cfg`. ~~Union[str, Path]~~ | | `overrides` | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ | | `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~ | -| **RETURNS** | The model's config. ~~Config~~ | +| **RETURNS** | The pipeline's config. ~~Config~~ | ### util.load_meta {#util.load_meta tag="function" new="3"} -Get a model's [`meta.json`](/api/data-formats#meta) from a file path and -validate its contents. +Get a pipeline's [`meta.json`](/api/data-formats#meta) from a file path and +validate its contents. The meta typically includes details about author, +licensing, data sources and version. > #### Example > > ```python -> meta = util.load_meta("/path/to/model/meta.json") +> meta = util.load_meta("/path/to/meta.json") > ``` -| Name | Description | -| ----------- | ----------------------------------------------------- | -| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ | -| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ | +| Name | Description | +| ----------- | -------------------------------------------------------- | +| `path` | Path to the pipeline's `meta.json`. ~~Union[str, Path]~~ | +| **RETURNS** | The pipeline's meta data. ~~Dict[str, Any]~~ | ### util.get_installed_models {#util.get_installed_models tag="function" new="3"} -List all model packages installed in the current environment. This will include -any spaCy model that was packaged with [`spacy package`](/api/cli#package). -Under the hood, model packages expose a Python entry point that spaCy can check, -without having to load the model. +List all pipeline packages installed in the current environment. This will +include any spaCy pipeline that was packaged with +[`spacy package`](/api/cli#package). Under the hood, pipeline packages expose a +Python entry point that spaCy can check, without having to load the `nlp` +object. > #### Example > > ```python -> model_names = util.get_installed_models() +> names = util.get_installed_models() > ``` -| Name | Description | -| ----------- | ---------------------------------------------------------------------------------- | -| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------- | +| **RETURNS** | The string names of the pipelines installed in the current environment. ~~List[str]~~ | ### util.is_package {#util.is_package tag="function"} Check if string maps to a package installed via pip. Mainly used to validate -[model packages](/usage/models). +[pipeline packages](/usage/models). > #### Example > @@ -786,7 +786,8 @@ Check if string maps to a package installed via pip. Mainly used to validate ### util.get_package_path {#util.get_package_path tag="function" new="2"} Get path to an installed package. Mainly used to resolve the location of -[model packages](/usage/models). Currently imports the package to find its path. +[pipeline packages](/usage/models). Currently imports the package to find its +path. > #### Example > @@ -795,10 +796,10 @@ Get path to an installed package. Mainly used to resolve the location of > # /usr/lib/python3.6/site-packages/en_core_web_sm > ``` -| Name | Description | -| -------------- | ----------------------------------------- | -| `package_name` | Name of installed package. ~~str~~ | -| **RETURNS** | Path to model package directory. ~~Path~~ | +| Name | Description | +| -------------- | -------------------------------------------- | +| `package_name` | Name of installed package. ~~str~~ | +| **RETURNS** | Path to pipeline package directory. ~~Path~~ | ### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"} diff --git a/website/docs/models/index.md b/website/docs/models/index.md index d5f87d3b5..64e719f37 100644 --- a/website/docs/models/index.md +++ b/website/docs/models/index.md @@ -1,6 +1,6 @@ --- -title: Models -teaser: Downloadable pretrained models for spaCy +title: Trained Models & Pipelines +teaser: Downloadable trained pipelines and weights for spaCy menu: - ['Quickstart', 'quickstart'] - ['Conventions', 'conventions'] @@ -8,15 +8,15 @@ menu: -The models directory includes two types of pretrained models: +This directory includes two types of packages: -1. **Core models:** General-purpose pretrained models to predict named entities, - part-of-speech tags and syntactic dependencies. Can be used out-of-the-box - and fine-tuned on more specific data. -2. **Starter models:** Transfer learning starter packs with pretrained weights - you can initialize your models with to achieve better accuracy. They can +1. **Trained pipelines:** General-purpose spaCy pipelines to predict named + entities, part-of-speech tags and syntactic dependencies. Can be used + out-of-the-box and fine-tuned on more specific data. +2. **Starters:** Transfer learning starter packs with pretrained weights you can + initialize your pipeline models with to achieve better accuracy. They can include word vectors (which will be used as features during training) or - other pretrained representations like BERT. These models don't include + other pretrained representations like BERT. These packages don't include components for specific tasks like NER or text classification and are intended to be used as base models when training your own models. @@ -28,43 +28,42 @@ import QuickstartModels from 'widgets/quickstart-models.js' -For more details on how to use models with spaCy, see the -[usage guide on models](/usage/models). +For more details on how to use trained pipelines with spaCy, see the +[usage guide](/usage/models). -## Model naming conventions {#conventions} +## Package naming conventions {#conventions} -In general, spaCy expects all model packages to follow the naming convention of -`[lang`\_[name]]. For spaCy's models, we also chose to divide the name into -three components: +In general, spaCy expects all pipeline packages to follow the naming convention +of `[lang`\_[name]]. For spaCy's pipelines, we also chose to divide the name +into three components: -1. **Type:** Model capabilities (e.g. `core` for general-purpose model with +1. **Type:** Capabilities (e.g. `core` for general-purpose pipeline with vocabulary, syntax, entities and word vectors, or `depent` for only vocab, syntax and entities). -2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`. -3. **Size:** Model size indicator, `sm`, `md` or `lg`. +2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`. +3. **Size:** Package size indicator, `sm`, `md` or `lg`. For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English -model trained on written web text (blogs, news, comments), that includes +pipeline trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities. -### Model versioning {#model-versioning} +### Package versioning {#model-versioning} -Additionally, the model versioning reflects both the compatibility with spaCy, -as well as the major and minor model version. A model version `a.b.c` translates -to: +Additionally, the pipeline package versioning reflects both the compatibility +with spaCy, as well as the major and minor version. A package version `a.b.c` +translates to: - `a`: **spaCy major version**. For example, `2` for spaCy v2.x. -- `b`: **Model major version**. Models with a different major version can't be - loaded by the same code. For example, changing the width of the model, adding - hidden layers or changing the activation changes the model major version. -- `c`: **Model minor version**. Same model structure, but different parameter - values, e.g. from being trained on different data, for different numbers of - iterations, etc. +- `b`: **Package major version**. Pipelines with a different major version can't + be loaded by the same code. For example, changing the width of the model, + adding hidden layers or changing the activation changes the major version. +- `c`: **Package minor version**. Same pipeline structure, but different + parameter values, e.g. from being trained on different data, for different + numbers of iterations, etc. For a detailed compatibility overview, see the -[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json) -in the models repository. This is also the source of spaCy's internal -compatibility check, performed when you run the [`download`](/api/cli#download) -command. +[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json). +This is also the source of spaCy's internal compatibility check, performed when +you run the [`download`](/api/cli#download) command. diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index 0aa821223..9a63ee42d 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -1,9 +1,9 @@ When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the -[default models](/models) typically include a tagger, a lemmatizer, a parser and -an entity recognizer. Each pipeline component returns the processed `Doc`, which -is then passed on to the next component. +[trained pipelines](/models) typically include a tagger, a lemmatizer, a parser +and an entity recognizer. Each pipeline component returns the processed `Doc`, +which is then passed on to the next component. ![The processing pipeline](../../images/pipeline.svg) @@ -23,14 +23,15 @@ is then passed on to the next component. | **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | | **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | -The processing pipeline always **depends on the statistical model** and its -capabilities. For example, a pipeline can only include an entity recognizer -component if the model includes data to make predictions of entity labels. This -is why each model will specify the pipeline to use in its meta data and -[config](/usage/training#config), as a simple list containing the component -names: +The capabilities of a processing pipeline always depend on the components, their +models and how they were trained. For example, a pipeline for named entity +recognition needs to include a trained named entity recognizer component with a +statistical model and weights that enable it to **make predictions** of entity +labels. This is why each pipeline specifies its components and their settings in +the [config](/usage/training#config): ```ini +[nlp] pipeline = ["tagger", "parser", "ner"] ``` diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md index 1e8960edf..a531b245e 100644 --- a/website/docs/usage/101/_pos-deps.md +++ b/website/docs/usage/101/_pos-deps.md @@ -1,9 +1,9 @@ After tokenization, spaCy can **parse** and **tag** a given `Doc`. This is where -the statistical model comes in, which enables spaCy to **make a prediction** of -which tag or label most likely applies in this context. A model consists of -binary data and is produced by showing a system enough examples for it to make -predictions that generalize across the language – for example, a word following -"the" in English is most likely a noun. +the trained pipeline and its statistical models come in, which enable spaCy to +**make predictions** of which tag or label most likely applies in this context. +A trained component includes binary data that is produced by showing a system +enough examples for it to make predictions that generalize across the language – +for example, a word following "the" in English is most likely a noun. Linguistic annotations are available as [`Token` attributes](/api/token#attributes). Like many NLP libraries, spaCy @@ -25,7 +25,8 @@ for token in doc: > - **Text:** The original word text. > - **Lemma:** The base form of the word. -> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag. +> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) +> part-of-speech tag. > - **Tag:** The detailed part-of-speech tag. > - **Dep:** Syntactic dependency, i.e. the relation between tokens. > - **Shape:** The word shape – capitalization, punctuation, digits. diff --git a/website/docs/usage/101/_serialization.md b/website/docs/usage/101/_serialization.md index 01a9c39d1..ce34ea6e9 100644 --- a/website/docs/usage/101/_serialization.md +++ b/website/docs/usage/101/_serialization.md @@ -1,9 +1,9 @@ If you've been modifying the pipeline, vocabulary, vectors and entities, or made -updates to the model, you'll eventually want to **save your progress** – for -example, everything that's in your `nlp` object. This means you'll have to -translate its contents and structure into a format that can be saved, like a -file or a byte string. This process is called serialization. spaCy comes with -**built-in serialization methods** and supports the +updates to the component models, you'll eventually want to **save your +progress** – for example, everything that's in your `nlp` object. This means +you'll have to translate its contents and structure into a format that can be +saved, like a file or a byte string. This process is called serialization. spaCy +comes with **built-in serialization methods** and supports the [Pickle protocol](https://www.diveinto.org/python3/serializing.html#dump). > #### What's pickle? diff --git a/website/docs/usage/101/_training.md b/website/docs/usage/101/_training.md index 4573f5ea3..b73a83d6a 100644 --- a/website/docs/usage/101/_training.md +++ b/website/docs/usage/101/_training.md @@ -1,25 +1,25 @@ spaCy's tagger, parser, text categorizer and many other components are powered by **statistical models**. Every "decision" these components make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a -**prediction** based on the model's current **weight values**. The weight -values are estimated based on examples the model has seen -during **training**. To train a model, you first need training data – examples -of text, and the labels you want the model to predict. This could be a -part-of-speech tag, a named entity or any other information. +**prediction** based on the model's current **weight values**. The weight values +are estimated based on examples the model has seen during **training**. To train +a model, you first need training data – examples of text, and the labels you +want the model to predict. This could be a part-of-speech tag, a named entity or +any other information. -Training is an iterative process in which the model's predictions are compared +Training is an iterative process in which the model's predictions are compared against the reference annotations in order to estimate the **gradient of the loss**. The gradient of the loss is then used to calculate the gradient of the weights through [backpropagation](https://thinc.ai/backprop101). The gradients -indicate how the weight values should be changed so that the model's -predictions become more similar to the reference labels over time. +indicate how the weight values should be changed so that the model's predictions +become more similar to the reference labels over time. > - **Training data:** Examples and their annotations. > - **Text:** The input text the model should predict a label for. > - **Label:** The label the model should predict. > - **Gradient:** The direction and rate of change for a numeric value. -> Minimising the gradient of the weights should result in predictions that -> are closer to the reference labels on the training data. +> Minimising the gradient of the weights should result in predictions that are +> closer to the reference labels on the training data. ![The training process](../../images/training.svg) diff --git a/website/docs/usage/101/_vectors-similarity.md b/website/docs/usage/101/_vectors-similarity.md index 92df1b331..cf5b70af2 100644 --- a/website/docs/usage/101/_vectors-similarity.md +++ b/website/docs/usage/101/_vectors-similarity.md @@ -24,12 +24,12 @@ array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, -To make them compact and fast, spaCy's small [models](/models) (all packages -that end in `sm`) **don't ship with word vectors**, and only include +To make them compact and fast, spaCy's small [pipeline packages](/models) (all +packages that end in `sm`) **don't ship with word vectors**, and only include context-sensitive **tensors**. This means you can still use the `similarity()` methods to compare documents, spans and tokens – but the result won't be as good, and individual tokens won't have any vectors assigned. So in order to use -_real_ word vectors, you need to download a larger model: +_real_ word vectors, you need to download a larger pipeline package: ```diff - python -m spacy download en_core_web_sm @@ -38,11 +38,11 @@ _real_ word vectors, you need to download a larger model: -Models that come with built-in word vectors make them available as the -[`Token.vector`](/api/token#vector) attribute. [`Doc.vector`](/api/doc#vector) -and [`Span.vector`](/api/span#vector) will default to an average of their token -vectors. You can also check if a token has a vector assigned, and get the L2 -norm, which can be used to normalize vectors. +Pipeline packages that come with built-in word vectors make them available as +the [`Token.vector`](/api/token#vector) attribute. +[`Doc.vector`](/api/doc#vector) and [`Span.vector`](/api/span#vector) will +default to an average of their token vectors. You can also check if a token has +a vector assigned, and get the L2 norm, which can be used to normalize vectors. ```python ### {executable="true"} @@ -62,12 +62,12 @@ for token in tokens: > - **OOV**: Out-of-vocabulary The words "dog", "cat" and "banana" are all pretty common in English, so they're -part of the model's vocabulary, and come with a vector. The word "afskfsd" on +part of the pipeline's vocabulary, and come with a vector. The word "afskfsd" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of `0`, which means it's practically nonexistent. If your application will benefit from a **large vocabulary** with -more vectors, you should consider using one of the larger models or loading in a -full vector package, for example, +more vectors, you should consider using one of the larger pipeline packages or +loading in a full vector package, for example, [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg), which includes over **1 million unique vectors**. @@ -82,7 +82,7 @@ Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether two words, spans or documents are similar really depends on how you're looking at it. spaCy's -similarity model usually assumes a pretty general-purpose definition of +similarity implementation usually assumes a pretty general-purpose definition of similarity. > #### 📝 Things to try @@ -99,7 +99,7 @@ similarity. ### {executable="true"} import spacy -nlp = spacy.load("en_core_web_md") # make sure to use larger model! +nlp = spacy.load("en_core_web_md") # make sure to use larger package! doc1 = nlp("I like salty fries and hamburgers.") doc2 = nlp("Fast food tastes very good.") @@ -143,10 +143,9 @@ us that builds on top of spaCy and lets you train and query more interesting and detailed word vectors. It combines noun phrases like "fast food" or "fair game" and includes the part-of-speech tags and entity labels. The library also includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy) -that let you evaluate vector models and create terminology lists. For more -details, check out -[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the -semantic similarities across all Reddit comments of 2015 and 2019, see the -[interactive demo](https://explosion.ai/demos/sense2vec). +that let you evaluate vectors and create terminology lists. For more details, +check out [our blog post](https://explosion.ai/blog/sense2vec-reloaded). To +explore the semantic similarities across all Reddit comments of 2015 and 2019, +see the [interactive demo](https://explosion.ai/demos/sense2vec). diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index 76858213c..ee5fd0a3b 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -35,10 +35,10 @@ Using pip, spaCy releases are available as source packages and binary wheels. $ pip install -U spacy ``` -> #### Download models +> #### Download pipelines > -> After installation you need to download a language model. For more info and -> available models, see the [docs on models](/models). +> After installation you typically want to download a trained pipeline. For more +> info and available packages, see the [models directory](/models). > > ```cli > $ python -m spacy download en_core_web_sm @@ -54,7 +54,7 @@ To install additional data tables for lemmatization you can run [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) separately. The lookups package is needed to provide normalization and lemmatization data for new models and to lemmatize in languages that don't yet -come with pretrained models and aren't powered by third-party libraries. +come with trained pipelines and aren't powered by third-party libraries. @@ -88,23 +88,21 @@ and pull requests to the recipe and setup are always appreciated. > spaCy v2.x to v3.x may still require some changes to your code base. For > details see the sections on [backwards incompatibilities](/usage/v3#incompat) > and [migrating](/usage/v3#migrating). Also remember to download the new -> models, and retrain your own models. +> trained pipelines, and retrain your own pipelines. When updating to a newer version of spaCy, it's generally recommended to start with a clean virtual environment. If you're upgrading to a new major version, -make sure you have the latest **compatible models** installed, and that there -are no old and incompatible model packages left over in your environment, as -this can often lead to unexpected results and errors. If you've trained your own -models, keep in mind that your train and runtime inputs must match. This means -you'll have to **retrain your models** with the new version. +make sure you have the latest **compatible trained pipelines** installed, and +that there are no old and incompatible packages left over in your environment, +as this can often lead to unexpected results and errors. If you've trained your +own models, keep in mind that your train and runtime inputs must match. This +means you'll have to **retrain your pipelines** with the new version. spaCy also provides a [`validate`](/api/cli#validate) command, which lets you -verify that all installed models are compatible with your spaCy version. If -incompatible models are found, tips and installation instructions are printed. -The command is also useful to detect out-of-sync model links resulting from -links created in different virtual environments. It's recommended to run the -command with `python -m` to make sure you're executing the correct version of -spaCy. +verify that all installed pipeline packages are compatible with your spaCy +version. If incompatible packages are found, tips and installation instructions +are printed. It's recommended to run the command with `python -m` to make sure +you're executing the correct version of spaCy. ```cli $ pip install -U spacy @@ -132,8 +130,8 @@ $ pip install -U spacy[cuda92] Once you have a GPU-enabled installation, the best way to activate it is to call [`spacy.prefer_gpu`](/api/top-level#spacy.prefer_gpu) or [`spacy.require_gpu()`](/api/top-level#spacy.require_gpu) somewhere in your -script before any models have been loaded. `require_gpu` will raise an error if -no GPU is available. +script before any pipelines have been loaded. `require_gpu` will raise an error +if no GPU is available. ```python import spacy @@ -238,16 +236,16 @@ installing, loading and using spaCy, as well as their solutions. ``` -No compatible model found for [lang] (spaCy vX.X.X). +No compatible package found for [lang] (spaCy vX.X.X). ``` -This usually means that the model you're trying to download does not exist, or -isn't available for your version of spaCy. Check the +This usually means that the trained pipeline you're trying to download does not +exist, or isn't available for your version of spaCy. Check the [compatibility table](https://github.com/explosion/spacy-models/tree/master/compatibility.json) -to see which models are available for your spaCy version. If you're using an old -version, consider upgrading to the latest release. Note that while spaCy +to see which packages are available for your spaCy version. If you're using an +old version, consider upgrading to the latest release. Note that while spaCy supports tokenization for [a variety of languages](/usage/models#languages), not -all of them come with statistical models. To only use the tokenizer, import the +all of them come with trained pipelines. To only use the tokenizer, import the language's `Language` class instead, for example `from spacy.lang.fr import French`. @@ -259,7 +257,7 @@ language's `Language` class instead, for example no such option: --no-cache-dir ``` -The `download` command uses pip to install the models and sets the +The `download` command uses pip to install the pipeline packages and sets the `--no-cache-dir` flag to prevent it from requiring too much memory. [This setting](https://pip.pypa.io/en/stable/reference/pip_install/#caching) requires pip v6.0 or newer. Run `pip install -U pip` to upgrade to the latest @@ -323,19 +321,19 @@ also run `which python` to find out where your Python executable is located. - + ``` ImportError: No module named 'en_core_web_sm' ``` -As of spaCy v1.7, all models can be installed as Python packages. This means -that they'll become importable modules of your application. If this fails, it's -usually a sign that the package is not installed in the current environment. Run -`pip list` or `pip freeze` to check which model packages you have installed, and -install the [correct models](/models) if necessary. If you're importing a model -manually at the top of a file, make sure to use the name of the package, not the -shortcut link you've created. +As of spaCy v1.7, all trained pipelines can be installed as Python packages. +This means that they'll become importable modules of your application. If this +fails, it's usually a sign that the package is not installed in the current +environment. Run `pip list` or `pip freeze` to check which pipeline packages you +have installed, and install the [correct package](/models) if necessary. If +you're importing a package manually at the top of a file, make sure to use the +full name of the package. diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 726cf0521..7d3613cf5 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -132,7 +132,7 @@ language can extend the `Lemmatizer` as part of its ### {executable="true"} import spacy -# English models include a rule-based lemmatizer +# English pipelines include a rule-based lemmatizer nlp = spacy.load("en_core_web_sm") lemmatizer = nlp.get_pipe("lemmatizer") print(lemmatizer.mode) # 'rule' @@ -156,14 +156,14 @@ component. The data for spaCy's lemmatizers is distributed in the package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The -provided models already include all the required tables, but if you are creating -new models, you'll probably want to install `spacy-lookups-data` to provide the -data when the lemmatizer is initialized. +provided trained pipelines already include all the required tables, but if you +are creating new pipelines, you'll probably want to install `spacy-lookups-data` +to provide the data when the lemmatizer is initialized. ### Lookup lemmatizer {#lemmatizer-lookup} -For models without a tagger or morphologizer, a lookup lemmatizer can be added -to the pipeline as long as a lookup table is provided, typically through +For pipelines without a tagger or morphologizer, a lookup lemmatizer can be +added to the pipeline as long as a lookup table is provided, typically through [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The lookup lemmatizer looks up the token surface form in the lookup table without reference to the token's part-of-speech or context. @@ -178,9 +178,9 @@ nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) ### Rule-based lemmatizer {#lemmatizer-rule} -When training models that include a component that assigns POS (a morphologizer -or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based lemmatizer -can be added using rule tables from +When training pipelines that include a component that assigns part-of-speech +tags (a morphologizer or a tagger with a [POS mapping](#mappings-exceptions)), a +rule-based lemmatizer can be added using rule tables from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data): ```python @@ -366,10 +366,10 @@ sequence of tokens. You can walk up the tree with the > #### Projective vs. non-projective > -> For the [default English model](/models/en), the parse tree is **projective**, -> which means that there are no crossing brackets. The tokens returned by -> `.subtree` are therefore guaranteed to be contiguous. This is not true for the -> German model, which has many +> For the [default English pipelines](/models/en), the parse tree is +> **projective**, which means that there are no crossing brackets. The tokens +> returned by `.subtree` are therefore guaranteed to be contiguous. This is not +> true for the German pipelines, which have many > [non-projective dependencies](https://explosion.ai/blog/german-model#word-order). ```python @@ -497,26 +497,27 @@ displaCy in our [online demo](https://explosion.ai/demos/displacy).. ### Disabling the parser {#disabling} -In the [default models](/models), the parser is loaded and enabled as part of -the [standard processing pipeline](/usage/processing-pipelines). If you don't -need any of the syntactic information, you should disable the parser. Disabling -the parser will make spaCy load and run much faster. If you want to load the -parser, but need to disable it for specific documents, you can also control its -use on the `nlp` object. +In the [trained pipelines](/models) provided by spaCy, the parser is loaded and +enabled by default as part of the +[standard processing pipeline](/usage/processing-pipelines). If you don't need +any of the syntactic information, you should disable the parser. Disabling the +parser will make spaCy load and run much faster. If you want to load the parser, +but need to disable it for specific documents, you can also control its use on +the `nlp` object. For more details, see the usage guide on +[disabling pipeline components](/usage/processing-pipelines/#disabling). ```python nlp = spacy.load("en_core_web_sm", disable=["parser"]) -nlp = English().from_disk("/model", disable=["parser"]) -doc = nlp("I don't want parsed", disable=["parser"]) ``` ## Named Entity Recognition {#named-entities} spaCy features an extremely fast statistical entity recognition system, that -assigns labels to contiguous spans of tokens. The default model identifies a -variety of named and numeric entities, including companies, locations, -organizations and products. You can add arbitrary classes to the entity -recognition system, and update the model with new examples. +assigns labels to contiguous spans of tokens. The default +[trained pipelines](/models) can indentify a variety of named and numeric +entities, including companies, locations, organizations and products. You can +add arbitrary classes to the entity recognition system, and update the model +with new examples. ### Named Entity Recognition 101 {#named-entities-101} @@ -669,7 +670,7 @@ responsibility for ensuring that the data is left in a consistent state. -For details on the entity types available in spaCy's pretrained models, see the +For details on the entity types available in spaCy's trained pipelines, see the "label scheme" sections of the individual models in the [models directory](/models). @@ -710,9 +711,8 @@ import DisplacyEntHtml from 'images/displacy-ent2.html' To ground the named entities into the "real world", spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own -[`KnowledgeBase`](/api/kb) and -[train a new Entity Linking model](/usage/training#entity-linker) using that -custom-made KB. +[`KnowledgeBase`](/api/kb) and [train](/usage/training) a new +[`EntityLinker`](/api/entitylinker) using that custom knowledge base. ### Accessing entity identifiers {#entity-linking-accessing model="entity linking"} @@ -724,7 +724,7 @@ object, or the `ent_kb_id` and `ent_kb_id_` attributes of a ```python import spacy -nlp = spacy.load("my_custom_el_model") +nlp = spacy.load("my_custom_el_pipeline") doc = nlp("Ada Lovelace was born in London") # Document level @@ -1042,13 +1042,15 @@ function that behaves the same way. -If you're using a statistical model, writing to the +If you've loaded a trained pipeline, writing to the [`nlp.Defaults`](/api/language#defaults) or `English.Defaults` directly won't -work, since the regular expressions are read from the model and will be compiled -when you load it. If you modify `nlp.Defaults`, you'll only see the effect if -you call [`spacy.blank`](/api/top-level#spacy.blank). If you want to modify the -tokenizer loaded from a statistical model, you should modify `nlp.tokenizer` -directly. +work, since the regular expressions are read from the pipeline data and will be +compiled when you load it. If you modify `nlp.Defaults`, you'll only see the +effect if you call [`spacy.blank`](/api/top-level#spacy.blank). If you want to +modify the tokenizer loaded from a trained pipeline, you should modify +`nlp.tokenizer` directly. If you're training your own pipeline, you can register +[callbacks](/usage/training/#custom-code-nlp-callbacks) to modify the `nlp` +object before training. @@ -1218,11 +1220,11 @@ print(doc.text, [token.text for token in doc]) -Keep in mind that your model's result may be less accurate if the tokenization +Keep in mind that your models' results may be less accurate if the tokenization during training differs from the tokenization at runtime. So if you modify a -pretrained model's tokenization afterwards, it may produce very different -predictions. You should therefore train your model with the **same tokenizer** -it will be using at runtime. See the docs on +trained pipeline' tokenization afterwards, it may produce very different +predictions. You should therefore train your pipeline with the **same +tokenizer** it will be using at runtime. See the docs on [training with custom tokenization](#custom-tokenizer-training) for details. @@ -1231,7 +1233,7 @@ it will be using at runtime. See the docs on spaCy's [training config](/usage/training#config) describe the settings, hyperparameters, pipeline and tokenizer used for constructing and training the -model. The `[nlp.tokenizer]` block refers to a **registered function** that +pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that takes the `nlp` object and returns a tokenizer. Here, we're registering a function called `whitespace_tokenizer` in the [`@tokenizers` registry](/api/registry). To make sure spaCy knows how to @@ -1626,11 +1628,11 @@ spaCy provides four alternatives for sentence segmentation: Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a -**statistical model** that provides accurate predictions. If your texts are +**trained pipeline** that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box -with spaCy's provided models. For social media or conversational text that -doesn't follow the same rules, your application may benefit from a custom model -or rule-based component. +with spaCy's provided trained pipelines. For social media or conversational text +that doesn't follow the same rules, your application may benefit from a custom +trained or rule-based component. ```python ### {executable="true"} @@ -1652,8 +1654,8 @@ parses consistent with the sentence boundaries. The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical component that only provides sentence boundaries. Along with being faster and smaller than the parser, its primary advantage is that it's easier to train -custom models because it only requires annotated sentence boundaries rather than -full dependency parses. +because it only requires annotated sentence boundaries rather than full +dependency parses. @@ -1685,7 +1687,7 @@ need sentence boundaries without dependency parses. import spacy from spacy.lang.en import English -nlp = English() # just the language with no model +nlp = English() # just the language with no pipeline nlp.add_pipe("sentencizer") doc = nlp("This is a sentence. This is another sentence.") for sent in doc.sents: @@ -1827,11 +1829,11 @@ or Tomas Mikolov's original [Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most word vector libraries output an easy-to-read text-based format, where each line consists of the word followed by its vector. For everyday use, we want to -convert the vectors model into a binary format that loads faster and takes up -less space on disk. The easiest way to do this is the -[`init model`](/api/cli#init-model) command-line utility. This will output a -spaCy model in the directory `/tmp/la_vectors_wiki_lg`, giving you access to -some nice Latin vectors. You can then pass the directory path to +convert the vectors into a binary format that loads faster and takes up less +space on disk. The easiest way to do this is the +[`init vocab`](/api/cli#init-vocab) command-line utility. This will output a +blank spaCy pipeline in the directory `/tmp/la_vectors_wiki_lg`, giving you +access to some nice Latin vectors. You can then pass the directory path to [`spacy.load`](/api/top-level#spacy.load). > #### Usage example @@ -1845,7 +1847,7 @@ some nice Latin vectors. You can then pass the directory path to ```cli $ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz -$ python -m spacy init model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz +$ python -m spacy init vocab en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz ``` @@ -1853,13 +1855,13 @@ $ python -m spacy init model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300. To help you strike a good balance between coverage and memory usage, spaCy's [`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same row** of the table. If you're using the -[`spacy init model`](/api/cli#init-model) command to create a vocabulary, +[`spacy init vocab`](/api/cli#init-vocab) command to create a vocabulary, pruning the vectors will be taken care of automatically if you set the `--prune-vectors` flag. You can also do it manually in the following steps: -1. Start with a **word vectors model** that covers a huge vocabulary. For +1. Start with a **word vectors package** that covers a huge vocabulary. For instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg) - model provides 300-dimensional GloVe vectors for over 1 million terms of + starter provides 300-dimensional GloVe vectors for over 1 million terms of English. 2. If your vocabulary has values set for the `Lexeme.prob` attribute, the lexemes will be sorted by descending probability to determine which vectors @@ -1900,17 +1902,17 @@ the two words. In the example above, the vector for "Shore" was removed and remapped to the vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to the vector of "leaving", which is identical. If you're using the -[`init model`](/api/cli#init-model) command, you can set the `--prune-vectors` +[`init vocab`](/api/cli#init-vocab) command, you can set the `--prune-vectors` option to easily reduce the size of the vectors as you add them to a spaCy -model: +pipeline: ```cli -$ python -m spacy init model en /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000 +$ python -m spacy init vocab en /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000 ``` -This will create a spaCy model with vectors for the first 10,000 words in the -vectors model. All other words in the vectors model are mapped to the closest -vector among those retained. +This will create a blank spaCy pipeline with vectors for the first 10,000 words +in the vectors. All other words in the vectors are mapped to the closest vector +among those retained. @@ -1925,8 +1927,8 @@ possible. You can modify the vectors via the [`Vocab`](/api/vocab) or if you have vectors in an arbitrary format, as you can read in the vectors with your own logic, and just set them with a simple loop. This method is likely to be slower than approaches that work with the whole vectors table at once, but -it's a great approach for once-off conversions before you save out your model to -disk. +it's a great approach for once-off conversions before you save out your `nlp` +object to disk. ```python ### Adding vectors @@ -1978,14 +1980,14 @@ print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")]) The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you register a custom language class and assign it a string name. This means that you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom -language name, and even train models with it and refer to it in your +language name, and even train pipelines with it and refer to it in your [training config](/usage/training#config). > #### Config usage > > After registering your custom language class using the `languages` registry, > you can refer to it in your [training config](/usage/training#config). This -> means spaCy will train your model using the custom subclass. +> means spaCy will train your pipeline using the custom subclass. > > ```ini > [nlp] diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index ec0e02297..9b1e96e4e 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -8,25 +8,24 @@ menu: - ['Production Use', 'production'] --- -spaCy's models can be installed as **Python packages**. This means that they're -a component of your application, just like any other module. They're versioned -and can be defined as a dependency in your `requirements.txt`. Models can be -installed from a download URL or a local directory, manually or via -[pip](https://pypi.python.org/pypi/pip). Their data can be located anywhere on -your file system. +spaCy's trained pipelines can be installed as **Python packages**. This means +that they're a component of your application, just like any other module. +They're versioned and can be defined as a dependency in your `requirements.txt`. +Trained pipelines can be installed from a download URL or a local directory, +manually or via [pip](https://pypi.python.org/pypi/pip). Their data can be +located anywhere on your file system. > #### Important note > -> If you're upgrading to spaCy v3.x, you need to **download the new models**. If -> you've trained statistical models that use spaCy's annotations, you should -> **retrain your models** after updating spaCy. If you don't retrain, you may -> suffer train/test skew, which might decrease your accuracy. +> If you're upgrading to spaCy v3.x, you need to **download the new pipeline +> packages**. If you've trained your own pipelines, you need to **retrain** them +> after updating spaCy. ## Quickstart {hidden="true"} import QuickstartModels from 'widgets/quickstart-models.js' - + ## Language support {#languages} @@ -34,14 +33,14 @@ spaCy currently provides support for the following languages. You can help by [improving the existing language data](/usage/adding-languages#language-data) and extending the tokenization patterns. [See here](https://github.com/explosion/spaCy/issues/3056) for details on how to -contribute to model development. +contribute to development. > #### Usage note > -> If a model is available for a language, you can download it using the -> [`spacy download`](/api/cli#download) command. In order to use languages that -> don't yet come with a model, you have to import them directly, or use -> [`spacy.blank`](/api/top-level#spacy.blank): +> If a trained pipeline is available for a language, you can download it using +> the [`spacy download`](/api/cli#download) command. In order to use languages +> that don't yet come with a trained pipeline, you have to import them directly, +> or use [`spacy.blank`](/api/top-level#spacy.blank): > > ```python > from spacy.lang.fi import Finnish @@ -73,13 +72,13 @@ import Languages from 'widgets/languages.js' > nlp = spacy.blank("xx") > ``` -spaCy also supports models trained on more than one language. This is especially -useful for named entity recognition. The language ID used for multi-language or -language-neutral models is `xx`. The language class, a generic subclass -containing only the base language data, can be found in +spaCy also supports pipelines trained on more than one language. This is +especially useful for named entity recognition. The language ID used for +multi-language or language-neutral pipelines is `xx`. The language class, a +generic subclass containing only the base language data, can be found in [`lang/xx`](https://github.com/explosion/spaCy/tree/master/spacy/lang/xx). -To train a model using the neutral multi-language class, you can set +To train a pipeline using the neutral multi-language class, you can set `lang = "xx"` in your [training config](/usage/training#config). You can also import the `MultiLanguage` class directly, or call [`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading. @@ -111,7 +110,7 @@ The Chinese language class supports three word segmentation options: 3. **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided - [Chinese models](/models/zh). Enable PKUSeg with the tokenizer option + [Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option `{"segmenter": "pkuseg"}`. @@ -169,9 +168,9 @@ nlp.tokenizer.pkuseg_update_user_dict([], reset=True) - + -The [Chinese models](/models/zh) provided by spaCy include a custom `pkuseg` +The [Chinese pipelines](/models/zh) provided by spaCy include a custom `pkuseg` model trained only on [Chinese OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), since the models provided by `pkuseg` include data restricted to research use. For @@ -208,29 +207,29 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo The Japanese language class uses [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word segmentation and part-of-speech tagging. The default Japanese language class and -the provided Japanese models use SudachiPy split mode `A`. The `meta` argument -of the `Japanese` language class can be used to configure the split mode to `A`, -`B` or `C`. +the provided Japanese pipelines use SudachiPy split mode `A`. The `meta` +argument of the `Japanese` language class can be used to configure the split +mode to `A`, `B` or `C`. If you run into errors related to `sudachipy`, which is currently under active development, we suggest downgrading to `sudachipy==0.4.5`, which is the version -used for training the current [Japanese models](/models/ja). +used for training the current [Japanese pipelines](/models/ja). -## Installing and using models {#download} +## Installing and using trained pipelines {#download} -The easiest way to download a model is via spaCy's +The easiest way to download a trained pipeline is via spaCy's [`download`](/api/cli#download) command. It takes care of finding the -best-matching model compatible with your spaCy installation. +best-matching package compatible with your spaCy installation. > #### Important note for v3.0 > -> Note that as of spaCy v3.0, model shortcut links that create (potentially +> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially > brittle) symlinks in your spaCy installation are **deprecated**. To download -> and load an installed model, use its full name: +> and load an installed pipeline package, use its full name: > > ```diff > - python -m spacy download en @@ -243,14 +242,14 @@ best-matching model compatible with your spaCy installation. > ``` ```cli -# Download best-matching version of a model for your spaCy installation +# Download best-matching version of a package for your spaCy installation $ python -m spacy download en_core_web_sm -# Download exact model version +# Download exact package version $ python -m spacy download en_core_web_sm-3.0.0 --direct ``` -The download command will [install the model](/usage/models#download-pip) via +The download command will [install the package](/usage/models#download-pip) via pip and place the package in your `site-packages` directory. ```cli @@ -266,11 +265,11 @@ doc = nlp("This is a sentence.") ### Installation via pip {#download-pip} -To download a model directly using [pip](https://pypi.python.org/pypi/pip), -point `pip install` to the URL or local path of the archive file. To find the -direct link to a model, head over to the -[model releases](https://github.com/explosion/spacy-models/releases), right -click on the archive link and copy it to your clipboard. +To download a trained pipeline directly using +[pip](https://pypi.python.org/pypi/pip), point `pip install` to the URL or local +path of the archive file. To find the direct link to a package, head over to the +[releases](https://github.com/explosion/spacy-models/releases), right click on +the archive link and copy it to your clipboard. ```bash # With external URL @@ -280,60 +279,61 @@ $ pip install https://github.com/explosion/spacy-models/releases/download/en_cor $ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz ``` -By default, this will install the model into your `site-packages` directory. You -can then use `spacy.load()` to load it via its package name or +By default, this will install the pipeline package into your `site-packages` +directory. You can then use `spacy.load` to load it via its package name or [import it](#usage-import) explicitly as a module. If you need to download -models as part of an automated process, we recommend using pip with a direct -link, instead of relying on spaCy's [`download`](/api/cli#download) command. +pipeline packages as part of an automated process, we recommend using pip with a +direct link, instead of relying on spaCy's [`download`](/api/cli#download) +command. You can also add the direct download link to your application's `requirements.txt`. For more details, see the section on -[working with models in production](#production). +[working with pipeline packages in production](#production). ### Manual download and installation {#download-manual} In some cases, you might prefer downloading the data manually, for example to -place it into a custom directory. You can download the model via your browser +place it into a custom directory. You can download the package via your browser from the [latest releases](https://github.com/explosion/spacy-models/releases), or configure your own download script using the URL of the archive file. The -archive consists of a model directory that contains another directory with the -model data. +archive consists of a package directory that contains another directory with the +pipeline data. ```yaml ### Directory structure {highlight="6"} └── en_core_web_md-3.0.0.tar.gz # downloaded archive ├── setup.py # setup file for pip installation - ├── meta.json # copy of model meta - └── en_core_web_md # 📦 model package + ├── meta.json # copy of pipeline meta + └── en_core_web_md # 📦 pipeline package ├── __init__.py # init for pip installation - └── en_core_web_md-3.0.0 # model data - ├── config.cfg # model config - ├── meta.json # model meta + └── en_core_web_md-3.0.0 # pipeline data + ├── config.cfg # pipeline config + ├── meta.json # pipeline meta └── ... # directories with component data ``` -You can place the **model package directory** anywhere on your local file +You can place the **pipeline package directory** anywhere on your local file system. -### Using models with spaCy {#usage} +### Using trained pipelines with spaCy {#usage} -To load a model, use [`spacy.load`](/api/top-level#spacy.load) with the model's -package name or a path to the data directory: +To load a pipeline package, use [`spacy.load`](/api/top-level#spacy.load) with +the package name or a path to the data directory: > #### Important note for v3.0 > -> Note that as of spaCy v3.0, model shortcut links that create (potentially -> brittle) symlinks in your spaCy installation are **deprecated**. To load an -> installed model, use its full name: +> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially +> brittle) symlinks in your spaCy installation are **deprecated**. To download +> and load an installed pipeline package, use its full name: > > ```diff -> - nlp = spacy.load("en") -> + nlp = spacy.load("en_core_web_sm") +> - python -m spacy download en +> + python -m spacy dowmload en_core_web_sm > ``` ```python import spacy -nlp = spacy.load("en_core_web_sm") # load model package "en_core_web_sm" +nlp = spacy.load("en_core_web_sm") # load package "en_core_web_sm" nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory doc = nlp("This is a sentence.") @@ -342,17 +342,18 @@ doc = nlp("This is a sentence.") You can use the [`info`](/api/cli#info) command or -[`spacy.info()`](/api/top-level#spacy.info) method to print a model's meta data -before loading it. Each `Language` object with a loaded model also exposes the -model's meta data as the attribute `meta`. For example, `nlp.meta['version']` -will return the model's version. +[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline +packages's meta data before loading it. Each `Language` object with a loaded +pipeline also exposes the pipeline's meta data as the attribute `meta`. For +example, `nlp.meta['version']` will return the package version. -### Importing models as modules {#usage-import} +### Importing pipeline packages as modules {#usage-import} -If you've installed a model via spaCy's downloader, or directly via pip, you can -also `import` it and then call its `load()` method with no arguments: +If you've installed a trained pipeline via [`spacy download`](/api/cli#download) +or directly via pip, you can also `import` it and then call its `load()` method +with no arguments: ```python ### {executable="true"} @@ -362,51 +363,38 @@ nlp = en_core_web_sm.load() doc = nlp("This is a sentence.") ``` -How you choose to load your models ultimately depends on personal preference. -However, **for larger code bases**, we usually recommend native imports, as this -will make it easier to integrate models with your existing build process, -continuous integration workflow and testing framework. It'll also prevent you -from ever trying to load a model that is not installed, as your code will raise -an `ImportError` immediately, instead of failing somewhere down the line when -calling `spacy.load()`. +How you choose to load your trained pipelines ultimately depends on personal +preference. However, **for larger code bases**, we usually recommend native +imports, as this will make it easier to integrate pipeline packages with your +existing build process, continuous integration workflow and testing framework. +It'll also prevent you from ever trying to load a package that is not installed, +as your code will raise an `ImportError` immediately, instead of failing +somewhere down the line when calling `spacy.load()`. For more details, see the +section on [working with pipeline packages in production](#production). -For more details, see the section on -[working with models in production](#production). +## Using trained pipelines in production {#production} -### Using your own models {#own-models} - -If you've trained your own model, for example for -[additional languages](/usage/adding-languages) or -[custom named entities](/usage/training#ner), you can save its state using the -[`Language.to_disk()`](/api/language#to_disk) method. To make the model more -convenient to deploy, we recommend wrapping it as a Python package. - -For more information and a detailed guide on how to package your model, see the -documentation on [saving and loading models](/usage/saving-loading#models). - -## Using models in production {#production} - -If your application depends on one or more models, you'll usually want to -integrate them into your continuous integration workflow and build process. -While spaCy provides a range of useful helpers for downloading, linking and -loading models, the underlying functionality is entirely based on native Python -packages. This allows your application to handle a model like any other package -dependency. +If your application depends on one or more trained pipeline packages, you'll +usually want to integrate them into your continuous integration workflow and +build process. While spaCy provides a range of useful helpers for downloading +and loading pipeline packages, the underlying functionality is entirely based on +native Python packaging. This allows your application to handle a spaCy pipeline +like any other package dependency. -### Downloading and requiring model dependencies {#models-download} +### Downloading and requiring package dependencies {#models-download} spaCy's built-in [`download`](/api/cli#download) command is mostly intended as a convenient, interactive wrapper. It performs compatibility checks and prints -detailed error messages and warnings. However, if you're downloading models as -part of an automated build process, this only adds an unnecessary layer of -complexity. If you know which models your application needs, you should be -specifying them directly. +detailed error messages and warnings. However, if you're downloading pipeline +packages as part of an automated build process, this only adds an unnecessary +layer of complexity. If you know which packages your application needs, you +should be specifying them directly. -Because all models are valid Python packages, you can add them to your +Because pipeline packages are valid Python packages, you can add them to your application's `requirements.txt`. If you're running your own internal PyPi -installation, you can upload the models there. pip's +installation, you can upload the pipeline packages there. pip's [requirements file format](https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format) supports both package names to download via a PyPi server, as well as direct URLs. @@ -422,17 +410,17 @@ the download URL. This way, the package won't be re-downloaded and overwritten if it's already installed - just like when you're downloading a package from PyPi. -All models are versioned and specify their spaCy dependency. This ensures -cross-compatibility and lets you specify exact version requirements for each -model. If you've trained your own model, you can use the -[`package`](/api/cli#package) command to generate the required meta data and -turn it into a loadable package. +All pipeline packages are versioned and specify their spaCy dependency. This +ensures cross-compatibility and lets you specify exact version requirements for +each pipeline. If you've [trained](/usage/training) your own pipeline, you can +use the [`spacy package`](/api/cli#package) command to generate the required +meta data and turn it into a loadable package. -### Loading and testing models {#models-loading} +### Loading and testing pipeline packages {#models-loading} -Models are regular Python packages, so you can also import them as a package -using Python's native `import` syntax, and then call the `load` method to load -the model data and return an `nlp` object: +Pipeline packages are regular Python packages, so you can also import them as a +package using Python's native `import` syntax, and then call the `load` method +to load the data and return an `nlp` object: ```python import en_core_web_sm @@ -440,16 +428,17 @@ nlp = en_core_web_sm.load() ``` In general, this approach is recommended for larger code bases, as it's more -"native", and doesn't depend on symlinks or rely on spaCy's loader to resolve -string names to model packages. If a model can't be imported, Python will raise -an `ImportError` immediately. And if a model is imported but not used, any -linter will catch that. +"native", and doesn't rely on spaCy's loader to resolve string names to +packages. If a package can't be imported, Python will raise an `ImportError` +immediately. And if a package is imported but not used, any linter will catch +that. Similarly, it'll give you more flexibility when writing tests that require -loading models. For example, instead of writing your own `try` and `except` +loading pipelines. For example, instead of writing your own `try` and `except` logic around spaCy's loader, you can use [pytest](http://pytest.readthedocs.io/en/latest/)'s [`importorskip()`](https://docs.pytest.org/en/latest/builtin.html#_pytest.outcomes.importorskip) -method to only run a test if a specific model or model version is installed. -Each model package exposes a `__version__` attribute which you can also use to -perform your own version compatibility checks before loading a model. +method to only run a test if a specific pipeline package or version is +installed. Each pipeline package package exposes a `__version__` attribute which +you can also use to perform your own version compatibility checks before loading +it. diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 3636aa3c2..c8702a147 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -42,8 +42,8 @@ texts = ["This is a text", "These are lots of texts", "..."] - Only apply the **pipeline components you need**. Getting predictions from the model that you don't actually need adds up and becomes very inefficient at scale. To prevent this, use the `disable` keyword argument to disable - components you don't need – either when loading a model, or during processing - with `nlp.pipe`. See the section on + components you don't need – either when loading a pipeline, or during + processing with `nlp.pipe`. See the section on [disabling pipeline components](#disabling) for more details and examples. @@ -95,7 +95,7 @@ spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy's default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added to an already existing `nlp` object, specified when initializing a `Language` -class, or defined within a [model package](/usage/saving-loading#models). +class, or defined within a [pipeline package](/usage/saving-loading#models). > #### config.cfg (excerpt) > @@ -115,7 +115,7 @@ class, or defined within a [model package](/usage/saving-loading#models). > # Settings for the parser component > ``` -When you load a model, spaCy first consults the model's +When you load a pipeline, spaCy first consults the [`meta.json`](/usage/saving-loading#models) and [`config.cfg`](/usage/training#config). The config tells spaCy what language class to use, which components are in the pipeline, and how those components @@ -131,8 +131,7 @@ should be created. spaCy will then do the following: component with with [`add_pipe`](/api/language#add_pipe). The settings are passed into the factory. 3. Make the **model data** available to the `Language` class by calling - [`from_disk`](/api/language#from_disk) with the path to the model data - directory. + [`from_disk`](/api/language#from_disk) with the path to the data directory. So when you call this... @@ -140,27 +139,27 @@ So when you call this... nlp = spacy.load("en_core_web_sm") ``` -... the model's `config.cfg` tells spaCy to use the language `"en"` and the +... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize `spacy.lang.en.English`, and create each pipeline component and add it to the -processing pipeline. It'll then load in the model's data from its data directory +processing pipeline. It'll then load in the model data from the data directory and return the modified `Language` class for you to use as the `nlp` object. spaCy v3.0 introduces a `config.cfg`, which includes more detailed settings for -the model pipeline, its components and the -[training process](/usage/training#config). You can export the config of your -current `nlp` object by calling [`nlp.config.to_disk`](/api/language#config). +the pipeline, its components and the [training process](/usage/training#config). +You can export the config of your current `nlp` object by calling +[`nlp.config.to_disk`](/api/language#config). -Fundamentally, a [spaCy model](/models) consists of three components: **the -weights**, i.e. binary data loaded in from a directory, a **pipeline** of +Fundamentally, a [spaCy pipeline package](/models) consists of three components: +**the weights**, i.e. binary data loaded in from a directory, a **pipeline** of functions called in order, and **language data** like the tokenization rules and -language-specific settings. For example, a Spanish NER model requires different -weights, language data and pipeline components than an English parsing and -tagging model. This is also why the pipeline state is always held by the +language-specific settings. For example, a Spanish NER pipeline requires +different weights, language data and components than an English parsing and +tagging pipeline. This is also why the pipeline state is always held by the `Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all together and returns an instance of `Language` with a pipeline set and access to the binary data: @@ -175,7 +174,7 @@ cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English nlp = cls() # 2. Initialize it for name in pipeline: nlp.add_pipe(name) # 3. Add the component to the pipeline -nlp.from_disk(model_data_path) # 4. Load in the binary data +nlp.from_disk(data_path) # 4. Load in the binary data ``` When you call `nlp` on a text, spaCy will **tokenize** it and then **call each @@ -243,28 +242,29 @@ tagger or the parser, you can **disable or exclude** it. This can sometimes make a big difference and improve loading and inference speed. There are two different mechanisms you can use: -1. **Disable:** The component and its data will be loaded with the model, but it - will be disabled by default and not run as part of the processing pipeline. - To run it, you can explicitly enable it by calling +1. **Disable:** The component and its data will be loaded with the pipeline, but + it will be disabled by default and not run as part of the processing + pipeline. To run it, you can explicitly enable it by calling [`nlp.enable_pipe`](/api/language#enable_pipe). When you save out the `nlp` object, the disabled component will be included but disabled by default. -2. **Exclude:** Don't load the component and its data with the model. Once the - model is loaded, there will be no reference to the excluded component. +2. **Exclude:** Don't load the component and its data with the pipeline. Once + the pipeline is loaded, there will be no reference to the excluded component. Disabled and excluded component names can be provided to [`spacy.load`](/api/top-level#spacy.load) as a list. -> #### 💡 Models with optional components +> #### 💡 Optional pipeline components > -> The `disable` mechanism makes it easy to distribute models with optional -> components that you can enable or disable at runtime. For instance, your model -> may include a statistical _and_ a rule-based component for sentence -> segmentation, and you can choose which one to run depending on your use case. +> The `disable` mechanism makes it easy to distribute pipeline packages with +> optional components that you can enable or disable at runtime. For instance, +> your pipeline may include a statistical _and_ a rule-based component for +> sentence segmentation, and you can choose which one to run depending on your +> use case. ```python -# Load the model without the entity recognizer +# Load the pipeline without the entity recognizer nlp = spacy.load("en_core_web_sm", exclude=["ner"]) # Load the tagger and parser but don't enable them @@ -358,25 +358,25 @@ run as part of the pipeline. | `nlp.component_names` | All component names, including disabled components. | | `nlp.disabled` | Names of components that are currently disabled. | -### Sourcing pipeline components from existing models {#sourced-components new="3"} +### Sourcing components from existing pipelines {#sourced-components new="3"} -Pipeline components that are independent can also be reused across models. -Instead of adding a new blank component to a pipeline, you can also copy an -existing component from a pretrained model by setting the `source` argument on +Pipeline components that are independent can also be reused across pipelines. +Instead of adding a new blank component, you can also copy an existing component +from a trained pipeline by setting the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be interpreted as the name of the component in the source pipeline – for instance, `"ner"`. This is especially useful for -[training a model](/usage/training#config-components) because it lets you mix -and match components and create fully custom model packages with updated -pretrained components and new components trained on your data. +[training a pipeline](/usage/training#config-components) because it lets you mix +and match components and create fully custom pipeline packages with updated +trained components and new components trained on your data. - + -When reusing components across models, keep in mind that the **vocabulary**, -**vectors** and model settings **must match**. If a pretrained model includes +When reusing components across pipelines, keep in mind that the **vocabulary**, +**vectors** and model settings **must match**. If a trained pipeline includes [word vectors](/usage/linguistic-features#vectors-similarity) and the component -uses them as features, the model you copy it to needs to have the _same_ vectors -available – otherwise, it won't be able to make the same predictions. +uses them as features, the pipeline you copy it to needs to have the _same_ +vectors available – otherwise, it won't be able to make the same predictions. @@ -384,7 +384,7 @@ available – otherwise, it won't be able to make the same predictions. > > Instead of providing a `factory`, component blocks in the training > [config](/usage/training#config) can also define a `source`. The string needs -> to be a loadable spaCy model package or path. The +> to be a loadable spaCy pipeline package or path. The > > ```ini > [components.ner] @@ -404,11 +404,11 @@ available – otherwise, it won't be able to make the same predictions. ### {executable="true"} import spacy -# The source model with different components +# The source pipeline with different components source_nlp = spacy.load("en_core_web_sm") print(source_nlp.pipe_names) -# Add only the entity recognizer to the new blank model +# Add only the entity recognizer to the new blank pipeline nlp = spacy.blank("en") nlp.add_pipe("ner", source=source_nlp) print(nlp.pipe_names) @@ -535,8 +535,8 @@ only being able to modify it afterwards. The [`@Language.component`](/api/language#component) decorator lets you turn a simple function into a pipeline component. It takes at least one argument, the **name** of the component factory. You can use this name to add an instance of -your component to the pipeline. It can also be listed in your model config, so -you can save, load and train models using your component. +your component to the pipeline. It can also be listed in your pipeline config, +so you can save, load and train pipelines using your component. Custom components can be added to the pipeline using the [`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify @@ -838,7 +838,7 @@ If what you're passing in isn't JSON-serializable – e.g. a custom object like [model](#trainable-components) – saving out the component config becomes impossible because there's no way for spaCy to know _how_ that object was created, and what to do to create it again. This makes it much harder to save, -load and train custom models with custom components. A simple solution is to +load and train custom pipelines with custom components. A simple solution is to **register a function** that returns your resources. The [registry](/api/top-level#registry) lets you **map string names to functions** that create objects, so given a name and optional arguments, spaCy will know how @@ -876,13 +876,13 @@ the result of the registered function is passed in as the key `"dictionary"`. ``` Using a registered function also means that you can easily include your custom -components in models that you [train](/usage/training). To make sure spaCy knows -where to find your custom `@assets` function, you can pass in a Python file via -the argument `--code`. If someone else is using your component, all they have to -do to customize the data is to register their own function and swap out the -name. Registered functions can also take **arguments** by the way that can be -defined in the config as well – you can read more about this in the docs on -[training with custom code](/usage/training#custom-code). +components in pipelines that you [train](/usage/training). To make sure spaCy +knows where to find your custom `@assets` function, you can pass in a Python +file via the argument `--code`. If someone else is using your component, all +they have to do to customize the data is to register their own function and swap +out the name. Registered functions can also take **arguments** by the way that +can be defined in the config as well – you can read more about this in the docs +on [training with custom code](/usage/training#custom-code). ### Python type hints and pydantic validation {#type-hints new="3"} @@ -1121,7 +1121,14 @@ loss is calculated and to add evaluation scores to the training output. | [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. | | [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. | - + + +For more details on how to implement your own trainable components and model +architectures, and plug existing models implemented in PyTorch or TensorFlow +into your spaCy pipeline, see the usage guide on +[layers and model architectures](/usage/layers-architectures#components). + + ## Extension attributes {#custom-components-attributes new="2"} @@ -1322,9 +1329,9 @@ While it's generally recommended to use the `Doc._`, `Span._` and `Token._` proxies to add your own custom attributes, spaCy offers a few exceptions to allow **customizing the built-in methods** like [`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with -your own hooks, which can rely on statistical models you train yourself. For -instance, you can provide your own on-the-fly sentence segmentation algorithm or -document similarity method. +your own hooks, which can rely on components you train yourself. For instance, +you can provide your own on-the-fly sentence segmentation algorithm or document +similarity method. Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token` objects by adding a component to the pipeline. For instance, to customize the @@ -1456,13 +1463,13 @@ function that takes a `Doc`, modifies it and returns it. method. However, a third-party extension should **never silently overwrite built-ins**, or attributes set by other extensions. -- If you're looking to publish a model that depends on a custom pipeline - component, you can either **require it** in the model package's dependencies, - or – if the component is specific and lightweight – choose to **ship it with - your model package**. Just make sure the +- If you're looking to publish a pipeline package that depends on a custom + pipeline component, you can either **require it** in the package's + dependencies, or – if the component is specific and lightweight – choose to + **ship it with your pipeline package**. Just make sure the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory) decorator that registers the - custom component runs in your model's `__init__.py` or is exposed via an + custom component runs in your package's `__init__.py` or is exposed via an [entry point](/usage/saving-loading#entry-points). - Once you're ready to share your extension with others, make sure to **add docs @@ -1511,9 +1518,9 @@ def custom_ner_wrapper(doc): return doc ``` -The `custom_ner_wrapper` can then be added to the pipeline of a blank model -using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the -existing entity recognizer of a pretrained model with +The `custom_ner_wrapper` can then be added to a blank pipeline using +[`nlp.add_pipe`](/api/language#add_pipe). You can also replace the existing +entity recognizer of a trained pipeline with [`nlp.replace_pipe`](/api/language#replace_pipe). Here's another example of a custom model, `your_custom_model`, that takes a list diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 97a0caed8..97e3abb6e 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -20,10 +20,10 @@ menu: spaCy projects let you manage and share **end-to-end spaCy workflows** for different **use cases and domains**, and orchestrate training, packaging and -serving your custom models. You can start off by cloning a pre-defined project -template, adjust it to fit your needs, load in your data, train a model, export -it as a Python package, upload your outputs to a remote storage and share your -results with your team. spaCy projects can be used via the new +serving your custom pipelines. You can start off by cloning a pre-defined +project template, adjust it to fit your needs, load in your data, train a +pipeline, export it as a Python package, upload your outputs to a remote storage +and share your results with your team. spaCy projects can be used via the new [`spacy project`](/api/cli#project) command and we provide templates in our [`projects`](https://github.com/explosion/projects) repo. @@ -51,7 +51,7 @@ production. Manage and version your data Create labelled training data -Visualize and demo your models +Visualize and demo your pipelines Serve your models and host APIs Distributed and parallel training Track your experiments and results @@ -66,8 +66,8 @@ production. The [`spacy project clone`](/api/cli#project-clone) command clones an existing project template and copies the files to a local directory. You can then run the -project, e.g. to train a model and edit the commands and scripts to build fully -custom workflows. +project, e.g. to train a pipeline and edit the commands and scripts to build +fully custom workflows. ```cli python -m spacy project clone some_example_project @@ -162,12 +162,12 @@ script). > ``` Workflows are series of commands that are run in order and often depend on each -other. For instance, to generate a packaged model, you might start by converting -your data, then run [`spacy train`](/api/cli#train) to train your model on the -converted data and if that's successful, run [`spacy package`](/api/cli#package) -to turn the best model artifact into an installable Python package. The -following command runs the workflow named `all` defined in the `project.yml`, -and executes the commands it specifies, in order: +other. For instance, to generate a pipeline package, you might start by +converting your data, then run [`spacy train`](/api/cli#train) to train your +pipeline on the converted data and if that's successful, run +[`spacy package`](/api/cli#package) to turn the best trained artifact into an +installable Python package. The following command runs the workflow named `all` +defined in the `project.yml`, and executes the commands it specifies, in order: ```cli $ python -m spacy project run all @@ -191,11 +191,11 @@ project as a DVC repo. > local: '/mnt/scratch/cache' > ``` -After training a model, you can optionally use the +After training a pipeline, you can optionally use the [`spacy project push`](/api/cli#project-push) command to upload your outputs to a remote storage, using protocols like [S3](https://aws.amazon.com/s3/), [Google Cloud Storage](https://cloud.google.com/storage) or SSH. This can help -you **export** your model packages, **share** work with your team, or **cache +you **export** your pipeline packages, **share** work with your team, or **cache results** to avoid repeating work. ```cli @@ -214,8 +214,8 @@ docs on [remote storage](#remote). The `project.yml` defines the assets a project depends on, like datasets and pretrained weights, as well as a series of commands that can be run separately or as a workflow – for instance, to preprocess the data, convert it to spaCy's -format, train a model, evaluate it and export metrics, package it and spin up a -quick web demo. It looks pretty similar to a config file used to define CI +format, train a pipeline, evaluate it and export metrics, package it and spin up +a quick web demo. It looks pretty similar to a config file used to define CI pipelines. @@ -324,17 +324,17 @@ others are running your project with the same data. Each command defined in the `project.yml` can optionally define a list of dependencies and outputs. These are the files the command requires and creates. -For example, a command for training a model may depend on a +For example, a command for training a pipeline may depend on a [`config.cfg`](/usage/training#config) and the training and evaluation data, and -it will export a directory `model-best`, containing the best model, which you -can then re-use in other commands. +it will export a directory `model-best`, which you can then re-use in other +commands. ```yaml ### project.yml commands: - name: train - help: 'Train a spaCy model using the specified corpus and config' + help: 'Train a spaCy pipeline using the specified corpus and config' script: - 'python -m spacy train ./configs/config.cfg -o training/ --paths.train ./corpus/training.spacy --paths.dev ./corpus/evaluation.spacy' deps: @@ -392,14 +392,14 @@ directory: ├── project.yml # the project settings ├── project.lock # lockfile that tracks inputs/outputs ├── assets/ # downloaded data assets -├── configs/ # model config.cfg files used for training +├── configs/ # pipeline config.cfg files used for training ├── corpus/ # output directory for training corpus -├── metas/ # model meta.json templates used for packaging +├── metas/ # pipeline meta.json templates used for packaging ├── metrics/ # output directory for evaluation metrics ├── notebooks/ # directory for Jupyter notebooks -├── packages/ # output directory for model Python packages +├── packages/ # output directory for pipeline Python packages ├── scripts/ # directory for scripts, e.g. referenced in commands -├── training/ # output directory for trained models +├── training/ # output directory for trained pipelines └── ... # any other files, like a requirements.txt etc. ``` @@ -426,7 +426,7 @@ report: ### project.yml commands: - name: test - help: 'Test the trained model' + help: 'Test the trained pipeline' script: - 'pip install pytest pytest-html' - 'python -m pytest ./scripts/tests --html=metrics/test-report.html' @@ -440,8 +440,8 @@ commands: Adding `training/model-best` to the command's `deps` lets you ensure that the file is available. If not, spaCy will show an error and the command won't run. Setting `no_skip: true` means that the command will always run, even if the -dependencies (the trained model) hasn't changed. This makes sense here, because -you typically don't want to skip your tests. +dependencies (the trained pipeline) haven't changed. This makes sense here, +because you typically don't want to skip your tests. ### Writing custom scripts {#custom-scripts} @@ -554,7 +554,7 @@ notebooks with usage examples. -It's typically not a good idea to check large data assets, trained models or +It's typically not a good idea to check large data assets, trained pipelines or other artifacts into a Git repo and you should exclude them from your project template by adding a `.gitignore`. If you want to version your data and models, check out [Data Version Control](#dvc) (DVC), which integrates with spaCy @@ -566,7 +566,7 @@ projects. You can persist your project outputs to a remote storage using the [`project push`](/api/cli#project-push) command. This can help you **export** -your model packages, **share** work with your team, or **cache results** to +your pipeline packages, **share** work with your team, or **cache results** to avoid repeating work. The [`project pull`](/api/cli#project-pull) command will download any outputs that are in the remote storage and aren't available locally. @@ -622,7 +622,7 @@ For instance, let's say you had the following command in your `project.yml`: ```yaml ### project.yml - name: train - help: 'Train a spaCy model using the specified corpus and config' + help: 'Train a spaCy pipeline using the specified corpus and config' script: - 'spacy train ./config.cfg --output training/' deps: @@ -814,8 +814,8 @@ mattis pretium. [Streamlit](https://streamlit.io) is a Python framework for building interactive data apps. The [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit) package helps you integrate spaCy visualizations into your Streamlit apps and -quickly spin up demos to explore your models interactively. It includes a full -embedded visualizer, as well as individual components. +quickly spin up demos to explore your pipelines interactively. It includes a +full embedded visualizer, as well as individual components. ```bash $ pip install spacy_streamlit @@ -829,11 +829,11 @@ $ pip install spacy_streamlit Using [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit), your projects can easily define their own scripts that spin up an interactive -visualizer, using the latest model you trained, or a selection of models so you -can compare their results. The following script starts an +visualizer, using the latest pipeline you trained, or a selection of pipelines +so you can compare their results. The following script starts an [NER visualizer](/usage/visualizers#ent) and takes two positional command-line -argument you can pass in from your `config.yml`: a comma-separated list of model -paths and an example text to use as the default text. +argument you can pass in from your `config.yml`: a comma-separated list of paths +to load the pipelines from and an example text to use as the default text. ```python ### scripts/visualize.py @@ -841,8 +841,8 @@ import spacy_streamlit import sys DEFAULT_TEXT = sys.argv[2] if len(sys.argv) >= 3 else "" -MODELS = [name.strip() for name in sys.argv[1].split(",")] -spacy_streamlit.visualize(MODELS, DEFAULT_TEXT, visualizers=["ner"]) +PIPELINES = [name.strip() for name in sys.argv[1].split(",")] +spacy_streamlit.visualize(PIPELINES, DEFAULT_TEXT, visualizers=["ner"]) ``` > #### Example usage @@ -856,7 +856,7 @@ spacy_streamlit.visualize(MODELS, DEFAULT_TEXT, visualizers=["ner"]) ### project.yml commands: - name: visualize - help: "Visualize the model's output interactively using Streamlit" + help: "Visualize the pipeline's output interactively using Streamlit" script: - 'streamlit run ./scripts/visualize.py ./training/model-best "I like Adidas shoes."' deps: @@ -879,8 +879,8 @@ mattis pretium. for building REST APIs with Python, based on Python [type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular library for serving machine learning models and you can use it in your spaCy -projects to quickly serve up a trained model and make it available behind a REST -API. +projects to quickly serve up a trained pipeline and make it available behind a +REST API. ```python # TODO: show an example that addresses some of the main concerns for serving ML (workers etc.) @@ -897,7 +897,7 @@ API. ### project.yml commands: - name: serve - help: "Serve the trained model with FastAPI" + help: "Serve the trained pipeline with FastAPI" script: - 'python ./scripts/serve.py ./training/model-best' deps: diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index a589c556e..fb54c9936 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -759,7 +759,7 @@ whitespace, making them easy to match as well. from spacy.lang.en import English from spacy.matcher import Matcher -nlp = English() # We only want the tokenizer, so no need to load a model +nlp = English() # We only want the tokenizer, so no need to load a pipeline matcher = Matcher(nlp.vocab) pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"] # Positive emoji @@ -893,12 +893,13 @@ pattern covering the exact tokenization of the term. To create the patterns, each phrase has to be processed with the `nlp` object. -If you have a model loaded, doing this in a loop or list comprehension can -easily become inefficient and slow. If you **only need the tokenization and -lexical attributes**, you can run [`nlp.make_doc`](/api/language#make_doc) -instead, which will only run the tokenizer. For an additional speed boost, you -can also use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will -process the texts as a stream. +If you have a trained pipeline loaded, doing this in a loop or list +comprehension can easily become inefficient and slow. If you **only need the +tokenization and lexical attributes**, you can run +[`nlp.make_doc`](/api/language#make_doc) instead, which will only run the +tokenizer. For an additional speed boost, you can also use the +[`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process the texts +as a stream. ```diff - patterns = [nlp(term) for term in LOTS_OF_TERMS] @@ -977,7 +978,7 @@ of an advantage over writing one or two token patterns. The [`EntityRuler`](/api/entityruler) is an exciting new component that lets you add named entities based on pattern dictionaries, and makes it easy to combine rule-based and statistical named entity recognition for even more powerful -models. +pipelines. ### Entity Patterns {#entityruler-patterns} @@ -1021,8 +1022,8 @@ doc = nlp("Apple is opening its first big office in San Francisco.") print([(ent.text, ent.label_) for ent in doc.ents]) ``` -The entity ruler is designed to integrate with spaCy's existing statistical -models and enhance the named entity recognizer. If it's added **before the +The entity ruler is designed to integrate with spaCy's existing pipeline +components and enhance the named entity recognizer. If it's added **before the `"ner"` component**, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it's added **after the `"ner"` component**, the @@ -1111,20 +1112,20 @@ versa. When you save out an `nlp` object that has an `EntityRuler` added to its -pipeline, its patterns are automatically exported to the model directory: +pipeline, its patterns are automatically exported to the pipeline directory: ```python nlp = spacy.load("en_core_web_sm") ruler = nlp.add_pipe("entity_ruler") ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}]) -nlp.to_disk("/path/to/model") +nlp.to_disk("/path/to/pipeline") ``` -The saved model now includes the `"entity_ruler"` in its -[`config.cfg`](/api/data-formats#config) and the model directory contains a file -`entityruler.jsonl` with the patterns. When you load the model back in, all -pipeline components will be restored and deserialized – including the entity -ruler. This lets you ship powerful model packages with binary weights _and_ +The saved pipeline now includes the `"entity_ruler"` in its +[`config.cfg`](/api/data-formats#config) and the pipeline directory contains a +file `entityruler.jsonl` with the patterns. When you load the pipeline back in, +all pipeline components will be restored and deserialized – including the entity +ruler. This lets you ship powerful pipeline packages with binary weights _and_ rules included! ### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"} @@ -1141,7 +1142,7 @@ of `"phrase_matcher_attr": "POS"` for the entity ruler. Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns. -As of spaCy 2.2.4 the `add_patterns` function has been refactored to use +As of spaCy v2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. Even with this speedup (but especially if you're using an older version) the `add_patterns` function can @@ -1168,7 +1169,7 @@ order to implement more abstract logic. ### Example: Expanding named entities {#models-rules-ner} -When using the a pretrained +When using a trained [named entity recognition](/usage/linguistic-features/#named-entities) model to extract information from your texts, you may find that the predicted span only includes parts of the entity you're looking for. Sometimes, this happens if @@ -1178,15 +1179,15 @@ what you need for your application. > #### Where corpora come from > -> Corpora used to train models from scratch are often produced in academia. They -> contain text from various sources with linguistic features labeled manually by -> human annotators (following a set of specific guidelines). The corpora are -> then distributed with evaluation data, so other researchers can benchmark -> their algorithms and everyone can report numbers on the same data. However, -> most applications need to learn information that isn't contained in any -> available corpus. +> Corpora used to train pipelines from scratch are often produced in academia. +> They contain text from various sources with linguistic features labeled +> manually by human annotators (following a set of specific guidelines). The +> corpora are then distributed with evaluation data, so other researchers can +> benchmark their algorithms and everyone can report numbers on the same data. +> However, most applications need to learn information that isn't contained in +> any available corpus. -For example, the corpus spaCy's [English models](/models/en) were trained on +For example, the corpus spaCy's [English pipelines](/models/en) were trained on defines a `PERSON` entity as just the **person name**, without titles like "Mr." or "Dr.". This makes sense, because it makes it easier to resolve the entity type back to a knowledge base. But what if your application needs the full diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 3f9435f5e..9955e7d84 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -4,7 +4,7 @@ menu: - ['Basics', 'basics'] - ['Serialization Methods', 'serialization-methods'] - ['Entry Points', 'entry-points'] - - ['Models', 'models'] + - ['Trained Pipelines', 'models'] --- ## Basics {#basics hidden="true"} @@ -25,10 +25,10 @@ can load in the data. > #### Saving the meta and config > > The [`nlp.meta`](/api/language#meta) attribute is a JSON-serializable -> dictionary and contains all model meta information like the author and license -> information. The [`nlp.config`](/api/language#config) attribute is a +> dictionary and contains all pipeline meta information like the author and +> license information. The [`nlp.config`](/api/language#config) attribute is a > dictionary containing the training configuration, pipeline component factories -> and other settings. It is saved out with a model as the `config.cfg`. +> and other settings. It is saved out with a pipeline as the `config.cfg`. ```python ### Serialize @@ -45,12 +45,11 @@ for pipe_name in pipeline: nlp.from_bytes(bytes_data) ``` -This is also how spaCy does it under the hood when loading a model: it loads the -model's `config.cfg` containing the language and pipeline information, -initializes the language class, creates and adds the pipeline components based -on the defined -[factories](/usage/processing-pipeline#custom-components-factories) and _then_ -loads in the binary data. You can read more about this process +This is also how spaCy does it under the hood when loading a pipeline: it loads +the `config.cfg` containing the language and pipeline information, initializes +the language class, creates and adds the pipeline components based on the +defined [factories](/usage/processing-pipeline#custom-components-factories) and +_then_ loads in the binary data. You can read more about this process [here](/usage/processing-pipelines#pipelines). ### Serializing Doc objects efficiently {#docs new="2.2"} @@ -168,10 +167,10 @@ data = pickle.dumps(span_doc) ## Implementing serialization methods {#serialization-methods} When you call [`nlp.to_disk`](/api/language#to_disk), -[`nlp.from_disk`](/api/language#from_disk) or load a model package, spaCy will -iterate over the components in the pipeline, check if they expose a `to_disk` or -`from_disk` method and if so, call it with the path to the model directory plus -the string name of the component. For example, if you're calling +[`nlp.from_disk`](/api/language#from_disk) or load a pipeline package, spaCy +will iterate over the components in the pipeline, check if they expose a +`to_disk` or `from_disk` method and if so, call it with the path to the pipeline +directory plus the string name of the component. For example, if you're calling `nlp.to_disk("/path")`, the data for the named entity recognizer will be saved in `/path/ner`. @@ -191,8 +190,8 @@ add to that data and saves and loads the data to and from a JSON file. > [source](https://github.com/explosion/spaCy/tree/master/spacy/pipeline/entityruler.py). > Patterns added to the component will be saved to a `.jsonl` file if the > pipeline is serialized to disk, and to a bytestring if the pipeline is -> serialized to bytes. This allows saving out a model with a rule-based entity -> recognizer and including all rules _with_ the model data. +> serialized to bytes. This allows saving out a pipeline with a rule-based +> entity recognizer and including all rules _with_ the component data. ```python ### {highlight="14-18,20-25"} @@ -232,7 +231,7 @@ component's `to_disk` method. nlp = spacy.load("en_core_web_sm") my_component = nlp.add_pipe("my_component") my_component.add({"hello": "world"}) -nlp.to_disk("/path/to/model") +nlp.to_disk("/path/to/pipeline") ``` The contents of the directory would then look like this. @@ -241,15 +240,15 @@ file `data.json` in its subdirectory: ```yaml ### Directory structure {highlight="2-3"} -└── /path/to/model +└── /path/to/pipeline ├── my_component # data serialized by "my_component" │ └── data.json ├── ner # data for "ner" component ├── parser # data for "parser" component ├── tagger # data for "tagger" component - ├── vocab # model vocabulary - ├── meta.json # model meta.json - ├── config.cfg # model config + ├── vocab # pipeline vocabulary + ├── meta.json # pipeline meta.json + ├── config.cfg # pipeline config └── tokenizer # tokenization rules ``` @@ -258,18 +257,19 @@ When you load the data back in, spaCy will call the custom component's contents of `data.json`, convert them to a Python object and restore the component state. The same works for other types of data, of course – for instance, you could add a -[wrapper for a model](/usage/processing-pipelines#wrapping-models-libraries) -trained with a different library like TensorFlow or PyTorch and make spaCy load -its weights automatically when you load the model package. +[wrapper for a model](/usage/layers-architectures#frameworks) trained with a +different library like TensorFlow or PyTorch and make spaCy load its weights +automatically when you load the pipeline package. -When you load back a model with custom components, make sure that the components -are **available** and that the [`@Language.component`](/api/language#component) -or [`@Language.factory`](/api/language#factory) decorators are executed _before_ -your model is loaded back. Otherwise, spaCy won't know how to resolve the string -name of a component factory like `"my_component"` back to a function. For more -details, see the documentation on +When you load back a pipeline with custom components, make sure that the +components are **available** and that the +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory) decorators are executed _before_ +your pipeline is loaded back. Otherwise, spaCy won't know how to resolve the +string name of a component factory like `"my_component"` back to a function. For +more details, see the documentation on [adding factories](/usage/processing-pipelines#custom-components-factories) or use [entry points](#entry-points) to make your extension package expose your custom components to spaCy automatically. @@ -297,18 +297,19 @@ installed in the same environment – that's it. ### Custom components via entry points {#entry-points-components} -When you load a model, spaCy will generally use the model's `config.cfg` to set -up the language class and construct the pipeline. The pipeline is specified as a +When you load a pipeline, spaCy will generally use its `config.cfg` to set up +the language class and construct the pipeline. The pipeline is specified as a list of strings, e.g. `pipeline = ["tagger", "paser", "ner"]`. For each of those strings, spaCy will call `nlp.add_pipe` and look up the name in all factories defined by the decorators [`@Language.component`](/api/language#component) and [`@Language.factory`](/api/language#factory). This means that you have to import -your custom components _before_ loading the model. +your custom components _before_ loading the pipeline. -Using entry points, model packages and extension packages can define their own -`"spacy_factories"`, which will be loaded automatically in the background when -the `Language` class is initialized. So if a user has your package installed, -they'll be able to use your components – even if they **don't import them**! +Using entry points, pipeline packages and extension packages can define their +own `"spacy_factories"`, which will be loaded automatically in the background +when the `Language` class is initialized. So if a user has your package +installed, they'll be able to use your components – even if they **don't import +them**! To stick with the theme of [this entry points blog post](https://amir.rachum.com/blog/2017/07/28/python-entry-points/), @@ -343,10 +344,10 @@ def snek_component(doc): Since it's a very complex and sophisticated module, you want to split it off into its own package so you can version it and upload it to PyPi. You also want -your custom model to be able to define `pipeline = ["snek"]` in its +your custom package to be able to define `pipeline = ["snek"]` in its `config.cfg`. For that, you need to be able to tell spaCy where to find the component `"snek"`. If you don't do this, spaCy will raise an error when you try -to load the model because there's no built-in `"snek"` component. To add an +to load the pipeline because there's no built-in `"snek"` component. To add an entry to the factories, you can now expose it in your `setup.py` via the `entry_points` dictionary: @@ -380,7 +381,7 @@ $ python setup.py develop spaCy is now able to create the pipeline component `"snek"` – even though you never imported `snek_component`. When you save the [`nlp.config`](/api/language#config) to disk, it includes an entry for your -`"snek"` component and any model you train with this config will include the +`"snek"` component and any pipeline you train with this config will include the component and know how to load it – if your `snek` package is installed. > #### config.cfg (excerpt) @@ -449,9 +450,9 @@ entry_points={ The factory can also implement other pipeline component like `to_disk` and `from_disk` for serialization, or even `update` to make the component trainable. -If a component exposes a `from_disk` method and is included in a model's -pipeline, spaCy will call it on load. This lets you ship custom data with your -model. When you save out a model using `nlp.to_disk` and the component exposes a +If a component exposes a `from_disk` method and is included in a pipeline, spaCy +will call it on load. This lets you ship custom data with your pipeline package. +When you save out a pipeline using `nlp.to_disk` and the component exposes a `to_disk` method, it will be called with the disk path. ```python @@ -467,8 +468,8 @@ def from_disk(self, path, exclude=tuple()): return self ``` -The above example will serialize the current snake in a `snek.txt` in the model -data directory. When a model using the `snek` component is loaded, it will open +The above example will serialize the current snake in a `snek.txt` in the data +directory. When a pipeline using the `snek` component is loaded, it will open the `snek.txt` and make it available to the component. ### Custom language classes via entry points {#entry-points-languages} @@ -476,7 +477,7 @@ the `snek.txt` and make it available to the component. To stay with the theme of the previous example and [this blog post on entry points](https://amir.rachum.com/blog/2017/07/28/python-entry-points/), let's imagine you wanted to implement your own `SnekLanguage` class for your -custom model – but you don't necessarily want to modify spaCy's code to add a +custom pipeline – but you don't necessarily want to modify spaCy's code to add a language. In your package, you could then implement the following [custom language subclass](/usage/linguistic-features#language-subclass): @@ -510,10 +511,10 @@ setup( ``` In spaCy, you can then load the custom `snk` language and it will be resolved to -`SnekLanguage` via the custom entry point. This is especially relevant for model -packages you train, which could then specify `lang = snk` in their `config.cfg` -without spaCy raising an error because the language is not available in the core -library. +`SnekLanguage` via the custom entry point. This is especially relevant for +pipeline packages you [train](/usage/training), which could then specify +`lang = snk` in their `config.cfg` without spaCy raising an error because the +language is not available in the core library. ### Custom displaCy colors via entry points {#entry-points-displacy new="2.2"} @@ -526,7 +527,7 @@ values. > #### Domain-specific NER labels > -> Good examples of models with domain-specific label schemes are +> Good examples of pipelines with domain-specific label schemes are > [scispaCy](/universe/project/scispacy) and > [Blackstone](/universe/project/blackstone). @@ -559,24 +560,23 @@ import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html'