From 319692aa539fe275c76ed82030396ae41726500b Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 17 Aug 2020 14:05:48 +0200 Subject: [PATCH 01/10] fix typos --- CONTRIBUTING.md | 156 +++++++++++++-------------- website/docs/api/architectures.md | 45 ++++---- website/docs/usage/index.md | 2 +- website/docs/usage/saving-loading.md | 6 +- website/setup/jinja_to_js.py | 2 +- 5 files changed, 105 insertions(+), 106 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 6b7881dd2..81cfbf8cb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body. A few more tips: -- **Describing your issue:** Try to provide as many details as possible. What - exactly goes wrong? _How_ is it failing? Is there an error? - "XY doesn't work" usually isn't that helpful for tracking down problems. Always - remember to include the code you ran and if possible, extract only the relevant - parts and don't just dump your entire script. This will make it easier for us to - reproduce the error. +- **Describing your issue:** Try to provide as many details as possible. What + exactly goes wrong? _How_ is it failing? Is there an error? + "XY doesn't work" usually isn't that helpful for tracking down problems. Always + remember to include the code you ran and if possible, extract only the relevant + parts and don't just dump your entire script. This will make it easier for us to + reproduce the error. -- **Getting info about your spaCy installation and environment:** If you're - using spaCy v1.7+, you can use the command line interface to print details and - even format them as Markdown to copy-paste into GitHub issues: - `python -m spacy info --markdown`. +- **Getting info about your spaCy installation and environment:** If you're + using spaCy v1.7+, you can use the command line interface to print details and + even format them as Markdown to copy-paste into GitHub issues: + `python -m spacy info --markdown`. -- **Checking the model compatibility:** If you're having problems with a - [statistical model](https://spacy.io/models), it may be because the - model is incompatible with your spaCy installation. In spaCy v2.0+, you can check - this on the command line by running `python -m spacy validate`. +- **Checking the model compatibility:** If you're having problems with a + [statistical model](https://spacy.io/models), it may be because the + model is incompatible with your spaCy installation. In spaCy v2.0+, you can check + this on the command line by running `python -m spacy validate`. -- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+ - comes with [built-in visualizers](https://spacy.io/usage/visualizers) that - you can run from within your script or a Jupyter notebook. For some issues, it's - helpful to **include a screenshot** of the visualization. You can simply drag and - drop the image into GitHub's editor and it will be uploaded and included. +- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+ + comes with [built-in visualizers](https://spacy.io/usage/visualizers) that + you can run from within your script or a Jupyter notebook. For some issues, it's + helpful to **include a screenshot** of the visualization. You can simply drag and + drop the image into GitHub's editor and it will be uploaded and included. -- **Sharing long blocks of code or logs:** If you need to include long code, - logs or tracebacks, you can wrap them in `
` and `
`. This - [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) - so it only becomes visible on click, making the issue easier to read and follow. +- **Sharing long blocks of code or logs:** If you need to include long code, + logs or tracebacks, you can wrap them in `
` and `
`. This + [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) + so it only becomes visible on click, making the issue easier to read and follow. ### Issue labels @@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our philosophy is to prefer a smaller core library. We generally ask the following questions: -- **What would this feature look like if implemented in a separate package?** - Some features would be very difficult to implement externally – for example, - changes to spaCy's built-in methods. In contrast, a library of word - alignment functions could easily live as a separate package that depended on - spaCy — there's little difference between writing `import word_aligner` and - `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement - [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components), - and add your own attributes, properties and methods to the `Doc`, `Token` and - `Span`. If you're looking to implement a new spaCy feature, starting with a - custom component package is usually the best strategy. You won't have to worry - about spaCy's internals and you can test your module in an isolated - environment. And if it works well, we can always integrate it into the core - library later. +- **What would this feature look like if implemented in a separate package?** + Some features would be very difficult to implement externally – for example, + changes to spaCy's built-in methods. In contrast, a library of word + alignment functions could easily live as a separate package that depended on + spaCy — there's little difference between writing `import word_aligner` and + `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement + [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components), + and add your own attributes, properties and methods to the `Doc`, `Token` and + `Span`. If you're looking to implement a new spaCy feature, starting with a + custom component package is usually the best strategy. You won't have to worry + about spaCy's internals and you can test your module in an isolated + environment. And if it works well, we can always integrate it into the core + library later. -- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** - Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or - TensorFlow/Keras do lots of useful things — but we don't want to have them as - dependencies. If the feature requires functionality in one of these libraries, - it's probably better to break it out into a different package. +- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** + Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or + TensorFlow/Keras do lots of useful things — but we don't want to have them as + dependencies. If the feature requires functionality in one of these libraries, + it's probably better to break it out into a different package. -- **Is the feature orthogonal to the current spaCy functionality, or overlapping?** - spaCy strongly prefers to avoid having 6 different ways of doing the same thing. - As better techniques are developed, we prefer to drop support for "the old way". - However, it's rare that one approach _entirely_ dominates another. It's very - common that there's still a use-case for the "obsolete" approach. For instance, - [WordNet](https://wordnet.princeton.edu/) is still very useful — but word - vectors are better for most use-cases, and the two approaches to lexical - semantics do a lot of the same things. spaCy therefore only supports word - vectors, and support for WordNet is currently left for other packages. +- **Is the feature orthogonal to the current spaCy functionality, or overlapping?** + spaCy strongly prefers to avoid having 6 different ways of doing the same thing. + As better techniques are developed, we prefer to drop support for "the old way". + However, it's rare that one approach _entirely_ dominates another. It's very + common that there's still a use-case for the "obsolete" approach. For instance, + [WordNet](https://wordnet.princeton.edu/) is still very useful — but word + vectors are better for most use-cases, and the two approaches to lexical + semantics do a lot of the same things. spaCy therefore only supports word + vectors, and support for WordNet is currently left for other packages. -- **Do you need the feature to get basic things done?** We do want spaCy to be - at least somewhat self-contained. If we keep needing some feature in our - recipes, that does provide some argument for bringing it "in house". +- **Do you need the feature to get basic things done?** We do want spaCy to be + at least somewhat self-contained. If we keep needing some feature in our + recipes, that does provide some argument for bringing it "in house". ### Getting started @@ -203,10 +203,10 @@ your files on save: ```json { - "python.formatting.provider": "black", - "[python]": { - "editor.formatOnSave": true - } + "python.formatting.provider": "black", + "[python]": { + "editor.formatOnSave": true + } } ``` @@ -216,7 +216,7 @@ list of available editor integrations. #### Disabling formatting There are a few cases where auto-formatting doesn't improve readability – for -example, in some of the the language data files like the `tag_map.py`, or in +example, in some of the language data files like the `tag_map.py`, or in the tests that construct `Doc` objects from lists of words and other labels. Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting for that particular code. Here's an example: @@ -397,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython. ### Resources to get you started -- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) -- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) -- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) -- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) +- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) +- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) +- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) +- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) ## Adding tests @@ -440,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page. We're very excited about all the new possibilities for **community extensions** and plugins in spaCy v2.0, and we can't wait to see what you build with it! -- An extension or plugin should add substantial functionality, be - **well-documented** and **open-source**. It should be available for users to download - and install as a Python package – for example via [PyPi](http://pypi.python.org). +- An extension or plugin should add substantial functionality, be + **well-documented** and **open-source**. It should be available for users to download + and install as a Python package – for example via [PyPi](http://pypi.python.org). -- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped - as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) - that users can **add to their processing pipeline** using `nlp.add_pipe()`. +- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped + as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) + that users can **add to their processing pipeline** using `nlp.add_pipe()`. -- When publishing your extension on GitHub, **tag it** with the topics - [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and - [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) - to make it easier to find. Those are also the topics we're linking to from the - spaCy website. If you're sharing your project on Twitter, feel free to tag - [@spacy_io](https://twitter.com/spacy_io) so we can check it out. +- When publishing your extension on GitHub, **tag it** with the topics + [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and + [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) + to make it easier to find. Those are also the topics we're linking to from the + spaCy website. If you're sharing your project on Twitter, feel free to tag + [@spacy_io](https://twitter.com/spacy_io) so we can check it out. -- Once your extension is published, you can open an issue on the - [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the - [resources directory](https://spacy.io/usage/resources#extensions) on the - website. +- Once your extension is published, you can open an issue on the + [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the + [resources directory](https://spacy.io/usage/resources#extensions) on the + website. 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).** diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index cc6f44fcc..0ae874d8a 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -489,18 +489,17 @@ network has an internal CNN Tok2Vec layer and uses attention. > nO = null > ``` -| Name | Type | Description | -| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. | -| `width` | int | Output dimension of the feature encoding step. | -| `embed_size` | int | Input dimension of the feature encoding step. | -| `conv_depth` | int | Depth of the Tok2Vec layer. | -| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. | -| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | -| `dropout` | float | The dropout rate. | -| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when | -| `begin_training` is called. | +| Name | Type | Description | +| -------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. | +| `width` | int | Output dimension of the feature encoding step. | +| `embed_size` | int | Input dimension of the feature encoding step. | +| `conv_depth` | int | Depth of the Tok2Vec layer. | +| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. | +| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | +| `dropout` | float | The dropout rate. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | ### spacy.TextCatCNN.v1 {#TextCatCNN} @@ -527,11 +526,11 @@ A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. -| Name | Type | Description | -| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. | -| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | +| Name | Type | Description | +| ------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | ### spacy.TextCatBOW.v1 {#TextCatBOW} @@ -549,12 +548,12 @@ architecture is usually less accurate than the ensemble, but runs faster. An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. -| Name | Type | Description | -| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | -| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | -| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | +| Name | Type | Description | +| ------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | +| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index bda9f76d6..a73753780 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -169,7 +169,7 @@ python setup.py build_ext --inplace # compile spaCy Compared to regular install via pip, the [`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt) -additionally installs developer dependencies such as Cython. See the the +additionally installs developer dependencies such as Cython. See the [quickstart widget](#quickstart) to get the right commands for your platform and Python version. diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 904477733..35112c02e 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -551,9 +551,9 @@ setup( ) ``` -After installing the package, the the custom colors will be used when -visualizing text with `displacy`. Whenever the label `SNEK` is assigned, it will -be displayed in `#3dff74`. +After installing the package, the custom colors will be used when visualizing +text with `displacy`. Whenever the label `SNEK` is assigned, it will be +displayed in `#3dff74`. import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html' diff --git a/website/setup/jinja_to_js.py b/website/setup/jinja_to_js.py index a2c896151..0d363375e 100644 --- a/website/setup/jinja_to_js.py +++ b/website/setup/jinja_to_js.py @@ -2,7 +2,7 @@ # With additional functionality: in/not in, replace, pprint, round, + for lists, # rendering empty dicts # This script is mostly used to generate the JavaScript function for the -# training quicktart widget. +# training quickstart widget. import contextlib import json import re From 6b6f7f3e7379e721a64c22742ef83811a375a8f0 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 17 Aug 2020 14:48:58 +0200 Subject: [PATCH 02/10] fix windows compat --- website/package.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/package.json b/website/package.json index e61661c11..441d996fe 100644 --- a/website/package.json +++ b/website/package.json @@ -60,7 +60,7 @@ "clear": "rm -rf .cache", "test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"", "python:install": "pip install setup/requirements.txt", - "python:setup": "cd setup && ./setup.sh" + "python:setup": "cd setup && sh setup.sh" }, "devDependencies": { "@sindresorhus/slugify": "^0.8.0", From 961e818be67105c0f59db1c34749c470aa73eb1a Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 17 Aug 2020 15:02:39 +0200 Subject: [PATCH 03/10] p/r definitions --- website/docs/usage/training.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index fc1624ec1..c2034278d 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -430,13 +430,11 @@ components are weighted equally. - - | Name | Description | | -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. | -| **Precision** (P) | Should increase. | -| **Recall** (R) | Should increase. | +| **Precision** (P) | The percentage of generated predictions that are correct. Should increase. | +| **Recall** (R) | The percentage of gold-standard annotations that are in fact predicted. Should increase. | | **F-Score** (F) | The weighted average of precision and recall. Should increase. | | **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. | | **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. | From 4fe4bab1c9c9c3ab60c75fe11006f1cac29f4b41 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Mon, 17 Aug 2020 17:10:15 +0200 Subject: [PATCH 04/10] typo fixes --- website/docs/api/data-formats.md | 2 +- website/docs/usage/training.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 4577d7ef3..8ed3232ec 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -161,7 +161,7 @@ run [`spacy pretrain`](/api/cli#pretrain). | `dropout` | The dropout rate. ~~float~~ | `0.2` | | `n_save_every` | Saving frequency. ~~int~~ | `null` | | `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). ~~Union[int, Sequence[int]]~~ | `3000` | -| `seed` | The random seed. ~~int~~ | `${system.seed}` | +| `seed` | The random seed. ~~int~~ | `${system:seed}` | | `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. ~~bool~~ | `${system:use_pytorch_for_gpu_memory}` | | `tok2vec_model` | tok2vec model section in the config. ~~str~~ | `"components.tok2vec.model"` | | `objective` | The pretraining objective. ~~Dict[str, Any]~~ | `{"type": "characters", "n_characters": 4}` | diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index f605750f6..94c5ea1cb 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -144,7 +144,7 @@ https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg Under the hood, the config is parsed into a dictionary. It's divided into sections and subsections, indicated by the square brackets and dot notation. For -example, `[training]` is a section and `[training.batch_size]` a subsections. +example, `[training]` is a section and `[training.batch_size]` a subsection. Subsections can define values, just like a dictionary, or use the `@` syntax to refer to [registered functions](#config-functions). This allows the config to not just define static settings, but also construct objects like architectures, @@ -156,7 +156,7 @@ sections of a config file are: | `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. | | `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. | | `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths:train}`, and can be [overwritten](#config-overrides) on the CLI. | -| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. | +| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system:seed}`, and can be [overwritten](#config-overrides) on the CLI. | | `training` | Settings and controls for the training and evaluation process. | | `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | From 8dcda351ec63d921761865d22bd93bbb469c6296 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 10:23:27 +0200 Subject: [PATCH 05/10] typo's and quick note on default values --- website/docs/usage/training.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 94c5ea1cb..38c40088b 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -433,8 +433,8 @@ components are weighted equally. | Name | Description | | -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. | -| **Precision** (P) | The percentage of generated predictions that are correct. Should increase. | -| **Recall** (R) | The percentage of gold-standard annotations that are in fact predicted. Should increase. | +| **Precision** (P) | The percentage of generated predictions that are correct. Should increase. | +| **Recall** (R) | The percentage of gold-standard annotations that are in fact predicted. Should increase. | | **F-Score** (F) | The weighted average of precision and recall. Should increase. | | **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. | | **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. | @@ -483,11 +483,11 @@ language class and `nlp` object at different points of the lifecycle: | `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. | | `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. | -The `@spacy.registry.callbacks` decorator lets you register your custom function in the -`callbacks` [registry](/api/top-level#registry) under a given name. You can then -reference the function in a config block using the `@callbacks` key. If a block -contains a key starting with an `@`, it's interpreted as a reference to a -function. Because you've registered the function, spaCy knows how to create it +The `@spacy.registry.callbacks` decorator lets you register your custom function +in the `callbacks` [registry](/api/top-level#registry) under a given name. You +can then reference the function in a config block using the `@callbacks` key. If +a block contains a key starting with an `@`, it's interpreted as a reference to +a function. Because you've registered the function, spaCy knows how to create it when you reference `"customize_language_data"` in your config. Here's an example of a callback that runs before the `nlp` object is created and adds a few custom tokenization rules to the defaults: @@ -562,9 +562,9 @@ spaCy's configs are powered by our machine learning library Thinc's using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered function provides type hints, the values that are passed in will be checked against the expected types. For example, `debug: bool` in the example above will -ensure that the value received as the argument `debug` is an boolean. If the +ensure that the value received as the argument `debug` is a boolean. If the value can't be coerced into a boolean, spaCy will raise an error. -`start: pydantic.StrictBool` will force the value to be an boolean and raise an +`debug: pydantic.StrictBool` will force the value to be a boolean and raise an error if it's not – for instance, if your config defines `1` instead of `true`. @@ -612,7 +612,9 @@ settings in the block will be passed to the function as keyword arguments. Keep in mind that the config shouldn't have any hidden defaults and all arguments on the functions need to be represented in the config. If your function defines **default argument values**, spaCy is able to auto-fill your config when you run -[`init fill-config`](/api/cli#init-fill-config). +[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a +given parameter is always explicitely set in the config, avoid setting a default +value for it. ```ini ### config.cfg (excerpt) From 705e1cb06c1f1d5298e9a41b3f1bfb01376f3001 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 12:04:05 +0200 Subject: [PATCH 06/10] typo in link --- website/docs/api/corpus.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md index 3631e201e..8c530ab6d 100644 --- a/website/docs/api/corpus.md +++ b/website/docs/api/corpus.md @@ -17,7 +17,7 @@ customize the data loading during training, you can register your own or evaluation data. It takes the same arguments as the `Corpus` class and returns a callable that yields [`Example`](/api/example) objects. You can replace it with your own registered function in the -[`@readers` registry](/api/top-level#regsitry) to customize the data loading and +[`@readers` registry](/api/top-level#registry) to customize the data loading and streaming. > #### Example config From 0d55b6ebb45ed66dbd9ef60ec0f79b983ddf093b Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 18:55:56 +0200 Subject: [PATCH 07/10] formatting --- website/docs/api/architectures.md | 44 +++++++++++++++---------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index b74f0275a..446e6c7c3 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -545,18 +545,18 @@ network has an internal CNN Tok2Vec layer and uses attention. -| Name | Description | -| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | -| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ | -| `width` | Output dimension of the feature encoding step. ~~int~~ | -| `embed_size` | Input dimension of the feature encoding step. ~~int~~ | -| `conv_depth` | Depth of the tok2vec layer. ~~int~~ | -| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ | -| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | -| `dropout` | The dropout rate. ~~float~~ | +| Name | Description | +| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ | +| `width` | Output dimension of the feature encoding step. ~~int~~ | +| `embed_size` | Input dimension of the feature encoding step. ~~int~~ | +| `conv_depth` | Depth of the tok2vec layer. ~~int~~ | +| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ | +| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | +| `dropout` | The dropout rate. ~~float~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model~~ | ### spacy.TextCatCNN.v1 {#TextCatCNN} @@ -585,12 +585,12 @@ architecture is usually less accurate than the ensemble, but runs faster. -| Name | Description | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | -| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | +| Name | Description | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model~~ | ### spacy.TextCatBOW.v1 {#TextCatBOW} @@ -610,13 +610,13 @@ others, but may not be as accurate, especially if texts are short. -| Name | Description | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | -| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | -| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ | +| Name | Description | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | +| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model~~ | ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} From a8acedd4baa2dcce742c3e6ded160f11cbcf048d Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 19:15:16 +0200 Subject: [PATCH 08/10] example of custom reader and batcher --- spacy/cli/train.py | 2 +- website/docs/usage/training.md | 60 +++++++++++++++++++++++++++++++++- 2 files changed, 60 insertions(+), 2 deletions(-) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 202a8555c..43866e4b3 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -235,7 +235,7 @@ def train_while_improving( with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`, where info is a dict, and is_best_checkpoint is in [True, False, None] -- None indicating that the iteration was not evaluated as a checkpoint. - The evaluation is conducted by calling the evaluate callback, which should + The evaluation is conducted by calling the evaluate callback. Positional arguments: nlp: The spaCy pipeline to evaluate. diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index d84635177..911046965 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -657,7 +657,65 @@ factor = 1.005 #### Example: Custom data reading and batching {#custom-code-readers-batchers} - +Some use-cases require streaming in data or manipulating datasets on the fly, +rather than generating all data beforehand and storing it to file. Instead of +using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you +can create and register a custom function that generates +[`Example`](/api/example) objects. The resulting generator can be infinite. When +using this dataset for training, other stopping criteria can be used such as +maximum number of steps, or stopping when the loss does not decrease further. + +For instance, in this example we assume a custom function `read_custom_data()` +which loads or generates texts with relevant textcat annotations. Then, small lexical +variations of the input text are created before generating the final `Example` +objects. + +```python +### functions.py +from typing import Callable, Iterable +import spacy +from spacy.gold import Example +import random + +@spacy.registry.readers("corpus_variants.v1") +def stream_data() -> Callable[["Language"], Iterable[Example]]: + def generate_stream(nlp): + for text, cats in read_custom_data(): + random_index = random.randint(0, len(text) - 1) + output_list = list(text) + output_list[random_index] = output_list[random_index].upper() + doc = nlp.make_doc("".join(output_list)) + example = Example.from_dict(doc, {"cats": cats}) + yield example + return generate_stream +``` + +We can also customize the batching strategy by registering a new "batcher" which +turns a stream of items into a stream of batches. spaCy has several useful builtin +batching strategies with customizable sizes , but it's also +easy to implement your own. For instance, the following function takes the stream +of generated Example objects, and removes those which have the exact same underlying +raw text, to avoid duplicates in the final training data. Note that in a more realistic +implementation, you'd also want to check whether the annotations are exactly the same. + +```python +### functions.py +from typing import Callable, Iterable, List +import spacy +from spacy.gold import Example + +@spacy.registry.batchers("filtering_batch.v1") +def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]: + def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]: + batch = [] + for eg in examples: + if eg.text not in [x.text for x in batch]: + batch.append(eg) + if len(batch) == size: + yield batch + batch = [] + return create_filtered_batches +``` ### Wrapping PyTorch and TensorFlow {#custom-frameworks} From f9fe5eb3230940250e25f32f6905167bda004e27 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 19:35:23 +0200 Subject: [PATCH 09/10] clean up example --- website/docs/usage/training.md | 51 ++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 24 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 911046965..4ee17ee21 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -662,17 +662,35 @@ rather than generating all data beforehand and storing it to file. Instead of using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you can create and register a custom function that generates [`Example`](/api/example) objects. The resulting generator can be infinite. When -using this dataset for training, other stopping criteria can be used such as -maximum number of steps, or stopping when the loss does not decrease further. +using this dataset for training, stopping criteria such as maximum number of +steps, or stopping when the loss does not decrease further, can be used. -For instance, in this example we assume a custom function `read_custom_data()` -which loads or generates texts with relevant textcat annotations. Then, small lexical -variations of the input text are created before generating the final `Example` -objects. +In this example we assume a custom function `read_custom_data()` +which loads or generates texts with relevant textcat annotations. Then, small +lexical variations of the input text are created before generating the final +`Example` objects. + +We can also customize the batching strategy by registering a new "batcher" which +turns a stream of items into a stream of batches. spaCy has several useful +built-in batching strategies with customizable sizes, but +it's also easy to implement your own. For instance, the following function takes +the stream of generated `Example` objects, and removes those which have the exact +same underlying raw text, to avoid duplicates in the final training data. Note +that in a more realistic implementation, you'd also want to check whether the +annotations are exactly the same. + +> ```ini +> [training.train_corpus] +> @readers = "corpus_variants.v1" +> +> [training.batcher] +> @batchers = "filtering_batch.v1" +> size = 150 +> ``` ```python ### functions.py -from typing import Callable, Iterable +from typing import Callable, Iterable, List import spacy from spacy.gold import Example import random @@ -682,27 +700,12 @@ def stream_data() -> Callable[["Language"], Iterable[Example]]: def generate_stream(nlp): for text, cats in read_custom_data(): random_index = random.randint(0, len(text) - 1) - output_list = list(text) - output_list[random_index] = output_list[random_index].upper() - doc = nlp.make_doc("".join(output_list)) + variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:] + doc = nlp.make_doc(variant) example = Example.from_dict(doc, {"cats": cats}) yield example return generate_stream -``` -We can also customize the batching strategy by registering a new "batcher" which -turns a stream of items into a stream of batches. spaCy has several useful builtin -batching strategies with customizable sizes , but it's also -easy to implement your own. For instance, the following function takes the stream -of generated Example objects, and removes those which have the exact same underlying -raw text, to avoid duplicates in the final training data. Note that in a more realistic -implementation, you'd also want to check whether the annotations are exactly the same. - -```python -### functions.py -from typing import Callable, Iterable, List -import spacy -from spacy.gold import Example @spacy.registry.batchers("filtering_batch.v1") def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]: From 6ed67d495a5bc42b9e7a8d7e1932363488b728af Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 18 Aug 2020 19:43:20 +0200 Subject: [PATCH 10/10] format --- website/docs/usage/training.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 4ee17ee21..adafcac68 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -665,18 +665,18 @@ can create and register a custom function that generates using this dataset for training, stopping criteria such as maximum number of steps, or stopping when the loss does not decrease further, can be used. -In this example we assume a custom function `read_custom_data()` -which loads or generates texts with relevant textcat annotations. Then, small -lexical variations of the input text are created before generating the final -`Example` objects. +In this example we assume a custom function `read_custom_data()` which loads or +generates texts with relevant textcat annotations. Then, small lexical +variations of the input text are created before generating the final `Example` +objects. We can also customize the batching strategy by registering a new "batcher" which turns a stream of items into a stream of batches. spaCy has several useful built-in batching strategies with customizable sizes, but it's also easy to implement your own. For instance, the following function takes -the stream of generated `Example` objects, and removes those which have the exact -same underlying raw text, to avoid duplicates in the final training data. Note -that in a more realistic implementation, you'd also want to check whether the +the stream of generated `Example` objects, and removes those which have the +exact same underlying raw text, to avoid duplicates within each batch. Note that +in a more realistic implementation, you'd also want to check whether the annotations are exactly the same. > ```ini