Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-19 11:29:02 +02:00 committed by GitHub
commit 2285e59765
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 186 additions and 123 deletions

View File

@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the opening an issue to report the bug, simply refer to your pull request in the
issue body. A few more tips: issue body. A few more tips:
- **Describing your issue:** Try to provide as many details as possible. What - **Describing your issue:** Try to provide as many details as possible. What
exactly goes wrong? _How_ is it failing? Is there an error? exactly goes wrong? _How_ is it failing? Is there an error?
"XY doesn't work" usually isn't that helpful for tracking down problems. Always "XY doesn't work" usually isn't that helpful for tracking down problems. Always
remember to include the code you ran and if possible, extract only the relevant remember to include the code you ran and if possible, extract only the relevant
parts and don't just dump your entire script. This will make it easier for us to parts and don't just dump your entire script. This will make it easier for us to
reproduce the error. reproduce the error.
- **Getting info about your spaCy installation and environment:** If you're - **Getting info about your spaCy installation and environment:** If you're
using spaCy v1.7+, you can use the command line interface to print details and using spaCy v1.7+, you can use the command line interface to print details and
even format them as Markdown to copy-paste into GitHub issues: even format them as Markdown to copy-paste into GitHub issues:
`python -m spacy info --markdown`. `python -m spacy info --markdown`.
- **Checking the model compatibility:** If you're having problems with a - **Checking the model compatibility:** If you're having problems with a
[statistical model](https://spacy.io/models), it may be because the [statistical model](https://spacy.io/models), it may be because the
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
this on the command line by running `python -m spacy validate`. this on the command line by running `python -m spacy validate`.
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+ - **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
you can run from within your script or a Jupyter notebook. For some issues, it's you can run from within your script or a Jupyter notebook. For some issues, it's
helpful to **include a screenshot** of the visualization. You can simply drag and helpful to **include a screenshot** of the visualization. You can simply drag and
drop the image into GitHub's editor and it will be uploaded and included. drop the image into GitHub's editor and it will be uploaded and included.
- **Sharing long blocks of code or logs:** If you need to include long code, - **Sharing long blocks of code or logs:** If you need to include long code,
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
so it only becomes visible on click, making the issue easier to read and follow. so it only becomes visible on click, making the issue easier to read and follow.
### Issue labels ### Issue labels
@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our
philosophy is to prefer a smaller core library. We generally ask the following philosophy is to prefer a smaller core library. We generally ask the following
questions: questions:
- **What would this feature look like if implemented in a separate package?** - **What would this feature look like if implemented in a separate package?**
Some features would be very difficult to implement externally for example, Some features would be very difficult to implement externally for example,
changes to spaCy's built-in methods. In contrast, a library of word changes to spaCy's built-in methods. In contrast, a library of word
alignment functions could easily live as a separate package that depended on alignment functions could easily live as a separate package that depended on
spaCy — there's little difference between writing `import word_aligner` and spaCy — there's little difference between writing `import word_aligner` and
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components), [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
and add your own attributes, properties and methods to the `Doc`, `Token` and and add your own attributes, properties and methods to the `Doc`, `Token` and
`Span`. If you're looking to implement a new spaCy feature, starting with a `Span`. If you're looking to implement a new spaCy feature, starting with a
custom component package is usually the best strategy. You won't have to worry custom component package is usually the best strategy. You won't have to worry
about spaCy's internals and you can test your module in an isolated about spaCy's internals and you can test your module in an isolated
environment. And if it works well, we can always integrate it into the core environment. And if it works well, we can always integrate it into the core
library later. library later.
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** - **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
TensorFlow/Keras do lots of useful things — but we don't want to have them as TensorFlow/Keras do lots of useful things — but we don't want to have them as
dependencies. If the feature requires functionality in one of these libraries, dependencies. If the feature requires functionality in one of these libraries,
it's probably better to break it out into a different package. it's probably better to break it out into a different package.
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?** - **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
spaCy strongly prefers to avoid having 6 different ways of doing the same thing. spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
As better techniques are developed, we prefer to drop support for "the old way". As better techniques are developed, we prefer to drop support for "the old way".
However, it's rare that one approach _entirely_ dominates another. It's very However, it's rare that one approach _entirely_ dominates another. It's very
common that there's still a use-case for the "obsolete" approach. For instance, common that there's still a use-case for the "obsolete" approach. For instance,
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word [WordNet](https://wordnet.princeton.edu/) is still very useful — but word
vectors are better for most use-cases, and the two approaches to lexical vectors are better for most use-cases, and the two approaches to lexical
semantics do a lot of the same things. spaCy therefore only supports word semantics do a lot of the same things. spaCy therefore only supports word
vectors, and support for WordNet is currently left for other packages. vectors, and support for WordNet is currently left for other packages.
- **Do you need the feature to get basic things done?** We do want spaCy to be - **Do you need the feature to get basic things done?** We do want spaCy to be
at least somewhat self-contained. If we keep needing some feature in our at least somewhat self-contained. If we keep needing some feature in our
recipes, that does provide some argument for bringing it "in house". recipes, that does provide some argument for bringing it "in house".
### Getting started ### Getting started
@ -203,10 +203,10 @@ your files on save:
```json ```json
{ {
"python.formatting.provider": "black", "python.formatting.provider": "black",
"[python]": { "[python]": {
"editor.formatOnSave": true "editor.formatOnSave": true
} }
} }
``` ```
@ -216,7 +216,7 @@ list of available editor integrations.
#### Disabling formatting #### Disabling formatting
There are a few cases where auto-formatting doesn't improve readability for There are a few cases where auto-formatting doesn't improve readability for
example, in some of the the language data files like the `tag_map.py`, or in example, in some of the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels. the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example: for that particular code. Here's an example:
@ -397,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
### Resources to get you started ### Resources to get you started
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) - [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests ## Adding tests
@ -440,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page.
We're very excited about all the new possibilities for **community extensions** We're very excited about all the new possibilities for **community extensions**
and plugins in spaCy v2.0, and we can't wait to see what you build with it! and plugins in spaCy v2.0, and we can't wait to see what you build with it!
- An extension or plugin should add substantial functionality, be - An extension or plugin should add substantial functionality, be
**well-documented** and **open-source**. It should be available for users to download **well-documented** and **open-source**. It should be available for users to download
and install as a Python package for example via [PyPi](http://pypi.python.org). and install as a Python package for example via [PyPi](http://pypi.python.org).
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped - Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
that users can **add to their processing pipeline** using `nlp.add_pipe()`. that users can **add to their processing pipeline** using `nlp.add_pipe()`.
- When publishing your extension on GitHub, **tag it** with the topics - When publishing your extension on GitHub, **tag it** with the topics
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
to make it easier to find. Those are also the topics we're linking to from the to make it easier to find. Those are also the topics we're linking to from the
spaCy website. If you're sharing your project on Twitter, feel free to tag spaCy website. If you're sharing your project on Twitter, feel free to tag
[@spacy_io](https://twitter.com/spacy_io) so we can check it out. [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
- Once your extension is published, you can open an issue on the - Once your extension is published, you can open an issue on the
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
[resources directory](https://spacy.io/usage/resources#extensions) on the [resources directory](https://spacy.io/usage/resources#extensions) on the
website. website.
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).** 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**

View File

@ -235,7 +235,7 @@ def train_while_improving(
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`, with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
where info is a dict, and is_best_checkpoint is in [True, False, None] -- where info is a dict, and is_best_checkpoint is in [True, False, None] --
None indicating that the iteration was not evaluated as a checkpoint. None indicating that the iteration was not evaluated as a checkpoint.
The evaluation is conducted by calling the evaluate callback, which should The evaluation is conducted by calling the evaluate callback.
Positional arguments: Positional arguments:
nlp: The spaCy pipeline to evaluate. nlp: The spaCy pipeline to evaluate.

View File

@ -545,18 +545,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
<!-- TODO: model return type --> <!-- TODO: model return type -->
| Name | Description | | Name | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ | | `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ |
| `width` | Output dimension of the feature encoding step. ~~int~~ | | `width` | Output dimension of the feature encoding step. ~~int~~ |
| `embed_size` | Input dimension of the feature encoding step. ~~int~~ | | `embed_size` | Input dimension of the feature encoding step. ~~int~~ |
| `conv_depth` | Depth of the tok2vec layer. ~~int~~ | | `conv_depth` | Depth of the tok2vec layer. ~~int~~ |
| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ | | `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ |
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
| `dropout` | The dropout rate. ~~float~~ | | `dropout` | The dropout rate. ~~float~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ | | **CREATES** | The model using the architecture. ~~Model~~ |
### spacy.TextCatCNN.v1 {#TextCatCNN} ### spacy.TextCatCNN.v1 {#TextCatCNN}
@ -585,12 +585,12 @@ architecture is usually less accurate than the ensemble, but runs faster.
<!-- TODO: model return type --> <!-- TODO: model return type -->
| Name | Description | | Name | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ | | **CREATES** | The model using the architecture. ~~Model~~ |
### spacy.TextCatBOW.v1 {#TextCatBOW} ### spacy.TextCatBOW.v1 {#TextCatBOW}
@ -610,13 +610,13 @@ others, but may not be as accurate, especially if texts are short.
<!-- TODO: model return type --> <!-- TODO: model return type -->
| Name | Description | | Name | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ | | `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ | | **CREATES** | The model using the architecture. ~~Model~~ |
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

View File

@ -17,7 +17,7 @@ customize the data loading during training, you can register your own
or evaluation data. It takes the same arguments as the `Corpus` class and or evaluation data. It takes the same arguments as the `Corpus` class and
returns a callable that yields [`Example`](/api/example) objects. You can returns a callable that yields [`Example`](/api/example) objects. You can
replace it with your own registered function in the replace it with your own registered function in the
[`@readers` registry](/api/top-level#regsitry) to customize the data loading and [`@readers` registry](/api/top-level#registry) to customize the data loading and
streaming. streaming.
> #### Example config > #### Example config

View File

@ -162,7 +162,7 @@ run [`spacy pretrain`](/api/cli#pretrain).
| `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ | | `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ |
| `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ | | `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ |
| `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). Defaults to `3000`. ~~Union[int, Sequence[int]]~~ | | `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). Defaults to `3000`. ~~Union[int, Sequence[int]]~~ |
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | | `seed` | The random seed. Defaults to variable `${system:seed}`. ~~int~~ |
| `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. Defaults to variable `${system:use_pytorch_for_gpu_memory}`. ~~bool~~ | | `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. Defaults to variable `${system:use_pytorch_for_gpu_memory}`. ~~bool~~ |
| `tok2vec_model` | The model section of the embedding component in the config. Defaults to `"components.tok2vec.model"`. ~~str~~ | | `tok2vec_model` | The model section of the embedding component in the config. Defaults to `"components.tok2vec.model"`. ~~str~~ |
| `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ | | `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ |

View File

@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace # compile spaCy
Compared to regular install via pip, the Compared to regular install via pip, the
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt) [`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
additionally installs developer dependencies such as Cython. See the the additionally installs developer dependencies such as Cython. See the
[quickstart widget](#quickstart) to get the right commands for your platform and [quickstart widget](#quickstart) to get the right commands for your platform and
Python version. Python version.

View File

@ -551,9 +551,9 @@ setup(
) )
``` ```
After installing the package, the the custom colors will be used when After installing the package, the custom colors will be used when visualizing
visualizing text with `displacy`. Whenever the label `SNEK` is assigned, it will text with `displacy`. Whenever the label `SNEK` is assigned, it will be
be displayed in `#3dff74`. displayed in `#3dff74`.
import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html' import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html'

View File

@ -144,7 +144,7 @@ https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
Under the hood, the config is parsed into a dictionary. It's divided into Under the hood, the config is parsed into a dictionary. It's divided into
sections and subsections, indicated by the square brackets and dot notation. For sections and subsections, indicated by the square brackets and dot notation. For
example, `[training]` is a section and `[training.batch_size]` a subsections. example, `[training]` is a section and `[training.batch_size]` a subsection.
Subsections can define values, just like a dictionary, or use the `@` syntax to Subsections can define values, just like a dictionary, or use the `@` syntax to
refer to [registered functions](#config-functions). This allows the config to refer to [registered functions](#config-functions). This allows the config to
not just define static settings, but also construct objects like architectures, not just define static settings, but also construct objects like architectures,
@ -156,7 +156,7 @@ sections of a config file are:
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. | | `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. | | `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths:train}`, and can be [overwritten](#config-overrides) on the CLI. | | `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths:train}`, and can be [overwritten](#config-overrides) on the CLI. |
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. | | `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system:seed}`, and can be [overwritten](#config-overrides) on the CLI. |
| `training` | Settings and controls for the training and evaluation process. | | `training` | Settings and controls for the training and evaluation process. |
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | | `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
@ -514,11 +514,11 @@ language class and `nlp` object at different points of the lifecycle:
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. | | `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. | | `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
The `@spacy.registry.callbacks` decorator lets you register that function in the The `@spacy.registry.callbacks` decorator lets you register your custom function
`callbacks` [registry](/api/top-level#registry) under a given name. You can then in the `callbacks` [registry](/api/top-level#registry) under a given name. You
reference the function in a config block using the `@callbacks` key. If a block can then reference the function in a config block using the `@callbacks` key. If
contains a key starting with an `@`, it's interpreted as a reference to a a block contains a key starting with an `@`, it's interpreted as a reference to
function. Because you've registered the function, spaCy knows how to create it a function. Because you've registered the function, spaCy knows how to create it
when you reference `"customize_language_data"` in your config. Here's an example when you reference `"customize_language_data"` in your config. Here's an example
of a callback that runs before the `nlp` object is created and adds a few custom of a callback that runs before the `nlp` object is created and adds a few custom
tokenization rules to the defaults: tokenization rules to the defaults:
@ -593,9 +593,9 @@ spaCy's configs are powered by our machine learning library Thinc's
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides type hints, the values that are passed in will be checked function provides type hints, the values that are passed in will be checked
against the expected types. For example, `debug: bool` in the example above will against the expected types. For example, `debug: bool` in the example above will
ensure that the value received as the argument `debug` is an boolean. If the ensure that the value received as the argument `debug` is a boolean. If the
value can't be coerced into a boolean, spaCy will raise an error. value can't be coerced into a boolean, spaCy will raise an error.
`start: pydantic.StrictBool` will force the value to be an boolean and raise an `debug: pydantic.StrictBool` will force the value to be a boolean and raise an
error if it's not for instance, if your config defines `1` instead of `true`. error if it's not for instance, if your config defines `1` instead of `true`.
</Infobox> </Infobox>
@ -642,7 +642,9 @@ settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config. If your function defines the functions need to be represented in the config. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run **default argument values**, spaCy is able to auto-fill your config when you run
[`init fill-config`](/api/cli#init-fill-config). [`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
given parameter is always explicitely set in the config, avoid setting a default
value for it.
```ini ```ini
### config.cfg (excerpt) ### config.cfg (excerpt)
@ -654,7 +656,68 @@ factor = 1.005
#### Example: Custom data reading and batching {#custom-code-readers-batchers} #### Example: Custom data reading and batching {#custom-code-readers-batchers}
<!-- TODO: --> Some use-cases require streaming in data or manipulating datasets on the fly,
rather than generating all data beforehand and storing it to file. Instead of
using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you
can create and register a custom function that generates
[`Example`](/api/example) objects. The resulting generator can be infinite. When
using this dataset for training, stopping criteria such as maximum number of
steps, or stopping when the loss does not decrease further, can be used.
In this example we assume a custom function `read_custom_data()` which loads or
generates texts with relevant textcat annotations. Then, small lexical
variations of the input text are created before generating the final `Example`
objects.
We can also customize the batching strategy by registering a new "batcher" which
turns a stream of items into a stream of batches. spaCy has several useful
built-in batching strategies with customizable sizes<!-- TODO: link -->, but
it's also easy to implement your own. For instance, the following function takes
the stream of generated `Example` objects, and removes those which have the
exact same underlying raw text, to avoid duplicates within each batch. Note that
in a more realistic implementation, you'd also want to check whether the
annotations are exactly the same.
> ```ini
> [training.train_corpus]
> @readers = "corpus_variants.v1"
>
> [training.batcher]
> @batchers = "filtering_batch.v1"
> size = 150
> ```
```python
### functions.py
from typing import Callable, Iterable, List
import spacy
from spacy.gold import Example
import random
@spacy.registry.readers("corpus_variants.v1")
def stream_data() -> Callable[["Language"], Iterable[Example]]:
def generate_stream(nlp):
for text, cats in read_custom_data():
random_index = random.randint(0, len(text) - 1)
variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:]
doc = nlp.make_doc(variant)
example = Example.from_dict(doc, {"cats": cats})
yield example
return generate_stream
@spacy.registry.batchers("filtering_batch.v1")
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]:
def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]:
batch = []
for eg in examples:
if eg.text not in [x.text for x in batch]:
batch.append(eg)
if len(batch) == size:
yield batch
batch = []
return create_filtered_batches
```
### Wrapping PyTorch and TensorFlow {#custom-frameworks} ### Wrapping PyTorch and TensorFlow {#custom-frameworks}

View File

@ -60,7 +60,7 @@
"clear": "rm -rf .cache", "clear": "rm -rf .cache",
"test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"", "test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"",
"python:install": "pip install setup/requirements.txt", "python:install": "pip install setup/requirements.txt",
"python:setup": "cd setup && ./setup.sh" "python:setup": "cd setup && sh setup.sh"
}, },
"devDependencies": { "devDependencies": {
"@sindresorhus/slugify": "^0.8.0", "@sindresorhus/slugify": "^0.8.0",

View File

@ -2,7 +2,7 @@
# With additional functionality: in/not in, replace, pprint, round, + for lists, # With additional functionality: in/not in, replace, pprint, round, + for lists,
# rendering empty dicts # rendering empty dicts
# This script is mostly used to generate the JavaScript function for the # This script is mostly used to generate the JavaScript function for the
# training quicktart widget. # training quickstart widget.
import contextlib import contextlib
import json import json
import re import re