diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index ac6f4183d..53ca8a51d 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train). | `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | | `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ | | `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ | -| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | -| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ | +| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ | +| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ | | `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | -| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | +| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ | | `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | | `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | | `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ | diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 5e9d3303c..9f929fe19 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as conventions within spaCy's default configs, but you can also define any other custom blocks. Each section in the corpora config should resolve to a [`Corpus`](/api/corpus) – for example, using spaCy's built-in -[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy` -file. The `train_corpus` and `dev_corpus` fields in the +[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary +`.spacy` file. The `train_corpus` and `dev_corpus` fields in the [`[training]`](/api/data-formats#config-training) block specify where to find the corpus in your config. This makes it easy to **swap out** different corpora by only changing a single config setting. @@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be especially useful if you need to split a single file into corpora for training and evaluation, without loading the same file twice. +By default, the training data is loaded into memory and shuffled before each +epoch. If the corpus is **too large to fit into memory** during training, stream +the corpus using a custom reader as described in the next section. + ### Custom data reading and batching {#custom-code-readers-batchers} Some use-cases require **streaming in data** or manipulating datasets on the -fly, rather than generating all data beforehand and storing it to file. Instead +fly, rather than generating all data beforehand and storing it to disk. Instead of using the built-in [`Corpus`](/api/corpus) reader, which uses static file paths, you can create and register a custom function that generates -[`Example`](/api/example) objects. The resulting generator can be infinite. When -using this dataset for training, stopping criteria such as maximum number of -steps, or stopping when the loss does not decrease further, can be used. +[`Example`](/api/example) objects. -In this example we assume a custom function `read_custom_data` which loads or -generates texts with relevant text classification annotations. Then, small -lexical variations of the input text are created before generating the final -[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets -you register the function creating the custom reader in the `readers` +In the following example we assume a custom function `read_custom_data` which +loads or generates texts with relevant text classification annotations. Then, +small lexical variations of the input text are created before generating the +final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator +lets you register the function creating the custom reader in the `readers` [registry](/api/top-level#registry) and assign it a string name, so it can be used in your config. All arguments on the registered function become available as **config settings** – in this case, `source`. @@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy +If the corpus is **too large to load into memory** or the corpus reader is an +**infinite generator**, use the setting `max_epochs = -1` to indicate that the +train corpus should be streamed. With this setting the train corpus is merely +streamed and batched, not shuffled, so any shuffling needs to be implemented in +the corpus reader itself. In the example below, a corpus reader that generates +sentences containing even or odd numbers is used with an unlimited number of +examples for the train corpus and a limited number of examples for the dev +corpus. The dev corpus should always be finite and fit in memory during the +evaluation step. `max_steps` and/or `patience` are used to determine when the +training should stop. + +> #### config.cfg +> +> ```ini +> [corpora.dev] +> @readers = "even_odd.v1" +> limit = 100 +> +> [corpora.train] +> @readers = "even_odd.v1" +> limit = -1 +> +> [training] +> max_epochs = -1 +> patience = 500 +> max_steps = 2000 +> ``` + +```python +### functions.py +from typing import Callable, Iterable, Iterator +from spacy import util +import random +from spacy.training import Example +from spacy import Language + + +@util.registry.readers("even_odd.v1") +def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]: + return EvenOddCorpus(limit) + + +class EvenOddCorpus: + def __init__(self, limit): + self.limit = limit + + def __call__(self, nlp: Language) -> Iterator[Example]: + i = 0 + while i < self.limit or self.limit < 0: + r = random.randint(0, 1000) + cat = r % 2 == 0 + text = "This is sentence " + str(r) + yield Example.from_dict( + nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}} + ) + i += 1 +``` + +> #### config.cfg +> +> ```ini +> [initialize.components.textcat.labels] +> @readers = "spacy.read_labels.v1" +> path = "labels/textcat.json" +> require = true +> ``` + +If the train corpus is streamed, the initialize step peeks at the first 100 +examples in the corpus to find the labels for each component. If this isn't +sufficient, you'll need to [provide the labels](#initialization-labels) for each +component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can +be used to generate JSON files in the correct format, which you can extend with +the full label set. + We can also customize the **batching strategy** by registering a new batcher function in the `batchers` [registry](/api/top-level#registry). A batcher turns a stream of items into a stream of batches. spaCy has several useful built-in