Add usage docs for streamed train corpora (#7693)

2025-08-08 22:24:55 +03:00 · 2021-04-09 16:15:38 +02:00 · 2021-04-09 16:15:38 +02:00 · 673e2bc4c0
commit 673e2bc4c0
parent 73a8c0f992
2 changed files with 90 additions and 14 deletions
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
 | `frozen_components`   | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                      |
 | `gpu_allocator`       | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~                                                                                                                                                                                   |
 | `logger`              | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
-| `max_epochs`          | Maximum number of epochs to train for. Defaults to `0`. ~~int~~                                                                                                                                                                                                                                                                     |
-| `max_steps`           | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                                                           |
+| `max_epochs`          | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~                                                                                             |
+| `max_steps`           | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                   |
 | `optimizer`           | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~                                                                                                                                             |
-| `patience`            | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~                                                                                                                                                                                                                                     |
+| `patience`            | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~                                                                                                                                                                                                        |
 | `score_weights`       | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~                                                                                                                                                       |
 | `seed`                | The random seed. Defaults to variable `${system.seed}`. ~~int~~                                                                                                                                                                                                                                                                     |
 | `train_corpus`        | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~                                                                                                                                                                                                                                 |
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
 conventions within spaCy's default configs, but you can also define any other
 custom blocks. Each section in the corpora config should resolve to a
 [`Corpus`](/api/corpus) – for example, using spaCy's built-in
-[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
-file. The `train_corpus` and `dev_corpus` fields in the
+[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
+`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
 [`[training]`](/api/data-formats#config-training) block specify where to find
 the corpus in your config. This makes it easy to **swap out** different corpora
 by only changing a single config setting.
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
 especially useful if you need to split a single file into corpora for training
 and evaluation, without loading the same file twice.

+By default, the training data is loaded into memory and shuffled before each
+epoch. If the corpus is **too large to fit into memory** during training, stream
+the corpus using a custom reader as described in the next section.
+
 ### Custom data reading and batching {#custom-code-readers-batchers}

 Some use-cases require **streaming in data** or manipulating datasets on the
-fly, rather than generating all data beforehand and storing it to file. Instead
+fly, rather than generating all data beforehand and storing it to disk. Instead
 of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
 paths, you can create and register a custom function that generates
-[`Example`](/api/example) objects. The resulting generator can be infinite. When
-using this dataset for training, stopping criteria such as maximum number of
-steps, or stopping when the loss does not decrease further, can be used.
+[`Example`](/api/example) objects.

-In this example we assume a custom function `read_custom_data` which loads or
-generates texts with relevant text classification annotations. Then, small
-lexical variations of the input text are created before generating the final
-[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
-you register the function creating the custom reader in the `readers`
+In the following example we assume a custom function `read_custom_data` which
+loads or generates texts with relevant text classification annotations. Then,
+small lexical variations of the input text are created before generating the
+final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
+lets you register the function creating the custom reader in the `readers`
 [registry](/api/top-level#registry) and assign it a string name, so it can be
 used in your config. All arguments on the registered function become available
 as **config settings** – in this case, `source`.
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy

 </Infobox>

+If the corpus is **too large to load into memory** or the corpus reader is an
+**infinite generator**, use the setting `max_epochs = -1` to indicate that the
+train corpus should be streamed. With this setting the train corpus is merely
+streamed and batched, not shuffled, so any shuffling needs to be implemented in
+the corpus reader itself. In the example below, a corpus reader that generates
+sentences containing even or odd numbers is used with an unlimited number of
+examples for the train corpus and a limited number of examples for the dev
+corpus. The dev corpus should always be finite and fit in memory during the
+evaluation step. `max_steps` and/or `patience` are used to determine when the
+training should stop.
+
+> #### config.cfg
+>
+> ```ini
+> [corpora.dev]
+> @readers = "even_odd.v1"
+> limit = 100
+>
+> [corpora.train]
+> @readers = "even_odd.v1"
+> limit = -1
+>
+> [training]
+> max_epochs = -1
+> patience = 500
+> max_steps = 2000
+> ```
+
+```python
+### functions.py
+from typing import Callable, Iterable, Iterator
+from spacy import util
+import random
+from spacy.training import Example
+from spacy import Language
+
+
+@util.registry.readers("even_odd.v1")
+def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
+    return EvenOddCorpus(limit)
+
+
+class EvenOddCorpus:
+    def __init__(self, limit):
+        self.limit = limit
+
+    def __call__(self, nlp: Language) -> Iterator[Example]:
+        i = 0
+        while i < self.limit or self.limit < 0:
+            r = random.randint(0, 1000)
+            cat = r % 2 == 0
+            text = "This is sentence " + str(r)
+            yield Example.from_dict(
+                nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
+            )
+            i += 1
+```
+
+> #### config.cfg
+>
+> ```ini
+> [initialize.components.textcat.labels]
+> @readers = "spacy.read_labels.v1"
+> path = "labels/textcat.json"
+> require = true
+> ```
+
+If the train corpus is streamed, the initialize step peeks at the first 100
+examples in the corpus to find the labels for each component. If this isn't
+sufficient, you'll need to [provide the labels](#initialization-labels) for each
+component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
+be used to generate JSON files in the correct format, which you can extend with
+the full label set.
+
 We can also customize the **batching strategy** by registering a new batcher
 function in the `batchers` [registry](/api/top-level#registry). A batcher turns
 a stream of items into a stream of batches. spaCy has several useful built-in