Add usage docs for streamed train corpora (#7693)

2025-10-22 11:44:16 +03:00 · 2021-04-09 16:15:38 +02:00 · 2021-04-09 16:15:38 +02:00 · 673e2bc4c0
commit 673e2bc4c0
parent 73a8c0f992
2 changed files with 90 additions and 14 deletions
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
 | `frozen_components`   | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                      |
 | `gpu_allocator`       | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~                                                                                                                                                                                   |
 | `logger`              | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
-| `max_epochs`          | Maximum number of epochs to train for. Defaults to `0`. ~~int~~                                                                                                                                                                                                                                                                     |
+| `max_epochs`          | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~                                                                                             |
-| `max_steps`           | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                                                           |
+| `max_steps`           | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                   |
 | `optimizer`           | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~                                                                                                                                             |
-| `patience`            | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~                                                                                                                                                                                                                                     |
+| `patience`            | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~                                                                                                                                                                                                        |
 | `score_weights`       | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~                                                                                                                                                       |
 | `seed`                | The random seed. Defaults to variable `${system.seed}`. ~~int~~                                                                                                                                                                                                                                                                     |
 | `train_corpus`        | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~                                                                                                                                                                                                                                 |
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
 conventions within spaCy's default configs, but you can also define any other
 custom blocks. Each section in the corpora config should resolve to a
 [`Corpus`](/api/corpus) – for example, using spaCy's built-in
-[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
+[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
-file. The `train_corpus` and `dev_corpus` fields in the
+`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
 [`[training]`](/api/data-formats#config-training) block specify where to find
 the corpus in your config. This makes it easy to **swap out** different corpora
 by only changing a single config setting.
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
 especially useful if you need to split a single file into corpora for training
 and evaluation, without loading the same file twice.
 By default, the training data is loaded into memory and shuffled before each
 epoch. If the corpus is **too large to fit into memory** during training, stream
 the corpus using a custom reader as described in the next section.
 ### Custom data reading and batching {#custom-code-readers-batchers}
 Some use-cases require **streaming in data** or manipulating datasets on the
-fly, rather than generating all data beforehand and storing it to file. Instead
+fly, rather than generating all data beforehand and storing it to disk. Instead
 of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
 paths, you can create and register a custom function that generates
-[`Example`](/api/example) objects. The resulting generator can be infinite. When
+[`Example`](/api/example) objects.
 using this dataset for training, stopping criteria such as maximum number of
 steps, or stopping when the loss does not decrease further, can be used.
-In this example we assume a custom function `read_custom_data` which loads or
+In the following example we assume a custom function `read_custom_data` which
-generates texts with relevant text classification annotations. Then, small
+loads or generates texts with relevant text classification annotations. Then,
-lexical variations of the input text are created before generating the final
+small lexical variations of the input text are created before generating the
-[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
+final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
-you register the function creating the custom reader in the `readers`
+lets you register the function creating the custom reader in the `readers`
 [registry](/api/top-level#registry) and assign it a string name, so it can be
 used in your config. All arguments on the registered function become available
 as **config settings** – in this case, `source`.
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
 </Infobox>
 If the corpus is **too large to load into memory** or the corpus reader is an
 **infinite generator**, use the setting `max_epochs = -1` to indicate that the
 train corpus should be streamed. With this setting the train corpus is merely
 streamed and batched, not shuffled, so any shuffling needs to be implemented in
 the corpus reader itself. In the example below, a corpus reader that generates
 sentences containing even or odd numbers is used with an unlimited number of
 examples for the train corpus and a limited number of examples for the dev
 corpus. The dev corpus should always be finite and fit in memory during the
 evaluation step. `max_steps` and/or `patience` are used to determine when the
 training should stop.
 > #### config.cfg
 >
 > ```ini
 > [corpora.dev]
 > @readers = "even_odd.v1"
 > limit = 100
 >
 > [corpora.train]
 > @readers = "even_odd.v1"
 > limit = -1
 >
 > [training]
 > max_epochs = -1
 > patience = 500
 > max_steps = 2000
 > ```
 ```python
 ### functions.py
 from typing import Callable, Iterable, Iterator
 from spacy import util
 import random
 from spacy.training import Example
 from spacy import Language
@util.registry.readers("even_odd.v1")
 def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
    return EvenOddCorpus(limit)
 class EvenOddCorpus:
    def __init__(self, limit):
        self.limit = limit
    def __call__(self, nlp: Language) -> Iterator[Example]:
        i = 0
        while i < self.limit or self.limit < 0:
            r = random.randint(0, 1000)
            cat = r % 2 == 0
            text = "This is sentence " + str(r)
            yield Example.from_dict(
                nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
            )
            i += 1
 ```
 > #### config.cfg
 >
 > ```ini
 > [initialize.components.textcat.labels]
 > @readers = "spacy.read_labels.v1"
 > path = "labels/textcat.json"
 > require = true
 > ```
 If the train corpus is streamed, the initialize step peeks at the first 100
 examples in the corpus to find the labels for each component. If this isn't
 sufficient, you'll need to [provide the labels](#initialization-labels) for each
 component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
 be used to generate JSON files in the correct format, which you can extend with
 the full label set.
 We can also customize the **batching strategy** by registering a new batcher
 function in the `batchers` [registry](/api/top-level#registry). A batcher turns
 a stream of items into a stream of batches. spaCy has several useful built-in