Add usage docs for streamed train corpora (#7693)

This commit is contained in:
Adriane Boyd 2021-04-09 16:15:38 +02:00 committed by GitHub
parent 73a8c0f992
commit 673e2bc4c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 90 additions and 14 deletions

View File

@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |

View File

@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
conventions within spaCy's default configs, but you can also define any other
custom blocks. Each section in the corpora config should resolve to a
[`Corpus`](/api/corpus) for example, using spaCy's built-in
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
file. The `train_corpus` and `dev_corpus` fields in the
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
[`[training]`](/api/data-formats#config-training) block specify where to find
the corpus in your config. This makes it easy to **swap out** different corpora
by only changing a single config setting.
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
especially useful if you need to split a single file into corpora for training
and evaluation, without loading the same file twice.
By default, the training data is loaded into memory and shuffled before each
epoch. If the corpus is **too large to fit into memory** during training, stream
the corpus using a custom reader as described in the next section.
### Custom data reading and batching {#custom-code-readers-batchers}
Some use-cases require **streaming in data** or manipulating datasets on the
fly, rather than generating all data beforehand and storing it to file. Instead
fly, rather than generating all data beforehand and storing it to disk. Instead
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
paths, you can create and register a custom function that generates
[`Example`](/api/example) objects. The resulting generator can be infinite. When
using this dataset for training, stopping criteria such as maximum number of
steps, or stopping when the loss does not decrease further, can be used.
[`Example`](/api/example) objects.
In this example we assume a custom function `read_custom_data` which loads or
generates texts with relevant text classification annotations. Then, small
lexical variations of the input text are created before generating the final
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
you register the function creating the custom reader in the `readers`
In the following example we assume a custom function `read_custom_data` which
loads or generates texts with relevant text classification annotations. Then,
small lexical variations of the input text are created before generating the
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
lets you register the function creating the custom reader in the `readers`
[registry](/api/top-level#registry) and assign it a string name, so it can be
used in your config. All arguments on the registered function become available
as **config settings** in this case, `source`.
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
</Infobox>
If the corpus is **too large to load into memory** or the corpus reader is an
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
train corpus should be streamed. With this setting the train corpus is merely
streamed and batched, not shuffled, so any shuffling needs to be implemented in
the corpus reader itself. In the example below, a corpus reader that generates
sentences containing even or odd numbers is used with an unlimited number of
examples for the train corpus and a limited number of examples for the dev
corpus. The dev corpus should always be finite and fit in memory during the
evaluation step. `max_steps` and/or `patience` are used to determine when the
training should stop.
> #### config.cfg
>
> ```ini
> [corpora.dev]
> @readers = "even_odd.v1"
> limit = 100
>
> [corpora.train]
> @readers = "even_odd.v1"
> limit = -1
>
> [training]
> max_epochs = -1
> patience = 500
> max_steps = 2000
> ```
```python
### functions.py
from typing import Callable, Iterable, Iterator
from spacy import util
import random
from spacy.training import Example
from spacy import Language
@util.registry.readers("even_odd.v1")
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
return EvenOddCorpus(limit)
class EvenOddCorpus:
def __init__(self, limit):
self.limit = limit
def __call__(self, nlp: Language) -> Iterator[Example]:
i = 0
while i < self.limit or self.limit < 0:
r = random.randint(0, 1000)
cat = r % 2 == 0
text = "This is sentence " + str(r)
yield Example.from_dict(
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
)
i += 1
```
> #### config.cfg
>
> ```ini
> [initialize.components.textcat.labels]
> @readers = "spacy.read_labels.v1"
> path = "labels/textcat.json"
> require = true
> ```
If the train corpus is streamed, the initialize step peeks at the first 100
examples in the corpus to find the labels for each component. If this isn't
sufficient, you'll need to [provide the labels](#initialization-labels) for each
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
be used to generate JSON files in the correct format, which you can extend with
the full label set.
We can also customize the **batching strategy** by registering a new batcher
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
a stream of items into a stream of batches. spaCy has several useful built-in