mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Add usage docs for streamed train corpora (#7693)
This commit is contained in:
parent
73a8c0f992
commit
673e2bc4c0
|
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
|
|||
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
||||
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
||||
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
||||
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
|
||||
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
|
||||
| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
|
||||
| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
|
||||
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
|
||||
| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
|
||||
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
||||
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
||||
|
|
|
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
|
|||
conventions within spaCy's default configs, but you can also define any other
|
||||
custom blocks. Each section in the corpora config should resolve to a
|
||||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
||||
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
|
||||
file. The `train_corpus` and `dev_corpus` fields in the
|
||||
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
|
||||
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
|
||||
[`[training]`](/api/data-formats#config-training) block specify where to find
|
||||
the corpus in your config. This makes it easy to **swap out** different corpora
|
||||
by only changing a single config setting.
|
||||
|
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
|
|||
especially useful if you need to split a single file into corpora for training
|
||||
and evaluation, without loading the same file twice.
|
||||
|
||||
By default, the training data is loaded into memory and shuffled before each
|
||||
epoch. If the corpus is **too large to fit into memory** during training, stream
|
||||
the corpus using a custom reader as described in the next section.
|
||||
|
||||
### Custom data reading and batching {#custom-code-readers-batchers}
|
||||
|
||||
Some use-cases require **streaming in data** or manipulating datasets on the
|
||||
fly, rather than generating all data beforehand and storing it to file. Instead
|
||||
fly, rather than generating all data beforehand and storing it to disk. Instead
|
||||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||||
paths, you can create and register a custom function that generates
|
||||
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
||||
using this dataset for training, stopping criteria such as maximum number of
|
||||
steps, or stopping when the loss does not decrease further, can be used.
|
||||
[`Example`](/api/example) objects.
|
||||
|
||||
In this example we assume a custom function `read_custom_data` which loads or
|
||||
generates texts with relevant text classification annotations. Then, small
|
||||
lexical variations of the input text are created before generating the final
|
||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
||||
you register the function creating the custom reader in the `readers`
|
||||
In the following example we assume a custom function `read_custom_data` which
|
||||
loads or generates texts with relevant text classification annotations. Then,
|
||||
small lexical variations of the input text are created before generating the
|
||||
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
|
||||
lets you register the function creating the custom reader in the `readers`
|
||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||
used in your config. All arguments on the registered function become available
|
||||
as **config settings** – in this case, `source`.
|
||||
|
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
|
|||
|
||||
</Infobox>
|
||||
|
||||
If the corpus is **too large to load into memory** or the corpus reader is an
|
||||
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
|
||||
train corpus should be streamed. With this setting the train corpus is merely
|
||||
streamed and batched, not shuffled, so any shuffling needs to be implemented in
|
||||
the corpus reader itself. In the example below, a corpus reader that generates
|
||||
sentences containing even or odd numbers is used with an unlimited number of
|
||||
examples for the train corpus and a limited number of examples for the dev
|
||||
corpus. The dev corpus should always be finite and fit in memory during the
|
||||
evaluation step. `max_steps` and/or `patience` are used to determine when the
|
||||
training should stop.
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [corpora.dev]
|
||||
> @readers = "even_odd.v1"
|
||||
> limit = 100
|
||||
>
|
||||
> [corpora.train]
|
||||
> @readers = "even_odd.v1"
|
||||
> limit = -1
|
||||
>
|
||||
> [training]
|
||||
> max_epochs = -1
|
||||
> patience = 500
|
||||
> max_steps = 2000
|
||||
> ```
|
||||
|
||||
```python
|
||||
### functions.py
|
||||
from typing import Callable, Iterable, Iterator
|
||||
from spacy import util
|
||||
import random
|
||||
from spacy.training import Example
|
||||
from spacy import Language
|
||||
|
||||
|
||||
@util.registry.readers("even_odd.v1")
|
||||
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
|
||||
return EvenOddCorpus(limit)
|
||||
|
||||
|
||||
class EvenOddCorpus:
|
||||
def __init__(self, limit):
|
||||
self.limit = limit
|
||||
|
||||
def __call__(self, nlp: Language) -> Iterator[Example]:
|
||||
i = 0
|
||||
while i < self.limit or self.limit < 0:
|
||||
r = random.randint(0, 1000)
|
||||
cat = r % 2 == 0
|
||||
text = "This is sentence " + str(r)
|
||||
yield Example.from_dict(
|
||||
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
|
||||
)
|
||||
i += 1
|
||||
```
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [initialize.components.textcat.labels]
|
||||
> @readers = "spacy.read_labels.v1"
|
||||
> path = "labels/textcat.json"
|
||||
> require = true
|
||||
> ```
|
||||
|
||||
If the train corpus is streamed, the initialize step peeks at the first 100
|
||||
examples in the corpus to find the labels for each component. If this isn't
|
||||
sufficient, you'll need to [provide the labels](#initialization-labels) for each
|
||||
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
|
||||
be used to generate JSON files in the correct format, which you can extend with
|
||||
the full label set.
|
||||
|
||||
We can also customize the **batching strategy** by registering a new batcher
|
||||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||||
a stream of items into a stream of batches. spaCy has several useful built-in
|
||||
|
|
Loading…
Reference in New Issue
Block a user