mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-27 17:54:39 +03:00
Add usage docs for streamed train corpora (#7693)
This commit is contained in:
parent
73a8c0f992
commit
673e2bc4c0
|
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
|
||||||
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
||||||
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
||||||
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
||||||
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
|
| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
|
||||||
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
|
| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
|
||||||
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||||
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
|
| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
|
||||||
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
||||||
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||||
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
||||||
|
|
|
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
|
||||||
conventions within spaCy's default configs, but you can also define any other
|
conventions within spaCy's default configs, but you can also define any other
|
||||||
custom blocks. Each section in the corpora config should resolve to a
|
custom blocks. Each section in the corpora config should resolve to a
|
||||||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
||||||
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
|
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
|
||||||
file. The `train_corpus` and `dev_corpus` fields in the
|
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
|
||||||
[`[training]`](/api/data-formats#config-training) block specify where to find
|
[`[training]`](/api/data-formats#config-training) block specify where to find
|
||||||
the corpus in your config. This makes it easy to **swap out** different corpora
|
the corpus in your config. This makes it easy to **swap out** different corpora
|
||||||
by only changing a single config setting.
|
by only changing a single config setting.
|
||||||
|
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
|
||||||
especially useful if you need to split a single file into corpora for training
|
especially useful if you need to split a single file into corpora for training
|
||||||
and evaluation, without loading the same file twice.
|
and evaluation, without loading the same file twice.
|
||||||
|
|
||||||
|
By default, the training data is loaded into memory and shuffled before each
|
||||||
|
epoch. If the corpus is **too large to fit into memory** during training, stream
|
||||||
|
the corpus using a custom reader as described in the next section.
|
||||||
|
|
||||||
### Custom data reading and batching {#custom-code-readers-batchers}
|
### Custom data reading and batching {#custom-code-readers-batchers}
|
||||||
|
|
||||||
Some use-cases require **streaming in data** or manipulating datasets on the
|
Some use-cases require **streaming in data** or manipulating datasets on the
|
||||||
fly, rather than generating all data beforehand and storing it to file. Instead
|
fly, rather than generating all data beforehand and storing it to disk. Instead
|
||||||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||||||
paths, you can create and register a custom function that generates
|
paths, you can create and register a custom function that generates
|
||||||
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
[`Example`](/api/example) objects.
|
||||||
using this dataset for training, stopping criteria such as maximum number of
|
|
||||||
steps, or stopping when the loss does not decrease further, can be used.
|
|
||||||
|
|
||||||
In this example we assume a custom function `read_custom_data` which loads or
|
In the following example we assume a custom function `read_custom_data` which
|
||||||
generates texts with relevant text classification annotations. Then, small
|
loads or generates texts with relevant text classification annotations. Then,
|
||||||
lexical variations of the input text are created before generating the final
|
small lexical variations of the input text are created before generating the
|
||||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
|
||||||
you register the function creating the custom reader in the `readers`
|
lets you register the function creating the custom reader in the `readers`
|
||||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||||
used in your config. All arguments on the registered function become available
|
used in your config. All arguments on the registered function become available
|
||||||
as **config settings** – in this case, `source`.
|
as **config settings** – in this case, `source`.
|
||||||
|
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
If the corpus is **too large to load into memory** or the corpus reader is an
|
||||||
|
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
|
||||||
|
train corpus should be streamed. With this setting the train corpus is merely
|
||||||
|
streamed and batched, not shuffled, so any shuffling needs to be implemented in
|
||||||
|
the corpus reader itself. In the example below, a corpus reader that generates
|
||||||
|
sentences containing even or odd numbers is used with an unlimited number of
|
||||||
|
examples for the train corpus and a limited number of examples for the dev
|
||||||
|
corpus. The dev corpus should always be finite and fit in memory during the
|
||||||
|
evaluation step. `max_steps` and/or `patience` are used to determine when the
|
||||||
|
training should stop.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [corpora.dev]
|
||||||
|
> @readers = "even_odd.v1"
|
||||||
|
> limit = 100
|
||||||
|
>
|
||||||
|
> [corpora.train]
|
||||||
|
> @readers = "even_odd.v1"
|
||||||
|
> limit = -1
|
||||||
|
>
|
||||||
|
> [training]
|
||||||
|
> max_epochs = -1
|
||||||
|
> patience = 500
|
||||||
|
> max_steps = 2000
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py
|
||||||
|
from typing import Callable, Iterable, Iterator
|
||||||
|
from spacy import util
|
||||||
|
import random
|
||||||
|
from spacy.training import Example
|
||||||
|
from spacy import Language
|
||||||
|
|
||||||
|
|
||||||
|
@util.registry.readers("even_odd.v1")
|
||||||
|
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
|
||||||
|
return EvenOddCorpus(limit)
|
||||||
|
|
||||||
|
|
||||||
|
class EvenOddCorpus:
|
||||||
|
def __init__(self, limit):
|
||||||
|
self.limit = limit
|
||||||
|
|
||||||
|
def __call__(self, nlp: Language) -> Iterator[Example]:
|
||||||
|
i = 0
|
||||||
|
while i < self.limit or self.limit < 0:
|
||||||
|
r = random.randint(0, 1000)
|
||||||
|
cat = r % 2 == 0
|
||||||
|
text = "This is sentence " + str(r)
|
||||||
|
yield Example.from_dict(
|
||||||
|
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
|
||||||
|
)
|
||||||
|
i += 1
|
||||||
|
```
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.textcat.labels]
|
||||||
|
> @readers = "spacy.read_labels.v1"
|
||||||
|
> path = "labels/textcat.json"
|
||||||
|
> require = true
|
||||||
|
> ```
|
||||||
|
|
||||||
|
If the train corpus is streamed, the initialize step peeks at the first 100
|
||||||
|
examples in the corpus to find the labels for each component. If this isn't
|
||||||
|
sufficient, you'll need to [provide the labels](#initialization-labels) for each
|
||||||
|
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
|
||||||
|
be used to generate JSON files in the correct format, which you can extend with
|
||||||
|
the full label set.
|
||||||
|
|
||||||
We can also customize the **batching strategy** by registering a new batcher
|
We can also customize the **batching strategy** by registering a new batcher
|
||||||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||||||
a stream of items into a stream of batches. spaCy has several useful built-in
|
a stream of items into a stream of batches. spaCy has several useful built-in
|
||||||
|
|
Loading…
Reference in New Issue
Block a user