From 9c25656ccc30f88ca023dec5a45f902ced661245 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 19 Aug 2020 12:14:41 +0200 Subject: [PATCH] Update docs [ci skip] --- website/docs/usage/embeddings-transformers.md | 2 + website/docs/usage/projects.md | 2 + website/docs/usage/training.md | 116 ++++++++++++------ website/docs/usage/v3.md | 4 + website/src/styles/layout.sass | 2 +- 5 files changed, 88 insertions(+), 38 deletions(-) diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 5a3189ecb..e097ae02a 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -179,6 +179,7 @@ of objects by referring to creation functions, including functions you register yourself. For details on how to get started with training your own model, check out the [training quickstart](/usage/training#quickstart). + The `[components]` section in the [`config.cfg`](/api/data-formats#config) describes the pipeline components and the settings used to construct them, diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index ab8101477..61367fb0e 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -33,6 +33,7 @@ and prototypes and ship your models into production. + spaCy projects make it easy to integrate with many other **awesome tools** in the data science and machine learning ecosystem to track and manage your data diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 30537b8d8..7ce9457f9 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -92,6 +92,7 @@ spaCy's binary `.spacy` format. You can either include the data paths in the $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy ``` + ## Training config {#config} @@ -656,32 +658,74 @@ factor = 1.005 #### Example: Custom data reading and batching {#custom-code-readers-batchers} -Some use-cases require streaming in data or manipulating datasets on the fly, -rather than generating all data beforehand and storing it to file. Instead of -using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you -can create and register a custom function that generates +Some use-cases require **streaming in data** or manipulating datasets on the +fly, rather than generating all data beforehand and storing it to file. Instead +of using the built-in [`Corpus`](/api/corpus) reader, which uses static file +paths, you can create and register a custom function that generates [`Example`](/api/example) objects. The resulting generator can be infinite. When using this dataset for training, stopping criteria such as maximum number of steps, or stopping when the loss does not decrease further, can be used. -In this example we assume a custom function `read_custom_data()` which loads or -generates texts with relevant textcat annotations. Then, small lexical -variations of the input text are created before generating the final `Example` -objects. - -We can also customize the batching strategy by registering a new "batcher" which -turns a stream of items into a stream of batches. spaCy has several useful -built-in batching strategies with customizable sizes, but -it's also easy to implement your own. For instance, the following function takes -the stream of generated `Example` objects, and removes those which have the -exact same underlying raw text, to avoid duplicates within each batch. Note that -in a more realistic implementation, you'd also want to check whether the -annotations are exactly the same. +In this example we assume a custom function `read_custom_data` which loads or +generates texts with relevant text classification annotations. Then, small +lexical variations of the input text are created before generating the final +[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets +you register the function creating the custom reader in the `readers` +[registry](/api/top-level#registry) and assign it a string name, so it can be +used in your config. All arguments on the registered function become available +as **config settings** – in this case, `source`. +> #### config.cfg +> > ```ini > [training.train_corpus] > @readers = "corpus_variants.v1" +> source = "s3://your_bucket/path/data.csv" +> ``` + +```python +### functions.py {highlight="7-8"} +from typing import Callable, Iterator, List +import spacy +from spacy.gold import Example +from spacy.language import Language +import random + +@spacy.registry.readers("corpus_variants.v1") +def stream_data(source: str) -> Callable[[Language], Iterator[Example]]: + def generate_stream(nlp): + for text, cats in read_custom_data(source): + # Create a random variant of the example text + i = random.randint(0, len(text) - 1) + variant = text[:i] + text[i].upper() + text[i + 1:] + doc = nlp.make_doc(variant) + example = Example.from_dict(doc, {"cats": cats}) + yield example + + return generate_stream +``` + + + +Remember that a registered function should always be a function that spaCy +**calls to create something**. In this case, it **creates the reader function** +– it's not the reader itself. + + + +We can also customize the **batching strategy** by registering a new batcher +function in the `batchers` [registry](/api/top-level#registry). A batcher turns +a stream of items into a stream of batches. spaCy has several useful built-in +[batching strategies](/api/top-level#batchers) with customizable sizes, but it's +also easy to implement your own. For instance, the following function takes the +stream of generated [`Example`](/api/example) objects, and removes those which +have the exact same underlying raw text, to avoid duplicates within each batch. +Note that in a more realistic implementation, you'd also want to check whether +the annotations are exactly the same. + +> #### config.cfg > +> ```ini > [training.batcher] > @batchers = "filtering_batch.v1" > size = 150 @@ -689,39 +733,26 @@ annotations are exactly the same. ```python ### functions.py -from typing import Callable, Iterable, List +from typing import Callable, Iterable, Iterator import spacy from spacy.gold import Example -import random - -@spacy.registry.readers("corpus_variants.v1") -def stream_data() -> Callable[["Language"], Iterable[Example]]: - def generate_stream(nlp): - for text, cats in read_custom_data(): - random_index = random.randint(0, len(text) - 1) - variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:] - doc = nlp.make_doc(variant) - example = Example.from_dict(doc, {"cats": cats}) - yield example - return generate_stream - @spacy.registry.batchers("filtering_batch.v1") -def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]: - def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]: +def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]: + def create_filtered_batches(examples): batch = [] for eg in examples: + # Remove duplicate examples with the same text from batch if eg.text not in [x.text for x in batch]: batch.append(eg) if len(batch) == size: yield batch batch = [] + return create_filtered_batches ``` -### Wrapping PyTorch and TensorFlow {#custom-frameworks} - - + + ### Defining custom architectures {#custom-architectures} + ## Transfer learning {#transfer-learning} + + ### Using transformer models like BERT {#transformers} spaCy v3.0 lets you use almost any statistical model to power your pipeline. You @@ -748,6 +784,8 @@ do the required plumbing. It also provides a pipeline component, [`Transformer`](/api/transformer), that lets you do multi-task learning and lets you save the transformer outputs for later use. + For more details on how to integrate transformer models into your training config and customize the implementations, see the usage guide on @@ -766,7 +805,8 @@ config and customize the implementations, see the usage guide on ## Parallel Training with Ray {#parallel-training} - + + ## Internal training API {#api} diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 47110609e..ffed1c89f 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -444,6 +444,8 @@ values. You can then use the auto-generated `config.cfg` for training: + python -m spacy train ./config.cfg --output ./output ``` + + #### Training via the Python API {#migrating-training-python} For most use cases, you **shouldn't** have to write your own training scripts diff --git a/website/src/styles/layout.sass b/website/src/styles/layout.sass index 03011bf4e..775523190 100644 --- a/website/src/styles/layout.sass +++ b/website/src/styles/layout.sass @@ -396,7 +396,7 @@ body [id]:target margin-right: -1.5em margin-left: -1.5em padding-right: 1.5em - padding-left: 1.25em + padding-left: 1.2em &:empty:before // Fix issue where empty lines would disappear