Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-19 12:14:41 +02:00
parent 2285e59765
commit 9c25656ccc
5 changed files with 88 additions and 38 deletions

View File

@ -179,6 +179,7 @@ of objects by referring to creation functions, including functions you register
yourself. For details on how to get started with training your own model, check yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart). out the [training quickstart](/usage/training#quickstart).
<!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_bert">
The easiest way to get started is to clone a transformers-based project The easiest way to get started is to clone a transformers-based project
@ -186,6 +187,7 @@ template. Swap in your data, edit the settings and hyperparameters and train,
evaluate, package and visualize your model. evaluate, package and visualize your model.
</Project> </Project>
-->
The `[components]` section in the [`config.cfg`](/api/data-formats#config) The `[components]` section in the [`config.cfg`](/api/data-formats#config)
describes the pipeline components and the settings used to construct them, describes the pipeline components and the settings used to construct them,

View File

@ -33,6 +33,7 @@ and prototypes and ship your models into production.
<!-- TODO: decide how to introduce concept --> <!-- TODO: decide how to introduce concept -->
<!-- TODO:
<Project id="some_example_project"> <Project id="some_example_project">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
@ -40,6 +41,7 @@ sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium. mattis pretium.
</Project> </Project>
-->
spaCy projects make it easy to integrate with many other **awesome tools** in spaCy projects make it easy to integrate with many other **awesome tools** in
the data science and machine learning ecosystem to track and manage your data the data science and machine learning ecosystem to track and manage your data

View File

@ -92,6 +92,7 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
``` ```
<!-- TODO:
<Project id="some_example_project"> <Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a The easiest way to get started with an end-to-end training process is to clone a
@ -99,6 +100,7 @@ The easiest way to get started with an end-to-end training process is to clone a
workflows, from data preprocessing to training and packaging your model. workflows, from data preprocessing to training and packaging your model.
</Project> </Project>
-->
## Training config {#config} ## Training config {#config}
@ -656,32 +658,74 @@ factor = 1.005
#### Example: Custom data reading and batching {#custom-code-readers-batchers} #### Example: Custom data reading and batching {#custom-code-readers-batchers}
Some use-cases require streaming in data or manipulating datasets on the fly, Some use-cases require **streaming in data** or manipulating datasets on the
rather than generating all data beforehand and storing it to file. Instead of fly, rather than generating all data beforehand and storing it to file. Instead
using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
can create and register a custom function that generates paths, you can create and register a custom function that generates
[`Example`](/api/example) objects. The resulting generator can be infinite. When [`Example`](/api/example) objects. The resulting generator can be infinite. When
using this dataset for training, stopping criteria such as maximum number of using this dataset for training, stopping criteria such as maximum number of
steps, or stopping when the loss does not decrease further, can be used. steps, or stopping when the loss does not decrease further, can be used.
In this example we assume a custom function `read_custom_data()` which loads or In this example we assume a custom function `read_custom_data` which loads or
generates texts with relevant textcat annotations. Then, small lexical generates texts with relevant text classification annotations. Then, small
variations of the input text are created before generating the final `Example` lexical variations of the input text are created before generating the final
objects. [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
you register the function creating the custom reader in the `readers`
We can also customize the batching strategy by registering a new "batcher" which [registry](/api/top-level#registry) and assign it a string name, so it can be
turns a stream of items into a stream of batches. spaCy has several useful used in your config. All arguments on the registered function become available
built-in batching strategies with customizable sizes<!-- TODO: link -->, but as **config settings** in this case, `source`.
it's also easy to implement your own. For instance, the following function takes
the stream of generated `Example` objects, and removes those which have the
exact same underlying raw text, to avoid duplicates within each batch. Note that
in a more realistic implementation, you'd also want to check whether the
annotations are exactly the same.
> #### config.cfg
>
> ```ini > ```ini
> [training.train_corpus] > [training.train_corpus]
> @readers = "corpus_variants.v1" > @readers = "corpus_variants.v1"
> source = "s3://your_bucket/path/data.csv"
> ```
```python
### functions.py {highlight="7-8"}
from typing import Callable, Iterator, List
import spacy
from spacy.gold import Example
from spacy.language import Language
import random
@spacy.registry.readers("corpus_variants.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
def generate_stream(nlp):
for text, cats in read_custom_data(source):
# Create a random variant of the example text
i = random.randint(0, len(text) - 1)
variant = text[:i] + text[i].upper() + text[i + 1:]
doc = nlp.make_doc(variant)
example = Example.from_dict(doc, {"cats": cats})
yield example
return generate_stream
```
<Infobox variant="warning">
Remember that a registered function should always be a function that spaCy
**calls to create something**. In this case, it **creates the reader function**
 it's not the reader itself.
</Infobox>
We can also customize the **batching strategy** by registering a new batcher
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
a stream of items into a stream of batches. spaCy has several useful built-in
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
also easy to implement your own. For instance, the following function takes the
stream of generated [`Example`](/api/example) objects, and removes those which
have the exact same underlying raw text, to avoid duplicates within each batch.
Note that in a more realistic implementation, you'd also want to check whether
the annotations are exactly the same.
> #### config.cfg
> >
> ```ini
> [training.batcher] > [training.batcher]
> @batchers = "filtering_batch.v1" > @batchers = "filtering_batch.v1"
> size = 150 > size = 150
@ -689,39 +733,26 @@ annotations are exactly the same.
```python ```python
### functions.py ### functions.py
from typing import Callable, Iterable, List from typing import Callable, Iterable, Iterator
import spacy import spacy
from spacy.gold import Example from spacy.gold import Example
import random
@spacy.registry.readers("corpus_variants.v1")
def stream_data() -> Callable[["Language"], Iterable[Example]]:
def generate_stream(nlp):
for text, cats in read_custom_data():
random_index = random.randint(0, len(text) - 1)
variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:]
doc = nlp.make_doc(variant)
example = Example.from_dict(doc, {"cats": cats})
yield example
return generate_stream
@spacy.registry.batchers("filtering_batch.v1") @spacy.registry.batchers("filtering_batch.v1")
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]: def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]: def create_filtered_batches(examples):
batch = [] batch = []
for eg in examples: for eg in examples:
# Remove duplicate examples with the same text from batch
if eg.text not in [x.text for x in batch]: if eg.text not in [x.text for x in batch]:
batch.append(eg) batch.append(eg)
if len(batch) == size: if len(batch) == size:
yield batch yield batch
batch = [] batch = []
return create_filtered_batches return create_filtered_batches
``` ```
### Wrapping PyTorch and TensorFlow {#custom-frameworks} <!-- TODO:
<!-- TODO: -->
<Project id="example_pytorch_model"> <Project id="example_pytorch_model">
@ -731,12 +762,17 @@ mattis pretium.
</Project> </Project>
-->
### Defining custom architectures {#custom-architectures} ### Defining custom architectures {#custom-architectures}
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works --> <!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
<!-- TODO: Wrapping PyTorch and TensorFlow -->
## Transfer learning {#transfer-learning} ## Transfer learning {#transfer-learning}
<!-- TODO: link to embeddings and transformers page -->
### Using transformer models like BERT {#transformers} ### Using transformer models like BERT {#transformers}
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
@ -748,6 +784,8 @@ do the required plumbing. It also provides a pipeline component,
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets [`Transformer`](/api/transformer), that lets you do multi-task learning and lets
you save the transformer outputs for later use. you save the transformer outputs for later use.
<!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_bert">
Try out a BERT-based model pipeline using this project template: swap in your Try out a BERT-based model pipeline using this project template: swap in your
@ -755,6 +793,7 @@ data, edit the settings and hyperparameters and train, evaluate, package and
visualize your model. visualize your model.
</Project> </Project>
-->
For more details on how to integrate transformer models into your training For more details on how to integrate transformer models into your training
config and customize the implementations, see the usage guide on config and customize the implementations, see the usage guide on
@ -766,7 +805,8 @@ config and customize the implementations, see the usage guide on
## Parallel Training with Ray {#parallel-training} ## Parallel Training with Ray {#parallel-training}
<!-- TODO: document Ray integration --> <!-- TODO:
<Project id="some_example_project"> <Project id="some_example_project">
@ -776,6 +816,8 @@ mattis pretium.
</Project> </Project>
-->
## Internal training API {#api} ## Internal training API {#api}
<Infobox variant="warning"> <Infobox variant="warning">

View File

@ -444,6 +444,8 @@ values. You can then use the auto-generated `config.cfg` for training:
+ python -m spacy train ./config.cfg --output ./output + python -m spacy train ./config.cfg --output ./output
``` ```
<!-- TODO:
<Project id="some_example_project"> <Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a The easiest way to get started with an end-to-end training process is to clone a
@ -452,6 +454,8 @@ workflows, from data preprocessing to training and packaging your model.
</Project> </Project>
-->
#### Training via the Python API {#migrating-training-python} #### Training via the Python API {#migrating-training-python}
For most use cases, you **shouldn't** have to write your own training scripts For most use cases, you **shouldn't** have to write your own training scripts

View File

@ -396,7 +396,7 @@ body [id]:target
margin-right: -1.5em margin-right: -1.5em
margin-left: -1.5em margin-left: -1.5em
padding-right: 1.5em padding-right: 1.5em
padding-left: 1.25em padding-left: 1.2em
&:empty:before &:empty:before
// Fix issue where empty lines would disappear // Fix issue where empty lines would disappear