mirror of
https://github.com/explosion/spaCy.git
synced 2025-05-02 23:03:41 +03:00
Update docs [ci skip]
This commit is contained in:
parent
2285e59765
commit
9c25656ccc
|
@ -179,6 +179,7 @@ of objects by referring to creation functions, including functions you register
|
||||||
yourself. For details on how to get started with training your own model, check
|
yourself. For details on how to get started with training your own model, check
|
||||||
out the [training quickstart](/usage/training#quickstart).
|
out the [training quickstart](/usage/training#quickstart).
|
||||||
|
|
||||||
|
<!-- TODO:
|
||||||
<Project id="en_core_bert">
|
<Project id="en_core_bert">
|
||||||
|
|
||||||
The easiest way to get started is to clone a transformers-based project
|
The easiest way to get started is to clone a transformers-based project
|
||||||
|
@ -186,6 +187,7 @@ template. Swap in your data, edit the settings and hyperparameters and train,
|
||||||
evaluate, package and visualize your model.
|
evaluate, package and visualize your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
-->
|
||||||
|
|
||||||
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
|
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
|
||||||
describes the pipeline components and the settings used to construct them,
|
describes the pipeline components and the settings used to construct them,
|
||||||
|
|
|
@ -33,6 +33,7 @@ and prototypes and ship your models into production.
|
||||||
|
|
||||||
<!-- TODO: decide how to introduce concept -->
|
<!-- TODO: decide how to introduce concept -->
|
||||||
|
|
||||||
|
<!-- TODO:
|
||||||
<Project id="some_example_project">
|
<Project id="some_example_project">
|
||||||
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
|
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
|
||||||
|
@ -40,6 +41,7 @@ sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
|
||||||
mattis pretium.
|
mattis pretium.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
-->
|
||||||
|
|
||||||
spaCy projects make it easy to integrate with many other **awesome tools** in
|
spaCy projects make it easy to integrate with many other **awesome tools** in
|
||||||
the data science and machine learning ecosystem to track and manage your data
|
the data science and machine learning ecosystem to track and manage your data
|
||||||
|
|
|
@ -92,6 +92,7 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
|
||||||
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<!-- TODO:
|
||||||
<Project id="some_example_project">
|
<Project id="some_example_project">
|
||||||
|
|
||||||
The easiest way to get started with an end-to-end training process is to clone a
|
The easiest way to get started with an end-to-end training process is to clone a
|
||||||
|
@ -99,6 +100,7 @@ The easiest way to get started with an end-to-end training process is to clone a
|
||||||
workflows, from data preprocessing to training and packaging your model.
|
workflows, from data preprocessing to training and packaging your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
-->
|
||||||
|
|
||||||
## Training config {#config}
|
## Training config {#config}
|
||||||
|
|
||||||
|
@ -656,32 +658,74 @@ factor = 1.005
|
||||||
|
|
||||||
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
||||||
|
|
||||||
Some use-cases require streaming in data or manipulating datasets on the fly,
|
Some use-cases require **streaming in data** or manipulating datasets on the
|
||||||
rather than generating all data beforehand and storing it to file. Instead of
|
fly, rather than generating all data beforehand and storing it to file. Instead
|
||||||
using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you
|
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||||||
can create and register a custom function that generates
|
paths, you can create and register a custom function that generates
|
||||||
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
||||||
using this dataset for training, stopping criteria such as maximum number of
|
using this dataset for training, stopping criteria such as maximum number of
|
||||||
steps, or stopping when the loss does not decrease further, can be used.
|
steps, or stopping when the loss does not decrease further, can be used.
|
||||||
|
|
||||||
In this example we assume a custom function `read_custom_data()` which loads or
|
In this example we assume a custom function `read_custom_data` which loads or
|
||||||
generates texts with relevant textcat annotations. Then, small lexical
|
generates texts with relevant text classification annotations. Then, small
|
||||||
variations of the input text are created before generating the final `Example`
|
lexical variations of the input text are created before generating the final
|
||||||
objects.
|
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
||||||
|
you register the function creating the custom reader in the `readers`
|
||||||
We can also customize the batching strategy by registering a new "batcher" which
|
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||||
turns a stream of items into a stream of batches. spaCy has several useful
|
used in your config. All arguments on the registered function become available
|
||||||
built-in batching strategies with customizable sizes<!-- TODO: link -->, but
|
as **config settings** – in this case, `source`.
|
||||||
it's also easy to implement your own. For instance, the following function takes
|
|
||||||
the stream of generated `Example` objects, and removes those which have the
|
|
||||||
exact same underlying raw text, to avoid duplicates within each batch. Note that
|
|
||||||
in a more realistic implementation, you'd also want to check whether the
|
|
||||||
annotations are exactly the same.
|
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [training.train_corpus]
|
> [training.train_corpus]
|
||||||
> @readers = "corpus_variants.v1"
|
> @readers = "corpus_variants.v1"
|
||||||
|
> source = "s3://your_bucket/path/data.csv"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py {highlight="7-8"}
|
||||||
|
from typing import Callable, Iterator, List
|
||||||
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
|
from spacy.language import Language
|
||||||
|
import random
|
||||||
|
|
||||||
|
@spacy.registry.readers("corpus_variants.v1")
|
||||||
|
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
|
||||||
|
def generate_stream(nlp):
|
||||||
|
for text, cats in read_custom_data(source):
|
||||||
|
# Create a random variant of the example text
|
||||||
|
i = random.randint(0, len(text) - 1)
|
||||||
|
variant = text[:i] + text[i].upper() + text[i + 1:]
|
||||||
|
doc = nlp.make_doc(variant)
|
||||||
|
example = Example.from_dict(doc, {"cats": cats})
|
||||||
|
yield example
|
||||||
|
|
||||||
|
return generate_stream
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Remember that a registered function should always be a function that spaCy
|
||||||
|
**calls to create something**. In this case, it **creates the reader function**
|
||||||
|
– it's not the reader itself.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
We can also customize the **batching strategy** by registering a new batcher
|
||||||
|
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||||||
|
a stream of items into a stream of batches. spaCy has several useful built-in
|
||||||
|
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
|
||||||
|
also easy to implement your own. For instance, the following function takes the
|
||||||
|
stream of generated [`Example`](/api/example) objects, and removes those which
|
||||||
|
have the exact same underlying raw text, to avoid duplicates within each batch.
|
||||||
|
Note that in a more realistic implementation, you'd also want to check whether
|
||||||
|
the annotations are exactly the same.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
>
|
>
|
||||||
|
> ```ini
|
||||||
> [training.batcher]
|
> [training.batcher]
|
||||||
> @batchers = "filtering_batch.v1"
|
> @batchers = "filtering_batch.v1"
|
||||||
> size = 150
|
> size = 150
|
||||||
|
@ -689,39 +733,26 @@ annotations are exactly the same.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### functions.py
|
### functions.py
|
||||||
from typing import Callable, Iterable, List
|
from typing import Callable, Iterable, Iterator
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
import random
|
|
||||||
|
|
||||||
@spacy.registry.readers("corpus_variants.v1")
|
|
||||||
def stream_data() -> Callable[["Language"], Iterable[Example]]:
|
|
||||||
def generate_stream(nlp):
|
|
||||||
for text, cats in read_custom_data():
|
|
||||||
random_index = random.randint(0, len(text) - 1)
|
|
||||||
variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:]
|
|
||||||
doc = nlp.make_doc(variant)
|
|
||||||
example = Example.from_dict(doc, {"cats": cats})
|
|
||||||
yield example
|
|
||||||
return generate_stream
|
|
||||||
|
|
||||||
|
|
||||||
@spacy.registry.batchers("filtering_batch.v1")
|
@spacy.registry.batchers("filtering_batch.v1")
|
||||||
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]:
|
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]:
|
||||||
def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]:
|
def create_filtered_batches(examples):
|
||||||
batch = []
|
batch = []
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
|
# Remove duplicate examples with the same text from batch
|
||||||
if eg.text not in [x.text for x in batch]:
|
if eg.text not in [x.text for x in batch]:
|
||||||
batch.append(eg)
|
batch.append(eg)
|
||||||
if len(batch) == size:
|
if len(batch) == size:
|
||||||
yield batch
|
yield batch
|
||||||
batch = []
|
batch = []
|
||||||
|
|
||||||
return create_filtered_batches
|
return create_filtered_batches
|
||||||
```
|
```
|
||||||
|
|
||||||
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
<!-- TODO:
|
||||||
|
|
||||||
<!-- TODO: -->
|
|
||||||
|
|
||||||
<Project id="example_pytorch_model">
|
<Project id="example_pytorch_model">
|
||||||
|
|
||||||
|
@ -731,12 +762,17 @@ mattis pretium.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
### Defining custom architectures {#custom-architectures}
|
### Defining custom architectures {#custom-architectures}
|
||||||
|
|
||||||
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
||||||
|
<!-- TODO: Wrapping PyTorch and TensorFlow -->
|
||||||
|
|
||||||
## Transfer learning {#transfer-learning}
|
## Transfer learning {#transfer-learning}
|
||||||
|
|
||||||
|
<!-- TODO: link to embeddings and transformers page -->
|
||||||
|
|
||||||
### Using transformer models like BERT {#transformers}
|
### Using transformer models like BERT {#transformers}
|
||||||
|
|
||||||
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
||||||
|
@ -748,6 +784,8 @@ do the required plumbing. It also provides a pipeline component,
|
||||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
||||||
you save the transformer outputs for later use.
|
you save the transformer outputs for later use.
|
||||||
|
|
||||||
|
<!-- TODO:
|
||||||
|
|
||||||
<Project id="en_core_bert">
|
<Project id="en_core_bert">
|
||||||
|
|
||||||
Try out a BERT-based model pipeline using this project template: swap in your
|
Try out a BERT-based model pipeline using this project template: swap in your
|
||||||
|
@ -755,6 +793,7 @@ data, edit the settings and hyperparameters and train, evaluate, package and
|
||||||
visualize your model.
|
visualize your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
-->
|
||||||
|
|
||||||
For more details on how to integrate transformer models into your training
|
For more details on how to integrate transformer models into your training
|
||||||
config and customize the implementations, see the usage guide on
|
config and customize the implementations, see the usage guide on
|
||||||
|
@ -766,7 +805,8 @@ config and customize the implementations, see the usage guide on
|
||||||
|
|
||||||
## Parallel Training with Ray {#parallel-training}
|
## Parallel Training with Ray {#parallel-training}
|
||||||
|
|
||||||
<!-- TODO: document Ray integration -->
|
<!-- TODO:
|
||||||
|
|
||||||
|
|
||||||
<Project id="some_example_project">
|
<Project id="some_example_project">
|
||||||
|
|
||||||
|
@ -776,6 +816,8 @@ mattis pretium.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
## Internal training API {#api}
|
## Internal training API {#api}
|
||||||
|
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
|
|
@ -444,6 +444,8 @@ values. You can then use the auto-generated `config.cfg` for training:
|
||||||
+ python -m spacy train ./config.cfg --output ./output
|
+ python -m spacy train ./config.cfg --output ./output
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<!-- TODO:
|
||||||
|
|
||||||
<Project id="some_example_project">
|
<Project id="some_example_project">
|
||||||
|
|
||||||
The easiest way to get started with an end-to-end training process is to clone a
|
The easiest way to get started with an end-to-end training process is to clone a
|
||||||
|
@ -452,6 +454,8 @@ workflows, from data preprocessing to training and packaging your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
#### Training via the Python API {#migrating-training-python}
|
#### Training via the Python API {#migrating-training-python}
|
||||||
|
|
||||||
For most use cases, you **shouldn't** have to write your own training scripts
|
For most use cases, you **shouldn't** have to write your own training scripts
|
||||||
|
|
|
@ -396,7 +396,7 @@ body [id]:target
|
||||||
margin-right: -1.5em
|
margin-right: -1.5em
|
||||||
margin-left: -1.5em
|
margin-left: -1.5em
|
||||||
padding-right: 1.5em
|
padding-right: 1.5em
|
||||||
padding-left: 1.25em
|
padding-left: 1.2em
|
||||||
|
|
||||||
&:empty:before
|
&:empty:before
|
||||||
// Fix issue where empty lines would disappear
|
// Fix issue where empty lines would disappear
|
||||||
|
|
Loading…
Reference in New Issue
Block a user