Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-05 20:29:53 +02:00
parent c675746ca2
commit 50311a4d37
5 changed files with 301 additions and 124 deletions

View File

@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
new: 3
---
This class manages annotated corpora and can read training and development
datasets in the [DocBin](/api/docbin) (`.spacy`) format.
This class manages annotated corpora and can be used for training and
development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
customize the data loading during training, you can register your own
[data readers and batchers](/usage/training#custom-code-readers-batchers)
## Corpus.\_\_init\_\_ {#init tag="method"}
Create a `Corpus`. The input data can be a file or a directory of files.
Create a `Corpus` for iterating [Example](/api/example) objects from a file or
directory of [`.spacy` data files](/api/data-formats#binary-training). The
`gold_preproc` setting lets you specify whether to set up the `Example` object
with gold-standard sentences and tokens for the predictions. Gold preprocessing
helps the annotations align to the tokenization, and may result in sequences of
more consistent length. However, it may reduce runtime accuracy due to
train/test skew.
> #### Example
>
> ```python
> from spacy.gold import Corpus
>
> corpus = Corpus("./train.spacy", "./dev.spacy")
> # With a single file
> corpus = Corpus("./data/train.spacy")
>
> # With a directory
> corpus = Corpus("./data", limit=10)
> ```
| Name | Type | Description |
| ------- | ------------ | ---------------------------------------------------------------- |
| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
| `limit` | int | Maximum number of examples returned. `0` for no limit (default). |
| Name | Type | Description |
| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | str / `Path` | The directory or filename to read from. |
| _keyword-only_ | | |
|  `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. |
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
## Corpus.train_dataset {#train_dataset tag="method"}
## Corpus.\_\_call\_\_ {#call tag="method"}
Yield examples from the training data.
Yield examples from the data.
> #### Example
>
@ -37,60 +51,12 @@ Yield examples from the training data.
> from spacy.gold import Corpus
> import spacy
>
> corpus = Corpus("./train.spacy", "./dev.spacy")
> corpus = Corpus("./train.spacy")
> nlp = spacy.blank("en")
> train_data = corpus.train_dataset(nlp)
> train_data = corpus(nlp)
> ```
| Name | Type | Description |
| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `nlp` | `Language` | The current `nlp` object. |
| _keyword-only_ | | |
| `shuffle` | bool | Whether to shuffle the examples. Defaults to `True`. |
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default).  |
| **YIELDS** | `Example` | The examples. |
## Corpus.dev_dataset {#dev_dataset tag="method"}
Yield examples from the development data.
> #### Example
>
> ```python
> from spacy.gold import Corpus
> import spacy
>
> corpus = Corpus("./train.spacy", "./dev.spacy")
> nlp = spacy.blank("en")
> dev_data = corpus.dev_dataset(nlp)
> ```
| Name | Type | Description |
| -------------- | ---------- | ---------------------------------------------------------------------------- |
| `nlp` | `Language` | The current `nlp` object. |
| _keyword-only_ | | |
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
| **YIELDS** | `Example` | The examples. |
## Corpus.count_train {#count_train tag="method"}
Get the word count of all training examples.
> #### Example
>
> ```python
> from spacy.gold import Corpus
> import spacy
>
> corpus = Corpus("./train.spacy", "./dev.spacy")
> nlp = spacy.blank("en")
> word_count = corpus.count_train(nlp)
> ```
| Name | Type | Description |
| ----------- | ---------- | ------------------------- |
| `nlp` | `Language` | The current `nlp` object. |
| **RETURNS** | int | The word count. |
<!-- TODO: document remaining methods? / decide which to document -->
| Name | Type | Description |
| ---------- | ---------- | ------------------------- |
| `nlp` | `Language` | The current `nlp` object. |
| **YIELDS** | `Example` | The examples. |

View File

@ -4,7 +4,7 @@ menu:
- ['spacy', 'spacy']
- ['displacy', 'displacy']
- ['registry', 'registry']
- ['Loaders & Batchers', 'loaders-batchers']
- ['Readers & Batchers', 'readers-batchers']
- ['Data & Alignment', 'gold']
- ['Utility Functions', 'util']
---
@ -303,6 +303,9 @@ factories.
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `readers` | Registry for training and evaluation [data readers](#readers-batchers). |
| `batchers` | Registry for training and evaluation [data batchers](#readers-batchers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
## Training data loaders and batchers {#loaders-batchers new="3"}
## Data readers and batchers {#readers-batchers new="3"}
<!-- TODO: -->
### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
Registered function that creates a [`Corpus`](/api/corpus) of training or
evaluation data. It takes the same arguments as the `Corpus` class and returns a
callable that yields [`Example`](/api/example) objects. You can replace it with
your own registered function in the [`@readers` registry](#regsitry) to
customize the data loading and streaming.
> #### Example config
>
> ```ini
> [paths]
> train = "corpus/train.spacy"
>
> [training.train_corpus]
> @readers = "spacy.Corpus.v1"
> path = ${paths:train}
> gold_preproc = false
> max_length = 0
> limit = 0
> ```
| Name | Type | Description |
| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). |
|  `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
### Batchers {#batchers source="spacy/gold/batchers.py"}
<!-- TODO: -->
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
Create minibatches of roughly a given number of words. If any examples are
longer than the specified batch length, they will appear in a batch by
themselves, or be discarded if `discard_oversize` is set to `True`. The argument
`docs` can be a list of strings, [`Doc`](/api/doc) objects or
[`Example`](/api/example) objects.
> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "batch_by_words.v1"
> size = 100
> tolerance = 0.2
> discard_oversize = false
> get_length = null
> ```
<!-- TODO: complete table -->
| Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `tolerance` | float | |
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
<!-- TODO: -->
> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "batch_by_sequence.v1"
> size = 32
> get_length = null
> ```
<!-- TODO: complete table -->
| Name | Type | Description |
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
<!-- TODO: -->
> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "batch_by_words.v1"
> size = 100
> buffer = TODO:
> discard_oversize = false
> get_length = null
> ```
| Name | Type | Description |
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
| `buffer` | int | |
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
## Training data and alignment {#gold source="spacy/gold"}
### gold.docs_to_json {#docs_to_json tag="function"}

View File

@ -5,8 +5,8 @@ menu:
- ['Introduction', 'basics']
- ['Quickstart', 'quickstart']
- ['Config System', 'config']
- ['Transfer Learning', 'transfer-learning']
- ['Custom Models', 'custom-models']
- ['Transfer Learning', 'transfer-learning']
- ['Parallel Training', 'parallel-training']
- ['Internal API', 'api']
---
@ -315,6 +315,10 @@ stop = 1000
compound = 1.001
```
### Using variable interpolation {#config-interpolation}
<!-- TODO: describe and come up with good example showing both values and sections -->
### Model architectures {#model-architectures}
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
@ -384,41 +388,17 @@ still look good.
</Accordion>
## Transfer learning {#transfer-learning}
### Using transformer models like BERT {#transformers}
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
can use models implemented in a variety of frameworks. A transformer model is
just a statistical model, so the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
actually has very little work to do: it just has to provide a few functions that
do the required plumbing. It also provides a pipeline component,
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
you save the transformer outputs for later use.
<Project id="en_core_bert">
Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and
visualize your model.
</Project>
For more details on how to integrate transformer models into your training
config and customize the implementations, see the usage guide on
[training transformers](/usage/transformers#training).
### Pretraining with spaCy {#pretraining}
<!-- TODO: document spacy pretrain -->
## Custom model implementations and architectures {#custom-models}
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
### Training with custom code {#custom-code}
> ```bash
> ### Example {wrap="true"}
> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
> ```
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
`--code` that points to a Python file. The file is imported before training and
allows you to add custom functions and architectures to the function registry
@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
models with custom components, without having to re-implement the whole training
workflow.
#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
For many use cases, you don't necessarily want to implement the whole `Language`
subclass and language data from scratch it's often enough to make a few small
modifications, like adjusting the
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
[language defaults](/api/language#defaults) like stop words. The config lets you
provide three optional **callback functions** that give you access to the
language class and `nlp` object at different points of the lifecycle:
| Callback | Description |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
The `@spacy.registry.callbacks` decorator lets you register that function in the
`callbacks` [registry](/api/top-level#registry) under a given name. You can then
reference the function in a config block using the `@callbacks` key. If a block
contains a key starting with an `@`, it's interpreted as a reference to a
function. Because you've registered the function, spaCy knows how to create it
when you reference `"customize_language_data"` in your config. Here's an example
of a callback that runs before the `nlp` object is created and adds a few custom
tokenization rules to the defaults:
> #### config.cfg
>
> ```ini
> [nlp.before_creation]
> @callbacks = "customize_language_data"
> ```
```python
### functions.py {highlight="3,6"}
import spacy
@spacy.registry.callbacks("customize_language_data")
def create_callback():
def customize_language_data(lang_cls):
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
return lang_cls
return customize_language_data
```
<Infobox variant="warning">
Remember that a registered function should always be a function that spaCy
**calls to create something**. In this case, it **creates a callback**  it's
not the callback itself.
</Infobox>
Any registered function in this case `create_callback` can also take
**arguments** that can be **set by the config**. This lets you implement and
keep track of different configurations, without having to hack at your code. You
can choose any arguments that make sense for your use case. In this example,
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
(boolean) for printing additional info when the function runs.
> #### config.cfg
>
> ```ini
> [nlp.before_creation]
> @callbacks = "customize_language_data"
> extra_stop_words = ["ooh", "aah"]
> debug = true
> ```
```python
### functions.py {highlight="5,8-10"}
from typing import List
import spacy
@spacy.registry.callbacks("customize_language_data")
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
def customize_language_data(lang_cls):
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
lang_cls.Defaults.stop_words.add(extra_stop_words)
if debug:
print("Updated stop words and tokenizer suffixes")
return lang_cls
return customize_language_data
```
<Infobox title="Tip: Use Python type hints" emoji="💡">
spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
[type hints](https://docs.python.org/3/library/typing.html) and even
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides type hints, the values that are passed in will be checked
against the expected types. For example, `debug: bool` in the example above will
ensure that the value received as the argument `debug` is an boolean. If the
value can't be coerced into a boolean, spaCy will raise an error.
`start: pydantic.StrictBool` will force the value to be an boolean and raise an
error if it's not for instance, if your config defines `1` instead of `true`.
</Infobox>
With your `functions.py` defining additional code and the updated `config.cfg`,
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
to your Python file. Before loading the config, spaCy will import the
`functions.py` module and your custom functions will be registered.
```bash
### Training with custom code {wrap="true"}
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
```
#### Example: Custom batch size schedule {#custom-code-schedule}
For example, let's say you've implemented your own batch size schedule to use
during training. The `@spacy.registry.schedules` decorator lets you register
that function in the `schedules` [registry](/api/top-level#registry) and assign
@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run
[`init config`](/api/cli#init-config).
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
```ini
### config.cfg (excerpt)
[training.batch_size]
@ -469,31 +561,9 @@ start = 2
factor = 1.005
```
You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
custom `functions.py` as the argument `--code`. Before loading the config, spaCy
will import the `functions.py` module and your custom functions will be
registered.
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
```bash
### Training with custom code {wrap="true"}
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
```
<Infobox title="Tip: Use Python type hints" emoji="💡">
spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
[type hints](https://docs.python.org/3/library/typing.html) and even
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides type hints, the values that are passed in will be checked
against the expected types. For example, `start: int` in the example above will
ensure that the value received as the argument `start` is an integer. If the
value can't be coerced into an integer, spaCy will raise an error.
`start: pydantic.StrictInt` will force the value to be an integer and raise an
error if it's not for instance, if your config defines a float.
</Infobox>
<!-- TODO: -->
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
@ -511,6 +581,35 @@ mattis pretium.
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
## Transfer learning {#transfer-learning}
### Using transformer models like BERT {#transformers}
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
can use models implemented in a variety of frameworks. A transformer model is
just a statistical model, so the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
actually has very little work to do: it just has to provide a few functions that
do the required plumbing. It also provides a pipeline component,
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
you save the transformer outputs for later use.
<Project id="en_core_bert">
Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and
visualize your model.
</Project>
For more details on how to integrate transformer models into your training
config and customize the implementations, see the usage guide on
[training transformers](/usage/transformers#training).
### Pretraining with spaCy {#pretraining}
<!-- TODO: document spacy pretrain -->
## Parallel Training with Ray {#parallel-training}
<!-- TODO: document Ray integration -->

View File

@ -24,10 +24,16 @@ $border-radius: 6px
&:last-child
margin: 0
&:first-child h4
margin-top: 0 !important
code
padding: 0
margin: 0
h4
margin-left: 0
p, ul, ol
font: inherit
margin-bottom: var(--spacing-sm)

View File

@ -373,7 +373,7 @@ body [id]:target
margin-right: -1.5em
margin-left: -1.5em
padding-right: 1.5em
padding-left: 1.65em
padding-left: 1.1em
&:empty:before
// Fix issue where empty lines would disappear