mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Update docs [ci skip]
This commit is contained in:
parent
c675746ca2
commit
50311a4d37
|
@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
|
||||||
new: 3
|
new: 3
|
||||||
---
|
---
|
||||||
|
|
||||||
This class manages annotated corpora and can read training and development
|
This class manages annotated corpora and can be used for training and
|
||||||
datasets in the [DocBin](/api/docbin) (`.spacy`) format.
|
development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
|
||||||
|
customize the data loading during training, you can register your own
|
||||||
|
[data readers and batchers](/usage/training#custom-code-readers-batchers)
|
||||||
|
|
||||||
## Corpus.\_\_init\_\_ {#init tag="method"}
|
## Corpus.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
Create a `Corpus`. The input data can be a file or a directory of files.
|
Create a `Corpus` for iterating [Example](/api/example) objects from a file or
|
||||||
|
directory of [`.spacy` data files](/api/data-formats#binary-training). The
|
||||||
|
`gold_preproc` setting lets you specify whether to set up the `Example` object
|
||||||
|
with gold-standard sentences and tokens for the predictions. Gold preprocessing
|
||||||
|
helps the annotations align to the tokenization, and may result in sequences of
|
||||||
|
more consistent length. However, it may reduce runtime accuracy due to
|
||||||
|
train/test skew.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.gold import Corpus
|
> from spacy.gold import Corpus
|
||||||
>
|
>
|
||||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
> # With a single file
|
||||||
|
> corpus = Corpus("./data/train.spacy")
|
||||||
|
>
|
||||||
|
> # With a directory
|
||||||
|
> corpus = Corpus("./data", limit=10)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------- | ------------ | ---------------------------------------------------------------- |
|
| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
|
| `path` | str / `Path` | The directory or filename to read from. |
|
||||||
| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
|
| _keyword-only_ | | |
|
||||||
| `limit` | int | Maximum number of examples returned. `0` for no limit (default). |
|
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. |
|
||||||
|
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
||||||
|
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
||||||
|
|
||||||
## Corpus.train_dataset {#train_dataset tag="method"}
|
## Corpus.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
Yield examples from the training data.
|
Yield examples from the data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -37,60 +51,12 @@ Yield examples from the training data.
|
||||||
> from spacy.gold import Corpus
|
> from spacy.gold import Corpus
|
||||||
> import spacy
|
> import spacy
|
||||||
>
|
>
|
||||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
> corpus = Corpus("./train.spacy")
|
||||||
> nlp = spacy.blank("en")
|
> nlp = spacy.blank("en")
|
||||||
> train_data = corpus.train_dataset(nlp)
|
> train_data = corpus(nlp)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ---------- | ---------- | ------------------------- |
|
||||||
| `nlp` | `Language` | The current `nlp` object. |
|
| `nlp` | `Language` | The current `nlp` object. |
|
||||||
| _keyword-only_ | | |
|
| **YIELDS** | `Example` | The examples. |
|
||||||
| `shuffle` | bool | Whether to shuffle the examples. Defaults to `True`. |
|
|
||||||
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
|
|
||||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default). |
|
|
||||||
| **YIELDS** | `Example` | The examples. |
|
|
||||||
|
|
||||||
## Corpus.dev_dataset {#dev_dataset tag="method"}
|
|
||||||
|
|
||||||
Yield examples from the development data.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import Corpus
|
|
||||||
> import spacy
|
|
||||||
>
|
|
||||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
|
||||||
> nlp = spacy.blank("en")
|
|
||||||
> dev_data = corpus.dev_dataset(nlp)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| -------------- | ---------- | ---------------------------------------------------------------------------- |
|
|
||||||
| `nlp` | `Language` | The current `nlp` object. |
|
|
||||||
| _keyword-only_ | | |
|
|
||||||
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
|
|
||||||
| **YIELDS** | `Example` | The examples. |
|
|
||||||
|
|
||||||
## Corpus.count_train {#count_train tag="method"}
|
|
||||||
|
|
||||||
Get the word count of all training examples.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import Corpus
|
|
||||||
> import spacy
|
|
||||||
>
|
|
||||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
|
||||||
> nlp = spacy.blank("en")
|
|
||||||
> word_count = corpus.count_train(nlp)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---------- | ------------------------- |
|
|
||||||
| `nlp` | `Language` | The current `nlp` object. |
|
|
||||||
| **RETURNS** | int | The word count. |
|
|
||||||
|
|
||||||
<!-- TODO: document remaining methods? / decide which to document -->
|
|
||||||
|
|
|
@ -4,7 +4,7 @@ menu:
|
||||||
- ['spacy', 'spacy']
|
- ['spacy', 'spacy']
|
||||||
- ['displacy', 'displacy']
|
- ['displacy', 'displacy']
|
||||||
- ['registry', 'registry']
|
- ['registry', 'registry']
|
||||||
- ['Loaders & Batchers', 'loaders-batchers']
|
- ['Readers & Batchers', 'readers-batchers']
|
||||||
- ['Data & Alignment', 'gold']
|
- ['Data & Alignment', 'gold']
|
||||||
- ['Utility Functions', 'util']
|
- ['Utility Functions', 'util']
|
||||||
---
|
---
|
||||||
|
@ -303,6 +303,9 @@ factories.
|
||||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
| `assets` | |
|
| `assets` | |
|
||||||
|
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||||
|
| `readers` | Registry for training and evaluation [data readers](#readers-batchers). |
|
||||||
|
| `batchers` | Registry for training and evaluation [data batchers](#readers-batchers). |
|
||||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||||
|
@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
|
||||||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||||
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||||
|
|
||||||
## Training data loaders and batchers {#loaders-batchers new="3"}
|
## Data readers and batchers {#readers-batchers new="3"}
|
||||||
|
|
||||||
<!-- TODO: -->
|
<!-- TODO: -->
|
||||||
|
|
||||||
|
### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
|
||||||
|
|
||||||
|
Registered function that creates a [`Corpus`](/api/corpus) of training or
|
||||||
|
evaluation data. It takes the same arguments as the `Corpus` class and returns a
|
||||||
|
callable that yields [`Example`](/api/example) objects. You can replace it with
|
||||||
|
your own registered function in the [`@readers` registry](#regsitry) to
|
||||||
|
customize the data loading and streaming.
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [paths]
|
||||||
|
> train = "corpus/train.spacy"
|
||||||
|
>
|
||||||
|
> [training.train_corpus]
|
||||||
|
> @readers = "spacy.Corpus.v1"
|
||||||
|
> path = ${paths:train}
|
||||||
|
> gold_preproc = false
|
||||||
|
> max_length = 0
|
||||||
|
> limit = 0
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). |
|
||||||
|
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
|
||||||
|
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
||||||
|
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
||||||
|
|
||||||
|
### Batchers {#batchers source="spacy/gold/batchers.py"}
|
||||||
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
|
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
||||||
|
|
||||||
|
Create minibatches of roughly a given number of words. If any examples are
|
||||||
|
longer than the specified batch length, they will appear in a batch by
|
||||||
|
themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
||||||
|
`docs` can be a list of strings, [`Doc`](/api/doc) objects or
|
||||||
|
[`Example`](/api/example) objects.
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [training.batcher]
|
||||||
|
> @batchers = "batch_by_words.v1"
|
||||||
|
> size = 100
|
||||||
|
> tolerance = 0.2
|
||||||
|
> discard_oversize = false
|
||||||
|
> get_length = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
<!-- TODO: complete table -->
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||||
|
| `tolerance` | float | |
|
||||||
|
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||||
|
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||||
|
|
||||||
|
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
||||||
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [training.batcher]
|
||||||
|
> @batchers = "batch_by_sequence.v1"
|
||||||
|
> size = 32
|
||||||
|
> get_length = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
<!-- TODO: complete table -->
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||||
|
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||||
|
|
||||||
|
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
||||||
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [training.batcher]
|
||||||
|
> @batchers = "batch_by_words.v1"
|
||||||
|
> size = 100
|
||||||
|
> buffer = TODO:
|
||||||
|
> discard_oversize = false
|
||||||
|
> get_length = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||||
|
| `buffer` | int | |
|
||||||
|
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||||
|
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||||
|
|
||||||
## Training data and alignment {#gold source="spacy/gold"}
|
## Training data and alignment {#gold source="spacy/gold"}
|
||||||
|
|
||||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||||
|
|
|
@ -5,8 +5,8 @@ menu:
|
||||||
- ['Introduction', 'basics']
|
- ['Introduction', 'basics']
|
||||||
- ['Quickstart', 'quickstart']
|
- ['Quickstart', 'quickstart']
|
||||||
- ['Config System', 'config']
|
- ['Config System', 'config']
|
||||||
- ['Transfer Learning', 'transfer-learning']
|
|
||||||
- ['Custom Models', 'custom-models']
|
- ['Custom Models', 'custom-models']
|
||||||
|
- ['Transfer Learning', 'transfer-learning']
|
||||||
- ['Parallel Training', 'parallel-training']
|
- ['Parallel Training', 'parallel-training']
|
||||||
- ['Internal API', 'api']
|
- ['Internal API', 'api']
|
||||||
---
|
---
|
||||||
|
@ -315,6 +315,10 @@ stop = 1000
|
||||||
compound = 1.001
|
compound = 1.001
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Using variable interpolation {#config-interpolation}
|
||||||
|
|
||||||
|
<!-- TODO: describe and come up with good example showing both values and sections -->
|
||||||
|
|
||||||
### Model architectures {#model-architectures}
|
### Model architectures {#model-architectures}
|
||||||
|
|
||||||
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
||||||
|
@ -384,41 +388,17 @@ still look good.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
## Transfer learning {#transfer-learning}
|
|
||||||
|
|
||||||
### Using transformer models like BERT {#transformers}
|
|
||||||
|
|
||||||
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
|
||||||
can use models implemented in a variety of frameworks. A transformer model is
|
|
||||||
just a statistical model, so the
|
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
|
||||||
actually has very little work to do: it just has to provide a few functions that
|
|
||||||
do the required plumbing. It also provides a pipeline component,
|
|
||||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
|
||||||
you save the transformer outputs for later use.
|
|
||||||
|
|
||||||
<Project id="en_core_bert">
|
|
||||||
|
|
||||||
Try out a BERT-based model pipeline using this project template: swap in your
|
|
||||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
|
||||||
visualize your model.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
|
|
||||||
For more details on how to integrate transformer models into your training
|
|
||||||
config and customize the implementations, see the usage guide on
|
|
||||||
[training transformers](/usage/transformers#training).
|
|
||||||
|
|
||||||
### Pretraining with spaCy {#pretraining}
|
|
||||||
|
|
||||||
<!-- TODO: document spacy pretrain -->
|
|
||||||
|
|
||||||
## Custom model implementations and architectures {#custom-models}
|
## Custom model implementations and architectures {#custom-models}
|
||||||
|
|
||||||
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
|
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
|
||||||
|
|
||||||
### Training with custom code {#custom-code}
|
### Training with custom code {#custom-code}
|
||||||
|
|
||||||
|
> ```bash
|
||||||
|
> ### Example {wrap="true"}
|
||||||
|
> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
|
||||||
|
> ```
|
||||||
|
|
||||||
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
|
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
|
||||||
`--code` that points to a Python file. The file is imported before training and
|
`--code` that points to a Python file. The file is imported before training and
|
||||||
allows you to add custom functions and architectures to the function registry
|
allows you to add custom functions and architectures to the function registry
|
||||||
|
@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
|
||||||
models with custom components, without having to re-implement the whole training
|
models with custom components, without having to re-implement the whole training
|
||||||
workflow.
|
workflow.
|
||||||
|
|
||||||
|
#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
|
||||||
|
|
||||||
|
For many use cases, you don't necessarily want to implement the whole `Language`
|
||||||
|
subclass and language data from scratch – it's often enough to make a few small
|
||||||
|
modifications, like adjusting the
|
||||||
|
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
|
||||||
|
[language defaults](/api/language#defaults) like stop words. The config lets you
|
||||||
|
provide three optional **callback functions** that give you access to the
|
||||||
|
language class and `nlp` object at different points of the lifecycle:
|
||||||
|
|
||||||
|
| Callback | Description |
|
||||||
|
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
|
||||||
|
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
|
||||||
|
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
|
||||||
|
|
||||||
|
The `@spacy.registry.callbacks` decorator lets you register that function in the
|
||||||
|
`callbacks` [registry](/api/top-level#registry) under a given name. You can then
|
||||||
|
reference the function in a config block using the `@callbacks` key. If a block
|
||||||
|
contains a key starting with an `@`, it's interpreted as a reference to a
|
||||||
|
function. Because you've registered the function, spaCy knows how to create it
|
||||||
|
when you reference `"customize_language_data"` in your config. Here's an example
|
||||||
|
of a callback that runs before the `nlp` object is created and adds a few custom
|
||||||
|
tokenization rules to the defaults:
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [nlp.before_creation]
|
||||||
|
> @callbacks = "customize_language_data"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py {highlight="3,6"}
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
@spacy.registry.callbacks("customize_language_data")
|
||||||
|
def create_callback():
|
||||||
|
def customize_language_data(lang_cls):
|
||||||
|
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
|
||||||
|
return lang_cls
|
||||||
|
|
||||||
|
return customize_language_data
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Remember that a registered function should always be a function that spaCy
|
||||||
|
**calls to create something**. In this case, it **creates a callback** – it's
|
||||||
|
not the callback itself.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
Any registered function – in this case `create_callback` – can also take
|
||||||
|
**arguments** that can be **set by the config**. This lets you implement and
|
||||||
|
keep track of different configurations, without having to hack at your code. You
|
||||||
|
can choose any arguments that make sense for your use case. In this example,
|
||||||
|
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
|
||||||
|
(boolean) for printing additional info when the function runs.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [nlp.before_creation]
|
||||||
|
> @callbacks = "customize_language_data"
|
||||||
|
> extra_stop_words = ["ooh", "aah"]
|
||||||
|
> debug = true
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py {highlight="5,8-10"}
|
||||||
|
from typing import List
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
@spacy.registry.callbacks("customize_language_data")
|
||||||
|
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
|
||||||
|
def customize_language_data(lang_cls):
|
||||||
|
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
|
||||||
|
lang_cls.Defaults.stop_words.add(extra_stop_words)
|
||||||
|
if debug:
|
||||||
|
print("Updated stop words and tokenizer suffixes")
|
||||||
|
return lang_cls
|
||||||
|
|
||||||
|
return customize_language_data
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox title="Tip: Use Python type hints" emoji="💡">
|
||||||
|
|
||||||
|
spaCy's configs are powered by our machine learning library Thinc's
|
||||||
|
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||||||
|
[type hints](https://docs.python.org/3/library/typing.html) and even
|
||||||
|
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
||||||
|
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
||||||
|
function provides type hints, the values that are passed in will be checked
|
||||||
|
against the expected types. For example, `debug: bool` in the example above will
|
||||||
|
ensure that the value received as the argument `debug` is an boolean. If the
|
||||||
|
value can't be coerced into a boolean, spaCy will raise an error.
|
||||||
|
`start: pydantic.StrictBool` will force the value to be an boolean and raise an
|
||||||
|
error if it's not – for instance, if your config defines `1` instead of `true`.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
With your `functions.py` defining additional code and the updated `config.cfg`,
|
||||||
|
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
|
||||||
|
to your Python file. Before loading the config, spaCy will import the
|
||||||
|
`functions.py` module and your custom functions will be registered.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
### Training with custom code {wrap="true"}
|
||||||
|
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Example: Custom batch size schedule {#custom-code-schedule}
|
||||||
|
|
||||||
For example, let's say you've implemented your own batch size schedule to use
|
For example, let's say you've implemented your own batch size schedule to use
|
||||||
during training. The `@spacy.registry.schedules` decorator lets you register
|
during training. The `@spacy.registry.schedules` decorator lets you register
|
||||||
that function in the `schedules` [registry](/api/top-level#registry) and assign
|
that function in the `schedules` [registry](/api/top-level#registry) and assign
|
||||||
|
@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
|
||||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||||
[`init config`](/api/cli#init-config).
|
[`init config`](/api/cli#init-config).
|
||||||
|
|
||||||
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
|
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
### config.cfg (excerpt)
|
### config.cfg (excerpt)
|
||||||
[training.batch_size]
|
[training.batch_size]
|
||||||
|
@ -469,31 +561,9 @@ start = 2
|
||||||
factor = 1.005
|
factor = 1.005
|
||||||
```
|
```
|
||||||
|
|
||||||
You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
|
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
||||||
custom `functions.py` as the argument `--code`. Before loading the config, spaCy
|
|
||||||
will import the `functions.py` module and your custom functions will be
|
|
||||||
registered.
|
|
||||||
|
|
||||||
```bash
|
<!-- TODO: -->
|
||||||
### Training with custom code {wrap="true"}
|
|
||||||
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
|
|
||||||
```
|
|
||||||
|
|
||||||
<Infobox title="Tip: Use Python type hints" emoji="💡">
|
|
||||||
|
|
||||||
spaCy's configs are powered by our machine learning library Thinc's
|
|
||||||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
|
||||||
[type hints](https://docs.python.org/3/library/typing.html) and even
|
|
||||||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
|
||||||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
|
||||||
function provides type hints, the values that are passed in will be checked
|
|
||||||
against the expected types. For example, `start: int` in the example above will
|
|
||||||
ensure that the value received as the argument `start` is an integer. If the
|
|
||||||
value can't be coerced into an integer, spaCy will raise an error.
|
|
||||||
`start: pydantic.StrictInt` will force the value to be an integer and raise an
|
|
||||||
error if it's not – for instance, if your config defines a float.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
||||||
|
|
||||||
|
@ -511,6 +581,35 @@ mattis pretium.
|
||||||
|
|
||||||
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
||||||
|
|
||||||
|
## Transfer learning {#transfer-learning}
|
||||||
|
|
||||||
|
### Using transformer models like BERT {#transformers}
|
||||||
|
|
||||||
|
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
||||||
|
can use models implemented in a variety of frameworks. A transformer model is
|
||||||
|
just a statistical model, so the
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
||||||
|
actually has very little work to do: it just has to provide a few functions that
|
||||||
|
do the required plumbing. It also provides a pipeline component,
|
||||||
|
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
||||||
|
you save the transformer outputs for later use.
|
||||||
|
|
||||||
|
<Project id="en_core_bert">
|
||||||
|
|
||||||
|
Try out a BERT-based model pipeline using this project template: swap in your
|
||||||
|
data, edit the settings and hyperparameters and train, evaluate, package and
|
||||||
|
visualize your model.
|
||||||
|
|
||||||
|
</Project>
|
||||||
|
|
||||||
|
For more details on how to integrate transformer models into your training
|
||||||
|
config and customize the implementations, see the usage guide on
|
||||||
|
[training transformers](/usage/transformers#training).
|
||||||
|
|
||||||
|
### Pretraining with spaCy {#pretraining}
|
||||||
|
|
||||||
|
<!-- TODO: document spacy pretrain -->
|
||||||
|
|
||||||
## Parallel Training with Ray {#parallel-training}
|
## Parallel Training with Ray {#parallel-training}
|
||||||
|
|
||||||
<!-- TODO: document Ray integration -->
|
<!-- TODO: document Ray integration -->
|
||||||
|
|
|
@ -24,10 +24,16 @@ $border-radius: 6px
|
||||||
&:last-child
|
&:last-child
|
||||||
margin: 0
|
margin: 0
|
||||||
|
|
||||||
|
&:first-child h4
|
||||||
|
margin-top: 0 !important
|
||||||
|
|
||||||
code
|
code
|
||||||
padding: 0
|
padding: 0
|
||||||
margin: 0
|
margin: 0
|
||||||
|
|
||||||
|
h4
|
||||||
|
margin-left: 0
|
||||||
|
|
||||||
p, ul, ol
|
p, ul, ol
|
||||||
font: inherit
|
font: inherit
|
||||||
margin-bottom: var(--spacing-sm)
|
margin-bottom: var(--spacing-sm)
|
||||||
|
|
|
@ -373,7 +373,7 @@ body [id]:target
|
||||||
margin-right: -1.5em
|
margin-right: -1.5em
|
||||||
margin-left: -1.5em
|
margin-left: -1.5em
|
||||||
padding-right: 1.5em
|
padding-right: 1.5em
|
||||||
padding-left: 1.65em
|
padding-left: 1.1em
|
||||||
|
|
||||||
&:empty:before
|
&:empty:before
|
||||||
// Fix issue where empty lines would disappear
|
// Fix issue where empty lines would disappear
|
||||||
|
|
Loading…
Reference in New Issue
Block a user