mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Update docs [ci skip]
This commit is contained in:
parent
c675746ca2
commit
50311a4d37
|
@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
|
|||
new: 3
|
||||
---
|
||||
|
||||
This class manages annotated corpora and can read training and development
|
||||
datasets in the [DocBin](/api/docbin) (`.spacy`) format.
|
||||
This class manages annotated corpora and can be used for training and
|
||||
development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
|
||||
customize the data loading during training, you can register your own
|
||||
[data readers and batchers](/usage/training#custom-code-readers-batchers)
|
||||
|
||||
## Corpus.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `Corpus`. The input data can be a file or a directory of files.
|
||||
Create a `Corpus` for iterating [Example](/api/example) objects from a file or
|
||||
directory of [`.spacy` data files](/api/data-formats#binary-training). The
|
||||
`gold_preproc` setting lets you specify whether to set up the `Example` object
|
||||
with gold-standard sentences and tokens for the predictions. Gold preprocessing
|
||||
helps the annotations align to the tokenization, and may result in sequences of
|
||||
more consistent length. However, it may reduce runtime accuracy due to
|
||||
train/test skew.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import Corpus
|
||||
>
|
||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
||||
> # With a single file
|
||||
> corpus = Corpus("./data/train.spacy")
|
||||
>
|
||||
> # With a directory
|
||||
> corpus = Corpus("./data", limit=10)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------- | ------------ | ---------------------------------------------------------------- |
|
||||
| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
|
||||
| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
|
||||
| `limit` | int | Maximum number of examples returned. `0` for no limit (default). |
|
||||
| Name | Type | Description |
|
||||
| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | The directory or filename to read from. |
|
||||
| _keyword-only_ | | |
|
||||
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. |
|
||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
||||
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
||||
|
||||
## Corpus.train_dataset {#train_dataset tag="method"}
|
||||
## Corpus.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Yield examples from the training data.
|
||||
Yield examples from the data.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -37,60 +51,12 @@ Yield examples from the training data.
|
|||
> from spacy.gold import Corpus
|
||||
> import spacy
|
||||
>
|
||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
||||
> corpus = Corpus("./train.spacy")
|
||||
> nlp = spacy.blank("en")
|
||||
> train_data = corpus.train_dataset(nlp)
|
||||
> train_data = corpus(nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `nlp` | `Language` | The current `nlp` object. |
|
||||
| _keyword-only_ | | |
|
||||
| `shuffle` | bool | Whether to shuffle the examples. Defaults to `True`. |
|
||||
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
|
||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default). |
|
||||
| **YIELDS** | `Example` | The examples. |
|
||||
|
||||
## Corpus.dev_dataset {#dev_dataset tag="method"}
|
||||
|
||||
Yield examples from the development data.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import Corpus
|
||||
> import spacy
|
||||
>
|
||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
||||
> nlp = spacy.blank("en")
|
||||
> dev_data = corpus.dev_dataset(nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ---------- | ---------------------------------------------------------------------------- |
|
||||
| `nlp` | `Language` | The current `nlp` object. |
|
||||
| _keyword-only_ | | |
|
||||
| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
|
||||
| **YIELDS** | `Example` | The examples. |
|
||||
|
||||
## Corpus.count_train {#count_train tag="method"}
|
||||
|
||||
Get the word count of all training examples.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import Corpus
|
||||
> import spacy
|
||||
>
|
||||
> corpus = Corpus("./train.spacy", "./dev.spacy")
|
||||
> nlp = spacy.blank("en")
|
||||
> word_count = corpus.count_train(nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------- | ------------------------- |
|
||||
| `nlp` | `Language` | The current `nlp` object. |
|
||||
| **RETURNS** | int | The word count. |
|
||||
|
||||
<!-- TODO: document remaining methods? / decide which to document -->
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---------- | ------------------------- |
|
||||
| `nlp` | `Language` | The current `nlp` object. |
|
||||
| **YIELDS** | `Example` | The examples. |
|
||||
|
|
|
@ -4,7 +4,7 @@ menu:
|
|||
- ['spacy', 'spacy']
|
||||
- ['displacy', 'displacy']
|
||||
- ['registry', 'registry']
|
||||
- ['Loaders & Batchers', 'loaders-batchers']
|
||||
- ['Readers & Batchers', 'readers-batchers']
|
||||
- ['Data & Alignment', 'gold']
|
||||
- ['Utility Functions', 'util']
|
||||
---
|
||||
|
@ -303,6 +303,9 @@ factories.
|
|||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `assets` | |
|
||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||
| `readers` | Registry for training and evaluation [data readers](#readers-batchers). |
|
||||
| `batchers` | Registry for training and evaluation [data batchers](#readers-batchers). |
|
||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||
|
@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||
|
||||
## Training data loaders and batchers {#loaders-batchers new="3"}
|
||||
## Data readers and batchers {#readers-batchers new="3"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
|
||||
|
||||
Registered function that creates a [`Corpus`](/api/corpus) of training or
|
||||
evaluation data. It takes the same arguments as the `Corpus` class and returns a
|
||||
callable that yields [`Example`](/api/example) objects. You can replace it with
|
||||
your own registered function in the [`@readers` registry](#regsitry) to
|
||||
customize the data loading and streaming.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [paths]
|
||||
> train = "corpus/train.spacy"
|
||||
>
|
||||
> [training.train_corpus]
|
||||
> @readers = "spacy.Corpus.v1"
|
||||
> path = ${paths:train}
|
||||
> gold_preproc = false
|
||||
> max_length = 0
|
||||
> limit = 0
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). |
|
||||
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
|
||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
||||
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
||||
|
||||
### Batchers {#batchers source="spacy/gold/batchers.py"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
||||
|
||||
Create minibatches of roughly a given number of words. If any examples are
|
||||
longer than the specified batch length, they will appear in a batch by
|
||||
themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
||||
`docs` can be a list of strings, [`Doc`](/api/doc) objects or
|
||||
[`Example`](/api/example) objects.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.batcher]
|
||||
> @batchers = "batch_by_words.v1"
|
||||
> size = 100
|
||||
> tolerance = 0.2
|
||||
> discard_oversize = false
|
||||
> get_length = null
|
||||
> ```
|
||||
|
||||
<!-- TODO: complete table -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `tolerance` | float | |
|
||||
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.batcher]
|
||||
> @batchers = "batch_by_sequence.v1"
|
||||
> size = 32
|
||||
> get_length = null
|
||||
> ```
|
||||
|
||||
<!-- TODO: complete table -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.batcher]
|
||||
> @batchers = "batch_by_words.v1"
|
||||
> size = 100
|
||||
> buffer = TODO:
|
||||
> discard_oversize = false
|
||||
> get_length = null
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
||||
| `buffer` | int | |
|
||||
| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
|
||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
|
||||
|
||||
## Training data and alignment {#gold source="spacy/gold"}
|
||||
|
||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||
|
|
|
@ -5,8 +5,8 @@ menu:
|
|||
- ['Introduction', 'basics']
|
||||
- ['Quickstart', 'quickstart']
|
||||
- ['Config System', 'config']
|
||||
- ['Transfer Learning', 'transfer-learning']
|
||||
- ['Custom Models', 'custom-models']
|
||||
- ['Transfer Learning', 'transfer-learning']
|
||||
- ['Parallel Training', 'parallel-training']
|
||||
- ['Internal API', 'api']
|
||||
---
|
||||
|
@ -315,6 +315,10 @@ stop = 1000
|
|||
compound = 1.001
|
||||
```
|
||||
|
||||
### Using variable interpolation {#config-interpolation}
|
||||
|
||||
<!-- TODO: describe and come up with good example showing both values and sections -->
|
||||
|
||||
### Model architectures {#model-architectures}
|
||||
|
||||
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
||||
|
@ -384,41 +388,17 @@ still look good.
|
|||
|
||||
</Accordion>
|
||||
|
||||
## Transfer learning {#transfer-learning}
|
||||
|
||||
### Using transformer models like BERT {#transformers}
|
||||
|
||||
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
||||
can use models implemented in a variety of frameworks. A transformer model is
|
||||
just a statistical model, so the
|
||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
||||
actually has very little work to do: it just has to provide a few functions that
|
||||
do the required plumbing. It also provides a pipeline component,
|
||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
||||
you save the transformer outputs for later use.
|
||||
|
||||
<Project id="en_core_bert">
|
||||
|
||||
Try out a BERT-based model pipeline using this project template: swap in your
|
||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
||||
visualize your model.
|
||||
|
||||
</Project>
|
||||
|
||||
For more details on how to integrate transformer models into your training
|
||||
config and customize the implementations, see the usage guide on
|
||||
[training transformers](/usage/transformers#training).
|
||||
|
||||
### Pretraining with spaCy {#pretraining}
|
||||
|
||||
<!-- TODO: document spacy pretrain -->
|
||||
|
||||
## Custom model implementations and architectures {#custom-models}
|
||||
|
||||
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
|
||||
|
||||
### Training with custom code {#custom-code}
|
||||
|
||||
> ```bash
|
||||
> ### Example {wrap="true"}
|
||||
> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
|
||||
> ```
|
||||
|
||||
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
|
||||
`--code` that points to a Python file. The file is imported before training and
|
||||
allows you to add custom functions and architectures to the function registry
|
||||
|
@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
|
|||
models with custom components, without having to re-implement the whole training
|
||||
workflow.
|
||||
|
||||
#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
|
||||
|
||||
For many use cases, you don't necessarily want to implement the whole `Language`
|
||||
subclass and language data from scratch – it's often enough to make a few small
|
||||
modifications, like adjusting the
|
||||
[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
|
||||
[language defaults](/api/language#defaults) like stop words. The config lets you
|
||||
provide three optional **callback functions** that give you access to the
|
||||
language class and `nlp` object at different points of the lifecycle:
|
||||
|
||||
| Callback | Description |
|
||||
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
|
||||
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
|
||||
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
|
||||
|
||||
The `@spacy.registry.callbacks` decorator lets you register that function in the
|
||||
`callbacks` [registry](/api/top-level#registry) under a given name. You can then
|
||||
reference the function in a config block using the `@callbacks` key. If a block
|
||||
contains a key starting with an `@`, it's interpreted as a reference to a
|
||||
function. Because you've registered the function, spaCy knows how to create it
|
||||
when you reference `"customize_language_data"` in your config. Here's an example
|
||||
of a callback that runs before the `nlp` object is created and adds a few custom
|
||||
tokenization rules to the defaults:
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [nlp.before_creation]
|
||||
> @callbacks = "customize_language_data"
|
||||
> ```
|
||||
|
||||
```python
|
||||
### functions.py {highlight="3,6"}
|
||||
import spacy
|
||||
|
||||
@spacy.registry.callbacks("customize_language_data")
|
||||
def create_callback():
|
||||
def customize_language_data(lang_cls):
|
||||
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
|
||||
return lang_cls
|
||||
|
||||
return customize_language_data
|
||||
```
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
Remember that a registered function should always be a function that spaCy
|
||||
**calls to create something**. In this case, it **creates a callback** – it's
|
||||
not the callback itself.
|
||||
|
||||
</Infobox>
|
||||
|
||||
Any registered function – in this case `create_callback` – can also take
|
||||
**arguments** that can be **set by the config**. This lets you implement and
|
||||
keep track of different configurations, without having to hack at your code. You
|
||||
can choose any arguments that make sense for your use case. In this example,
|
||||
we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
|
||||
(boolean) for printing additional info when the function runs.
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [nlp.before_creation]
|
||||
> @callbacks = "customize_language_data"
|
||||
> extra_stop_words = ["ooh", "aah"]
|
||||
> debug = true
|
||||
> ```
|
||||
|
||||
```python
|
||||
### functions.py {highlight="5,8-10"}
|
||||
from typing import List
|
||||
import spacy
|
||||
|
||||
@spacy.registry.callbacks("customize_language_data")
|
||||
def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
|
||||
def customize_language_data(lang_cls):
|
||||
lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
|
||||
lang_cls.Defaults.stop_words.add(extra_stop_words)
|
||||
if debug:
|
||||
print("Updated stop words and tokenizer suffixes")
|
||||
return lang_cls
|
||||
|
||||
return customize_language_data
|
||||
```
|
||||
|
||||
<Infobox title="Tip: Use Python type hints" emoji="💡">
|
||||
|
||||
spaCy's configs are powered by our machine learning library Thinc's
|
||||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||||
[type hints](https://docs.python.org/3/library/typing.html) and even
|
||||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
||||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
||||
function provides type hints, the values that are passed in will be checked
|
||||
against the expected types. For example, `debug: bool` in the example above will
|
||||
ensure that the value received as the argument `debug` is an boolean. If the
|
||||
value can't be coerced into a boolean, spaCy will raise an error.
|
||||
`start: pydantic.StrictBool` will force the value to be an boolean and raise an
|
||||
error if it's not – for instance, if your config defines `1` instead of `true`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
With your `functions.py` defining additional code and the updated `config.cfg`,
|
||||
you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
|
||||
to your Python file. Before loading the config, spaCy will import the
|
||||
`functions.py` module and your custom functions will be registered.
|
||||
|
||||
```bash
|
||||
### Training with custom code {wrap="true"}
|
||||
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
|
||||
```
|
||||
|
||||
#### Example: Custom batch size schedule {#custom-code-schedule}
|
||||
|
||||
For example, let's say you've implemented your own batch size schedule to use
|
||||
during training. The `@spacy.registry.schedules` decorator lets you register
|
||||
that function in the `schedules` [registry](/api/top-level#registry) and assign
|
||||
|
@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
|
|||
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||
[`init config`](/api/cli#init-config).
|
||||
|
||||
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
|
||||
|
||||
```ini
|
||||
### config.cfg (excerpt)
|
||||
[training.batch_size]
|
||||
|
@ -469,31 +561,9 @@ start = 2
|
|||
factor = 1.005
|
||||
```
|
||||
|
||||
You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
|
||||
custom `functions.py` as the argument `--code`. Before loading the config, spaCy
|
||||
will import the `functions.py` module and your custom functions will be
|
||||
registered.
|
||||
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
||||
|
||||
```bash
|
||||
### Training with custom code {wrap="true"}
|
||||
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
|
||||
```
|
||||
|
||||
<Infobox title="Tip: Use Python type hints" emoji="💡">
|
||||
|
||||
spaCy's configs are powered by our machine learning library Thinc's
|
||||
[configuration system](https://thinc.ai/docs/usage-config), which supports
|
||||
[type hints](https://docs.python.org/3/library/typing.html) and even
|
||||
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
|
||||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
||||
function provides type hints, the values that are passed in will be checked
|
||||
against the expected types. For example, `start: int` in the example above will
|
||||
ensure that the value received as the argument `start` is an integer. If the
|
||||
value can't be coerced into an integer, spaCy will raise an error.
|
||||
`start: pydantic.StrictInt` will force the value to be an integer and raise an
|
||||
error if it's not – for instance, if your config defines a float.
|
||||
|
||||
</Infobox>
|
||||
<!-- TODO: -->
|
||||
|
||||
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
||||
|
||||
|
@ -511,6 +581,35 @@ mattis pretium.
|
|||
|
||||
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
||||
|
||||
## Transfer learning {#transfer-learning}
|
||||
|
||||
### Using transformer models like BERT {#transformers}
|
||||
|
||||
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
||||
can use models implemented in a variety of frameworks. A transformer model is
|
||||
just a statistical model, so the
|
||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
||||
actually has very little work to do: it just has to provide a few functions that
|
||||
do the required plumbing. It also provides a pipeline component,
|
||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
||||
you save the transformer outputs for later use.
|
||||
|
||||
<Project id="en_core_bert">
|
||||
|
||||
Try out a BERT-based model pipeline using this project template: swap in your
|
||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
||||
visualize your model.
|
||||
|
||||
</Project>
|
||||
|
||||
For more details on how to integrate transformer models into your training
|
||||
config and customize the implementations, see the usage guide on
|
||||
[training transformers](/usage/transformers#training).
|
||||
|
||||
### Pretraining with spaCy {#pretraining}
|
||||
|
||||
<!-- TODO: document spacy pretrain -->
|
||||
|
||||
## Parallel Training with Ray {#parallel-training}
|
||||
|
||||
<!-- TODO: document Ray integration -->
|
||||
|
|
|
@ -24,10 +24,16 @@ $border-radius: 6px
|
|||
&:last-child
|
||||
margin: 0
|
||||
|
||||
&:first-child h4
|
||||
margin-top: 0 !important
|
||||
|
||||
code
|
||||
padding: 0
|
||||
margin: 0
|
||||
|
||||
h4
|
||||
margin-left: 0
|
||||
|
||||
p, ul, ol
|
||||
font: inherit
|
||||
margin-bottom: var(--spacing-sm)
|
||||
|
|
|
@ -373,7 +373,7 @@ body [id]:target
|
|||
margin-right: -1.5em
|
||||
margin-left: -1.5em
|
||||
padding-right: 1.5em
|
||||
padding-left: 1.65em
|
||||
padding-left: 1.1em
|
||||
|
||||
&:empty:before
|
||||
// Fix issue where empty lines would disappear
|
||||
|
|
Loading…
Reference in New Issue
Block a user