mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 01:04:34 +03:00
WIP: Update docs [ci skip]
This commit is contained in:
parent
4d34efa697
commit
5d417d3b19
|
@ -219,23 +219,22 @@ The command will create all objects in the tree and validate them. Note that
|
||||||
some config validation errors are blocking and will prevent the rest of the
|
some config validation errors are blocking and will prevent the rest of the
|
||||||
config from being resolved. This means that you may not see all validation
|
config from being resolved. This means that you may not see all validation
|
||||||
errors at once and some issues are only shown once previous errors have been
|
errors at once and some issues are only shown once previous errors have been
|
||||||
fixed.
|
fixed. To auto-fill a partial config and save the result, you can use the
|
||||||
|
[`init config`](/api/cli#init-config) command.
|
||||||
Instead of specifying all required settings in the config file, you can rely on
|
|
||||||
an auto-fill functionality that uses spaCy's built-in defaults. The resulting
|
|
||||||
full config can be written to file and used in downstream training tasks.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
|
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example 1
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```bash
|
||||||
> $ python -m spacy debug config ./config.cfg
|
> $ python -m spacy debug config ./config.cfg
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
<Accordion title="Example 1 output" spaced>
|
<Accordion title="Example output" spaced>
|
||||||
|
|
||||||
|
<!-- TODO: update examples with validation error of final config -->
|
||||||
|
|
||||||
```
|
```
|
||||||
✘ Config validation error
|
✘ Config validation error
|
||||||
|
@ -254,30 +253,15 @@ training -> width extra fields not permitted
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
> #### Example 2
|
| Argument | Type | Default | Description |
|
||||||
>
|
| --------------------- | ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
> ```bash
|
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||||
> $ python -m spacy debug config ./minimal_config.cfg -F -o ./filled_config.cfg
|
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
||||||
> ```
|
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
|
||||||
|
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
|
||||||
<Accordion title="Example 2 output" spaced>
|
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
|
||||||
|
| `--help`, `-h` | flag | `False` | Show help message and available arguments. |
|
||||||
```
|
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
||||||
✔ Auto-filled config is valid
|
|
||||||
✔ Saved updated config to ./filled_config.cfg
|
|
||||||
```
|
|
||||||
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
| Argument | Type | Default | Description |
|
|
||||||
| --------------------- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
|
||||||
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
|
||||||
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
|
|
||||||
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
|
|
||||||
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
|
|
||||||
| `--help`, `-h` | flag | `False` | Show help message and available arguments. |
|
|
||||||
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
|
||||||
|
|
||||||
### debug data {#debug-data}
|
### debug data {#debug-data}
|
||||||
|
|
||||||
|
@ -289,19 +273,20 @@ low data labels and more.
|
||||||
|
|
||||||
The `debug-data` command is now available as a subcommand of `spacy debug`. It
|
The `debug-data` command is now available as a subcommand of `spacy debug`. It
|
||||||
takes the same arguments as `train` and reads settings off the
|
takes the same arguments as `train` and reads settings off the
|
||||||
[`config.cfg` file](/usage/training#config).
|
[`config.cfg` file](/usage/training#config) and optional
|
||||||
|
[overrides](/usage/training#config-overrides) on the CLI.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy debug data [train_path] [dev_path] [config_path] [--code]
|
$ python -m spacy debug data [config_path] [--code] [--ignore-warnings]
|
||||||
[--ignore-warnings] [--verbose] [--no-format] [overrides]
|
[--verbose] [--no-format] [overrides]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```bash
|
||||||
> $ python -m spacy debug data ./train.spacy ./dev.spacy ./config.cfg
|
> $ python -m spacy debug data ./config.cfg
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
<Accordion title="Example output" spaced>
|
<Accordion title="Example output" spaced>
|
||||||
|
@ -443,17 +428,15 @@ will not be available.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| -------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. |
|
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||||
| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. |
|
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
|
||||||
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
|
||||||
|
|
||||||
<!-- TODO: document debug profile?-->
|
<!-- TODO: document debug profile?-->
|
||||||
|
|
||||||
|
@ -466,13 +449,16 @@ sample text and checking how it updates its internal weights and parameters.
|
||||||
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id]
|
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example 1
|
<Accordion title="Example outputs" spaced>
|
||||||
>
|
|
||||||
> ```bash
|
|
||||||
> $ python -m spacy debug model ./config.cfg tagger -P0
|
|
||||||
> ```
|
|
||||||
|
|
||||||
<Accordion title="Example 1 output" spaced>
|
In this example log, we just print the name of each layer after creation of the
|
||||||
|
model ("Step 0"), which helps us to understand the internal structure of the
|
||||||
|
Neural Network, and to focus on specific layers that we want to inspect further
|
||||||
|
(see next example).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy debug model ./config.cfg tagger -P0
|
||||||
|
```
|
||||||
|
|
||||||
```
|
```
|
||||||
ℹ Using CPU
|
ℹ Using CPU
|
||||||
|
@ -509,20 +495,16 @@ $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR]
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
</Accordion>
|
In this example log, we see how initialization of the model (Step 1) propagates
|
||||||
|
the correct values for the `nI` (input) and `nO` (output) dimensions of the
|
||||||
|
various layers. In the `softmax` layer, this step also defines the `W` matrix as
|
||||||
|
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
|
||||||
|
training step (Step 2), this matrix has clearly updated its values through the
|
||||||
|
training feedback loop.
|
||||||
|
|
||||||
In this example log, we just print the name of each layer after creation of the
|
```bash
|
||||||
model ("Step 0"), which helps us to understand the internal structure of the
|
$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
|
||||||
Neural Network, and to focus on specific layers that we want to inspect further
|
```
|
||||||
(see next example).
|
|
||||||
|
|
||||||
> #### Example 2
|
|
||||||
>
|
|
||||||
> ```bash
|
|
||||||
> $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
|
|
||||||
> ```
|
|
||||||
|
|
||||||
<Accordion title="Example 2 output" spaced>
|
|
||||||
|
|
||||||
```
|
```
|
||||||
ℹ Using CPU
|
ℹ Using CPU
|
||||||
|
@ -563,27 +545,20 @@ Neural Network, and to focus on specific layers that we want to inspect further
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
In this example log, we see how initialization of the model (Step 1) propagates
|
| Argument | Type | Default | Description |
|
||||||
the correct values for the `nI` (input) and `nO` (output) dimensions of the
|
| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
|
||||||
various layers. In the `softmax` layer, this step also defines the `W` matrix as
|
|
||||||
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
|
|
||||||
training step (Step 2), this matrix has clearly updated its values through the
|
|
||||||
training feedback loop.
|
|
||||||
|
|
||||||
| Argument | Type | Default | Description |
|
|
||||||
| ----------------------- | ---------- | ------- | ---------------------------------------------------------------------------------------------------- |
|
|
||||||
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||||
| `component` | positional | | Name of the pipeline component of which the model should be analysed. |
|
| `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
|
||||||
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
|
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
|
||||||
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
|
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
|
||||||
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
|
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
|
||||||
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
|
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
|
||||||
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
|
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
|
||||||
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
|
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
|
||||||
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
|
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
|
||||||
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
|
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
|
||||||
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
|
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
|
||||||
| `--help`, `-h` | flag | | Show help message and available arguments. |
|
| `--help`, `-h` | flag | | Show help message and available arguments. |
|
||||||
|
|
||||||
## Train {#train}
|
## Train {#train}
|
||||||
|
|
||||||
|
@ -603,37 +578,37 @@ you need to manage complex multi-step training workflows, check out the new
|
||||||
The `train` command doesn't take a long list of command-line arguments anymore
|
The `train` command doesn't take a long list of command-line arguments anymore
|
||||||
and instead expects a single [`config.cfg` file](/usage/training#config)
|
and instead expects a single [`config.cfg` file](/usage/training#config)
|
||||||
containing all settings for the pipeline, training process and hyperparameters.
|
containing all settings for the pipeline, training process and hyperparameters.
|
||||||
|
Config values can be [overwritten](/usage/training#config-overrides) on the CLI
|
||||||
|
if needed. For example, `--paths.train ./train.spacy` sets the variable `train`
|
||||||
|
in the section `[paths]`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy train [train_path] [dev_path] [config_path] [--output]
|
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides]
|
||||||
[--code] [--verbose] [overrides]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ----------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `train_path` | positional | Location of training data in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
|
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||||
| `dev_path` | positional | Location of development data for evaluation in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
|
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
||||||
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
|
| `--verbose`, `-V` | flag | Show more detailed messages during training. |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| `--verbose`, `-V` | flag | Show more detailed messages during training. |
|
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| **CREATES** | model | The final model and the best model. |
|
||||||
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
|
||||||
| **CREATES** | model | The final model and the best model. |
|
|
||||||
|
|
||||||
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
||||||
|
|
||||||
<!-- TODO: document new pretrain command and link to new pretraining docs -->
|
<!-- TODO: document new pretrain command and link to new pretraining docs -->
|
||||||
|
|
||||||
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using
|
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components on
|
||||||
an approximate language-modeling objective. Specifically, we load pretrained
|
[raw text](/api/data-formats#pretrain), using an approximate language-modeling
|
||||||
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
|
objective. Specifically, we load pretrained vectors, and train a component like
|
||||||
match the pretrained ones. The weights are saved to a directory after each
|
a CNN, BiLSTM, etc to predict vectors which match the pretrained ones. The
|
||||||
epoch. You can then pass a path to one of these pretrained weights files to the
|
weights are saved to a directory after each epoch. You can then pass a path to
|
||||||
`spacy train` command. This technique may be especially helpful if you have
|
one of these pretrained weights files to the `spacy train` command. This
|
||||||
little labelled data.
|
technique may be especially helpful if you have little labelled data.
|
||||||
|
|
||||||
<Infobox title="Changed in v3.0" variant="warning">
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
|
|
||||||
|
@ -650,48 +625,17 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path]
|
||||||
[--code] [--resume-path] [--epoch-resume] [overrides]
|
[--code] [--resume-path] [--epoch-resume] [overrides]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
|
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. |
|
||||||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
| `output_dir` | positional | Directory to write models to on each epoch. |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
||||||
| `--resume-path`, `-r` | option | TODO: |
|
| `--resume-path`, `-r` | option | TODO: |
|
||||||
| `--epoch-resume`, `-er` | option | TODO: |
|
| `--epoch-resume`, `-er` | option | TODO: |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
||||||
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
|
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
|
||||||
|
|
||||||
### JSONL format for raw text {#pretrain-jsonl}
|
|
||||||
|
|
||||||
Raw text can be provided as a `.jsonl` (newline-delimited JSON) file containing
|
|
||||||
one input text per line (roughly paragraph length is good). Optionally, custom
|
|
||||||
tokenization can be provided.
|
|
||||||
|
|
||||||
> #### Tip: Writing JSONL
|
|
||||||
>
|
|
||||||
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
|
|
||||||
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
|
|
||||||
> writes out JSONL-formatted data.
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> import srsly
|
|
||||||
> data = [{"text": "Some text"}, {"text": "More..."}]
|
|
||||||
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Key | Type | Description |
|
|
||||||
| -------- | ---- | ---------------------------------------------------------- |
|
|
||||||
| `text` | str | The raw input text. Is not required if `tokens` available. |
|
|
||||||
| `tokens` | list | Optional tokenization, one string per token. |
|
|
||||||
|
|
||||||
```json
|
|
||||||
### Example
|
|
||||||
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
|
||||||
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
|
||||||
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
|
|
||||||
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Evaluate {#evaluate new="2"}
|
## Evaluate {#evaluate new="2"}
|
||||||
|
|
||||||
|
|
|
@ -3,6 +3,7 @@ title: Data formats
|
||||||
teaser: Details on spaCy's input and output data formats
|
teaser: Details on spaCy's input and output data formats
|
||||||
menu:
|
menu:
|
||||||
- ['Training Data', 'training']
|
- ['Training Data', 'training']
|
||||||
|
- ['Pretraining Data', 'pretraining']
|
||||||
- ['Training Config', 'config']
|
- ['Training Config', 'config']
|
||||||
- ['Vocabulary', 'vocab']
|
- ['Vocabulary', 'vocab']
|
||||||
---
|
---
|
||||||
|
@ -16,17 +17,30 @@ label schemes used in its components, depending on the data it was trained on.
|
||||||
|
|
||||||
### Binary training format {#binary-training new="3"}
|
### Binary training format {#binary-training new="3"}
|
||||||
|
|
||||||
|
The built-in [`convert`](/api/cli#convert) command helps you convert the
|
||||||
|
`.conllu` format used by the
|
||||||
|
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
|
||||||
|
well as spaCy's previous [JSON format](#json-input).
|
||||||
|
|
||||||
<!-- TODO: document DocBin format -->
|
<!-- TODO: document DocBin format -->
|
||||||
|
|
||||||
### JSON input format for training {#json-input}
|
### JSON training format {#json-input tag="deprecated"}
|
||||||
|
|
||||||
spaCy takes training data in JSON format. The built-in
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
|
||||||
used by the
|
As of v3.0, the JSON input format is deprecated and is replaced by the
|
||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
||||||
spaCy's training format. To convert one or more existing `Doc` objects to
|
objects to JSON, you can now now serialize them directly using the
|
||||||
spaCy's JSON format, you can use the
|
[`DocBin`](/api/docbin) container and then use them as input data.
|
||||||
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
|
|
||||||
|
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
||||||
|
format:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy convert ./data.json ./output
|
||||||
|
```
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
> #### Annotating entities {#biluo}
|
> #### Annotating entities {#biluo}
|
||||||
>
|
>
|
||||||
|
@ -68,61 +82,99 @@ spaCy's JSON format, you can use the
|
||||||
}]
|
}]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<Accordion title="Sample JSON data" spaced>
|
||||||
|
|
||||||
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
||||||
from the English Wall Street Journal portion of the Penn Treebank:
|
from the English Wall Street Journal portion of the Penn Treebank:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/training-data.json
|
||||||
```
|
```
|
||||||
|
|
||||||
### Annotations in dictionary format {#dict-input}
|
</Accordion>
|
||||||
|
|
||||||
To create [`Example`](/api/example) objects, you can create a dictionary of the
|
### Annotation format for creating training examples {#dict-input}
|
||||||
gold-standard annotations `gold_dict`, and then call
|
|
||||||
|
|
||||||
```python
|
An [`Example`](/api/example) object holds the information for one training
|
||||||
example = Example.from_dict(doc, gold_dict)
|
instance. It stores two [`Doc`](/api/doc) objects: one for holding the
|
||||||
```
|
gold-standard reference data, and one for holding the predictions of the
|
||||||
|
pipeline. Examples can be created using the
|
||||||
|
[`Example.from_dict`](/api/example#from_dict) method with a reference `Doc` and
|
||||||
|
a dictionary of gold-standard annotations. There are currently two formats
|
||||||
|
supported for this dictionary of annotations: one with a simple, **flat
|
||||||
|
structure** of keywords, and one with a more **hierarchical structure**.
|
||||||
|
|
||||||
There are currently two formats supported for this dictionary of annotations:
|
> #### Example
|
||||||
one with a simple, flat structure of keywords, and one with a more hierarchical
|
>
|
||||||
structure.
|
> ```python
|
||||||
|
> example = Example.from_dict(doc, gold_dict)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
`Example` objects are used as part of the
|
||||||
|
[internal training API](/usage/training#api) and they're expected when you call
|
||||||
|
[`nlp.update`](/api/language#update). However, for most use cases, you
|
||||||
|
**shouldn't** have to write your own training scripts. It's recommended to train
|
||||||
|
your models via the [`spacy train`](/api/cli#train) command with a config file
|
||||||
|
to keep track of your settings and hyperparameters and your own
|
||||||
|
[registered functions](/usage/training/#custom-code) to customize the setup.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
#### Flat structure {#dict-flat}
|
#### Flat structure {#dict-flat}
|
||||||
|
|
||||||
Here is the full overview of potential entries in a flat dictionary of
|
> #### Example
|
||||||
annotations. You need to only specify those keys corresponding to the task you
|
>
|
||||||
want to train.
|
> ```python
|
||||||
|
> {
|
||||||
|
> "text": str,
|
||||||
|
> "words": List[str],
|
||||||
|
> "lemmas": List[str],
|
||||||
|
> "spaces": List[bool],
|
||||||
|
> "tags": List[str],
|
||||||
|
> "pos": List[str],
|
||||||
|
> "morphs": List[str],
|
||||||
|
> "sent_starts": List[bool],
|
||||||
|
> "deps": List[string],
|
||||||
|
> "heads": List[int],
|
||||||
|
> "entities": List[str],
|
||||||
|
> "entities": List[(int, int, str)],
|
||||||
|
> "cats": Dict[str, float],
|
||||||
|
> "links": Dict[(int, int), dict],
|
||||||
|
> }
|
||||||
|
> ```
|
||||||
|
|
||||||
```python
|
| Name | Type | Description |
|
||||||
### Flat dictionary
|
| ------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
{
|
| `text` | str | Raw text. |
|
||||||
"text": string, # Raw text.
|
| `words` | `List[str]` | List of gold-standard tokens. |
|
||||||
"words": List[string], # List of gold tokens.
|
| `lemmas` | `List[str]` | List of lemmas. |
|
||||||
"lemmas": List[string], # List of lemmas.
|
| `spaces` | `List[bool]` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. |
|
||||||
"spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
|
| `tags` | `List[str]` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). |
|
||||||
"tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
| `pos` | `List[str]` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). |
|
||||||
"pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
| `morphs` | `List[str]` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). |
|
||||||
"morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
|
| `sent_starts` | `List[bool]` | List of boolean values indicating whether each token is the first of a sentence or not. |
|
||||||
"sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
|
| `deps` | `List[str]` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. |
|
||||||
"deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
|
| `heads` | `List[int]` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. |
|
||||||
"heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
|
| `entities` | `List[str]` | Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. |
|
||||||
"entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
|
| `entities` | `List[Tuple[int, int, str]]` | Option 2: List of `"(start, end, label)"` tuples defining all entities in the text. |
|
||||||
"entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in.
|
| `cats` | `Dict[str, float]` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. |
|
||||||
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
|
| `links` | `Dict[(int, int), Dict]` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. |
|
||||||
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
There are a few caveats to take into account:
|
<Infobox variant="warning" title="Important notes and caveats">
|
||||||
|
|
||||||
- Multiple formats are possible for the "entities" entry, but you have to pick
|
- Multiple formats are possible for the "entities" entry, but you have to pick
|
||||||
one.
|
one.
|
||||||
- Any values for sentence starts will be ignored if there are annotations for
|
- Any values for sentence starts will be ignored if there are annotations for
|
||||||
dependency relations.
|
dependency relations.
|
||||||
- If the dictionary contains values for "text" and "words", but not "spaces",
|
- If the dictionary contains values for `"text"` and `"words"`, but not
|
||||||
the latter are inferred automatically. If "words" is not provided either, the
|
`"spaces"`, the latter are inferred automatically. If "words" is not provided
|
||||||
values are inferred from the `doc` argument.
|
either, the values are inferred from the `Doc` argument.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
<!-- TODO: finish reformatting below -->
|
||||||
|
|
||||||
##### Examples
|
##### Examples
|
||||||
|
|
||||||
|
@ -192,6 +244,39 @@ There are a few caveats to take into account:
|
||||||
latter are inferred automatically. If "ORTH" is not provided either, the
|
latter are inferred automatically. If "ORTH" is not provided either, the
|
||||||
values are inferred from the `doc` argument.
|
values are inferred from the `doc` argument.
|
||||||
|
|
||||||
|
## Pretraining data {#pretraining}
|
||||||
|
|
||||||
|
The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the tok2vec
|
||||||
|
layer of pipeline components from raw text. Raw text can be provided as a
|
||||||
|
`.jsonl` (newline-delimited JSON) file containing one input text per line
|
||||||
|
(roughly paragraph length is good). Optionally, custom tokenization can be
|
||||||
|
provided.
|
||||||
|
|
||||||
|
> #### Tip: Writing JSONL
|
||||||
|
>
|
||||||
|
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
|
||||||
|
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
|
||||||
|
> writes out JSONL-formatted data.
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> import srsly
|
||||||
|
> data = [{"text": "Some text"}, {"text": "More..."}]
|
||||||
|
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Key | Type | Description |
|
||||||
|
| -------- | ---- | ---------------------------------------------------------- |
|
||||||
|
| `text` | str | The raw input text. Is not required if `tokens` available. |
|
||||||
|
| `tokens` | list | Optional tokenization, one string per token. |
|
||||||
|
|
||||||
|
```json
|
||||||
|
### Example
|
||||||
|
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
||||||
|
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
||||||
|
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
|
||||||
|
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
|
||||||
|
```
|
||||||
|
|
||||||
## Training config {#config new="3"}
|
## Training config {#config new="3"}
|
||||||
|
|
||||||
Config files define the training process and model pipeline and can be passed to
|
Config files define the training process and model pipeline and can be passed to
|
||||||
|
|
|
@ -172,6 +172,8 @@ available for the different architectures are documented with the
|
||||||
|
|
||||||
### Overwriting config settings on the command line {#config-overrides}
|
### Overwriting config settings on the command line {#config-overrides}
|
||||||
|
|
||||||
|
<!-- TODO: change example to use file path overrides -->
|
||||||
|
|
||||||
The config system means that you can define all settings **in one place** and in
|
The config system means that you can define all settings **in one place** and in
|
||||||
a consistent format. There are no command-line arguments that need to be set,
|
a consistent format. There are no command-line arguments that need to be set,
|
||||||
and no hidden defaults. However, there can still be scenarios where you may want
|
and no hidden defaults. However, there can still be scenarios where you may want
|
||||||
|
|
|
@ -20,6 +20,7 @@ menu:
|
||||||
| Removed | Replacement |
|
| Removed | Replacement |
|
||||||
| -------------------------------------------------------- | ----------------------------------------- |
|
| -------------------------------------------------------- | ----------------------------------------- |
|
||||||
| `GoldParse` | [`Example`](/api/example) |
|
| `GoldParse` | [`Example`](/api/example) |
|
||||||
|
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
||||||
|
|
||||||
|
|
|
@ -82,6 +82,10 @@ export default function QuickstartTraining({ id, title, download = 'config.cfg'
|
||||||
hidePrompts
|
hidePrompts
|
||||||
>
|
>
|
||||||
<QS comment>{COMMENT}</QS>
|
<QS comment>{COMMENT}</QS>
|
||||||
|
<span>[paths]</span>
|
||||||
|
<span>train = ""</span>
|
||||||
|
<span>dev = ""</span>
|
||||||
|
<br />
|
||||||
<span>[nlp]</span>
|
<span>[nlp]</span>
|
||||||
<span>lang = "{lang}"</span>
|
<span>lang = "{lang}"</span>
|
||||||
<span>pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}</span>
|
<span>pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}</span>
|
||||||
|
|
Loading…
Reference in New Issue
Block a user