WIP: Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-06 13:10:15 +02:00
parent 4d34efa697
commit 5d417d3b19
5 changed files with 227 additions and 191 deletions

View File

@ -219,23 +219,22 @@ The command will create all objects in the tree and validate them. Note that
some config validation errors are blocking and will prevent the rest of the
config from being resolved. This means that you may not see all validation
errors at once and some issues are only shown once previous errors have been
fixed.
Instead of specifying all required settings in the config file, you can rely on
an auto-fill functionality that uses spaCy's built-in defaults. The resulting
full config can be written to file and used in downstream training tasks.
fixed. To auto-fill a partial config and save the result, you can use the
[`init config`](/api/cli#init-config) command.
```bash
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
```
> #### Example 1
> #### Example
>
> ```bash
> $ python -m spacy debug config ./config.cfg
> ```
<Accordion title="Example 1 output" spaced>
<Accordion title="Example output" spaced>
<!-- TODO: update examples with validation error of final config -->
```
✘ Config validation error
@ -254,30 +253,15 @@ training -> width extra fields not permitted
</Accordion>
> #### Example 2
>
> ```bash
> $ python -m spacy debug config ./minimal_config.cfg -F -o ./filled_config.cfg
> ```
<Accordion title="Example 2 output" spaced>
```
✔ Auto-filled config is valid
✔ Saved updated config to ./filled_config.cfg
```
</Accordion>
| Argument | Type | Default | Description |
| --------------------- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
| `--help`, `-h` | flag | `False` | Show help message and available arguments. |
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| Argument | Type | Default | Description |
| --------------------- | ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
| `--help`, `-h` | flag | `False` | Show help message and available arguments. |
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
### debug data {#debug-data}
@ -289,19 +273,20 @@ low data labels and more.
The `debug-data` command is now available as a subcommand of `spacy debug`. It
takes the same arguments as `train` and reads settings off the
[`config.cfg` file](/usage/training#config).
[`config.cfg` file](/usage/training#config) and optional
[overrides](/usage/training#config-overrides) on the CLI.
</Infobox>
```bash
$ python -m spacy debug data [train_path] [dev_path] [config_path] [--code]
[--ignore-warnings] [--verbose] [--no-format] [overrides]
$ python -m spacy debug data [config_path] [--code] [--ignore-warnings]
[--verbose] [--no-format] [overrides]
```
> #### Example
>
> ```bash
> $ python -m spacy debug data ./train.spacy ./dev.spacy ./config.cfg
> $ python -m spacy debug data ./config.cfg
> ```
<Accordion title="Example output" spaced>
@ -443,17 +428,15 @@ will not be available.
</Accordion>
| Argument | Type | Description |
| -------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. |
| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
| `--verbose`, `-V` | flag | Print additional information and explanations. |
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| Argument | Type | Description |
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
| `--verbose`, `-V` | flag | Print additional information and explanations. |
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
<!-- TODO: document debug profile?-->
@ -466,13 +449,16 @@ sample text and checking how it updates its internal weights and parameters.
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id]
```
> #### Example 1
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -P0
> ```
<Accordion title="Example outputs" spaced>
<Accordion title="Example 1 output" spaced>
In this example log, we just print the name of each layer after creation of the
model ("Step 0"), which helps us to understand the internal structure of the
Neural Network, and to focus on specific layers that we want to inspect further
(see next example).
```bash
$ python -m spacy debug model ./config.cfg tagger -P0
```
```
Using CPU
@ -509,20 +495,16 @@ $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR]
...
```
</Accordion>
In this example log, we see how initialization of the model (Step 1) propagates
the correct values for the `nI` (input) and `nO` (output) dimensions of the
various layers. In the `softmax` layer, this step also defines the `W` matrix as
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
training step (Step 2), this matrix has clearly updated its values through the
training feedback loop.
In this example log, we just print the name of each layer after creation of the
model ("Step 0"), which helps us to understand the internal structure of the
Neural Network, and to focus on specific layers that we want to inspect further
(see next example).
> #### Example 2
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
> ```
<Accordion title="Example 2 output" spaced>
```bash
$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
```
```
Using CPU
@ -563,27 +545,20 @@ Neural Network, and to focus on specific layers that we want to inspect further
</Accordion>
In this example log, we see how initialization of the model (Step 1) propagates
the correct values for the `nI` (input) and `nO` (output) dimensions of the
various layers. In the `softmax` layer, this step also defines the `W` matrix as
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
training step (Step 2), this matrix has clearly updated its values through the
training feedback loop.
| Argument | Type | Default | Description |
| ----------------------- | ---------- | ------- | ---------------------------------------------------------------------------------------------------- |
| Argument | Type | Default | Description |
| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `component` | positional | | Name of the pipeline component of which the model should be analysed. |
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
| `--help`, `-h` | flag | | Show help message and available arguments. |
| `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
| `--help`, `-h` | flag | | Show help message and available arguments. |
## Train {#train}
@ -603,37 +578,37 @@ you need to manage complex multi-step training workflows, check out the new
The `train` command doesn't take a long list of command-line arguments anymore
and instead expects a single [`config.cfg` file](/usage/training#config)
containing all settings for the pipeline, training process and hyperparameters.
Config values can be [overwritten](/usage/training#config-overrides) on the CLI
if needed. For example, `--paths.train ./train.spacy` sets the variable `train`
in the section `[paths]`.
</Infobox>
```bash
$ python -m spacy train [train_path] [dev_path] [config_path] [--output]
[--code] [--verbose] [overrides]
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides]
```
| Argument | Type | Description |
| ----------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `train_path` | positional | Location of training data in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
| `dev_path` | positional | Location of development data for evaluation in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--verbose`, `-V` | flag | Show more detailed messages during training. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| **CREATES** | model | The final model and the best model. |
| Argument | Type | Description |
| ----------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--verbose`, `-V` | flag | Show more detailed messages during training. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
| **CREATES** | model | The final model and the best model. |
## Pretrain {#pretrain new="2.1" tag="experimental"}
<!-- TODO: document new pretrain command and link to new pretraining docs -->
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using
an approximate language-modeling objective. Specifically, we load pretrained
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
match the pretrained ones. The weights are saved to a directory after each
epoch. You can then pass a path to one of these pretrained weights files to the
`spacy train` command. This technique may be especially helpful if you have
little labelled data.
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components on
[raw text](/api/data-formats#pretrain), using an approximate language-modeling
objective. Specifically, we load pretrained vectors, and train a component like
a CNN, BiLSTM, etc to predict vectors which match the pretrained ones. The
weights are saved to a directory after each epoch. You can then pass a path to
one of these pretrained weights files to the `spacy train` command. This
technique may be especially helpful if you have little labelled data.
<Infobox title="Changed in v3.0" variant="warning">
@ -650,48 +625,17 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path]
[--code] [--resume-path] [--epoch-resume] [overrides]
```
| Argument | Type | Description |
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
| `output_dir` | positional | Directory to write models to on each epoch. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--resume-path`, `-r` | option | TODO: |
| `--epoch-resume`, `-er` | option | TODO: |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl}
Raw text can be provided as a `.jsonl` (newline-delimited JSON) file containing
one input text per line (roughly paragraph length is good). Optionally, custom
tokenization can be provided.
> #### Tip: Writing JSONL
>
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
> writes out JSONL-formatted data.
>
> ```python
> import srsly
> data = [{"text": "Some text"}, {"text": "More..."}]
> srsly.write_jsonl("/path/to/text.jsonl", data)
> ```
| Key | Type | Description |
| -------- | ---- | ---------------------------------------------------------- |
| `text` | str | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. |
```json
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
```
| Argument | Type | Description |
| ----------------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. |
| `output_dir` | positional | Directory to write models to on each epoch. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--resume-path`, `-r` | option | TODO: |
| `--epoch-resume`, `-er` | option | TODO: |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
## Evaluate {#evaluate new="2"}

View File

@ -3,6 +3,7 @@ title: Data formats
teaser: Details on spaCy's input and output data formats
menu:
- ['Training Data', 'training']
- ['Pretraining Data', 'pretraining']
- ['Training Config', 'config']
- ['Vocabulary', 'vocab']
---
@ -16,17 +17,30 @@ label schemes used in its components, depending on the data it was trained on.
### Binary training format {#binary-training new="3"}
The built-in [`convert`](/api/cli#convert) command helps you convert the
`.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
well as spaCy's previous [JSON format](#json-input).
<!-- TODO: document DocBin format -->
### JSON input format for training {#json-input}
### JSON training format {#json-input tag="deprecated"}
spaCy takes training data in JSON format. The built-in
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
spaCy's training format. To convert one or more existing `Doc` objects to
spaCy's JSON format, you can use the
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
<Infobox variant="warning" title="Changed in v3.0">
As of v3.0, the JSON input format is deprecated and is replaced by the
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
objects to JSON, you can now now serialize them directly using the
[`DocBin`](/api/docbin) container and then use them as input data.
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
format:
```bash
$ python -m spacy convert ./data.json ./output
```
</Infobox>
> #### Annotating entities {#biluo}
>
@ -68,61 +82,99 @@ spaCy's JSON format, you can use the
}]
```
<Accordion title="Sample JSON data" spaced>
Here's an example of dependencies, part-of-speech tags and names entities, taken
from the English Wall Street Journal portion of the Penn Treebank:
```json
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/training-data.json
```
### Annotations in dictionary format {#dict-input}
</Accordion>
To create [`Example`](/api/example) objects, you can create a dictionary of the
gold-standard annotations `gold_dict`, and then call
### Annotation format for creating training examples {#dict-input}
```python
example = Example.from_dict(doc, gold_dict)
```
An [`Example`](/api/example) object holds the information for one training
instance. It stores two [`Doc`](/api/doc) objects: one for holding the
gold-standard reference data, and one for holding the predictions of the
pipeline. Examples can be created using the
[`Example.from_dict`](/api/example#from_dict) method with a reference `Doc` and
a dictionary of gold-standard annotations. There are currently two formats
supported for this dictionary of annotations: one with a simple, **flat
structure** of keywords, and one with a more **hierarchical structure**.
There are currently two formats supported for this dictionary of annotations:
one with a simple, flat structure of keywords, and one with a more hierarchical
structure.
> #### Example
>
> ```python
> example = Example.from_dict(doc, gold_dict)
> ```
<Infobox title="Important note" variant="warning">
`Example` objects are used as part of the
[internal training API](/usage/training#api) and they're expected when you call
[`nlp.update`](/api/language#update). However, for most use cases, you
**shouldn't** have to write your own training scripts. It's recommended to train
your models via the [`spacy train`](/api/cli#train) command with a config file
to keep track of your settings and hyperparameters and your own
[registered functions](/usage/training/#custom-code) to customize the setup.
</Infobox>
#### Flat structure {#dict-flat}
Here is the full overview of potential entries in a flat dictionary of
annotations. You need to only specify those keys corresponding to the task you
want to train.
> #### Example
>
> ```python
> {
> "text": str,
> "words": List[str],
> "lemmas": List[str],
> "spaces": List[bool],
> "tags": List[str],
> "pos": List[str],
> "morphs": List[str],
> "sent_starts": List[bool],
> "deps": List[string],
> "heads": List[int],
> "entities": List[str],
> "entities": List[(int, int, str)],
> "cats": Dict[str, float],
> "links": Dict[(int, int), dict],
> }
> ```
```python
### Flat dictionary
{
"text": string, # Raw text.
"words": List[string], # List of gold tokens.
"lemmas": List[string], # List of lemmas.
"spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
"tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
"pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
"morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
"sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
"deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
"heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
"entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
"entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in.
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
}
```
| Name | Type | Description |
| ------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `text` | str | Raw text. |
| `words` | `List[str]` | List of gold-standard tokens. |
| `lemmas` | `List[str]` | List of lemmas. |
| `spaces` | `List[bool]` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. |
| `tags` | `List[str]` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). |
| `pos` | `List[str]` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). |
| `morphs` | `List[str]` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). |
| `sent_starts` | `List[bool]` | List of boolean values indicating whether each token is the first of a sentence or not. |
| `deps` | `List[str]` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. |
| `heads` | `List[int]` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. |
| `entities` | `List[str]` | Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. |
| `entities` | `List[Tuple[int, int, str]]` | Option 2: List of `"(start, end, label)"` tuples defining all entities in the text. |
| `cats` | `Dict[str, float]` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. |
| `links` | `Dict[(int, int), Dict]` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. |
There are a few caveats to take into account:
<Infobox variant="warning" title="Important notes and caveats">
- Multiple formats are possible for the "entities" entry, but you have to pick
one.
- Any values for sentence starts will be ignored if there are annotations for
dependency relations.
- If the dictionary contains values for "text" and "words", but not "spaces",
the latter are inferred automatically. If "words" is not provided either, the
values are inferred from the `doc` argument.
- If the dictionary contains values for `"text"` and `"words"`, but not
`"spaces"`, the latter are inferred automatically. If "words" is not provided
either, the values are inferred from the `Doc` argument.
</Infobox>
<!-- TODO: finish reformatting below -->
##### Examples
@ -192,6 +244,39 @@ There are a few caveats to take into account:
latter are inferred automatically. If "ORTH" is not provided either, the
values are inferred from the `doc` argument.
## Pretraining data {#pretraining}
The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the tok2vec
layer of pipeline components from raw text. Raw text can be provided as a
`.jsonl` (newline-delimited JSON) file containing one input text per line
(roughly paragraph length is good). Optionally, custom tokenization can be
provided.
> #### Tip: Writing JSONL
>
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
> writes out JSONL-formatted data.
>
> ```python
> import srsly
> data = [{"text": "Some text"}, {"text": "More..."}]
> srsly.write_jsonl("/path/to/text.jsonl", data)
> ```
| Key | Type | Description |
| -------- | ---- | ---------------------------------------------------------- |
| `text` | str | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. |
```json
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
```
## Training config {#config new="3"}
Config files define the training process and model pipeline and can be passed to

View File

@ -172,6 +172,8 @@ available for the different architectures are documented with the
### Overwriting config settings on the command line {#config-overrides}
<!-- TODO: change example to use file path overrides -->
The config system means that you can define all settings **in one place** and in
a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want

View File

@ -20,6 +20,7 @@ menu:
| Removed | Replacement |
| -------------------------------------------------------- | ----------------------------------------- |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |

View File

@ -82,6 +82,10 @@ export default function QuickstartTraining({ id, title, download = 'config.cfg'
hidePrompts
>
<QS comment>{COMMENT}</QS>
<span>[paths]</span>
<span>train = ""</span>
<span>dev = ""</span>
<br />
<span>[nlp]</span>
<span>lang = "{lang}"</span>
<span>pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}</span>