WIP: Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-06 13:10:15 +02:00
parent 4d34efa697
commit 5d417d3b19
5 changed files with 227 additions and 191 deletions

View File

@ -219,23 +219,22 @@ The command will create all objects in the tree and validate them. Note that
some config validation errors are blocking and will prevent the rest of the some config validation errors are blocking and will prevent the rest of the
config from being resolved. This means that you may not see all validation config from being resolved. This means that you may not see all validation
errors at once and some issues are only shown once previous errors have been errors at once and some issues are only shown once previous errors have been
fixed. fixed. To auto-fill a partial config and save the result, you can use the
[`init config`](/api/cli#init-config) command.
Instead of specifying all required settings in the config file, you can rely on
an auto-fill functionality that uses spaCy's built-in defaults. The resulting
full config can be written to file and used in downstream training tasks.
```bash ```bash
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides] $ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
``` ```
> #### Example 1 > #### Example
> >
> ```bash > ```bash
> $ python -m spacy debug config ./config.cfg > $ python -m spacy debug config ./config.cfg
> ``` > ```
<Accordion title="Example 1 output" spaced> <Accordion title="Example output" spaced>
<!-- TODO: update examples with validation error of final config -->
``` ```
✘ Config validation error ✘ Config validation error
@ -254,30 +253,15 @@ training -> width extra fields not permitted
</Accordion> </Accordion>
> #### Example 2
>
> ```bash
> $ python -m spacy debug config ./minimal_config.cfg -F -o ./filled_config.cfg
> ```
<Accordion title="Example 2 output" spaced>
```
✔ Auto-filled config is valid
✔ Saved updated config to ./filled_config.cfg
```
</Accordion>
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
| --------------------- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------------------- | ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | | `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. | | `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. | | `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. | | `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
| `--help`, `-h` | flag | `False` | Show help message and available arguments. | | `--help`, `-h` | flag | `False` | Show help message and available arguments. |
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | | overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
### debug data {#debug-data} ### debug data {#debug-data}
@ -289,19 +273,20 @@ low data labels and more.
The `debug-data` command is now available as a subcommand of `spacy debug`. It The `debug-data` command is now available as a subcommand of `spacy debug`. It
takes the same arguments as `train` and reads settings off the takes the same arguments as `train` and reads settings off the
[`config.cfg` file](/usage/training#config). [`config.cfg` file](/usage/training#config) and optional
[overrides](/usage/training#config-overrides) on the CLI.
</Infobox> </Infobox>
```bash ```bash
$ python -m spacy debug data [train_path] [dev_path] [config_path] [--code] $ python -m spacy debug data [config_path] [--code] [--ignore-warnings]
[--ignore-warnings] [--verbose] [--no-format] [overrides] [--verbose] [--no-format] [overrides]
``` ```
> #### Example > #### Example
> >
> ```bash > ```bash
> $ python -m spacy debug data ./train.spacy ./dev.spacy ./config.cfg > $ python -m spacy debug data ./config.cfg
> ``` > ```
<Accordion title="Example output" spaced> <Accordion title="Example output" spaced>
@ -444,16 +429,14 @@ will not be available.
</Accordion> </Accordion>
| Argument | Type | Description | | Argument | Type | Description |
| -------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. |
| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | | `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | | `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
| `--verbose`, `-V` | flag | Print additional information and explanations. | | `--verbose`, `-V` | flag | Print additional information and explanations. |
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. | | `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
| `--help`, `-h` | flag | Show help message and available arguments. | | `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | | overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
<!-- TODO: document debug profile?--> <!-- TODO: document debug profile?-->
@ -466,13 +449,16 @@ sample text and checking how it updates its internal weights and parameters.
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id] $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id]
``` ```
> #### Example 1 <Accordion title="Example outputs" spaced>
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -P0
> ```
<Accordion title="Example 1 output" spaced> In this example log, we just print the name of each layer after creation of the
model ("Step 0"), which helps us to understand the internal structure of the
Neural Network, and to focus on specific layers that we want to inspect further
(see next example).
```bash
$ python -m spacy debug model ./config.cfg tagger -P0
```
``` ```
Using CPU Using CPU
@ -509,20 +495,16 @@ $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR]
... ...
``` ```
</Accordion> In this example log, we see how initialization of the model (Step 1) propagates
the correct values for the `nI` (input) and `nO` (output) dimensions of the
various layers. In the `softmax` layer, this step also defines the `W` matrix as
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
training step (Step 2), this matrix has clearly updated its values through the
training feedback loop.
In this example log, we just print the name of each layer after creation of the ```bash
model ("Step 0"), which helps us to understand the internal structure of the $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
Neural Network, and to focus on specific layers that we want to inspect further ```
(see next example).
> #### Example 2
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
> ```
<Accordion title="Example 2 output" spaced>
``` ```
Using CPU Using CPU
@ -563,17 +545,10 @@ Neural Network, and to focus on specific layers that we want to inspect further
</Accordion> </Accordion>
In this example log, we see how initialization of the model (Step 1) propagates
the correct values for the `nI` (input) and `nO` (output) dimensions of the
various layers. In the `softmax` layer, this step also defines the `W` matrix as
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
training step (Step 2), this matrix has clearly updated its values through the
training feedback loop.
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
| ----------------------- | ---------- | ------- | ---------------------------------------------------------------------------------------------------- | | ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `component` | positional | | Name of the pipeline component of which the model should be analysed. | | `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. | | `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. | | `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. | | `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
@ -603,37 +578,37 @@ you need to manage complex multi-step training workflows, check out the new
The `train` command doesn't take a long list of command-line arguments anymore The `train` command doesn't take a long list of command-line arguments anymore
and instead expects a single [`config.cfg` file](/usage/training#config) and instead expects a single [`config.cfg` file](/usage/training#config)
containing all settings for the pipeline, training process and hyperparameters. containing all settings for the pipeline, training process and hyperparameters.
Config values can be [overwritten](/usage/training#config-overrides) on the CLI
if needed. For example, `--paths.train ./train.spacy` sets the variable `train`
in the section `[paths]`.
</Infobox> </Infobox>
```bash ```bash
$ python -m spacy train [train_path] [dev_path] [config_path] [--output] $ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides]
[--code] [--verbose] [overrides]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ----------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `train_path` | positional | Location of training data in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
| `dev_path` | positional | Location of development data for evaluation in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. | | `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | | `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--verbose`, `-V` | flag | Show more detailed messages during training. | | `--verbose`, `-V` | flag | Show more detailed messages during training. |
| `--help`, `-h` | flag | Show help message and available arguments. | | `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | | overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
| **CREATES** | model | The final model and the best model. | | **CREATES** | model | The final model and the best model. |
## Pretrain {#pretrain new="2.1" tag="experimental"} ## Pretrain {#pretrain new="2.1" tag="experimental"}
<!-- TODO: document new pretrain command and link to new pretraining docs --> <!-- TODO: document new pretrain command and link to new pretraining docs -->
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using Pre-train the "token to vector" (`tok2vec`) layer of pipeline components on
an approximate language-modeling objective. Specifically, we load pretrained [raw text](/api/data-formats#pretrain), using an approximate language-modeling
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which objective. Specifically, we load pretrained vectors, and train a component like
match the pretrained ones. The weights are saved to a directory after each a CNN, BiLSTM, etc to predict vectors which match the pretrained ones. The
epoch. You can then pass a path to one of these pretrained weights files to the weights are saved to a directory after each epoch. You can then pass a path to
`spacy train` command. This technique may be especially helpful if you have one of these pretrained weights files to the `spacy train` command. This
little labelled data. technique may be especially helpful if you have little labelled data.
<Infobox title="Changed in v3.0" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">
@ -651,8 +626,8 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. | | `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. |
| `output_dir` | positional | Directory to write models to on each epoch. | | `output_dir` | positional | Directory to write models to on each epoch. |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | | `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
@ -662,37 +637,6 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path]
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | | overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | | **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl}
Raw text can be provided as a `.jsonl` (newline-delimited JSON) file containing
one input text per line (roughly paragraph length is good). Optionally, custom
tokenization can be provided.
> #### Tip: Writing JSONL
>
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
> writes out JSONL-formatted data.
>
> ```python
> import srsly
> data = [{"text": "Some text"}, {"text": "More..."}]
> srsly.write_jsonl("/path/to/text.jsonl", data)
> ```
| Key | Type | Description |
| -------- | ---- | ---------------------------------------------------------- |
| `text` | str | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. |
```json
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
```
## Evaluate {#evaluate new="2"} ## Evaluate {#evaluate new="2"}
<!-- TODO: document new evaluate command --> <!-- TODO: document new evaluate command -->

View File

@ -3,6 +3,7 @@ title: Data formats
teaser: Details on spaCy's input and output data formats teaser: Details on spaCy's input and output data formats
menu: menu:
- ['Training Data', 'training'] - ['Training Data', 'training']
- ['Pretraining Data', 'pretraining']
- ['Training Config', 'config'] - ['Training Config', 'config']
- ['Vocabulary', 'vocab'] - ['Vocabulary', 'vocab']
--- ---
@ -16,17 +17,30 @@ label schemes used in its components, depending on the data it was trained on.
### Binary training format {#binary-training new="3"} ### Binary training format {#binary-training new="3"}
The built-in [`convert`](/api/cli#convert) command helps you convert the
`.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
well as spaCy's previous [JSON format](#json-input).
<!-- TODO: document DocBin format --> <!-- TODO: document DocBin format -->
### JSON input format for training {#json-input} ### JSON training format {#json-input tag="deprecated"}
spaCy takes training data in JSON format. The built-in <Infobox variant="warning" title="Changed in v3.0">
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
used by the As of v3.0, the JSON input format is deprecated and is replaced by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to [binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
spaCy's training format. To convert one or more existing `Doc` objects to objects to JSON, you can now now serialize them directly using the
spaCy's JSON format, you can use the [`DocBin`](/api/docbin) container and then use them as input data.
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
format:
```bash
$ python -m spacy convert ./data.json ./output
```
</Infobox>
> #### Annotating entities {#biluo} > #### Annotating entities {#biluo}
> >
@ -68,61 +82,99 @@ spaCy's JSON format, you can use the
}] }]
``` ```
<Accordion title="Sample JSON data" spaced>
Here's an example of dependencies, part-of-speech tags and names entities, taken Here's an example of dependencies, part-of-speech tags and names entities, taken
from the English Wall Street Journal portion of the Penn Treebank: from the English Wall Street Journal portion of the Penn Treebank:
```json ```json
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/training-data.json
``` ```
### Annotations in dictionary format {#dict-input} </Accordion>
To create [`Example`](/api/example) objects, you can create a dictionary of the ### Annotation format for creating training examples {#dict-input}
gold-standard annotations `gold_dict`, and then call
```python An [`Example`](/api/example) object holds the information for one training
example = Example.from_dict(doc, gold_dict) instance. It stores two [`Doc`](/api/doc) objects: one for holding the
``` gold-standard reference data, and one for holding the predictions of the
pipeline. Examples can be created using the
[`Example.from_dict`](/api/example#from_dict) method with a reference `Doc` and
a dictionary of gold-standard annotations. There are currently two formats
supported for this dictionary of annotations: one with a simple, **flat
structure** of keywords, and one with a more **hierarchical structure**.
There are currently two formats supported for this dictionary of annotations: > #### Example
one with a simple, flat structure of keywords, and one with a more hierarchical >
structure. > ```python
> example = Example.from_dict(doc, gold_dict)
> ```
<Infobox title="Important note" variant="warning">
`Example` objects are used as part of the
[internal training API](/usage/training#api) and they're expected when you call
[`nlp.update`](/api/language#update). However, for most use cases, you
**shouldn't** have to write your own training scripts. It's recommended to train
your models via the [`spacy train`](/api/cli#train) command with a config file
to keep track of your settings and hyperparameters and your own
[registered functions](/usage/training/#custom-code) to customize the setup.
</Infobox>
#### Flat structure {#dict-flat} #### Flat structure {#dict-flat}
Here is the full overview of potential entries in a flat dictionary of > #### Example
annotations. You need to only specify those keys corresponding to the task you >
want to train. > ```python
> {
> "text": str,
> "words": List[str],
> "lemmas": List[str],
> "spaces": List[bool],
> "tags": List[str],
> "pos": List[str],
> "morphs": List[str],
> "sent_starts": List[bool],
> "deps": List[string],
> "heads": List[int],
> "entities": List[str],
> "entities": List[(int, int, str)],
> "cats": Dict[str, float],
> "links": Dict[(int, int), dict],
> }
> ```
```python | Name | Type | Description |
### Flat dictionary | ------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
{ | `text` | str | Raw text. |
"text": string, # Raw text. | `words` | `List[str]` | List of gold-standard tokens. |
"words": List[string], # List of gold tokens. | `lemmas` | `List[str]` | List of lemmas. |
"lemmas": List[string], # List of lemmas. | `spaces` | `List[bool]` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. |
"spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not. | `tags` | `List[str]` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). |
"tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). | `pos` | `List[str]` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). |
"pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). | `morphs` | `List[str]` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). |
"morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology). | `sent_starts` | `List[bool]` | List of boolean values indicating whether each token is the first of a sentence or not. |
"sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not. | `deps` | `List[str]` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. |
"deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. | `heads` | `List[int]` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. |
"heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. | `entities` | `List[str]` | Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. |
"entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. | `entities` | `List[Tuple[int, int, str]]` | Option 2: List of `"(start, end, label)"` tuples defining all entities in the text. |
"entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in. | `cats` | `Dict[str, float]` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. |
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text. | `links` | `Dict[(int, int), Dict]` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. |
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
}
```
There are a few caveats to take into account: <Infobox variant="warning" title="Important notes and caveats">
- Multiple formats are possible for the "entities" entry, but you have to pick - Multiple formats are possible for the "entities" entry, but you have to pick
one. one.
- Any values for sentence starts will be ignored if there are annotations for - Any values for sentence starts will be ignored if there are annotations for
dependency relations. dependency relations.
- If the dictionary contains values for "text" and "words", but not "spaces", - If the dictionary contains values for `"text"` and `"words"`, but not
the latter are inferred automatically. If "words" is not provided either, the `"spaces"`, the latter are inferred automatically. If "words" is not provided
values are inferred from the `doc` argument. either, the values are inferred from the `Doc` argument.
</Infobox>
<!-- TODO: finish reformatting below -->
##### Examples ##### Examples
@ -192,6 +244,39 @@ There are a few caveats to take into account:
latter are inferred automatically. If "ORTH" is not provided either, the latter are inferred automatically. If "ORTH" is not provided either, the
values are inferred from the `doc` argument. values are inferred from the `doc` argument.
## Pretraining data {#pretraining}
The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the tok2vec
layer of pipeline components from raw text. Raw text can be provided as a
`.jsonl` (newline-delimited JSON) file containing one input text per line
(roughly paragraph length is good). Optionally, custom tokenization can be
provided.
> #### Tip: Writing JSONL
>
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
> writes out JSONL-formatted data.
>
> ```python
> import srsly
> data = [{"text": "Some text"}, {"text": "More..."}]
> srsly.write_jsonl("/path/to/text.jsonl", data)
> ```
| Key | Type | Description |
| -------- | ---- | ---------------------------------------------------------- |
| `text` | str | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. |
```json
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
```
## Training config {#config new="3"} ## Training config {#config new="3"}
Config files define the training process and model pipeline and can be passed to Config files define the training process and model pipeline and can be passed to

View File

@ -172,6 +172,8 @@ available for the different architectures are documented with the
### Overwriting config settings on the command line {#config-overrides} ### Overwriting config settings on the command line {#config-overrides}
<!-- TODO: change example to use file path overrides -->
The config system means that you can define all settings **in one place** and in The config system means that you can define all settings **in one place** and in
a consistent format. There are no command-line arguments that need to be set, a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want and no hidden defaults. However, there can still be scenarios where you may want

View File

@ -20,6 +20,7 @@ menu:
| Removed | Replacement | | Removed | Replacement |
| -------------------------------------------------------- | ----------------------------------------- | | -------------------------------------------------------- | ----------------------------------------- |
| `GoldParse` | [`Example`](/api/example) | | `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |

View File

@ -82,6 +82,10 @@ export default function QuickstartTraining({ id, title, download = 'config.cfg'
hidePrompts hidePrompts
> >
<QS comment>{COMMENT}</QS> <QS comment>{COMMENT}</QS>
<span>[paths]</span>
<span>train = ""</span>
<span>dev = ""</span>
<br />
<span>[nlp]</span> <span>[nlp]</span>
<span>lang = "{lang}"</span> <span>lang = "{lang}"</span>
<span>pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}</span> <span>pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}</span>