From 5d417d3b19de0c82b07128761d3f40e20c84170d Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 6 Aug 2020 13:10:15 +0200 Subject: [PATCH] WIP: Update docs [ci skip] --- website/docs/api/cli.md | 238 ++++++++------------- website/docs/api/data-formats.md | 173 +++++++++++---- website/docs/usage/training.md | 2 + website/docs/usage/v3.md | 1 + website/src/widgets/quickstart-training.js | 4 + 5 files changed, 227 insertions(+), 191 deletions(-) diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index abe050661..00c3bac57 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -219,23 +219,22 @@ The command will create all objects in the tree and validate them. Note that some config validation errors are blocking and will prevent the rest of the config from being resolved. This means that you may not see all validation errors at once and some issues are only shown once previous errors have been -fixed. - -Instead of specifying all required settings in the config file, you can rely on -an auto-fill functionality that uses spaCy's built-in defaults. The resulting -full config can be written to file and used in downstream training tasks. +fixed. To auto-fill a partial config and save the result, you can use the +[`init config`](/api/cli#init-config) command. ```bash $ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides] ``` -> #### Example 1 +> #### Example > > ```bash > $ python -m spacy debug config ./config.cfg > ``` - + + + ``` ✘ Config validation error @@ -254,30 +253,15 @@ training -> width extra fields not permitted -> #### Example 2 -> -> ```bash -> $ python -m spacy debug config ./minimal_config.cfg -F -o ./filled_config.cfg -> ``` - - - -``` -✔ Auto-filled config is valid -✔ Saved updated config to ./filled_config.cfg -``` - - - -| Argument | Type | Default | Description | -| --------------------- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | -| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. | -| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. | -| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. | -| `--help`, `-h` | flag | `False` | Show help message and available arguments. | -| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | +| Argument | Type | Default | Description | +| --------------------- | ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | +| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | +| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. | +| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. | +| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. | +| `--help`, `-h` | flag | `False` | Show help message and available arguments. | +| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. | ### debug data {#debug-data} @@ -289,19 +273,20 @@ low data labels and more. The `debug-data` command is now available as a subcommand of `spacy debug`. It takes the same arguments as `train` and reads settings off the -[`config.cfg` file](/usage/training#config). +[`config.cfg` file](/usage/training#config) and optional +[overrides](/usage/training#config-overrides) on the CLI. ```bash -$ python -m spacy debug data [train_path] [dev_path] [config_path] [--code] -[--ignore-warnings] [--verbose] [--no-format] [overrides] +$ python -m spacy debug data [config_path] [--code] [--ignore-warnings] +[--verbose] [--no-format] [overrides] ``` > #### Example > > ```bash -> $ python -m spacy debug data ./train.spacy ./dev.spacy ./config.cfg +> $ python -m spacy debug data ./config.cfg > ``` @@ -443,17 +428,15 @@ will not be available. -| Argument | Type | Description | -| -------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. | -| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. | -| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | -| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | -| `--verbose`, `-V` | flag | Print additional information and explanations. | -| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | +| Argument | Type | Description | +| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | +| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | +| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | +| `--verbose`, `-V` | flag | Print additional information and explanations. | +| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. | +| `--help`, `-h` | flag | Show help message and available arguments. | +| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. | @@ -466,13 +449,16 @@ sample text and checking how it updates its internal weights and parameters. $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id] ``` -> #### Example 1 -> -> ```bash -> $ python -m spacy debug model ./config.cfg tagger -P0 -> ``` + - +In this example log, we just print the name of each layer after creation of the +model ("Step 0"), which helps us to understand the internal structure of the +Neural Network, and to focus on specific layers that we want to inspect further +(see next example). + +```bash +$ python -m spacy debug model ./config.cfg tagger -P0 +``` ``` ℹ Using CPU @@ -509,20 +495,16 @@ $ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] ... ``` - +In this example log, we see how initialization of the model (Step 1) propagates +the correct values for the `nI` (input) and `nO` (output) dimensions of the +various layers. In the `softmax` layer, this step also defines the `W` matrix as +an all-zero matrix determined by the `nO` and `nI` dimensions. After a first +training step (Step 2), this matrix has clearly updated its values through the +training feedback loop. -In this example log, we just print the name of each layer after creation of the -model ("Step 0"), which helps us to understand the internal structure of the -Neural Network, and to focus on specific layers that we want to inspect further -(see next example). - -> #### Example 2 -> -> ```bash -> $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2 -> ``` - - +```bash +$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2 +``` ``` ℹ Using CPU @@ -563,27 +545,20 @@ Neural Network, and to focus on specific layers that we want to inspect further -In this example log, we see how initialization of the model (Step 1) propagates -the correct values for the `nI` (input) and `nO` (output) dimensions of the -various layers. In the `softmax` layer, this step also defines the `W` matrix as -an all-zero matrix determined by the `nO` and `nI` dimensions. After a first -training step (Step 2), this matrix has clearly updated its values through the -training feedback loop. - -| Argument | Type | Default | Description | -| ----------------------- | ---------- | ------- | ---------------------------------------------------------------------------------------------------- | +| Argument | Type | Default | Description | +| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- | | `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `component` | positional | | Name of the pipeline component of which the model should be analysed. | -| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. | -| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. | -| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. | -| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. | -| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. | -| `--print-step0`, `-P0` | option | `False` | Print model before training. | -| `--print-step1`, `-P1` | option | `False` | Print model after initialization. | -| `--print-step2`, `-P2` | option | `False` | Print model after training. | -| `--print-step3`, `-P3` | option | `False` | Print final predictions. | -| `--help`, `-h` | flag | | Show help message and available arguments. | +| `component` | positional | | Name of the pipeline component of which the model should be analyzed. | +| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. | +| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. | +| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. | +| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. | +| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. | +| `--print-step0`, `-P0` | option | `False` | Print model before training. | +| `--print-step1`, `-P1` | option | `False` | Print model after initialization. | +| `--print-step2`, `-P2` | option | `False` | Print model after training. | +| `--print-step3`, `-P3` | option | `False` | Print final predictions. | +| `--help`, `-h` | flag | | Show help message and available arguments. | ## Train {#train} @@ -603,37 +578,37 @@ you need to manage complex multi-step training workflows, check out the new The `train` command doesn't take a long list of command-line arguments anymore and instead expects a single [`config.cfg` file](/usage/training#config) containing all settings for the pipeline, training process and hyperparameters. +Config values can be [overwritten](/usage/training#config-overrides) on the CLI +if needed. For example, `--paths.train ./train.spacy` sets the variable `train` +in the section `[paths]`. ```bash -$ python -m spacy train [train_path] [dev_path] [config_path] [--output] -[--code] [--verbose] [overrides] +$ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides] ``` -| Argument | Type | Description | -| ----------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `train_path` | positional | Location of training data in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. | -| `dev_path` | positional | Location of development data for evaluation in spaCy's [binary format](/api/data-formats#training). Can be a file or a directory of files. | -| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. | -| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | -| `--verbose`, `-V` | flag | Show more detailed messages during training. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | -| **CREATES** | model | The final model and the best model. | +| Argument | Type | Description | +| ----------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | +| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. | +| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | +| `--verbose`, `-V` | flag | Show more detailed messages during training. | +| `--help`, `-h` | flag | Show help message and available arguments. | +| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. | +| **CREATES** | model | The final model and the best model. | ## Pretrain {#pretrain new="2.1" tag="experimental"} -Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using -an approximate language-modeling objective. Specifically, we load pretrained -vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which -match the pretrained ones. The weights are saved to a directory after each -epoch. You can then pass a path to one of these pretrained weights files to the -`spacy train` command. This technique may be especially helpful if you have -little labelled data. +Pre-train the "token to vector" (`tok2vec`) layer of pipeline components on +[raw text](/api/data-formats#pretrain), using an approximate language-modeling +objective. Specifically, we load pretrained vectors, and train a component like +a CNN, BiLSTM, etc to predict vectors which match the pretrained ones. The +weights are saved to a directory after each epoch. You can then pass a path to +one of these pretrained weights files to the `spacy train` command. This +technique may be especially helpful if you have little labelled data. @@ -650,48 +625,17 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--resume-path] [--epoch-resume] [overrides] ``` -| Argument | Type | Description | -| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. | -| `output_dir` | positional | Directory to write models to on each epoch. | -| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | -| `--resume-path`, `-r` | option | TODO: | -| `--epoch-resume`, `-er` | option | TODO: | -| `--help`, `-h` | flag | Show help message and available arguments. | -| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | -| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | - -### JSONL format for raw text {#pretrain-jsonl} - -Raw text can be provided as a `.jsonl` (newline-delimited JSON) file containing -one input text per line (roughly paragraph length is good). Optionally, custom -tokenization can be provided. - -> #### Tip: Writing JSONL -> -> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a -> handy `write_jsonl` helper that takes a file path and list of dictionaries and -> writes out JSONL-formatted data. -> -> ```python -> import srsly -> data = [{"text": "Some text"}, {"text": "More..."}] -> srsly.write_jsonl("/path/to/text.jsonl", data) -> ``` - -| Key | Type | Description | -| -------- | ---- | ---------------------------------------------------------- | -| `text` | str | The raw input text. Is not required if `tokens` available. | -| `tokens` | list | Optional tokenization, one string per token. | - -```json -### Example -{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} -{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} -{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} -{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]} -``` +| Argument | Type | Description | +| ----------------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. | +| `output_dir` | positional | Directory to write models to on each epoch. | +| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | +| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. | +| `--resume-path`, `-r` | option | TODO: | +| `--epoch-resume`, `-er` | option | TODO: | +| `--help`, `-h` | flag | Show help message and available arguments. | +| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. | +| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | ## Evaluate {#evaluate new="2"} diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 210e5d47d..ae398cbf5 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -3,6 +3,7 @@ title: Data formats teaser: Details on spaCy's input and output data formats menu: - ['Training Data', 'training'] + - ['Pretraining Data', 'pretraining'] - ['Training Config', 'config'] - ['Vocabulary', 'vocab'] --- @@ -16,17 +17,30 @@ label schemes used in its components, depending on the data it was trained on. ### Binary training format {#binary-training new="3"} +The built-in [`convert`](/api/cli#convert) command helps you convert the +`.conllu` format used by the +[Universal Dependencies corpora](https://github.com/UniversalDependencies) as +well as spaCy's previous [JSON format](#json-input). + -### JSON input format for training {#json-input} +### JSON training format {#json-input tag="deprecated"} -spaCy takes training data in JSON format. The built-in -[`convert`](/api/cli#convert) command helps you convert the `.conllu` format -used by the -[Universal Dependencies corpora](https://github.com/UniversalDependencies) to -spaCy's training format. To convert one or more existing `Doc` objects to -spaCy's JSON format, you can use the -[`gold.docs_to_json`](/api/top-level#docs_to_json) helper. + + +As of v3.0, the JSON input format is deprecated and is replaced by the +[binary format](#binary-training). Instead of converting [`Doc`](/api/doc) +objects to JSON, you can now now serialize them directly using the +[`DocBin`](/api/docbin) container and then use them as input data. + +[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy` +format: + +```bash +$ python -m spacy convert ./data.json ./output +``` + + > #### Annotating entities {#biluo} > @@ -68,61 +82,99 @@ spaCy's JSON format, you can use the }] ``` + + Here's an example of dependencies, part-of-speech tags and names entities, taken from the English Wall Street Journal portion of the Penn Treebank: ```json -https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json +https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/training-data.json ``` -### Annotations in dictionary format {#dict-input} + -To create [`Example`](/api/example) objects, you can create a dictionary of the -gold-standard annotations `gold_dict`, and then call +### Annotation format for creating training examples {#dict-input} -```python -example = Example.from_dict(doc, gold_dict) -``` +An [`Example`](/api/example) object holds the information for one training +instance. It stores two [`Doc`](/api/doc) objects: one for holding the +gold-standard reference data, and one for holding the predictions of the +pipeline. Examples can be created using the +[`Example.from_dict`](/api/example#from_dict) method with a reference `Doc` and +a dictionary of gold-standard annotations. There are currently two formats +supported for this dictionary of annotations: one with a simple, **flat +structure** of keywords, and one with a more **hierarchical structure**. -There are currently two formats supported for this dictionary of annotations: -one with a simple, flat structure of keywords, and one with a more hierarchical -structure. +> #### Example +> +> ```python +> example = Example.from_dict(doc, gold_dict) +> ``` + + + +`Example` objects are used as part of the +[internal training API](/usage/training#api) and they're expected when you call +[`nlp.update`](/api/language#update). However, for most use cases, you +**shouldn't** have to write your own training scripts. It's recommended to train +your models via the [`spacy train`](/api/cli#train) command with a config file +to keep track of your settings and hyperparameters and your own +[registered functions](/usage/training/#custom-code) to customize the setup. + + #### Flat structure {#dict-flat} -Here is the full overview of potential entries in a flat dictionary of -annotations. You need to only specify those keys corresponding to the task you -want to train. +> #### Example +> +> ```python +> { +> "text": str, +> "words": List[str], +> "lemmas": List[str], +> "spaces": List[bool], +> "tags": List[str], +> "pos": List[str], +> "morphs": List[str], +> "sent_starts": List[bool], +> "deps": List[string], +> "heads": List[int], +> "entities": List[str], +> "entities": List[(int, int, str)], +> "cats": Dict[str, float], +> "links": Dict[(int, int), dict], +> } +> ``` -```python -### Flat dictionary -{ - "text": string, # Raw text. - "words": List[string], # List of gold tokens. - "lemmas": List[string], # List of lemmas. - "spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not. - "tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). - "pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). - "morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology). - "sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not. - "deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. - "heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. - "entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. - "entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in. - "cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text. - "links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs. -} -``` +| Name | Type | Description | +| ------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `text` | str | Raw text. | +| `words` | `List[str]` | List of gold-standard tokens. | +| `lemmas` | `List[str]` | List of lemmas. | +| `spaces` | `List[bool]` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. | +| `tags` | `List[str]` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). | +| `pos` | `List[str]` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). | +| `morphs` | `List[str]` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). | +| `sent_starts` | `List[bool]` | List of boolean values indicating whether each token is the first of a sentence or not. | +| `deps` | `List[str]` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. | +| `heads` | `List[int]` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. | +| `entities` | `List[str]` | Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. | +| `entities` | `List[Tuple[int, int, str]]` | Option 2: List of `"(start, end, label)"` tuples defining all entities in the text. | +| `cats` | `Dict[str, float]` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. | +| `links` | `Dict[(int, int), Dict]` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. | -There are a few caveats to take into account: + - Multiple formats are possible for the "entities" entry, but you have to pick one. - Any values for sentence starts will be ignored if there are annotations for dependency relations. -- If the dictionary contains values for "text" and "words", but not "spaces", - the latter are inferred automatically. If "words" is not provided either, the - values are inferred from the `doc` argument. +- If the dictionary contains values for `"text"` and `"words"`, but not + `"spaces"`, the latter are inferred automatically. If "words" is not provided + either, the values are inferred from the `Doc` argument. + + + + ##### Examples @@ -192,6 +244,39 @@ There are a few caveats to take into account: latter are inferred automatically. If "ORTH" is not provided either, the values are inferred from the `doc` argument. +## Pretraining data {#pretraining} + +The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the tok2vec +layer of pipeline components from raw text. Raw text can be provided as a +`.jsonl` (newline-delimited JSON) file containing one input text per line +(roughly paragraph length is good). Optionally, custom tokenization can be +provided. + +> #### Tip: Writing JSONL +> +> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a +> handy `write_jsonl` helper that takes a file path and list of dictionaries and +> writes out JSONL-formatted data. +> +> ```python +> import srsly +> data = [{"text": "Some text"}, {"text": "More..."}] +> srsly.write_jsonl("/path/to/text.jsonl", data) +> ``` + +| Key | Type | Description | +| -------- | ---- | ---------------------------------------------------------- | +| `text` | str | The raw input text. Is not required if `tokens` available. | +| `tokens` | list | Optional tokenization, one string per token. | + +```json +### Example +{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} +{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} +{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} +{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]} +``` + ## Training config {#config new="3"} Config files define the training process and model pipeline and can be passed to diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index c0ec052b9..5b9e76c02 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -172,6 +172,8 @@ available for the different architectures are documented with the ### Overwriting config settings on the command line {#config-overrides} + + The config system means that you can define all settings **in one place** and in a consistent format. There are no command-line arguments that need to be set, and no hidden defaults. However, there can still be scenarios where you may want diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 1f13b6328..c78799050 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -20,6 +20,7 @@ menu: | Removed | Replacement | | -------------------------------------------------------- | ----------------------------------------- | | `GoldParse` | [`Example`](/api/example) | +| `GoldCorpus` | [`Corpus`](/api/corpus) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | diff --git a/website/src/widgets/quickstart-training.js b/website/src/widgets/quickstart-training.js index c5cd9aab9..8d20a0744 100644 --- a/website/src/widgets/quickstart-training.js +++ b/website/src/widgets/quickstart-training.js @@ -82,6 +82,10 @@ export default function QuickstartTraining({ id, title, download = 'config.cfg' hidePrompts > {COMMENT} + [paths] + train = "" + dev = "" +
[nlp] lang = "{lang}" pipeline = {JSON.stringify(pipeline).replace(/,/g, ', ')}