diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index e99166af8..3fd6b7a76 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -790,6 +790,12 @@ in the section `[paths]`. +> #### Example +> +> ```cli +> $ python -m spacy train config.cfg --output ./output --paths.train ./train --paths.dev ./dev +> ``` + ```cli $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides] ``` @@ -808,15 +814,16 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] ## pretrain {#pretrain new="2.1" tag="command,experimental"} Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline -components on [raw text](/api/data-formats#pretrain), using an approximate -language-modeling objective. Specifically, we load pretrained vectors, and train -a component like a CNN, BiLSTM, etc to predict vectors which match the -pretrained ones. The weights are saved to a directory after each epoch. You can -then include a **path to one of these pretrained weights files** in your +components on raw text, using an approximate language-modeling objective. +Specifically, we load pretrained vectors, and train a component like a CNN, +BiLSTM, etc to predict vectors which match the pretrained ones. The weights are +saved to a directory after each epoch. You can then include a **path to one of +these pretrained weights files** in your [training config](/usage/training#config) as the `init_tok2vec` setting when you train your pipeline. This technique may be especially helpful if you have little labelled data. See the usage docs on -[pretraining](/usage/embeddings-transformers#pretraining) for more info. +[pretraining](/usage/embeddings-transformers#pretraining) for more info. To read +the raw text, a [`JsonlCorpus`](/api/top-level#jsonlcorpus) is typically used. @@ -830,6 +837,12 @@ auto-generated by setting `--pretraining` on +> #### Example +> +> ```cli +> $ python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl +> ``` + ```cli $ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides] ``` diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 3df9a7ba4..fc2cda547 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -148,7 +148,7 @@ This section defines a **dictionary** mapping of string keys to functions. Each function takes an `nlp` object and yields [`Example`](/api/example) objects. By default, the two keys `train` and `dev` are specified and each refer to a [`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain` -section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus). +section is added that defaults to a [`JsonlCorpus`](/api/top-level#jsonlcorpus). You can also register custom functions that return a callable. | Name | Description |