From 7d9f00bdbf45120345972516693e369b61b5dbf3 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 19 Aug 2020 19:53:00 +0200 Subject: [PATCH 1/5] waltzing schedule --- website/docs/api/top-level.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 626c1d858..91f7e276c 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -282,17 +282,18 @@ concept of function registries. spaCy also uses the function registry for language subclasses, model architecture, lookups and pipeline component factories. - - > #### Example > > ```python +> from typing import Iterator > import spacy -> from thinc.api import Model > -> @spacy.registry.architectures("CustomNER.v1") -> def custom_ner(n0: int) -> Model: -> return Model("custom", forward, dims={"nO": nO}) +> @spacy.registry.schedules("waltzing.v1") +> def waltzing() -> Iterator[float]: +> i = 0 +> while True: +> yield i % 3 + 1 +> i += 1 > ``` | Registry name | Description | From 09f3cfc985f186555d6255950c0970528cd25235 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 19 Aug 2020 19:58:45 +0200 Subject: [PATCH 2/5] add version --- website/docs/api/doc.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index e8ce7343d..3c4825f0d 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -317,9 +317,7 @@ array of attributes. | `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | | **RETURNS** | The `Doc` itself. ~~Doc~~ | -## Doc.from_docs {#from_docs tag="staticmethod"} - - +## Doc.from_docs {#from_docs tag="staticmethod" new="3"} Concatenate multiple `Doc` objects to form a new one. Raises an error if the `Doc` objects do not all share the same `Vocab`. From 229033831aeeb4a78d9ebaa98ebfe1f06d6f9d28 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 20 Aug 2020 10:00:45 +0200 Subject: [PATCH 3/5] add explanation of raw_text --- website/docs/api/data-formats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index ff106b229..444dc0003 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -135,7 +135,7 @@ process that are used when you run [`spacy train`](/api/cli#train). | `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ | | `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ | | `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths:init_tok2vec}`. ~~Optional[str]~~ | -| `raw_text` | TODO: ... Defaults to variable `${paths:raw}`. ~~Optional[str]~~ | +| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsel](/api/language#rehearse) step. Defaults to variable `${paths:raw}`. ~~Optional[str]~~ | | `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ | | `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | | `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | From ae719b354fad17e42092f60f60b80fe71ce8825e Mon Sep 17 00:00:00 2001 From: svlandeg Date: Thu, 20 Aug 2020 10:20:40 +0200 Subject: [PATCH 4/5] fix typos --- website/docs/api/data-formats.md | 52 ++++++++++++++++---------------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 0d19c797a..fd527935a 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -5,7 +5,7 @@ menu: - ['Training Config', 'config'] - ['Training Data', 'training'] - ['Pretraining Data', 'pretraining'] - - ['Vocabulary', 'vocab'] + - ['Vocabulary', 'vocab-jsonl'] - ['Model Meta', 'meta'] --- @@ -391,10 +391,10 @@ tokenization can be provided. > srsly.write_jsonl("/path/to/text.jsonl", data) > ``` -| Key | Description | -| -------- | ------------------------------------------------------------------ | -| `text` | The raw input text. Is not required if `tokens` available. ~~str~~ | -| `tokens` | Optional tokenization, one string per token. ~~List[str]~~ | +| Key | Description | +| -------- | --------------------------------------------------------------------- | +| `text` | The raw input text. Is not required if `tokens` is available. ~~str~~ | +| `tokens` | Optional tokenization, one string per token. ~~List[str]~~ | ```json ### Example @@ -407,7 +407,7 @@ tokenization can be provided. ## Lexical data for vocabulary {#vocab-jsonl new="2"} To populate a model's vocabulary, you can use the -[`spacy init-model`](/api/cli#init-model) command and load in a +[`spacy init model`](/api/cli#init-model) command and load in a [newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one lexical entry per line via the `--jsonl-loc` option. The first line defines the language and vocabulary settings. All other lines are expected to be JSON @@ -510,23 +510,23 @@ of truth** used for loading a model. > } > ``` -| Name | Description | -| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ | -| `name` | Model name, e.g. `"core_web_sm"`. The final model package name will be `{lang}_{name}`. Defaults to `"model"`. ~~str~~ | -| `version` | Model version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ | -| `spacy_version` | spaCy version range the model is compatible with. Defaults to spaCy version used to create the model, up to next minor version, which is the default compatibility for the available [pretrained models](/models). For instance, a model trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ | -| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ | -| `description` | Model description. Also used for Python package. Defaults to `""`. ~~str~~ | -| `author` | Model author name. Also used for Python package. Defaults to `""`. ~~str~~ | -| `email` | Model author email. Also used for Python package. Defaults to `""`. ~~str~~ | -| `url` | Model author URL. Also used for Python package. Defaults to `""`. ~~str~~ | -| `license` | Model license. Also used for Python package. Defaults to `""`. ~~str~~ | -| `sources` | Data sources used to train the model. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ | -| `vectors` | Information about the word vectors included with the model. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ | -| `pipeline` | Names of pipeline component names in the model, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ | -| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ | -| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | -| `speed` | Model speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ | -| `spacy_git_version` 3 | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create model. ~~str~~ | -| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ | +| Name | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ | +| `name` | Model name, e.g. `"core_web_sm"`. The final model package name will be `{lang}_{name}`. Defaults to `"model"`. ~~str~~ | +| `version` | Model version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ | +| `spacy_version` | spaCy version range the model is compatible with. Defaults to the spaCy version used to create the model, up to next minor version, which is the default compatibility for the available [pretrained models](/models). For instance, a model trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ | +| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ | +| `description` | Model description. Also used for Python package. Defaults to `""`. ~~str~~ | +| `author` | Model author name. Also used for Python package. Defaults to `""`. ~~str~~ | +| `email` | Model author email. Also used for Python package. Defaults to `""`. ~~str~~ | +| `url` | Model author URL. Also used for Python package. Defaults to `""`. ~~str~~ | +| `license` | Model license. Also used for Python package. Defaults to `""`. ~~str~~ | +| `sources` | Data sources used to train the model. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ | +| `vectors` | Information about the word vectors included with the model. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ | +| `pipeline` | Names of pipeline component names in the model, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ | +| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ | +| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | +| `speed` | Model speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ | +| `spacy_git_version` 3 | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create model. ~~str~~ | +| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ | From 410b54e10e46f120ac0440af02344158373ee634 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Thu, 20 Aug 2020 11:15:34 +0200 Subject: [PATCH 5/5] Update website/docs/api/data-formats.md Co-authored-by: Ines Montani --- website/docs/api/data-formats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index fd527935a..8b67aa263 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -135,7 +135,7 @@ process that are used when you run [`spacy train`](/api/cli#train). | `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ | | `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ | | `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ | -| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsel](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ | +| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ | | `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ | | `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | | `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |