diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index ea4b779c7..fe8f7d8d5 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -176,12 +176,12 @@ This method was previously called `begin_training`. > path = "corpus/labels/parser.json > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | -| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | -| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ | ## DependencyParser.predict {#predict tag="method"} @@ -433,6 +433,24 @@ The labels currently added to the component. | ----------- | ------------------------------------------------------ | | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | +## DependencyParser.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = parser.label_data +> parser.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 5fbd0b229..6ac0d163f 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -165,12 +165,12 @@ This method was previously called `begin_training`. > path = "corpus/labels/ner.json > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | -| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | -| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ | ## EntityRecognizer.predict {#predict tag="method"} @@ -421,6 +421,24 @@ The labels currently added to the component. | ----------- | ------------------------------------------------------ | | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | +## EntityRecognizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = ner.label_data +> ner.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md index 50e2bb33a..d32514fb0 100644 --- a/website/docs/api/morphologizer.md +++ b/website/docs/api/morphologizer.md @@ -147,12 +147,12 @@ config. > path = "corpus/labels/morphologizer.json > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | -| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | -| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | ## Morphologizer.predict {#predict tag="method"} @@ -377,6 +377,24 @@ coarse-grained POS as the feature `POS`. | ----------- | ------------------------------------------------------ | | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | +## Morphologizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the +model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = morphologizer.label_data +> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~dict~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index d7c56be67..2123004b6 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -148,12 +148,12 @@ This method was previously called `begin_training`. > path = "corpus/labels/tagger.json > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | -| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | -| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[list]~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ | ## Tagger.predict {#predict tag="method"} @@ -411,6 +411,24 @@ The labels currently added to the component. | ----------- | ------------------------------------------------------ | | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | +## Tagger.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a +pre-defined label set. + +> #### Example +> +> ```python +> labels = tagger.label_data +> tagger.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index dd8c81040..0901a6fa9 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -29,7 +29,6 @@ architectures and their arguments and hyperparameters. > ```python > from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL > config = { -> "labels": [], > "threshold": 0.5, > "model": DEFAULT_TEXTCAT_MODEL, > } @@ -38,7 +37,6 @@ architectures and their arguments and hyperparameters. | Setting | Description | | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ | | `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | | `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ | | `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ | @@ -61,7 +59,7 @@ architectures and their arguments and hyperparameters. > > # Construction from class > from spacy.pipeline import TextCategorizer -> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS") +> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5, positive_label="POS") > ``` Create a new pipeline instance. In your application, you would normally use a @@ -74,7 +72,6 @@ shortcut for this and instantiate the component using its string name and | `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | | `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | | _keyword-only_ | | -| `labels` | The labels to use. ~~Iterable[str]~~ | | `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | | `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ | @@ -161,12 +158,12 @@ This method was previously called `begin_training`. > path = "corpus/labels/textcat.json > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | -| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | -| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ | ## TextCategorizer.predict {#predict tag="method"} @@ -425,6 +422,24 @@ The labels currently added to the component. | ----------- | ------------------------------------------------------ | | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | +## TextCategorizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = textcat.label_data +> textcat.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 74d2f6de5..6317479bc 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -692,14 +692,14 @@ for writing the log files to [Weights & Biases](https://www.wandb.com/) with the [`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a **dictionary** with the following keys: -| Key | Value | -| -------------- | ---------------------------------------------------------------------------------------------- | -| `epoch` | How many passes over the data have been completed. ~~int~~ | -| `step` | How many steps have been completed. ~~int~~ | -| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ | -| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | -| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | -| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | +| Key | Value | +| -------------- | ----------------------------------------------------------------------------------------------------- | +| `epoch` | How many passes over the data have been completed. ~~int~~ | +| `step` | How many steps have been completed. ~~int~~ | +| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ | +| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | +| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ | +| `checkpoints` | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ | You can easily implement and plug in your own logger that records the training results in a custom way, or sends them to an experiment management tracker of @@ -819,7 +819,84 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]: ### Customizing the initialization {#initialization} - +When you start training a new model from scratch, +[`spacy train`](/api/cli#train) will call +[`nlp.initialize`](/api/language#initialize) to initialize the pipeline for +training. This process typically includes the following: + +> #### config.cfg (excerpt) +> +> ```ini +> [initialize] +> vectors = ${paths.vectors} +> init_tok2vec = ${paths.init_tok2vec} +> +> [initialize.components] +> # Settings for components +> ``` + +1. Load in **data resources** defined in the `[initialize]` config, including + **word vectors** and + [pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec + weights**. +2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for + [Chinese](/usage/models#chinese)) and pipeline components with a callback to + access the training data, the current `nlp` object and any **custom + arguments** defined in the `[initialize]` config. +3. In **pipeline components**: if needed, use the data to + [infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and + set up the label scheme if no labels are provided. Components may also load + other data like lookup tables or dictionaries. + +The initialization step allows the config to define **all settings** required +for the pipeline, while keeping a separation between settings and functions that +should only be used **before training** to set up the initial pipeline, and +logic and configuration that needs to be available **at runtime**. Without that +separation, TODO: + +![Illustration of pipeline lifecycle](../images/lifecycle.svg) + +#### Initializing labels {#initialization-labels} + +Built-in pipeline components like the +[`EntityRecognizer`](/api/entityrecognizer) or +[`DependencyParser`](/api/dependencyparser) need to know their available labels +and associated internal meta information to initialize their model weights. +Using the `get_examples` callback provided on initialization, they're able to +**read the labels off the training data** automatically, which is very +convenient – but it can also slow down the training process to compute this +information on every run. + +The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON +files containing the label data for all supported components. You can then pass +in the labels in the `[initialize]` settings for the respective components to +allow them to initialize faster. + +> #### config.cfg +> +> ```ini +> [initialize.components.ner] +> +> [initialize.components.ner.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/ner.json +> ``` + +```cli +$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy +``` + +Under the hood, the command delegates to the `label_data` property of the +pipeline components, for instance +[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data). + + + +The JSON format differs for each component and some components need additional +meta information about their labels. The format exported by +[`init labels`](/api/cli#init-labels) matches what the components need, so you +should always let spaCy **auto-generate the labels** for you. + ## Data utilities {#data} @@ -1298,8 +1375,8 @@ of being dropped. > - [`nlp`](/api/language): The `nlp` object with the pipeline components and > their models. -> - [`nlp.initialize`](/api/language#initialize): Start the training and return -> an optimizer to update the component model weights. +> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and +> return an optimizer to update the component model weights. > - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds > state between updates. > - [`nlp.update`](/api/language#update): Update component models with examples.