Update docs

2025-08-09 06:34:54 +03:00 · 2020-10-03 16:08:24 +02:00 · 2020-10-03 16:08:24 +02:00 · 35d695a031
commit 35d695a031
parent 5fb776556a
6 changed files with 209 additions and 45 deletions
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -176,12 +176,12 @@ This method was previously called `begin_training`.
 > path = "corpus/labels/parser.json
 > ```

-| Name           | Description                                                                                                                                                                                                                                                                                                         |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                               |
-| _keyword-only_ |                                                                                                                                                                                                                                                                                                                     |
-| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                |
-| `labels`       | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
+| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                            |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                                                  |
+| _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                                                   |
+| `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |

 ## DependencyParser.predict {#predict tag="method"}

@ -433,6 +433,24 @@ The labels currently added to the component.
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |

+## DependencyParser.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize
+the model with a pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = parser.label_data
+> parser.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name        | Description                                                                     |
+| ----------- | ------------------------------------------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
+
 ## Serialization fields {#serialization-fields}

 During serialization, spaCy will export several data fields used to restore
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -165,12 +165,12 @@ This method was previously called `begin_training`.
 > path = "corpus/labels/ner.json
 > ```

-| Name           | Description                                                                                                                                                                                                                                                                                                         |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                               |
-| _keyword-only_ |                                                                                                                                                                                                                                                                                                                     |
-| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                |
-| `labels`       | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
+| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                            |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                                                  |
+| _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                                                   |
+| `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |

 ## EntityRecognizer.predict {#predict tag="method"}

@ -421,6 +421,24 @@ The labels currently added to the component.
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |

+## EntityRecognizer.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize
+the model with a pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = ner.label_data
+> ner.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name        | Description                                                                     |
+| ----------- | ------------------------------------------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
+
 ## Serialization fields {#serialization-fields}

 During serialization, spaCy will export several data fields used to restore
--- a/website/docs/api/morphologizer.md
+++ b/website/docs/api/morphologizer.md
@ -147,12 +147,12 @@ config.
 > path = "corpus/labels/morphologizer.json
 > ```

-| Name           | Description                                                                                                                                                                                                                                                                                                         |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                               |
-| _keyword-only_ |                                                                                                                                                                                                                                                                                                                     |
-| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                |
-| `labels`       | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
+| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                       |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                             |
+| _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                   |
+| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                              |
+| `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |

 ## Morphologizer.predict {#predict tag="method"}

@ -377,6 +377,24 @@ coarse-grained POS as the feature `POS`.
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |

+## Morphologizer.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the
+model with a pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = morphologizer.label_data
+> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name        | Description                                     |
+| ----------- | ----------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~dict~~ |
+
 ## Serialization fields {#serialization-fields}

 During serialization, spaCy will export several data fields used to restore
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -148,12 +148,12 @@ This method was previously called `begin_training`.
 > path = "corpus/labels/tagger.json
 > ```

-| Name           | Description                                                                                                                                                                                                                                                                                                         |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                               |
-| _keyword-only_ |                                                                                                                                                                                                                                                                                                                     |
-| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                |
-| `labels`       | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[list]~~ |
+| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                                      |
+| _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                            |
+| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                                       |
+| `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |

 ## Tagger.predict {#predict tag="method"}

@ -411,6 +411,24 @@ The labels currently added to the component.
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |

+## Tagger.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a
+pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = tagger.label_data
+> tagger.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name        | Description                                                |
+| ----------- | ---------------------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
+
 ## Serialization fields {#serialization-fields}

 During serialization, spaCy will export several data fields used to restore
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -29,7 +29,6 @@ architectures and their arguments and hyperparameters.
 > ```python
 > from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
 > config = {
->    "labels": [],
 >    "threshold": 0.5,
 >    "model": DEFAULT_TEXTCAT_MODEL,
 > }
@ -38,7 +37,6 @@ architectures and their arguments and hyperparameters.

 | Setting          | Description                                                                                                                                                      |
 | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `labels`         | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~                                      |
 | `threshold`      | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~                                                                   |
 | `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~                                                    |
 | `model`          | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
@ -61,7 +59,7 @@ architectures and their arguments and hyperparameters.
 >
 > # Construction from class
 > from spacy.pipeline import TextCategorizer
-> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
+> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5, positive_label="POS")
 > ```

 Create a new pipeline instance. In your application, you would normally use a
@ -74,7 +72,6 @@ shortcut for this and instantiate the component using its string name and
 | `model`          | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
 | `name`           | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~                        |
 | _keyword-only_   |                                                                                                                            |
-| `labels`         | The labels to use. ~~Iterable[str]~~                                                                                       |
 | `threshold`      | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~                             |
 | `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~                             |

@ -161,12 +158,12 @@ This method was previously called `begin_training`.
 > path = "corpus/labels/textcat.json
 > ```

-| Name           | Description                                                                                                                                                                                                                                                                                                         |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                               |
-| _keyword-only_ |                                                                                                                                                                                                                                                                                                                     |
-| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                |
-| `labels`       | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
+| Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                                      |
+| _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                            |
+| `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                                       |
+| `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |

 ## TextCategorizer.predict {#predict tag="method"}

@ -425,6 +422,24 @@ The labels currently added to the component.
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |

+## TextCategorizer.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize
+the model with a pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = textcat.label_data
+> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name        | Description                                                |
+| ----------- | ---------------------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
+
 ## Serialization fields {#serialization-fields}

 During serialization, spaCy will export several data fields used to restore
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -692,14 +692,14 @@ for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
 [`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a
 **dictionary** with the following keys:

-| Key            | Value                                                                                          |
-| -------------- | ---------------------------------------------------------------------------------------------- |
-| `epoch`        | How many passes over the data have been completed. ~~int~~                                     |
-| `step`         | How many steps have been completed. ~~int~~                                                    |
-| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                    |
-| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~         |
-| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                 |
-| `checkpoints`  | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
+| Key            | Value                                                                                                 |
+| -------------- | ----------------------------------------------------------------------------------------------------- |
+| `epoch`        | How many passes over the data have been completed. ~~int~~                                            |
+| `step`         | How many steps have been completed. ~~int~~                                                           |
+| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                           |
+| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~                |
+| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                        |
+| `checkpoints`  | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |

 You can easily implement and plug in your own logger that records the training
 results in a custom way, or sends them to an experiment management tracker of
@ -819,7 +819,84 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:

 ### Customizing the initialization {#initialization}

-<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
+When you start training a new model from scratch,
+[`spacy train`](/api/cli#train) will call
+[`nlp.initialize`](/api/language#initialize) to initialize the pipeline for
+training. This process typically includes the following:
+
+> #### config.cfg (excerpt)
+>
+> ```ini
+> [initialize]
+> vectors = ${paths.vectors}
+> init_tok2vec = ${paths.init_tok2vec}
+>
+> [initialize.components]
+> # Settings for components
+> ```
+
+1. Load in **data resources** defined in the `[initialize]` config, including
+   **word vectors** and
+   [pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
+   weights**.
+2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
+   [Chinese](/usage/models#chinese)) and pipeline components with a callback to
+   access the training data, the current `nlp` object and any **custom
+   arguments** defined in the `[initialize]` config.
+3. In **pipeline components**: if needed, use the data to
+   [infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
+   set up the label scheme if no labels are provided. Components may also load
+   other data like lookup tables or dictionaries.
+
+The initialization step allows the config to define **all settings** required
+for the pipeline, while keeping a separation between settings and functions that
+should only be used **before training** to set up the initial pipeline, and
+logic and configuration that needs to be available **at runtime**. Without that
+separation, TODO:
+
+![Illustration of pipeline lifecycle](../images/lifecycle.svg)
+
+#### Initializing labels {#initialization-labels}
+
+Built-in pipeline components like the
+[`EntityRecognizer`](/api/entityrecognizer) or
+[`DependencyParser`](/api/dependencyparser) need to know their available labels
+and associated internal meta information to initialize their model weights.
+Using the `get_examples` callback provided on initialization, they're able to
+**read the labels off the training data** automatically, which is very
+convenient – but it can also slow down the training process to compute this
+information on every run.
+
+The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
+files containing the label data for all supported components. You can then pass
+in the labels in the `[initialize]` settings for the respective components to
+allow them to initialize faster.
+
+> #### config.cfg
+>
+> ```ini
+> [initialize.components.ner]
+>
+> [initialize.components.ner.labels]
+> @readers = "spacy.read_labels.v1"
+> path = "corpus/labels/ner.json
+> ```
+
+```cli
+$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
+```
+
+Under the hood, the command delegates to the `label_data` property of the
+pipeline components, for instance
+[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
+
+<Infobox variant="warning" title="Important note">
+
+The JSON format differs for each component and some components need additional
+meta information about their labels. The format exported by
+[`init labels`](/api/cli#init-labels) matches what the components need, so you
+should always let spaCy **auto-generate the labels** for you.
+
 </Infobox>

 ## Data utilities {#data}
@ -1298,8 +1375,8 @@ of being dropped.

 > - [`nlp`](/api/language): The `nlp` object with the pipeline components and
 >   their models.
-> - [`nlp.initialize`](/api/language#initialize): Start the training and return
->   an optimizer to update the component model weights.
+> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
+>   return an optimizer to update the component model weights.
 > - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
 >   state between updates.
 > - [`nlp.update`](/api/language#update): Update component models with examples.