mirror of
https://github.com/explosion/spaCy.git
synced 2025-06-03 20:53:12 +03:00
Update docs
This commit is contained in:
parent
5fb776556a
commit
35d695a031
|
@ -177,11 +177,11 @@ This method was previously called `begin_training`.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## DependencyParser.predict {#predict tag="method"}
|
## DependencyParser.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -433,6 +433,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## DependencyParser.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = parser.label_data
|
||||||
|
> parser.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -166,11 +166,11 @@ This method was previously called `begin_training`.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.predict {#predict tag="method"}
|
## EntityRecognizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -421,6 +421,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## EntityRecognizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = ner.label_data
|
||||||
|
> ner.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -148,11 +148,11 @@ config.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||||
|
|
||||||
## Morphologizer.predict {#predict tag="method"}
|
## Morphologizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -377,6 +377,24 @@ coarse-grained POS as the feature `POS`.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## Morphologizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the
|
||||||
|
model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = morphologizer.label_data
|
||||||
|
> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~dict~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -149,11 +149,11 @@ This method was previously called `begin_training`.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[list]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||||
|
|
||||||
## Tagger.predict {#predict tag="method"}
|
## Tagger.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -411,6 +411,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## Tagger.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a
|
||||||
|
pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = tagger.label_data
|
||||||
|
> tagger.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -29,7 +29,6 @@ architectures and their arguments and hyperparameters.
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
||||||
> config = {
|
> config = {
|
||||||
> "labels": [],
|
|
||||||
> "threshold": 0.5,
|
> "threshold": 0.5,
|
||||||
> "model": DEFAULT_TEXTCAT_MODEL,
|
> "model": DEFAULT_TEXTCAT_MODEL,
|
||||||
> }
|
> }
|
||||||
|
@ -38,7 +37,6 @@ architectures and their arguments and hyperparameters.
|
||||||
|
|
||||||
| Setting | Description |
|
| Setting | Description |
|
||||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
|
||||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
||||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
@ -61,7 +59,7 @@ architectures and their arguments and hyperparameters.
|
||||||
>
|
>
|
||||||
> # Construction from class
|
> # Construction from class
|
||||||
> from spacy.pipeline import TextCategorizer
|
> from spacy.pipeline import TextCategorizer
|
||||||
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
|
> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5, positive_label="POS")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Create a new pipeline instance. In your application, you would normally use a
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
|
@ -74,7 +72,6 @@ shortcut for this and instantiate the component using its string name and
|
||||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
|
||||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
|
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
|
||||||
|
|
||||||
|
@ -162,11 +159,11 @@ This method was previously called `begin_training`.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||||
|
|
||||||
## TextCategorizer.predict {#predict tag="method"}
|
## TextCategorizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -425,6 +422,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## TextCategorizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = textcat.label_data
|
||||||
|
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -693,13 +693,13 @@ for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
|
||||||
**dictionary** with the following keys:
|
**dictionary** with the following keys:
|
||||||
|
|
||||||
| Key | Value |
|
| Key | Value |
|
||||||
| -------------- | ---------------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------- |
|
||||||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||||||
| `step` | How many steps have been completed. ~~int~~ |
|
| `step` | How many steps have been completed. ~~int~~ |
|
||||||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||||||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||||||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||||||
| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
|
| `checkpoints` | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
|
||||||
|
|
||||||
You can easily implement and plug in your own logger that records the training
|
You can easily implement and plug in your own logger that records the training
|
||||||
results in a custom way, or sends them to an experiment management tracker of
|
results in a custom way, or sends them to an experiment management tracker of
|
||||||
|
@ -819,7 +819,84 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
||||||
|
|
||||||
### Customizing the initialization {#initialization}
|
### Customizing the initialization {#initialization}
|
||||||
|
|
||||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
When you start training a new model from scratch,
|
||||||
|
[`spacy train`](/api/cli#train) will call
|
||||||
|
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline for
|
||||||
|
training. This process typically includes the following:
|
||||||
|
|
||||||
|
> #### config.cfg (excerpt)
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize]
|
||||||
|
> vectors = ${paths.vectors}
|
||||||
|
> init_tok2vec = ${paths.init_tok2vec}
|
||||||
|
>
|
||||||
|
> [initialize.components]
|
||||||
|
> # Settings for components
|
||||||
|
> ```
|
||||||
|
|
||||||
|
1. Load in **data resources** defined in the `[initialize]` config, including
|
||||||
|
**word vectors** and
|
||||||
|
[pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
|
||||||
|
weights**.
|
||||||
|
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
|
||||||
|
[Chinese](/usage/models#chinese)) and pipeline components with a callback to
|
||||||
|
access the training data, the current `nlp` object and any **custom
|
||||||
|
arguments** defined in the `[initialize]` config.
|
||||||
|
3. In **pipeline components**: if needed, use the data to
|
||||||
|
[infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
|
||||||
|
set up the label scheme if no labels are provided. Components may also load
|
||||||
|
other data like lookup tables or dictionaries.
|
||||||
|
|
||||||
|
The initialization step allows the config to define **all settings** required
|
||||||
|
for the pipeline, while keeping a separation between settings and functions that
|
||||||
|
should only be used **before training** to set up the initial pipeline, and
|
||||||
|
logic and configuration that needs to be available **at runtime**. Without that
|
||||||
|
separation, TODO:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
#### Initializing labels {#initialization-labels}
|
||||||
|
|
||||||
|
Built-in pipeline components like the
|
||||||
|
[`EntityRecognizer`](/api/entityrecognizer) or
|
||||||
|
[`DependencyParser`](/api/dependencyparser) need to know their available labels
|
||||||
|
and associated internal meta information to initialize their model weights.
|
||||||
|
Using the `get_examples` callback provided on initialization, they're able to
|
||||||
|
**read the labels off the training data** automatically, which is very
|
||||||
|
convenient – but it can also slow down the training process to compute this
|
||||||
|
information on every run.
|
||||||
|
|
||||||
|
The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
|
||||||
|
files containing the label data for all supported components. You can then pass
|
||||||
|
in the labels in the `[initialize]` settings for the respective components to
|
||||||
|
allow them to initialize faster.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.ner]
|
||||||
|
>
|
||||||
|
> [initialize.components.ner.labels]
|
||||||
|
> @readers = "spacy.read_labels.v1"
|
||||||
|
> path = "corpus/labels/ner.json
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
Under the hood, the command delegates to the `label_data` property of the
|
||||||
|
pipeline components, for instance
|
||||||
|
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Important note">
|
||||||
|
|
||||||
|
The JSON format differs for each component and some components need additional
|
||||||
|
meta information about their labels. The format exported by
|
||||||
|
[`init labels`](/api/cli#init-labels) matches what the components need, so you
|
||||||
|
should always let spaCy **auto-generate the labels** for you.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
## Data utilities {#data}
|
## Data utilities {#data}
|
||||||
|
@ -1298,8 +1375,8 @@ of being dropped.
|
||||||
|
|
||||||
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
||||||
> their models.
|
> their models.
|
||||||
> - [`nlp.initialize`](/api/language#initialize): Start the training and return
|
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
|
||||||
> an optimizer to update the component model weights.
|
> return an optimizer to update the component model weights.
|
||||||
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
||||||
> state between updates.
|
> state between updates.
|
||||||
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user