Merge pull request #6002 from svlandeg/feature/vectors-docs

2025-08-08 22:24:55 +03:00 · 2020-08-31 16:25:18 +02:00 · 2020-08-31 16:25:18 +02:00 · 97ffb4ed05
commit 97ffb4ed05
parent ec14744ee4 fe6c08218e
6 changed files with 172 additions and 46 deletions
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -270,9 +270,9 @@ def train_while_improving(

        epoch (int): How many passes over the data have been completed.
        step (int): How many steps have been completed.
-        score (float): The main score form the last evaluation.
+        score (float): The main score from the last evaluation.
        other_scores: : The other scores from the last evaluation.
-        loss: The accumulated losses throughout training.
+        losses: The accumulated losses throughout training.
        checkpoints: A list of previous results, where each result is a
            (score, step, epoch) tuple.
    """
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -118,11 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
 [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
 argument that connects to the shared `tok2vec` component in the pipeline.

-| Name        | Description                                                                                                                                                                                                                                                                                                    |
-| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `width`     | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~                                                                                                                                                                                                               |
-| `upstream`  | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
-| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~                                                                                                                                                                                                                                         |
+| Name        | Description                                                                                                                                                                                                                                                                                                                          |
+| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `width`     | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~                                                                                                                                                                                                                                     |
+| `upstream`  | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
+| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~                                                                                                                                                                                                                                                               |

 ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}

@ -323,11 +323,11 @@ for details and system requirements.

 Load and wrap a transformer model from the
 [HuggingFace `transformers`](https://huggingface.co/transformers) library. You
-can any transformer that has pretrained weights and a PyTorch implementation.
-The `name` variable is passed through to the underlying library, so it can be
-either a string or a path. If it's a string, the pretrained weights will be
-downloaded via the transformers library if they are not already available
-locally.
+can use any transformer that has pretrained weights and a PyTorch
+implementation. The `name` variable is passed through to the underlying library,
+so it can be either a string or a path. If it's a string, the pretrained weights
+will be downloaded via the transformers library if they are not already
+available locally.

 In order to support longer documents, the
 [TransformerModel](/api/architectures#TransformerModel) layer allows you to pass
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -4,6 +4,7 @@ menu:
  - ['spacy', 'spacy']
  - ['displacy', 'displacy']
  - ['registry', 'registry']
+  - ['Loggers', 'loggers']
  - ['Batchers', 'batchers']
  - ['Data & Alignment', 'gold']
  - ['Utility Functions', 'util']
@ -316,6 +317,7 @@ factories.
 | `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
 | `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
 | `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
+| `loggers`         | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
 | `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
 | `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
 | `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
@ -340,7 +342,7 @@ See the [`Transformer`](/api/transformer) API reference and
 >     def annotation_setter(docs, trf_data) -> None:
 >        # Set annotations on the docs
 >
->     return annotation_sette
+>     return annotation_setter
 > ```

 | Registry name                                               | Description                                                                                                                                                                                                                                       |
@ -348,6 +350,70 @@ See the [`Transformer`](/api/transformer) API reference and
 | [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 | [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |

+## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}
+
+A logger records the training results. When a logger is created, two functions
+are returned: one for logging the information for each training step, and a
+second function that is called to finalize the logging when the training is
+finished. To log each training step, a
+[dictionary](/usage/training#custom-logging) is passed on from the
+[training script](/api/cli#train), including information such as the training
+loss and the accuracy scores on the development set.
+
+There are two built-in logging functions: a logger printing results to the
+console in tabular format (which is the default), and one that also sends the
+results to a [Weights & Biases](https://www.wandb.com/) dashboard.
+Instead of using one of the built-in loggers listed here, you can also
+[implement your own](/usage/training#custom-logging).
+
+> #### Example config
+>
+> ```ini
+> [training.logger]
+> @loggers = "spacy.ConsoleLogger.v1"
+> ```
+
+#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"}
+
+Writes the results of a training step to the console in a tabular format.
+
+#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}
+
+> #### Installation
+>
+> ```bash
+> $ pip install wandb
+> $ wandb login
+> ```
+
+Built-in logger that sends the results of each training step to the dashboard of
+the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights
+& Biases should be installed, and you should be logged in. The logger will send
+the full config file to W&B, as well as various system information such as
+memory utilization, network traffic, disk IO, GPU statistics, etc. This will
+also include information such as your hostname and operating system, as well as
+the location of your Python executable.
+
+Note that by default, the full (interpolated) training config file is sent over
+to the W&B dashboard. If you prefer to exclude certain information such as path
+names, you can list those fields in "dot notation" in the `remove_config_values`
+parameter. These fields will then be removed from the config before uploading,
+but will otherwise remain in the config file stored on your local system.
+
+> #### Example config
+>
+> ```ini
+> [training.logger]
+> @loggers = "spacy.WandbLogger.v1"
+> project_name = "monitor_spacy_training"
+> remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"]
+> ```
+
+| Name                   | Description                                                                                                                           |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+| `project_name`         | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
+| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~                              |
+
 ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}

 A data batcher implements a batching strategy that essentially turns a stream of
--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -25,8 +25,8 @@ work out-of-the-box.

 </Infobox>

-This pipeline component lets you use transformer models in your pipeline.
-Supports all models that are available via the
+This pipeline component lets you use transformer models in your pipeline. It
+supports all models that are available via the
 [HuggingFace `transformers`](https://huggingface.co/transformers) library.
 Usually you will connect subsequent components to the shared transformer using
 the [TransformerListener](/api/architectures#TransformerListener) layer. This
@ -50,8 +50,8 @@ The default config is defined by the pipeline component factory and describes
 how the component should be configured. You can override its settings via the
 `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
 [`config.cfg` for training](/usage/training#config). See the
-[model architectures](/api/architectures) documentation for details on the
-architectures and their arguments and hyperparameters.
+[model architectures](/api/architectures#transformers) documentation for details
+on the transformer architectures and their arguments and hyperparameters.

 > #### Example
 >
@ -61,11 +61,11 @@ architectures and their arguments and hyperparameters.
 > nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
 > ```

-| Setting             | Description                                                                                                                                                                                                                                                                                                            |
-| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                                                                                                            |
-| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
-| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                                                                                                         |
+| Setting             | Description                                                                                                                                                                                                                                                                                                           |
+| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `max_batch_items`   | Maximum size of a padded batch. Defaults to `4096`. ~~int~~                                                                                                                                                                                                                                                           |
+| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
+| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~                                                                                                                        |

 ```python
 https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -102,14 +102,14 @@ attribute. You can also provide a callback to set additional annotations. In
 your application, you would normally use a shortcut for this and instantiate the
 component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).

-| Name                | Description                                                                                                                                                                                                                                                                              |
-| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`             | The shared vocabulary. ~~Vocab~~                                                                                                                                                                                                                                                         |
-| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~                                                       |
-| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
-| _keyword-only_      |                                                                                                                                                                                                                                                                                          |
-| `name`              | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~                                                                                                                                                                                      |
-| `max_batch_items`   | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~                                                                                                                                                                                                                            |
+| Name                | Description                                                                                                                                                                                                                                                                             |
+| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`             | The shared vocabulary. ~~Vocab~~                                                                                                                                                                                                                                                        |
+| `model`             | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~                                                      |
+| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
+| _keyword-only_      |                                                                                                                                                                                                                                                                                         |
+| `name`              | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~                                                                                                                                                                                     |
+| `max_batch_items`   | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~                                                                                                                                                                                                                           |

 ## Transformer.\_\_call\_\_ {#call tag="method"}

@ -383,9 +383,8 @@ return tensors that refer to a whole padded batch of documents. These tensors
 are wrapped into the
 [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
 `FullTransformerBatch` then splits out the per-document data, which is handled
-by this class. Instances of this class
-are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
-extension attribute.
+by this class. Instances of this class are typically assigned to the
+[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.

 | Name      | Description                                                                                                                                                                                                                                                                                                                                             |
 | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -447,8 +446,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant.

 Span getters can be referenced in the `[components.transformer.model.get_spans]`
 block of the config to customize the sequences processed by the transformer. You
-can also register custom span getters using the `@spacy.registry.span_getters`
-decorator.
+can also register
+[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings)
+using the `@spacy.registry.span_getters` decorator.

 > #### Example
 >
@ -518,7 +518,7 @@ right context.

 ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}

-Annotation setters are functions that that take a batch of `Doc` objects and a
+Annotation setters are functions that take a batch of `Doc` objects and a
 [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
 additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
 You can register custom annotation setters using the
@ -551,6 +551,6 @@ The following built-in functions are available:
 The component sets the following
 [custom extension attributes](/usage/processing-pipeline#custom-components-attributes):

-| Name           | Description                                                              |
-| -------------- | ------------------------------------------------------------------------ |
-| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
+| Name             | Description                                                              |
+| ---------------- | ------------------------------------------------------------------------ |
+| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -251,13 +251,14 @@ for doc in nlp.pipe(["some text", "some other text"]):
    tokvecs = doc._.trf_data.tensors[-1]
 ```

-You can customize how the [`Transformer`](/api/transformer) component sets
-annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`.
-This callback will be called with the raw input and output data for the whole
-batch, along with the batch of `Doc` objects, allowing you to implement whatever
-you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
-objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
-containing the transformers data for the batch.
+You can also customize how the [`Transformer`](/api/transformer) component sets
+annotations onto the [`Doc`](/api/doc), by specifying a custom
+`annotation_setter`. This callback will be called with the raw input and output
+data for the whole batch, along with the batch of `Doc` objects, allowing you to
+implement whatever you need. The annotation setter is called with a batch of
+[`Doc`](/api/doc) objects and a
+[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
+transformers data for the batch.

 ```python
 def custom_annotation_setter(docs, trf_data):
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -605,6 +605,65 @@ to your Python file. Before loading the config, spaCy will import the
 $ python -m spacy train config.cfg --output ./output --code ./functions.py
 ```

+#### Example: Custom logging function {#custom-logging}
+
+During training, the results of each step are passed to a logger function in a
+dictionary providing the following information:
+
+| Key            | Value                                                                                          |
+| -------------- | ---------------------------------------------------------------------------------------------- |
+| `epoch`        | How many passes over the data have been completed. ~~int~~                                     |
+| `step`         | How many steps have been completed. ~~int~~                                                    |
+| `score`        | The main score from the last evaluation, measured on the dev set. ~~float~~                    |
+| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~         |
+| `losses`       | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~                 |
+| `checkpoints`  | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
+
+By default, these results are written to the console with the
+[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
+for writing the log files to [Weights & Biases](https://www.wandb.com/) with
+the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement
+your own logger as well, for instance to write the tabular results to file:
+
+```python
+### functions.py
+from typing import Tuple, Callable, Dict, Any
+import spacy
+from pathlib import Path
+
+@spacy.registry.loggers("my_custom_logger.v1")
+def custom_logger(log_path):
+    def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]:
+        with Path(log_path).open("w") as file_:
+            file_.write("step\t")
+            file_.write("score\t")
+            for pipe in nlp.pipe_names:
+                file_.write(f"loss_{pipe}\t")
+            file_.write("\n")
+
+        def log_step(info: Dict[str, Any]):
+            with Path(log_path).open("a") as file_:
+                file_.write(f"{info['step']}\t")
+                file_.write(f"{info['score']}\t")
+                for pipe in nlp.pipe_names:
+                    file_.write(f"{info['losses'][pipe]}\t")
+                file_.write("\n")
+
+        def finalize():
+            pass
+
+        return log_step, finalize
+
+    return setup_logger
+```
+
+```ini
+### config.cfg (excerpt)
+[training.logger]
+@loggers = "my_custom_logger.v1"
+file_path = "my_file.tab"
+```
+
 #### Example: Custom batch size schedule {#custom-code-schedule}

 For example, let's say you've implemented your own batch size schedule to use