mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-03 13:14:11 +03:00
Merge pull request #6002 from svlandeg/feature/vectors-docs
This commit is contained in:
commit
97ffb4ed05
|
@ -270,9 +270,9 @@ def train_while_improving(
|
|||
|
||||
epoch (int): How many passes over the data have been completed.
|
||||
step (int): How many steps have been completed.
|
||||
score (float): The main score form the last evaluation.
|
||||
score (float): The main score from the last evaluation.
|
||||
other_scores: : The other scores from the last evaluation.
|
||||
loss: The accumulated losses throughout training.
|
||||
losses: The accumulated losses throughout training.
|
||||
checkpoints: A list of previous results, where each result is a
|
||||
(score, step, epoch) tuple.
|
||||
"""
|
||||
|
|
|
@ -118,11 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
|
|||
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
|
||||
argument that connects to the shared `tok2vec` component in the pipeline.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ |
|
||||
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ |
|
||||
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||
|
||||
|
@ -323,11 +323,11 @@ for details and system requirements.
|
|||
|
||||
Load and wrap a transformer model from the
|
||||
[HuggingFace `transformers`](https://huggingface.co/transformers) library. You
|
||||
can any transformer that has pretrained weights and a PyTorch implementation.
|
||||
The `name` variable is passed through to the underlying library, so it can be
|
||||
either a string or a path. If it's a string, the pretrained weights will be
|
||||
downloaded via the transformers library if they are not already available
|
||||
locally.
|
||||
can use any transformer that has pretrained weights and a PyTorch
|
||||
implementation. The `name` variable is passed through to the underlying library,
|
||||
so it can be either a string or a path. If it's a string, the pretrained weights
|
||||
will be downloaded via the transformers library if they are not already
|
||||
available locally.
|
||||
|
||||
In order to support longer documents, the
|
||||
[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass
|
||||
|
|
|
@ -4,6 +4,7 @@ menu:
|
|||
- ['spacy', 'spacy']
|
||||
- ['displacy', 'displacy']
|
||||
- ['registry', 'registry']
|
||||
- ['Loggers', 'loggers']
|
||||
- ['Batchers', 'batchers']
|
||||
- ['Data & Alignment', 'gold']
|
||||
- ['Utility Functions', 'util']
|
||||
|
@ -316,6 +317,7 @@ factories.
|
|||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||
| `loggers` | Registry for functions that log [training results](/usage/training). |
|
||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||
|
@ -340,7 +342,7 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
> def annotation_setter(docs, trf_data) -> None:
|
||||
> # Set annotations on the docs
|
||||
>
|
||||
> return annotation_sette
|
||||
> return annotation_setter
|
||||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
|
@ -348,6 +350,70 @@ See the [`Transformer`](/api/transformer) API reference and
|
|||
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
||||
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
||||
|
||||
## Loggers {#loggers source="spacy/gold/loggers.py" new="3"}
|
||||
|
||||
A logger records the training results. When a logger is created, two functions
|
||||
are returned: one for logging the information for each training step, and a
|
||||
second function that is called to finalize the logging when the training is
|
||||
finished. To log each training step, a
|
||||
[dictionary](/usage/training#custom-logging) is passed on from the
|
||||
[training script](/api/cli#train), including information such as the training
|
||||
loss and the accuracy scores on the development set.
|
||||
|
||||
There are two built-in logging functions: a logger printing results to the
|
||||
console in tabular format (which is the default), and one that also sends the
|
||||
results to a [Weights & Biases](https://www.wandb.com/) dashboard.
|
||||
Instead of using one of the built-in loggers listed here, you can also
|
||||
[implement your own](/usage/training#custom-logging).
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.logger]
|
||||
> @loggers = "spacy.ConsoleLogger.v1"
|
||||
> ```
|
||||
|
||||
#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"}
|
||||
|
||||
Writes the results of a training step to the console in a tabular format.
|
||||
|
||||
#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}
|
||||
|
||||
> #### Installation
|
||||
>
|
||||
> ```bash
|
||||
> $ pip install wandb
|
||||
> $ wandb login
|
||||
> ```
|
||||
|
||||
Built-in logger that sends the results of each training step to the dashboard of
|
||||
the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights
|
||||
& Biases should be installed, and you should be logged in. The logger will send
|
||||
the full config file to W&B, as well as various system information such as
|
||||
memory utilization, network traffic, disk IO, GPU statistics, etc. This will
|
||||
also include information such as your hostname and operating system, as well as
|
||||
the location of your Python executable.
|
||||
|
||||
Note that by default, the full (interpolated) training config file is sent over
|
||||
to the W&B dashboard. If you prefer to exclude certain information such as path
|
||||
names, you can list those fields in "dot notation" in the `remove_config_values`
|
||||
parameter. These fields will then be removed from the config before uploading,
|
||||
but will otherwise remain in the config file stored on your local system.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.logger]
|
||||
> @loggers = "spacy.WandbLogger.v1"
|
||||
> project_name = "monitor_spacy_training"
|
||||
> remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"]
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||
|
||||
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
|
||||
|
||||
A data batcher implements a batching strategy that essentially turns a stream of
|
||||
|
|
|
@ -25,8 +25,8 @@ work out-of-the-box.
|
|||
|
||||
</Infobox>
|
||||
|
||||
This pipeline component lets you use transformer models in your pipeline.
|
||||
Supports all models that are available via the
|
||||
This pipeline component lets you use transformer models in your pipeline. It
|
||||
supports all models that are available via the
|
||||
[HuggingFace `transformers`](https://huggingface.co/transformers) library.
|
||||
Usually you will connect subsequent components to the shared transformer using
|
||||
the [TransformerListener](/api/architectures#TransformerListener) layer. This
|
||||
|
@ -50,8 +50,8 @@ The default config is defined by the pipeline component factory and describes
|
|||
how the component should be configured. You can override its settings via the
|
||||
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
||||
[`config.cfg` for training](/usage/training#config). See the
|
||||
[model architectures](/api/architectures) documentation for details on the
|
||||
architectures and their arguments and hyperparameters.
|
||||
[model architectures](/api/architectures#transformers) documentation for details
|
||||
on the transformer architectures and their arguments and hyperparameters.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -61,11 +61,11 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
| Setting | Description |
|
||||
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
||||
|
@ -102,14 +102,14 @@ attribute. You can also provide a callback to set additional annotations. In
|
|||
your application, you would normally use a shortcut for this and instantiate the
|
||||
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
|
||||
| Name | Description |
|
||||
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
|
||||
|
||||
## Transformer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -383,9 +383,8 @@ return tensors that refer to a whole padded batch of documents. These tensors
|
|||
are wrapped into the
|
||||
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
|
||||
`FullTransformerBatch` then splits out the per-document data, which is handled
|
||||
by this class. Instances of this class
|
||||
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
|
||||
extension attribute.
|
||||
by this class. Instances of this class are typically assigned to the
|
||||
[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.
|
||||
|
||||
| Name | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -447,8 +446,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant.
|
|||
|
||||
Span getters can be referenced in the `[components.transformer.model.get_spans]`
|
||||
block of the config to customize the sequences processed by the transformer. You
|
||||
can also register custom span getters using the `@spacy.registry.span_getters`
|
||||
decorator.
|
||||
can also register
|
||||
[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings)
|
||||
using the `@spacy.registry.span_getters` decorator.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -518,7 +518,7 @@ right context.
|
|||
|
||||
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||
|
||||
Annotation setters are functions that that take a batch of `Doc` objects and a
|
||||
Annotation setters are functions that take a batch of `Doc` objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
|
||||
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
|
||||
You can register custom annotation setters using the
|
||||
|
@ -551,6 +551,6 @@ The following built-in functions are available:
|
|||
The component sets the following
|
||||
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------ |
|
||||
| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
|
||||
| Name | Description |
|
||||
| ---------------- | ------------------------------------------------------------------------ |
|
||||
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
|
||||
|
|
|
@ -251,13 +251,14 @@ for doc in nlp.pipe(["some text", "some other text"]):
|
|||
tokvecs = doc._.trf_data.tensors[-1]
|
||||
```
|
||||
|
||||
You can customize how the [`Transformer`](/api/transformer) component sets
|
||||
annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`.
|
||||
This callback will be called with the raw input and output data for the whole
|
||||
batch, along with the batch of `Doc` objects, allowing you to implement whatever
|
||||
you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
|
||||
objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||||
containing the transformers data for the batch.
|
||||
You can also customize how the [`Transformer`](/api/transformer) component sets
|
||||
annotations onto the [`Doc`](/api/doc), by specifying a custom
|
||||
`annotation_setter`. This callback will be called with the raw input and output
|
||||
data for the whole batch, along with the batch of `Doc` objects, allowing you to
|
||||
implement whatever you need. The annotation setter is called with a batch of
|
||||
[`Doc`](/api/doc) objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
|
||||
transformers data for the batch.
|
||||
|
||||
```python
|
||||
def custom_annotation_setter(docs, trf_data):
|
||||
|
|
|
@ -605,6 +605,65 @@ to your Python file. Before loading the config, spaCy will import the
|
|||
$ python -m spacy train config.cfg --output ./output --code ./functions.py
|
||||
```
|
||||
|
||||
#### Example: Custom logging function {#custom-logging}
|
||||
|
||||
During training, the results of each step are passed to a logger function in a
|
||||
dictionary providing the following information:
|
||||
|
||||
| Key | Value |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||||
| `step` | How many steps have been completed. ~~int~~ |
|
||||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||||
| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
|
||||
|
||||
By default, these results are written to the console with the
|
||||
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
|
||||
for writing the log files to [Weights & Biases](https://www.wandb.com/) with
|
||||
the [`WandbLogger`](/api/top-level#WandbLogger). But you can easily implement
|
||||
your own logger as well, for instance to write the tabular results to file:
|
||||
|
||||
```python
|
||||
### functions.py
|
||||
from typing import Tuple, Callable, Dict, Any
|
||||
import spacy
|
||||
from pathlib import Path
|
||||
|
||||
@spacy.registry.loggers("my_custom_logger.v1")
|
||||
def custom_logger(log_path):
|
||||
def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]:
|
||||
with Path(log_path).open("w") as file_:
|
||||
file_.write("step\t")
|
||||
file_.write("score\t")
|
||||
for pipe in nlp.pipe_names:
|
||||
file_.write(f"loss_{pipe}\t")
|
||||
file_.write("\n")
|
||||
|
||||
def log_step(info: Dict[str, Any]):
|
||||
with Path(log_path).open("a") as file_:
|
||||
file_.write(f"{info['step']}\t")
|
||||
file_.write(f"{info['score']}\t")
|
||||
for pipe in nlp.pipe_names:
|
||||
file_.write(f"{info['losses'][pipe]}\t")
|
||||
file_.write("\n")
|
||||
|
||||
def finalize():
|
||||
pass
|
||||
|
||||
return log_step, finalize
|
||||
|
||||
return setup_logger
|
||||
```
|
||||
|
||||
```ini
|
||||
### config.cfg (excerpt)
|
||||
[training.logger]
|
||||
@loggers = "my_custom_logger.v1"
|
||||
file_path = "my_file.tab"
|
||||
```
|
||||
|
||||
#### Example: Custom batch size schedule {#custom-code-schedule}
|
||||
|
||||
For example, let's say you've implemented your own batch size schedule to use
|
||||
|
|
Loading…
Reference in New Issue
Block a user