fix component constructors, update, begin_training, reference to GoldParse

This commit is contained in:
svlandeg 2020-07-07 19:17:19 +02:00
parent 14a796e3f9
commit 2b60e894cb
12 changed files with 265 additions and 238 deletions

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import DependencyParser
> parser = DependencyParser(nlp.vocab)
> parser = DependencyParser(nlp.vocab, parser_model)
> parser.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ------------------ | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. |
## DependencyParser.\_\_call\_\_ {#call tag="method"}
@ -126,26 +126,28 @@ Modify a batch of documents, using pre-computed scores.
## DependencyParser.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating the
pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and
Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
model. Delegates to [`predict`](/api/dependencyparser#predict) and
[`get_loss`](/api/dependencyparser#get_loss).
> #### Example
>
> ```python
> parser = DependencyParser(nlp.vocab)
> parser = DependencyParser(nlp.vocab, parser_model)
> losses = {}
> optimizer = nlp.begin_training()
> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
> parser.update(examples, losses=losses, sgd=optimizer)
> ```
| Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
| `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
| Name | Type | Description |
| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). |
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## DependencyParser.get_loss {#get_loss tag="method"}
@ -169,8 +171,8 @@ predicted scores.
## DependencyParser.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model
has been initialized yet, the model is added.
Initialize the pipe for training, using data examples if available. Return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@ -180,16 +182,17 @@ has been initialized yet, the model is added.
> optimizer = parser.begin_training(pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
| **RETURNS** | `Optimizer` | An optimizer. |
## DependencyParser.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component.
Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline
component.
> #### Example
>
@ -198,9 +201,9 @@ Create an optimizer for the pipeline component.
> optimizer = parser.create_optimizer()
> ```
| Name | Type | Description |
| ----------- | -------- | -------------- |
| **RETURNS** | callable | The optimizer. |
| Name | Type | Description |
| ----------- | ----------- | -------------- |
| **RETURNS** | `Optimizer` | The optimizer. |
## DependencyParser.use_params {#use_params tag="method, contextmanager"}

View File

@ -38,18 +38,17 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import EntityLinker
> entity_linker = EntityLinker(nlp.vocab)
> entity_linker = EntityLinker(nlp.vocab, nel_model)
> entity_linker.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. |
| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. |
| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. |
| **RETURNS** | `EntityLinker` | The newly constructed object. |
| Name | Type | Description |
| ------- | ------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityLinker` | The newly constructed object. |
## EntityLinker.\_\_call\_\_ {#call tag="method"}
@ -134,7 +133,7 @@ entities.
## EntityLinker.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating both the
Learn from a batch of [`Example`](/api/example) objects, updating both the
pipe's entity linking model and context encoder. Delegates to
[`predict`](/api/entitylinker#predict) and
[`get_loss`](/api/entitylinker#get_loss).
@ -142,19 +141,21 @@ pipe's entity linking model and context encoder. Delegates to
> #### Example
>
> ```python
> entity_linker = EntityLinker(nlp.vocab)
> entity_linker = EntityLinker(nlp.vocab, nel_model)
> losses = {}
> optimizer = nlp.begin_training()
> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
> entity_linker.update(examples, losses=losses, sgd=optimizer)
> ```
| Name | Type | Description |
| -------- | -------- | ------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
| `drop` | float | The dropout rate, used both for the EL model and the context encoder. |
| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
| Name | Type | Description |
| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). |
| `sgd` | `Optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | float | The loss from this batch. |
## EntityLinker.get_loss {#get_loss tag="method"}
@ -195,9 +196,9 @@ identifiers.
## EntityLinker.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model
has been initialized yet, the model is added. Before calling this method, a
knowledge base should have been defined with
Initialize the pipe for training, using data examples if available. Return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this
method, a knowledge base should have been defined with
[`set_kb`](/api/entitylinker#set_kb).
> #### Example
@ -209,12 +210,12 @@ knowledge base should have been defined with
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entitylinker#create_optimizer) if not set. |
| **RETURNS** | `Optimizer` | An optimizer. | |
## EntityLinker.create_optimizer {#create_optimizer tag="method"}

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import EntityRecognizer
> ner = EntityRecognizer(nlp.vocab)
> ner = EntityRecognizer(nlp.vocab, ner_model)
> ner.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ------------------ | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
> scores, tensors = ner.predict([doc1, doc2])
> ```
| Name | Type | Description |
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | The documents to predict. |
| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
| Name | Type | Description |
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | The documents to predict. |
| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
## EntityRecognizer.set_annotations {#set_annotations tag="method"}
@ -127,26 +127,28 @@ Modify a batch of documents, using pre-computed scores.
## EntityRecognizer.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating the
pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and
Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
model. Delegates to [`predict`](/api/entityrecognizer#predict) and
[`get_loss`](/api/entityrecognizer#get_loss).
> #### Example
>
> ```python
> ner = EntityRecognizer(nlp.vocab)
> ner = EntityRecognizer(nlp.vocab, ner_model)
> losses = {}
> optimizer = nlp.begin_training()
> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
> ner.update(examples, losses=losses, sgd=optimizer)
> ```
| Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
| `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
| Name | Type | Description |
| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). |
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## EntityRecognizer.get_loss {#get_loss tag="method"}
@ -170,8 +172,8 @@ predicted scores.
## EntityRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model
has been initialized yet, the model is added.
Initialize the pipe for training, using data examples if available. Return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@ -181,12 +183,14 @@ has been initialized yet, the model is added.
> optimizer = ner.begin_training(pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
| **RETURNS** | `Optimizer` | An optimizer. |
|
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -141,11 +141,12 @@ of the `reference` document.
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
> ```
Get the aligned view of a certain token attribute, denoted by its int ID or string name.
Get the aligned view of a certain token attribute, denoted by its int ID or
string name.
| Name | Type | Description | Default |
| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- |
| `field` | int or str | Attribute ID or string name | |
| `field` | int or str | Attribute ID or string name | |
| `as_string` | bool | Whether or not to return the list of values as strings. | `False` |
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
@ -176,7 +177,7 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
> ```python
> words = ["Mrs", "Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words)
> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")]
> entities = [(0, 9, "PERSON"), (18, 26, "LOC")]
> gold_words = ["Mrs Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
> ner_tags = example.get_aligned_ner()
@ -197,7 +198,7 @@ Get the aligned view of the NER
> ```python
> words = ["Mr and Mrs Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words)
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
> entities = [(0, 16, "PERSON")]
> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
> ents_ref = example.reference.ents
@ -220,15 +221,12 @@ in `example.predicted`.
> #### Example
>
> ```python
> ruler = EntityRuler(nlp)
> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}]
> ruler.add_patterns(patterns)
> nlp.add_pipe(ruler)
> nlp.add_pipe(my_ner)
> doc = nlp("Mr and Mrs Smith flew to New York")
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"]
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
> example = Example.from_dict(doc, {"words": tokens_ref})
> ents_pred = example.predicted.ents
> # Assume the NER model has found "Mr and Mrs Smith" as a named entity
> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)]
> ents_x2y = example.get_aligned_spans_x2y(ents_pred)
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]

View File

@ -87,18 +87,18 @@ Update the models in the pipeline.
> ```python
> for raw_text, entity_offsets in train_data:
> doc = nlp.make_doc(raw_text)
> gold = GoldParse(doc, entities=entity_offsets)
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
> example = Example.from_dict(doc, {"entities": entity_offsets})
> nlp.update([example], sgd=optimizer)
> ```
| Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text. |
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
| `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. |
| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| Name | Type | Description |
| -------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `sgd` | `Optimizer` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. |
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
## Language.evaluate {#evaluate tag="method"}
@ -107,35 +107,37 @@ Evaluate a model's pipeline components.
> #### Example
>
> ```python
> scorer = nlp.evaluate(docs_golds, verbose=True)
> scorer = nlp.evaluate(examples, verbose=True)
> print(scorer.scores)
> ```
| Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
| `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. |
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
| Name | Type | Description |
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. |
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
## Language.begin_training {#begin_training tag="method"}
Allocate models, pre-process training data and acquire an optimizer.
Allocate models, pre-process training data and acquire an
[`Optimizer`](https://thinc.ai/docs/api-optimizers).
> #### Example
>
> ```python
> optimizer = nlp.begin_training(gold_tuples)
> optimizer = nlp.begin_training(get_examples)
> ```
| Name | Type | Description |
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Gold-standard training data. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| `**cfg` | - | Config parameters (sent to all components). |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. If not set, a default one will be created. |
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
| `**cfg` | - | Config parameters (sent to all components). |
| **RETURNS** | `Optimizer` | An optimizer. |
## Language.use_params {#use_params tag="contextmanager, method"}
@ -155,16 +157,6 @@ their original weights after the block.
| `params` | dict | A dictionary of parameters keyed by model ID. |
| `**cfg` | - | Config parameters. |
## Language.preprocess_gold {#preprocess_gold tag="method"}
Can be called before training to pre-process gold data. By default, it handles
nonprojectivity and adds missing tags to the tag map.
| Name | Type | Description |
| ------------ | -------- | ---------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. |
## Language.create_pipe {#create_pipe tag="method" new="2"}
Create a pipeline component from a factory.

View File

@ -27,22 +27,20 @@ Create a new `Scorer`.
## Scorer.score {#score tag="method"}
Update the evaluation scores from a single [`Doc`](/api/doc) /
[`GoldParse`](/api/goldparse) pair.
Update the evaluation scores from a single [`Example`](/api/example) object.
> #### Example
>
> ```python
> scorer = Scorer()
> scorer.score(doc, gold)
> scorer.score(example)
> ```
| Name | Type | Description |
| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The predicted annotations. |
| `gold` | `GoldParse` | The correct annotations. |
| `verbose` | bool | Print debugging information. |
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
| Name | Type | Description |
| -------------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
| `example` | `Example` | The `Example` object holding both the predictions and the correct gold-standard annotations. |
| `verbose` | bool | Print debugging information. |
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
## Properties

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import Tagger
> tagger = Tagger(nlp.vocab)
> tagger = Tagger(nlp.vocab, tagger_model)
> tagger.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. |
| Name | Type | Description |
| ----------- | -------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. |
## Tagger.\_\_call\_\_ {#call tag="method"}
@ -132,19 +132,20 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
> #### Example
>
> ```python
> tagger = Tagger(nlp.vocab)
> tagger = Tagger(nlp.vocab, tagger_model)
> losses = {}
> optimizer = nlp.begin_training()
> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
> tagger.update(examples, losses=losses, sgd=optimizer)
> ```
| Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
| `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
| Name | Type | Description |
| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## Tagger.get_loss {#get_loss tag="method"}
@ -168,8 +169,8 @@ predicted scores.
## Tagger.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model
has been initialized yet, the model is added.
Initialize the pipe for training, using data examples if available. Return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@ -179,12 +180,12 @@ has been initialized yet, the model is added.
> optimizer = tagger.begin_training(pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. |
| **RETURNS** | `Optimizer` | An optimizer. |
## Tagger.create_optimizer {#create_optimizer tag="method"}

View File

@ -35,17 +35,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import TextCategorizer
> textcat = TextCategorizer(nlp.vocab)
> textcat = TextCategorizer(nlp.vocab, textcat_model)
> textcat.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ----------------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"}
@ -151,19 +150,20 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
> #### Example
>
> ```python
> textcat = TextCategorizer(nlp.vocab)
> textcat = TextCategorizer(nlp.vocab, textcat_model)
> losses = {}
> optimizer = nlp.begin_training()
> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
> textcat.update(examples, losses=losses, sgd=optimizer)
> ```
| Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
| `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
| Name | Type | Description |
| ----------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| _keyword-only_ | | |
| `drop` | float | The dropout rate. |
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## TextCategorizer.get_loss {#get_loss tag="method"}
@ -187,8 +187,8 @@ predicted scores.
## TextCategorizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model
has been initialized yet, the model is added.
Initialize the pipe for training, using data examples if available. Return an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@ -198,12 +198,12 @@ has been initialized yet, the model is added.
> optimizer = textcat.begin_training(pipeline=nlp.pipeline)
> ```
| Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. |
| **RETURNS** | `Optimizer` | An optimizer. |
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -719,8 +719,7 @@ vary on each step.
> ```python
> batches = minibatch(train_data)
> for batch in batches:
> texts, annotations = zip(*batch)
> nlp.update(texts, annotations)
> nlp.update(batch)
> ```
| Name | Type | Description |

View File

@ -45,10 +45,11 @@ an **annotated document**. It also orchestrates training and serialization.
### Other classes {#architecture-other}
| Name | Description |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
| [`GoldParse`](/api/goldparse) | Collection for training annotations. |
| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
| Name | Description |
| --------------------------------- | ----------------------------------------------------------------------------- |
| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
| [`Example`](/api/example) | Collection for training annotations. |
|

View File

@ -633,8 +633,9 @@ for ent in doc.ents:
### Train and update neural network models {#lightning-tour-training"}
```python
import spacy
import random
import spacy
from spacy.gold import Example
nlp = spacy.load("en_core_web_sm")
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
@ -644,7 +645,9 @@ with nlp.select_pipes(enable="ner"):
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer)
nlp.to_disk("/model")
```

View File

@ -375,45 +375,71 @@ mattis pretium.
## Internal training API {#api}
<!-- TODO: rewrite for new nlp.update / example logic -->
The [`Example`](/api/example) object contains annotated training data, also
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. Here's an example of a simple `Example` for
part-of-speech tags:
The [`GoldParse`](/api/goldparse) object collects the annotated training
examples, also called the **gold standard**. It's initialized with the
[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the
annotations, like `tags` or `entities`. Its job is to encode the annotations,
keep them aligned and create the C-level data structures required for efficient
access. Here's an example of a simple `GoldParse` for part-of-speech tags:
```python
words = ["I", "like", "stuff"]
predicted = Doc(vocab, words=words)
# create the reference Doc with gold-standard TAG annotations
tags = ["NOUN", "VERB", "NOUN"]
tag_ids = [vocab.strings.add(tag) for tag in tags]
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
example = Example(predicted, reference)
```
Alternatively, the `reference` `Doc` with the gold-standard annotations can be
created from a dictionary with keyword arguments specifying the annotations,
like `tags` or `entities`:
```python
words = ["I", "like", "stuff"]
tags = ["NOUN", "VERB", "NOUN"]
predicted = Doc(en_vocab, words=words)
example = Example.from_dict(predicted, {"tags": tags})
```
Using the `Example` object and its gold-standard annotations, the model can be
updated to learn a sentence of three words with their assigned part-of-speech
tags.
<!-- TODO: is this the best place for the tag_map explanation ? -->
The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and
defines the annotation scheme. If you're training a new language model, this
will let you map the tags present in the treebank you train on to spaCy's tag
scheme:
```python
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
doc = Doc(vocab, words=["I", "like", "stuff"])
gold = GoldParse(doc, tags=["N", "V", "N"])
```
Using the `Doc` and its gold-standard annotations, the model can be updated to
learn a sentence of three words with their assigned part-of-speech tags. The
[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
the annotation scheme. If you're training a new language model, this will let
you map the tags present in the treebank you train on to spaCy's tag scheme.
Another example shows how to define gold-standard named entities:
```python
doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"])
gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"])
doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"])
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
```
The same goes for named entities. The letters added before the labels refer to
the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) `O`
is a token outside an entity, `U` an single entity unit, `B` the beginning of an
entity, `I` a token inside an entity and `L` the last token of an entity.
The letters added before the labels refer to the tags of the
[BILUO scheme](/usage/linguistic-features#updating-biluo) `O` is a token
outside an entity, `U` an single entity unit, `B` the beginning of an entity,
`I` a token inside an entity and `L` the last token of an entity.
> - **Training data**: The training examples.
> - **Text and label**: The current example.
> - **Doc**: A `Doc` object created from the example text.
> - **GoldParse**: A `GoldParse` object of the `Doc` and label.
> - **Example**: An `Example` object holding both predictions and gold-standard
> annotations.
> - **nlp**: The `nlp` object with the model.
> - **Optimizer**: A function that holds state between updates.
> - **Update**: Update the model's weights.
<!-- TODO: update graphic & related text -->
![The training loop](../images/training-loop.svg)
Of course, it's not enough to only show a model a single example once.
@ -427,32 +453,33 @@ dropout means that each feature or internal representation has a 1/4 likelihood
of being dropped.
> - [`begin_training`](/api/language#begin_training): Start the training and
> return an optimizer function to update the model's weights. Can take an
> optional function converting the training data to spaCy's training format.
> - [`update`](/api/language#update): Update the model with the training example
> and gold data.
> return an [`Optimizer`](https://thinc.ai/docs/api-optimizers) object to
> update the model's weights.
> - [`update`](/api/language#update): Update the model with the training
> examplea.
> - [`to_disk`](/api/language#to_disk): Save the updated model to a directory.
```python
### Example training loop
optimizer = nlp.begin_training(get_data)
optimizer = nlp.begin_training()
for itn in range(100):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
nlp.to_disk("/model")
```
The [`nlp.update`](/api/language#update) method takes the following arguments:
| Name | Description |
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. |
| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. |
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. |
| Name | Description |
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. |
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. |
<!-- TODO: DocBin format ? -->
Instead of writing your own training loop, you can also use the built-in
[`train`](/api/cli#train) command, which expects data in spaCy's