diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md
index 0980dc2e0..9c9a60490 100644
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import DependencyParser
-> parser = DependencyParser(nlp.vocab)
+> parser = DependencyParser(nlp.vocab, parser_model)
> parser.from_disk("/path/to/model")
> ```
-| Name | Type | Description |
-| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | `Vocab` | The shared vocabulary. |
-| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg` | - | Configuration parameters. |
-| **RETURNS** | `DependencyParser` | The newly constructed object. |
+| Name | Type | Description |
+| ----------- | ------------------ | ------------------------------------------------------------------------------- |
+| `vocab` | `Vocab` | The shared vocabulary. |
+| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
+| `**cfg` | - | Configuration parameters. |
+| **RETURNS** | `DependencyParser` | The newly constructed object. |
## DependencyParser.\_\_call\_\_ {#call tag="method"}
@@ -126,26 +126,28 @@ Modify a batch of documents, using pre-computed scores.
## DependencyParser.update {#update tag="method"}
-Learn from a batch of documents and gold-standard information, updating the
-pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and
+Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
+model. Delegates to [`predict`](/api/dependencyparser#predict) and
[`get_loss`](/api/dependencyparser#get_loss).
> #### Example
>
> ```python
-> parser = DependencyParser(nlp.vocab)
+> parser = DependencyParser(nlp.vocab, parser_model)
> losses = {}
> optimizer = nlp.begin_training()
-> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+> parser.update(examples, losses=losses, sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------- | -------- | -------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of documents to learn from. |
-| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `drop` | float | The dropout rate. |
-| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
-| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| Name | Type | Description |
+| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). |
+| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## DependencyParser.get_loss {#get_loss tag="method"}
@@ -169,8 +171,8 @@ predicted scores.
## DependencyParser.begin_training {#begin_training tag="method"}
-Initialize the pipe for training, using data examples if available. If no model
-has been initialized yet, the model is added.
+Initialize the pipe for training, using data examples if available. Return an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@@ -180,16 +182,17 @@ has been initialized yet, the model is added.
> optimizer = parser.begin_training(pipeline=nlp.pipeline)
> ```
-| Name | Type | Description |
-| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
-| `pipeline` | list | Optional list of pipeline components that this component is part of. |
-| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
+| **RETURNS** | `Optimizer` | An optimizer. |
## DependencyParser.create_optimizer {#create_optimizer tag="method"}
-Create an optimizer for the pipeline component.
+Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline
+component.
> #### Example
>
@@ -198,9 +201,9 @@ Create an optimizer for the pipeline component.
> optimizer = parser.create_optimizer()
> ```
-| Name | Type | Description |
-| ----------- | -------- | -------------- |
-| **RETURNS** | callable | The optimizer. |
+| Name | Type | Description |
+| ----------- | ----------- | -------------- |
+| **RETURNS** | `Optimizer` | The optimizer. |
## DependencyParser.use_params {#use_params tag="method, contextmanager"}
diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md
index d7f25ed56..1e6a56a48 100644
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@@ -38,18 +38,17 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import EntityLinker
-> entity_linker = EntityLinker(nlp.vocab)
+> entity_linker = EntityLinker(nlp.vocab, nel_model)
> entity_linker.from_disk("/path/to/model")
> ```
-| Name | Type | Description |
-| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | `Vocab` | The shared vocabulary. |
-| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. |
-| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. |
-| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. |
-| **RETURNS** | `EntityLinker` | The newly constructed object. |
+| Name | Type | Description |
+| ------- | ------- | ------------------------------------------------------------------------------- |
+| `vocab` | `Vocab` | The shared vocabulary. |
+| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
+| `**cfg` | - | Configuration parameters. |
+
+| **RETURNS** | `EntityLinker` | The newly constructed object. |
## EntityLinker.\_\_call\_\_ {#call tag="method"}
@@ -134,7 +133,7 @@ entities.
## EntityLinker.update {#update tag="method"}
-Learn from a batch of documents and gold-standard information, updating both the
+Learn from a batch of [`Example`](/api/example) objects, updating both the
pipe's entity linking model and context encoder. Delegates to
[`predict`](/api/entitylinker#predict) and
[`get_loss`](/api/entitylinker#get_loss).
@@ -142,19 +141,21 @@ pipe's entity linking model and context encoder. Delegates to
> #### Example
>
> ```python
-> entity_linker = EntityLinker(nlp.vocab)
+> entity_linker = EntityLinker(nlp.vocab, nel_model)
> losses = {}
> optimizer = nlp.begin_training()
-> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+> entity_linker.update(examples, losses=losses, sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------- | -------- | ------------------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of documents to learn from. |
-| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `drop` | float | The dropout rate, used both for the EL model and the context encoder. |
-| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. |
-| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| Name | Type | Description |
+| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). |
+| `sgd` | `Optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| **RETURNS** | float | The loss from this batch. |
## EntityLinker.get_loss {#get_loss tag="method"}
@@ -195,9 +196,9 @@ identifiers.
## EntityLinker.begin_training {#begin_training tag="method"}
-Initialize the pipe for training, using data examples if available. If no model
-has been initialized yet, the model is added. Before calling this method, a
-knowledge base should have been defined with
+Initialize the pipe for training, using data examples if available. Return an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this
+method, a knowledge base should have been defined with
[`set_kb`](/api/entitylinker#set_kb).
> #### Example
@@ -209,12 +210,12 @@ knowledge base should have been defined with
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
> ```
-| Name | Type | Description |
-| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
-| `pipeline` | list | Optional list of pipeline components that this component is part of. |
-| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entitylinker#create_optimizer) if not set. |
+| **RETURNS** | `Optimizer` | An optimizer. | |
## EntityLinker.create_optimizer {#create_optimizer tag="method"}
diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md
index c9a81f6f1..9a9b0926b 100644
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import EntityRecognizer
-> ner = EntityRecognizer(nlp.vocab)
+> ner = EntityRecognizer(nlp.vocab, ner_model)
> ner.from_disk("/path/to/model")
> ```
-| Name | Type | Description |
-| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | `Vocab` | The shared vocabulary. |
-| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg` | - | Configuration parameters. |
-| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
+| Name | Type | Description |
+| ----------- | ------------------ | ------------------------------------------------------------------------------- |
+| `vocab` | `Vocab` | The shared vocabulary. |
+| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
+| `**cfg` | - | Configuration parameters. |
+| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
@@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
> scores, tensors = ner.predict([doc1, doc2])
> ```
-| Name | Type | Description |
-| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs` | iterable | The documents to predict. |
-| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
+| Name | Type | Description |
+| ----------- | -------- | ---------------------------------------------------------------------------------------------------------- |
+| `docs` | iterable | The documents to predict. |
+| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
## EntityRecognizer.set_annotations {#set_annotations tag="method"}
@@ -127,26 +127,28 @@ Modify a batch of documents, using pre-computed scores.
## EntityRecognizer.update {#update tag="method"}
-Learn from a batch of documents and gold-standard information, updating the
-pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and
+Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
+model. Delegates to [`predict`](/api/entityrecognizer#predict) and
[`get_loss`](/api/entityrecognizer#get_loss).
> #### Example
>
> ```python
-> ner = EntityRecognizer(nlp.vocab)
+> ner = EntityRecognizer(nlp.vocab, ner_model)
> losses = {}
> optimizer = nlp.begin_training()
-> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+> ner.update(examples, losses=losses, sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------- | -------- | -------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of documents to learn from. |
-| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `drop` | float | The dropout rate. |
-| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
-| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| Name | Type | Description |
+| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). |
+| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## EntityRecognizer.get_loss {#get_loss tag="method"}
@@ -170,8 +172,8 @@ predicted scores.
## EntityRecognizer.begin_training {#begin_training tag="method"}
-Initialize the pipe for training, using data examples if available. If no model
-has been initialized yet, the model is added.
+Initialize the pipe for training, using data examples if available. Return an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@@ -181,12 +183,14 @@ has been initialized yet, the model is added.
> optimizer = ner.begin_training(pipeline=nlp.pipeline)
> ```
-| Name | Type | Description |
-| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
-| `pipeline` | list | Optional list of pipeline components that this component is part of. |
-| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
+| **RETURNS** | `Optimizer` | An optimizer. |
+
+|
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}
diff --git a/website/docs/api/example.md b/website/docs/api/example.md
index 0f1ed618d..ca1b762c1 100644
--- a/website/docs/api/example.md
+++ b/website/docs/api/example.md
@@ -141,11 +141,12 @@ of the `reference` document.
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
> ```
-Get the aligned view of a certain token attribute, denoted by its int ID or string name.
+Get the aligned view of a certain token attribute, denoted by its int ID or
+string name.
| Name | Type | Description | Default |
| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- |
-| `field` | int or str | Attribute ID or string name | |
+| `field` | int or str | Attribute ID or string name | |
| `as_string` | bool | Whether or not to return the list of values as strings. | `False` |
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
@@ -176,7 +177,7 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
> ```python
> words = ["Mrs", "Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words)
-> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")]
+> entities = [(0, 9, "PERSON"), (18, 26, "LOC")]
> gold_words = ["Mrs Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
> ner_tags = example.get_aligned_ner()
@@ -197,7 +198,7 @@ Get the aligned view of the NER
> ```python
> words = ["Mr and Mrs Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words)
-> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
+> entities = [(0, 16, "PERSON")]
> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
> ents_ref = example.reference.ents
@@ -220,15 +221,12 @@ in `example.predicted`.
> #### Example
>
> ```python
-> ruler = EntityRuler(nlp)
-> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}]
-> ruler.add_patterns(patterns)
-> nlp.add_pipe(ruler)
+> nlp.add_pipe(my_ner)
> doc = nlp("Mr and Mrs Smith flew to New York")
-> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"]
-> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
+> example = Example.from_dict(doc, {"words": tokens_ref})
> ents_pred = example.predicted.ents
+> # Assume the NER model has found "Mr and Mrs Smith" as a named entity
> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)]
> ents_x2y = example.get_aligned_spans_x2y(ents_pred)
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
diff --git a/website/docs/api/language.md b/website/docs/api/language.md
index e835168b7..f6631b1db 100644
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@@ -87,18 +87,18 @@ Update the models in the pipeline.
> ```python
> for raw_text, entity_offsets in train_data:
> doc = nlp.make_doc(raw_text)
-> gold = GoldParse(doc, entities=entity_offsets)
-> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
+> example = Example.from_dict(doc, {"entities": entity_offsets})
+> nlp.update([example], sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text. |
-| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
-| `drop` | float | The dropout rate. |
-| `sgd` | callable | An optimizer. |
-| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. |
-| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. |
+| Name | Type | Description |
+| -------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `sgd` | `Optimizer` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. |
+| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
## Language.evaluate {#evaluate tag="method"}
@@ -107,35 +107,37 @@ Evaluate a model's pipeline components.
> #### Example
>
> ```python
-> scorer = nlp.evaluate(docs_golds, verbose=True)
+> scorer = nlp.evaluate(examples, verbose=True)
> print(scorer.scores)
> ```
-| Name | Type | Description |
-| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
-| `verbose` | bool | Print debugging information. |
-| `batch_size` | int | The batch size to use. |
-| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
-| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. |
-| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
+| Name | Type | Description |
+| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| `verbose` | bool | Print debugging information. |
+| `batch_size` | int | The batch size to use. |
+| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
+| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
## Language.begin_training {#begin_training tag="method"}
-Allocate models, pre-process training data and acquire an optimizer.
+Allocate models, pre-process training data and acquire an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers).
> #### Example
>
> ```python
-> optimizer = nlp.begin_training(gold_tuples)
+> optimizer = nlp.begin_training(get_examples)
> ```
-| Name | Type | Description |
-| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Gold-standard training data. |
-| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. |
-| `**cfg` | - | Config parameters (sent to all components). |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------ |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. If not set, a default one will be created. |
+| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
+| `**cfg` | - | Config parameters (sent to all components). |
+| **RETURNS** | `Optimizer` | An optimizer. |
## Language.use_params {#use_params tag="contextmanager, method"}
@@ -155,16 +157,6 @@ their original weights after the block.
| `params` | dict | A dictionary of parameters keyed by model ID. |
| `**cfg` | - | Config parameters. |
-## Language.preprocess_gold {#preprocess_gold tag="method"}
-
-Can be called before training to pre-process gold data. By default, it handles
-nonprojectivity and adds missing tags to the tag map.
-
-| Name | Type | Description |
-| ------------ | -------- | ---------------------------------------- |
-| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
-| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. |
-
## Language.create_pipe {#create_pipe tag="method" new="2"}
Create a pipeline component from a factory.
diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md
index 8ad735e0d..cd720d26c 100644
--- a/website/docs/api/scorer.md
+++ b/website/docs/api/scorer.md
@@ -27,22 +27,20 @@ Create a new `Scorer`.
## Scorer.score {#score tag="method"}
-Update the evaluation scores from a single [`Doc`](/api/doc) /
-[`GoldParse`](/api/goldparse) pair.
+Update the evaluation scores from a single [`Example`](/api/example) object.
> #### Example
>
> ```python
> scorer = Scorer()
-> scorer.score(doc, gold)
+> scorer.score(example)
> ```
-| Name | Type | Description |
-| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- |
-| `doc` | `Doc` | The predicted annotations. |
-| `gold` | `GoldParse` | The correct annotations. |
-| `verbose` | bool | Print debugging information. |
-| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
+| Name | Type | Description |
+| -------------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
+| `example` | `Example` | The `Example` object holding both the predictions and the correct gold-standard annotations. |
+| `verbose` | bool | Print debugging information. |
+| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
## Properties
diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md
index f14da3ac5..1aa5fb327 100644
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import Tagger
-> tagger = Tagger(nlp.vocab)
+> tagger = Tagger(nlp.vocab, tagger_model)
> tagger.from_disk("/path/to/model")
> ```
-| Name | Type | Description |
-| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | `Vocab` | The shared vocabulary. |
-| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg` | - | Configuration parameters. |
-| **RETURNS** | `Tagger` | The newly constructed object. |
+| Name | Type | Description |
+| ----------- | -------- | ------------------------------------------------------------------------------- |
+| `vocab` | `Vocab` | The shared vocabulary. |
+| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
+| `**cfg` | - | Configuration parameters. |
+| **RETURNS** | `Tagger` | The newly constructed object. |
## Tagger.\_\_call\_\_ {#call tag="method"}
@@ -132,19 +132,20 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
> #### Example
>
> ```python
-> tagger = Tagger(nlp.vocab)
+> tagger = Tagger(nlp.vocab, tagger_model)
> losses = {}
> optimizer = nlp.begin_training()
-> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+> tagger.update(examples, losses=losses, sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------- | -------- | -------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of documents to learn from. |
-| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `drop` | float | The dropout rate. |
-| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
-| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| Name | Type | Description |
+| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
+| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## Tagger.get_loss {#get_loss tag="method"}
@@ -168,8 +169,8 @@ predicted scores.
## Tagger.begin_training {#begin_training tag="method"}
-Initialize the pipe for training, using data examples if available. If no model
-has been initialized yet, the model is added.
+Initialize the pipe for training, using data examples if available. Return an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@@ -179,12 +180,12 @@ has been initialized yet, the model is added.
> optimizer = tagger.begin_training(pipeline=nlp.pipeline)
> ```
-| Name | Type | Description |
-| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
-| `pipeline` | list | Optional list of pipeline components that this component is part of. |
-| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. |
+| **RETURNS** | `Optimizer` | An optimizer. |
## Tagger.create_optimizer {#create_optimizer tag="method"}
diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md
index dc1c083ac..c0c3e15a0 100644
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@@ -35,17 +35,16 @@ shortcut for this and instantiate the component using its string name and
>
> # Construction from class
> from spacy.pipeline import TextCategorizer
-> textcat = TextCategorizer(nlp.vocab)
+> textcat = TextCategorizer(nlp.vocab, textcat_model)
> textcat.from_disk("/path/to/model")
> ```
-| Name | Type | Description |
-| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | `Vocab` | The shared vocabulary. |
-| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
-| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
-| **RETURNS** | `TextCategorizer` | The newly constructed object. |
+| Name | Type | Description |
+| ----------- | ----------------- | ------------------------------------------------------------------------------- |
+| `vocab` | `Vocab` | The shared vocabulary. |
+| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
+| `**cfg` | - | Configuration parameters. |
+| **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"}
@@ -151,19 +150,20 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
> #### Example
>
> ```python
-> textcat = TextCategorizer(nlp.vocab)
+> textcat = TextCategorizer(nlp.vocab, textcat_model)
> losses = {}
> optimizer = nlp.begin_training()
-> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+> textcat.update(examples, losses=losses, sgd=optimizer)
> ```
-| Name | Type | Description |
-| -------- | -------- | -------------------------------------------------------------------------------------------- |
-| `docs` | iterable | A batch of documents to learn from. |
-| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `drop` | float | The dropout rate. |
-| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
-| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
+| Name | Type | Description |
+| ----------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| _keyword-only_ | | |
+| `drop` | float | The dropout rate. |
+| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
+| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
+| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## TextCategorizer.get_loss {#get_loss tag="method"}
@@ -187,8 +187,8 @@ predicted scores.
## TextCategorizer.begin_training {#begin_training tag="method"}
-Initialize the pipe for training, using data examples if available. If no model
-has been initialized yet, the model is added.
+Initialize the pipe for training, using data examples if available. Return an
+[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
@@ -198,12 +198,12 @@ has been initialized yet, the model is added.
> optimizer = textcat.begin_training(pipeline=nlp.pipeline)
> ```
-| Name | Type | Description |
-| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
-| `pipeline` | list | Optional list of pipeline components that this component is part of. |
-| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. |
-| **RETURNS** | callable | An optimizer. |
+| Name | Type | Description |
+| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
+| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
+| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. |
+| **RETURNS** | `Optimizer` | An optimizer. |
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md
index c8fea6a34..c9c8138e8 100644
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@@ -719,8 +719,7 @@ vary on each step.
> ```python
> batches = minibatch(train_data)
> for batch in batches:
-> texts, annotations = zip(*batch)
-> nlp.update(texts, annotations)
+> nlp.update(batch)
> ```
| Name | Type | Description |
diff --git a/website/docs/usage/101/_architecture.md b/website/docs/usage/101/_architecture.md
index 4363b9b4f..95158b67d 100644
--- a/website/docs/usage/101/_architecture.md
+++ b/website/docs/usage/101/_architecture.md
@@ -45,10 +45,11 @@ an **annotated document**. It also orchestrates training and serialization.
### Other classes {#architecture-other}
-| Name | Description |
-| --------------------------------- | ------------------------------------------------------------------------------------------------------------- |
-| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
-| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
-| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
-| [`GoldParse`](/api/goldparse) | Collection for training annotations. |
-| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
+| Name | Description |
+| --------------------------------- | ----------------------------------------------------------------------------- |
+| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
+| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
+| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
+| [`Example`](/api/example) | Collection for training annotations. |
+
+|
diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md
index 245d4ef42..19580dc0f 100644
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@@ -633,8 +633,9 @@ for ent in doc.ents:
### Train and update neural network models {#lightning-tour-training"}
```python
-import spacy
import random
+import spacy
+from spacy.gold import Example
nlp = spacy.load("en_core_web_sm")
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
@@ -644,7 +645,9 @@ with nlp.select_pipes(enable="ner"):
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
- nlp.update([text], [annotations], sgd=optimizer)
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)
+ nlp.update([example], sgd=optimizer)
nlp.to_disk("/model")
```
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index fd755c58b..51282c2ab 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -375,45 +375,71 @@ mattis pretium.
## Internal training API {#api}
-
+The [`Example`](/api/example) object contains annotated training data, also
+called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
+that will hold the predictions, and another `Doc` object that holds the
+gold-standard annotations. Here's an example of a simple `Example` for
+part-of-speech tags:
-The [`GoldParse`](/api/goldparse) object collects the annotated training
-examples, also called the **gold standard**. It's initialized with the
-[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the
-annotations, like `tags` or `entities`. Its job is to encode the annotations,
-keep them aligned and create the C-level data structures required for efficient
-access. Here's an example of a simple `GoldParse` for part-of-speech tags:
+```python
+words = ["I", "like", "stuff"]
+predicted = Doc(vocab, words=words)
+# create the reference Doc with gold-standard TAG annotations
+tags = ["NOUN", "VERB", "NOUN"]
+tag_ids = [vocab.strings.add(tag) for tag in tags]
+reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
+example = Example(predicted, reference)
+```
+
+Alternatively, the `reference` `Doc` with the gold-standard annotations can be
+created from a dictionary with keyword arguments specifying the annotations,
+like `tags` or `entities`:
+
+```python
+words = ["I", "like", "stuff"]
+tags = ["NOUN", "VERB", "NOUN"]
+predicted = Doc(en_vocab, words=words)
+example = Example.from_dict(predicted, {"tags": tags})
+```
+
+Using the `Example` object and its gold-standard annotations, the model can be
+updated to learn a sentence of three words with their assigned part-of-speech
+tags.
+
+
+
+The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and
+defines the annotation scheme. If you're training a new language model, this
+will let you map the tags present in the treebank you train on to spaCy's tag
+scheme:
```python
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
-doc = Doc(vocab, words=["I", "like", "stuff"])
-gold = GoldParse(doc, tags=["N", "V", "N"])
```
-Using the `Doc` and its gold-standard annotations, the model can be updated to
-learn a sentence of three words with their assigned part-of-speech tags. The
-[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
-the annotation scheme. If you're training a new language model, this will let
-you map the tags present in the treebank you train on to spaCy's tag scheme.
+Another example shows how to define gold-standard named entities:
```python
-doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"])
-gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"])
+doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"])
+example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
```
-The same goes for named entities. The letters added before the labels refer to
-the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O`
-is a token outside an entity, `U` an single entity unit, `B` the beginning of an
-entity, `I` a token inside an entity and `L` the last token of an entity.
+The letters added before the labels refer to the tags of the
+[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
+outside an entity, `U` an single entity unit, `B` the beginning of an entity,
+`I` a token inside an entity and `L` the last token of an entity.
> - **Training data**: The training examples.
> - **Text and label**: The current example.
> - **Doc**: A `Doc` object created from the example text.
-> - **GoldParse**: A `GoldParse` object of the `Doc` and label.
+> - **Example**: An `Example` object holding both predictions and gold-standard
+> annotations.
> - **nlp**: The `nlp` object with the model.
> - **Optimizer**: A function that holds state between updates.
> - **Update**: Update the model's weights.
+
+
![The training loop](../images/training-loop.svg)
Of course, it's not enough to only show a model a single example once.
@@ -427,32 +453,33 @@ dropout means that each feature or internal representation has a 1/4 likelihood
of being dropped.
> - [`begin_training`](/api/language#begin_training): Start the training and
-> return an optimizer function to update the model's weights. Can take an
-> optional function converting the training data to spaCy's training format.
-> - [`update`](/api/language#update): Update the model with the training example
-> and gold data.
+> return an [`Optimizer`](https://thinc.ai/docs/api-optimizers) object to
+> update the model's weights.
+> - [`update`](/api/language#update): Update the model with the training
+> examplea.
> - [`to_disk`](/api/language#to_disk): Save the updated model to a directory.
```python
### Example training loop
-optimizer = nlp.begin_training(get_data)
+optimizer = nlp.begin_training()
for itn in range(100):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
- gold = GoldParse(doc, entities=entity_offsets)
- nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
+ example = Example.from_dict(doc, {"entities": entity_offsets})
+ nlp.update([example], sgd=optimizer)
nlp.to_disk("/model")
```
The [`nlp.update`](/api/language#update) method takes the following arguments:
-| Name | Description |
-| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. |
-| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. |
-| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
-| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. |
+| Name | Description |
+| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. |
+| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
+| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. |
+
+
Instead of writing your own training loop, you can also use the built-in
[`train`](/api/cli#train) command, which expects data in spaCy's