fix component constructors, update, begin_training, reference to GoldParse

This commit is contained in:
svlandeg 2020-07-07 19:17:19 +02:00
parent 14a796e3f9
commit 2b60e894cb
12 changed files with 265 additions and 238 deletions

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import DependencyParser > from spacy.pipeline import DependencyParser
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab, parser_model)
> parser.from_disk("/path/to/model") > parser.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ------------------ | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. | | **RETURNS** | `DependencyParser` | The newly constructed object. |
## DependencyParser.\_\_call\_\_ {#call tag="method"} ## DependencyParser.\_\_call\_\_ {#call tag="method"}
@ -126,26 +126,28 @@ Modify a batch of documents, using pre-computed scores.
## DependencyParser.update {#update tag="method"} ## DependencyParser.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating the Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and model. Delegates to [`predict`](/api/dependencyparser#predict) and
[`get_loss`](/api/dependencyparser#get_loss). [`get_loss`](/api/dependencyparser#get_loss).
> #### Example > #### Example
> >
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab, parser_model)
> losses = {} > losses = {}
> optimizer = nlp.begin_training() > optimizer = nlp.begin_training()
> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) > parser.update(examples, losses=losses, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- | | ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | | `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## DependencyParser.get_loss {#get_loss tag="method"} ## DependencyParser.get_loss {#get_loss tag="method"}
@ -169,8 +171,8 @@ predicted scores.
## DependencyParser.begin_training {#begin_training tag="method"} ## DependencyParser.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model Initialize the pipe for training, using data examples if available. Return an
has been initialized yet, the model is added. [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example > #### Example
> >
@ -180,16 +182,17 @@ has been initialized yet, the model is added.
> optimizer = parser.begin_training(pipeline=nlp.pipeline) > optimizer = parser.begin_training(pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. | | `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. | | **RETURNS** | `Optimizer` | An optimizer. |
## DependencyParser.create_optimizer {#create_optimizer tag="method"} ## DependencyParser.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component. Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline
component.
> #### Example > #### Example
> >
@ -198,9 +201,9 @@ Create an optimizer for the pipeline component.
> optimizer = parser.create_optimizer() > optimizer = parser.create_optimizer()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | -------------- | | ----------- | ----------- | -------------- |
| **RETURNS** | callable | The optimizer. | | **RETURNS** | `Optimizer` | The optimizer. |
## DependencyParser.use_params {#use_params tag="method, contextmanager"} ## DependencyParser.use_params {#use_params tag="method, contextmanager"}

View File

@ -38,18 +38,17 @@ shortcut for this and instantiate the component using its string name and
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import EntityLinker > from spacy.pipeline import EntityLinker
> entity_linker = EntityLinker(nlp.vocab) > entity_linker = EntityLinker(nlp.vocab, nel_model)
> entity_linker.from_disk("/path/to/model") > entity_linker.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ------- | ------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. | | `**cfg` | - | Configuration parameters. |
| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. |
| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. | | **RETURNS** | `EntityLinker` | The newly constructed object. |
| **RETURNS** | `EntityLinker` | The newly constructed object. |
## EntityLinker.\_\_call\_\_ {#call tag="method"} ## EntityLinker.\_\_call\_\_ {#call tag="method"}
@ -134,7 +133,7 @@ entities.
## EntityLinker.update {#update tag="method"} ## EntityLinker.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating both the Learn from a batch of [`Example`](/api/example) objects, updating both the
pipe's entity linking model and context encoder. Delegates to pipe's entity linking model and context encoder. Delegates to
[`predict`](/api/entitylinker#predict) and [`predict`](/api/entitylinker#predict) and
[`get_loss`](/api/entitylinker#get_loss). [`get_loss`](/api/entitylinker#get_loss).
@ -142,19 +141,21 @@ pipe's entity linking model and context encoder. Delegates to
> #### Example > #### Example
> >
> ```python > ```python
> entity_linker = EntityLinker(nlp.vocab) > entity_linker = EntityLinker(nlp.vocab, nel_model)
> losses = {} > losses = {}
> optimizer = nlp.begin_training() > optimizer = nlp.begin_training()
> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) > entity_linker.update(examples, losses=losses, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | -------- | ------------------------------------------------------------------------------------------------------- | | ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `docs` | iterable | A batch of documents to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | | _keyword-only_ | | |
| `drop` | float | The dropout rate, used both for the EL model and the context encoder. | | `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | | `sgd` | `Optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | float | The loss from this batch. |
## EntityLinker.get_loss {#get_loss tag="method"} ## EntityLinker.get_loss {#get_loss tag="method"}
@ -195,9 +196,9 @@ identifiers.
## EntityLinker.begin_training {#begin_training tag="method"} ## EntityLinker.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model Initialize the pipe for training, using data examples if available. Return an
has been initialized yet, the model is added. Before calling this method, a [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this
knowledge base should have been defined with method, a knowledge base should have been defined with
[`set_kb`](/api/entitylinker#set_kb). [`set_kb`](/api/entitylinker#set_kb).
> #### Example > #### Example
@ -209,12 +210,12 @@ knowledge base should have been defined with
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline) > optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. | | `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entitylinker#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. | | **RETURNS** | `Optimizer` | An optimizer. | |
## EntityLinker.create_optimizer {#create_optimizer tag="method"} ## EntityLinker.create_optimizer {#create_optimizer tag="method"}

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import EntityRecognizer > from spacy.pipeline import EntityRecognizer
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab, ner_model)
> ner.from_disk("/path/to/model") > ner.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ------------------ | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. | | **RETURNS** | `EntityRecognizer` | The newly constructed object. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"} ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
> scores, tensors = ner.predict([doc1, doc2]) > scores, tensors = ner.predict([doc1, doc2])
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------- | ---------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | The documents to predict. | | `docs` | iterable | The documents to predict. |
| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). | | **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
## EntityRecognizer.set_annotations {#set_annotations tag="method"} ## EntityRecognizer.set_annotations {#set_annotations tag="method"}
@ -127,26 +127,28 @@ Modify a batch of documents, using pre-computed scores.
## EntityRecognizer.update {#update tag="method"} ## EntityRecognizer.update {#update tag="method"}
Learn from a batch of documents and gold-standard information, updating the Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and model. Delegates to [`predict`](/api/entityrecognizer#predict) and
[`get_loss`](/api/entityrecognizer#get_loss). [`get_loss`](/api/entityrecognizer#get_loss).
> #### Example > #### Example
> >
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab, ner_model)
> losses = {} > losses = {}
> optimizer = nlp.begin_training() > optimizer = nlp.begin_training()
> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) > ner.update(examples, losses=losses, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- | | ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | | `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## EntityRecognizer.get_loss {#get_loss tag="method"} ## EntityRecognizer.get_loss {#get_loss tag="method"}
@ -170,8 +172,8 @@ predicted scores.
## EntityRecognizer.begin_training {#begin_training tag="method"} ## EntityRecognizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model Initialize the pipe for training, using data examples if available. Return an
has been initialized yet, the model is added. [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example > #### Example
> >
@ -181,12 +183,14 @@ has been initialized yet, the model is added.
> optimizer = ner.begin_training(pipeline=nlp.pipeline) > optimizer = ner.begin_training(pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. | | `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. | | **RETURNS** | `Optimizer` | An optimizer. |
|
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"} ## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -141,11 +141,12 @@ of the `reference` document.
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"] > assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
> ``` > ```
Get the aligned view of a certain token attribute, denoted by its int ID or string name. Get the aligned view of a certain token attribute, denoted by its int ID or
string name.
| Name | Type | Description | Default | | Name | Type | Description | Default |
| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- | | ----------- | -------------------------- | ------------------------------------------------------------------ | ------- |
| `field` | int or str | Attribute ID or string name | | | `field` | int or str | Attribute ID or string name | |
| `as_string` | bool | Whether or not to return the list of values as strings. | `False` | | `as_string` | bool | Whether or not to return the list of values as strings. | `False` |
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | | | **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
@ -176,7 +177,7 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
> ```python > ```python
> words = ["Mrs", "Smith", "flew", "to", "New York"] > words = ["Mrs", "Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words) > doc = Doc(en_vocab, words=words)
> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")] > entities = [(0, 9, "PERSON"), (18, 26, "LOC")]
> gold_words = ["Mrs Smith", "flew", "to", "New", "York"] > gold_words = ["Mrs Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) > example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
> ner_tags = example.get_aligned_ner() > ner_tags = example.get_aligned_ner()
@ -197,7 +198,7 @@ Get the aligned view of the NER
> ```python > ```python
> words = ["Mr and Mrs Smith", "flew", "to", "New York"] > words = ["Mr and Mrs Smith", "flew", "to", "New York"]
> doc = Doc(en_vocab, words=words) > doc = Doc(en_vocab, words=words)
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")] > entities = [(0, 16, "PERSON")]
> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] > tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"]
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) > example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
> ents_ref = example.reference.ents > ents_ref = example.reference.ents
@ -220,15 +221,12 @@ in `example.predicted`.
> #### Example > #### Example
> >
> ```python > ```python
> ruler = EntityRuler(nlp) > nlp.add_pipe(my_ner)
> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}]
> ruler.add_patterns(patterns)
> nlp.add_pipe(ruler)
> doc = nlp("Mr and Mrs Smith flew to New York") > doc = nlp("Mr and Mrs Smith flew to New York")
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] > tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"]
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) > example = Example.from_dict(doc, {"words": tokens_ref})
> ents_pred = example.predicted.ents > ents_pred = example.predicted.ents
> # Assume the NER model has found "Mr and Mrs Smith" as a named entity
> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] > assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)]
> ents_x2y = example.get_aligned_spans_x2y(ents_pred) > ents_x2y = example.get_aligned_spans_x2y(ents_pred)
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] > assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]

View File

@ -87,18 +87,18 @@ Update the models in the pipeline.
> ```python > ```python
> for raw_text, entity_offsets in train_data: > for raw_text, entity_offsets in train_data:
> doc = nlp.make_doc(raw_text) > doc = nlp.make_doc(raw_text)
> gold = GoldParse(doc, entities=entity_offsets) > example = Example.from_dict(doc, {"entities": entity_offsets})
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer) > nlp.update([example], sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- |
| `docs` | iterable | A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text. | | `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. |
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. | | `sgd` | `Optimizer` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. | | `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. | | `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
## Language.evaluate {#evaluate tag="method"} ## Language.evaluate {#evaluate tag="method"}
@ -107,35 +107,37 @@ Evaluate a model's pipeline components.
> #### Example > #### Example
> >
> ```python > ```python
> scorer = nlp.evaluate(docs_golds, verbose=True) > scorer = nlp.evaluate(examples, verbose=True)
> print(scorer.scores) > print(scorer.scores)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `verbose` | bool | Print debugging information. | | `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. | | `batch_size` | int | The batch size to use. |
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | | `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. | | `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | Scorer | The scorer containing the evaluation scores. | | **RETURNS** | Scorer | The scorer containing the evaluation scores. |
## Language.begin_training {#begin_training tag="method"} ## Language.begin_training {#begin_training tag="method"}
Allocate models, pre-process training data and acquire an optimizer. Allocate models, pre-process training data and acquire an
[`Optimizer`](https://thinc.ai/docs/api-optimizers).
> #### Example > #### Example
> >
> ```python > ```python
> optimizer = nlp.begin_training(gold_tuples) > optimizer = nlp.begin_training(get_examples)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- | | -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `gold_tuples` | iterable | Gold-standard training data. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. If not set, a default one will be created. |
| `**cfg` | - | Config parameters (sent to all components). | | `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | callable | An optimizer. | | `**cfg` | - | Config parameters (sent to all components). |
| **RETURNS** | `Optimizer` | An optimizer. |
## Language.use_params {#use_params tag="contextmanager, method"} ## Language.use_params {#use_params tag="contextmanager, method"}
@ -155,16 +157,6 @@ their original weights after the block.
| `params` | dict | A dictionary of parameters keyed by model ID. | | `params` | dict | A dictionary of parameters keyed by model ID. |
| `**cfg` | - | Config parameters. | | `**cfg` | - | Config parameters. |
## Language.preprocess_gold {#preprocess_gold tag="method"}
Can be called before training to pre-process gold data. By default, it handles
nonprojectivity and adds missing tags to the tag map.
| Name | Type | Description |
| ------------ | -------- | ---------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. |
## Language.create_pipe {#create_pipe tag="method" new="2"} ## Language.create_pipe {#create_pipe tag="method" new="2"}
Create a pipeline component from a factory. Create a pipeline component from a factory.

View File

@ -27,22 +27,20 @@ Create a new `Scorer`.
## Scorer.score {#score tag="method"} ## Scorer.score {#score tag="method"}
Update the evaluation scores from a single [`Doc`](/api/doc) / Update the evaluation scores from a single [`Example`](/api/example) object.
[`GoldParse`](/api/goldparse) pair.
> #### Example > #### Example
> >
> ```python > ```python
> scorer = Scorer() > scorer = Scorer()
> scorer.score(doc, gold) > scorer.score(example)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- | | -------------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The predicted annotations. | | `example` | `Example` | The `Example` object holding both the predictions and the correct gold-standard annotations. |
| `gold` | `GoldParse` | The correct annotations. | | `verbose` | bool | Print debugging information. |
| `verbose` | bool | Print debugging information. | | `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
## Properties ## Properties

View File

@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import Tagger > from spacy.pipeline import Tagger
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab, tagger_model)
> tagger.from_disk("/path/to/model") > tagger.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. | | **RETURNS** | `Tagger` | The newly constructed object. |
## Tagger.\_\_call\_\_ {#call tag="method"} ## Tagger.\_\_call\_\_ {#call tag="method"}
@ -132,19 +132,20 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
> #### Example > #### Example
> >
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab, tagger_model)
> losses = {} > losses = {}
> optimizer = nlp.begin_training() > optimizer = nlp.begin_training()
> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) > tagger.update(examples, losses=losses, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- | | ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| `docs` | iterable | A batch of documents to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | | `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## Tagger.get_loss {#get_loss tag="method"} ## Tagger.get_loss {#get_loss tag="method"}
@ -168,8 +169,8 @@ predicted scores.
## Tagger.begin_training {#begin_training tag="method"} ## Tagger.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model Initialize the pipe for training, using data examples if available. Return an
has been initialized yet, the model is added. [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example > #### Example
> >
@ -179,12 +180,12 @@ has been initialized yet, the model is added.
> optimizer = tagger.begin_training(pipeline=nlp.pipeline) > optimizer = tagger.begin_training(pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. | | `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. | | **RETURNS** | `Optimizer` | An optimizer. |
## Tagger.create_optimizer {#create_optimizer tag="method"} ## Tagger.create_optimizer {#create_optimizer tag="method"}

View File

@ -35,17 +35,16 @@ shortcut for this and instantiate the component using its string name and
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import TextCategorizer > from spacy.pipeline import TextCategorizer
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab, textcat_model)
> textcat.from_disk("/path/to/model") > textcat.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------- | ------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. | | `**cfg` | - | Configuration parameters. |
| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. | | **RETURNS** | `TextCategorizer` | The newly constructed object. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"} ### Architectures {#architectures new="2.1"}
@ -151,19 +150,20 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
> #### Example > #### Example
> >
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab, textcat_model)
> losses = {} > losses = {}
> optimizer = nlp.begin_training() > optimizer = nlp.begin_training()
> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) > textcat.update(examples, losses=losses, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | -------- | -------------------------------------------------------------------------------------------- | | ----------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of documents to learn from. | | `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | | _keyword-only_ | | |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | | `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | | `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
## TextCategorizer.get_loss {#get_loss tag="method"} ## TextCategorizer.get_loss {#get_loss tag="method"}
@ -187,8 +187,8 @@ predicted scores.
## TextCategorizer.begin_training {#begin_training tag="method"} ## TextCategorizer.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. If no model Initialize the pipe for training, using data examples if available. Return an
has been initialized yet, the model is added. [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example > #### Example
> >
@ -198,12 +198,12 @@ has been initialized yet, the model is added.
> optimizer = textcat.begin_training(pipeline=nlp.pipeline) > optimizer = textcat.begin_training(pipeline=nlp.pipeline)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | | `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
| `pipeline` | list | Optional list of pipeline components that this component is part of. | | `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. | | `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. |
| **RETURNS** | callable | An optimizer. | | **RETURNS** | `Optimizer` | An optimizer. |
## TextCategorizer.create_optimizer {#create_optimizer tag="method"} ## TextCategorizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -719,8 +719,7 @@ vary on each step.
> ```python > ```python
> batches = minibatch(train_data) > batches = minibatch(train_data)
> for batch in batches: > for batch in batches:
> texts, annotations = zip(*batch) > nlp.update(batch)
> nlp.update(texts, annotations)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -45,10 +45,11 @@ an **annotated document**. It also orchestrates training and serialization.
### Other classes {#architecture-other} ### Other classes {#architecture-other}
| Name | Description | | Name | Description |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------- | | --------------------------------- | ----------------------------------------------------------------------------- |
| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. | | [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
| [`StringStore`](/api/stringstore) | Map strings to and from hash values. | | [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. | | [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
| [`GoldParse`](/api/goldparse) | Collection for training annotations. | | [`Example`](/api/example) | Collection for training annotations. |
| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
|

View File

@ -633,8 +633,9 @@ for ent in doc.ents:
### Train and update neural network models {#lightning-tour-training"} ### Train and update neural network models {#lightning-tour-training"}
```python ```python
import spacy
import random import random
import spacy
from spacy.gold import Example
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})] train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
@ -644,7 +645,9 @@ with nlp.select_pipes(enable="ner"):
for i in range(10): for i in range(10):
random.shuffle(train_data) random.shuffle(train_data)
for text, annotations in train_data: for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer) doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer)
nlp.to_disk("/model") nlp.to_disk("/model")
``` ```

View File

@ -375,45 +375,71 @@ mattis pretium.
## Internal training API {#api} ## Internal training API {#api}
<!-- TODO: rewrite for new nlp.update / example logic --> The [`Example`](/api/example) object contains annotated training data, also
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. Here's an example of a simple `Example` for
part-of-speech tags:
The [`GoldParse`](/api/goldparse) object collects the annotated training ```python
examples, also called the **gold standard**. It's initialized with the words = ["I", "like", "stuff"]
[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the predicted = Doc(vocab, words=words)
annotations, like `tags` or `entities`. Its job is to encode the annotations, # create the reference Doc with gold-standard TAG annotations
keep them aligned and create the C-level data structures required for efficient tags = ["NOUN", "VERB", "NOUN"]
access. Here's an example of a simple `GoldParse` for part-of-speech tags: tag_ids = [vocab.strings.add(tag) for tag in tags]
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
example = Example(predicted, reference)
```
Alternatively, the `reference` `Doc` with the gold-standard annotations can be
created from a dictionary with keyword arguments specifying the annotations,
like `tags` or `entities`:
```python
words = ["I", "like", "stuff"]
tags = ["NOUN", "VERB", "NOUN"]
predicted = Doc(en_vocab, words=words)
example = Example.from_dict(predicted, {"tags": tags})
```
Using the `Example` object and its gold-standard annotations, the model can be
updated to learn a sentence of three words with their assigned part-of-speech
tags.
<!-- TODO: is this the best place for the tag_map explanation ? -->
The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and
defines the annotation scheme. If you're training a new language model, this
will let you map the tags present in the treebank you train on to spaCy's tag
scheme:
```python ```python
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}}) vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
doc = Doc(vocab, words=["I", "like", "stuff"])
gold = GoldParse(doc, tags=["N", "V", "N"])
``` ```
Using the `Doc` and its gold-standard annotations, the model can be updated to Another example shows how to define gold-standard named entities:
learn a sentence of three words with their assigned part-of-speech tags. The
[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
the annotation scheme. If you're training a new language model, this will let
you map the tags present in the treebank you train on to spaCy's tag scheme.
```python ```python
doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"]) doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"])
gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]) example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
``` ```
The same goes for named entities. The letters added before the labels refer to The letters added before the labels refer to the tags of the
the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) `O` [BILUO scheme](/usage/linguistic-features#updating-biluo) `O` is a token
is a token outside an entity, `U` an single entity unit, `B` the beginning of an outside an entity, `U` an single entity unit, `B` the beginning of an entity,
entity, `I` a token inside an entity and `L` the last token of an entity. `I` a token inside an entity and `L` the last token of an entity.
> - **Training data**: The training examples. > - **Training data**: The training examples.
> - **Text and label**: The current example. > - **Text and label**: The current example.
> - **Doc**: A `Doc` object created from the example text. > - **Doc**: A `Doc` object created from the example text.
> - **GoldParse**: A `GoldParse` object of the `Doc` and label. > - **Example**: An `Example` object holding both predictions and gold-standard
> annotations.
> - **nlp**: The `nlp` object with the model. > - **nlp**: The `nlp` object with the model.
> - **Optimizer**: A function that holds state between updates. > - **Optimizer**: A function that holds state between updates.
> - **Update**: Update the model's weights. > - **Update**: Update the model's weights.
<!-- TODO: update graphic & related text -->
![The training loop](../images/training-loop.svg) ![The training loop](../images/training-loop.svg)
Of course, it's not enough to only show a model a single example once. Of course, it's not enough to only show a model a single example once.
@ -427,32 +453,33 @@ dropout means that each feature or internal representation has a 1/4 likelihood
of being dropped. of being dropped.
> - [`begin_training`](/api/language#begin_training): Start the training and > - [`begin_training`](/api/language#begin_training): Start the training and
> return an optimizer function to update the model's weights. Can take an > return an [`Optimizer`](https://thinc.ai/docs/api-optimizers) object to
> optional function converting the training data to spaCy's training format. > update the model's weights.
> - [`update`](/api/language#update): Update the model with the training example > - [`update`](/api/language#update): Update the model with the training
> and gold data. > examplea.
> - [`to_disk`](/api/language#to_disk): Save the updated model to a directory. > - [`to_disk`](/api/language#to_disk): Save the updated model to a directory.
```python ```python
### Example training loop ### Example training loop
optimizer = nlp.begin_training(get_data) optimizer = nlp.begin_training()
for itn in range(100): for itn in range(100):
random.shuffle(train_data) random.shuffle(train_data)
for raw_text, entity_offsets in train_data: for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text) doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets) example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([doc], [gold], drop=0.5, sgd=optimizer) nlp.update([example], sgd=optimizer)
nlp.to_disk("/model") nlp.to_disk("/model")
``` ```
The [`nlp.update`](/api/language#update) method takes the following arguments: The [`nlp.update`](/api/language#update) method takes the following arguments:
| Name | Description | | Name | Description |
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. | | `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. |
| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. | | `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. | | `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. |
| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. |
<!-- TODO: DocBin format ? -->
Instead of writing your own training loop, you can also use the built-in Instead of writing your own training loop, you can also use the built-in
[`train`](/api/cli#train) command, which expects data in spaCy's [`train`](/api/cli#train) command, which expects data in spaCy's