mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
fix component constructors, update, begin_training, reference to GoldParse
This commit is contained in:
parent
14a796e3f9
commit
2b60e894cb
|
@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import DependencyParser
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> parser = DependencyParser(nlp.vocab, parser_model)
|
||||
> parser.from_disk("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `DependencyParser` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | ------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `DependencyParser` | The newly constructed object. |
|
||||
|
||||
## DependencyParser.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -126,26 +126,28 @@ Modify a batch of documents, using pre-computed scores.
|
|||
|
||||
## DependencyParser.update {#update tag="method"}
|
||||
|
||||
Learn from a batch of documents and gold-standard information, updating the
|
||||
pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and
|
||||
Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
|
||||
model. Delegates to [`predict`](/api/dependencyparser#predict) and
|
||||
[`get_loss`](/api/dependencyparser#get_loss).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> parser = DependencyParser(nlp.vocab, parser_model)
|
||||
> losses = {}
|
||||
> optimizer = nlp.begin_training()
|
||||
> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
> parser.update(examples, losses=losses, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | -------- | -------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of documents to learn from. |
|
||||
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
|
||||
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). |
|
||||
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||
|
||||
## DependencyParser.get_loss {#get_loss tag="method"}
|
||||
|
||||
|
@ -169,8 +171,8 @@ predicted scores.
|
|||
|
||||
## DependencyParser.begin_training {#begin_training tag="method"}
|
||||
|
||||
Initialize the pipe for training, using data examples if available. If no model
|
||||
has been initialized yet, the model is added.
|
||||
Initialize the pipe for training, using data examples if available. Return an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -180,16 +182,17 @@ has been initialized yet, the model is added.
|
|||
> optimizer = parser.begin_training(pipeline=nlp.pipeline)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
|
||||
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. |
|
||||
|
||||
## DependencyParser.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
Create an optimizer for the pipeline component.
|
||||
Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline
|
||||
component.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -198,9 +201,9 @@ Create an optimizer for the pipeline component.
|
|||
> optimizer = parser.create_optimizer()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | -------------- |
|
||||
| **RETURNS** | callable | The optimizer. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------- | -------------- |
|
||||
| **RETURNS** | `Optimizer` | The optimizer. |
|
||||
|
||||
## DependencyParser.use_params {#use_params tag="method, contextmanager"}
|
||||
|
||||
|
|
|
@ -38,18 +38,17 @@ shortcut for this and instantiate the component using its string name and
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import EntityLinker
|
||||
> entity_linker = EntityLinker(nlp.vocab)
|
||||
> entity_linker = EntityLinker(nlp.vocab, nel_model)
|
||||
> entity_linker.from_disk("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. |
|
||||
| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. |
|
||||
| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. |
|
||||
| **RETURNS** | `EntityLinker` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ------- | ------- | ------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
|
||||
| **RETURNS** | `EntityLinker` | The newly constructed object. |
|
||||
|
||||
## EntityLinker.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -134,7 +133,7 @@ entities.
|
|||
|
||||
## EntityLinker.update {#update tag="method"}
|
||||
|
||||
Learn from a batch of documents and gold-standard information, updating both the
|
||||
Learn from a batch of [`Example`](/api/example) objects, updating both the
|
||||
pipe's entity linking model and context encoder. Delegates to
|
||||
[`predict`](/api/entitylinker#predict) and
|
||||
[`get_loss`](/api/entitylinker#get_loss).
|
||||
|
@ -142,19 +141,21 @@ pipe's entity linking model and context encoder. Delegates to
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> entity_linker = EntityLinker(nlp.vocab)
|
||||
> entity_linker = EntityLinker(nlp.vocab, nel_model)
|
||||
> losses = {}
|
||||
> optimizer = nlp.begin_training()
|
||||
> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
> entity_linker.update(examples, losses=losses, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | -------- | ------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of documents to learn from. |
|
||||
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
|
||||
| `drop` | float | The dropout rate, used both for the EL model and the context encoder. |
|
||||
| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. |
|
||||
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). |
|
||||
| `sgd` | `Optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| **RETURNS** | float | The loss from this batch. |
|
||||
|
||||
## EntityLinker.get_loss {#get_loss tag="method"}
|
||||
|
||||
|
@ -195,9 +196,9 @@ identifiers.
|
|||
|
||||
## EntityLinker.begin_training {#begin_training tag="method"}
|
||||
|
||||
Initialize the pipe for training, using data examples if available. If no model
|
||||
has been initialized yet, the model is added. Before calling this method, a
|
||||
knowledge base should have been defined with
|
||||
Initialize the pipe for training, using data examples if available. Return an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this
|
||||
method, a knowledge base should have been defined with
|
||||
[`set_kb`](/api/entitylinker#set_kb).
|
||||
|
||||
> #### Example
|
||||
|
@ -209,12 +210,12 @@ knowledge base should have been defined with
|
|||
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
|
||||
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entitylinker#create_optimizer) if not set. |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. | |
|
||||
|
||||
## EntityLinker.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
|
|
|
@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import EntityRecognizer
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> ner = EntityRecognizer(nlp.vocab, ner_model)
|
||||
> ner.from_disk("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | ------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
|
||||
|
||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
|
|||
> scores, tensors = ner.predict([doc1, doc2])
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | The documents to predict. |
|
||||
| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | The documents to predict. |
|
||||
| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
|
||||
|
||||
## EntityRecognizer.set_annotations {#set_annotations tag="method"}
|
||||
|
||||
|
@ -127,26 +127,28 @@ Modify a batch of documents, using pre-computed scores.
|
|||
|
||||
## EntityRecognizer.update {#update tag="method"}
|
||||
|
||||
Learn from a batch of documents and gold-standard information, updating the
|
||||
pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and
|
||||
Learn from a batch of [`Example`](/api/example) objects, updating the pipe's
|
||||
model. Delegates to [`predict`](/api/entityrecognizer#predict) and
|
||||
[`get_loss`](/api/entityrecognizer#get_loss).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> ner = EntityRecognizer(nlp.vocab, ner_model)
|
||||
> losses = {}
|
||||
> optimizer = nlp.begin_training()
|
||||
> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
> ner.update(examples, losses=losses, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | -------- | -------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of documents to learn from. |
|
||||
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
|
||||
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). |
|
||||
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||
|
||||
## EntityRecognizer.get_loss {#get_loss tag="method"}
|
||||
|
||||
|
@ -170,8 +172,8 @@ predicted scores.
|
|||
|
||||
## EntityRecognizer.begin_training {#begin_training tag="method"}
|
||||
|
||||
Initialize the pipe for training, using data examples if available. If no model
|
||||
has been initialized yet, the model is added.
|
||||
Initialize the pipe for training, using data examples if available. Return an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -181,12 +183,14 @@ has been initialized yet, the model is added.
|
|||
> optimizer = ner.begin_training(pipeline=nlp.pipeline)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
|
||||
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. |
|
||||
|
||||
|
|
||||
|
||||
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
|
|
|
@ -141,11 +141,12 @@ of the `reference` document.
|
|||
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
|
||||
> ```
|
||||
|
||||
Get the aligned view of a certain token attribute, denoted by its int ID or string name.
|
||||
Get the aligned view of a certain token attribute, denoted by its int ID or
|
||||
string name.
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- |
|
||||
| `field` | int or str | Attribute ID or string name | |
|
||||
| `field` | int or str | Attribute ID or string name | |
|
||||
| `as_string` | bool | Whether or not to return the list of values as strings. | `False` |
|
||||
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
|
||||
|
||||
|
@ -176,7 +177,7 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
|
|||
> ```python
|
||||
> words = ["Mrs", "Smith", "flew", "to", "New York"]
|
||||
> doc = Doc(en_vocab, words=words)
|
||||
> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")]
|
||||
> entities = [(0, 9, "PERSON"), (18, 26, "LOC")]
|
||||
> gold_words = ["Mrs Smith", "flew", "to", "New", "York"]
|
||||
> example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
> ner_tags = example.get_aligned_ner()
|
||||
|
@ -197,7 +198,7 @@ Get the aligned view of the NER
|
|||
> ```python
|
||||
> words = ["Mr and Mrs Smith", "flew", "to", "New York"]
|
||||
> doc = Doc(en_vocab, words=words)
|
||||
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
|
||||
> entities = [(0, 16, "PERSON")]
|
||||
> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"]
|
||||
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
|
||||
> ents_ref = example.reference.ents
|
||||
|
@ -220,15 +221,12 @@ in `example.predicted`.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> ruler = EntityRuler(nlp)
|
||||
> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}]
|
||||
> ruler.add_patterns(patterns)
|
||||
> nlp.add_pipe(ruler)
|
||||
> nlp.add_pipe(my_ner)
|
||||
> doc = nlp("Mr and Mrs Smith flew to New York")
|
||||
> entities = [(0, len("Mr and Mrs Smith"), "PERSON")]
|
||||
> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"]
|
||||
> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
|
||||
> example = Example.from_dict(doc, {"words": tokens_ref})
|
||||
> ents_pred = example.predicted.ents
|
||||
> # Assume the NER model has found "Mr and Mrs Smith" as a named entity
|
||||
> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)]
|
||||
> ents_x2y = example.get_aligned_spans_x2y(ents_pred)
|
||||
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
||||
|
|
|
@ -87,18 +87,18 @@ Update the models in the pipeline.
|
|||
> ```python
|
||||
> for raw_text, entity_offsets in train_data:
|
||||
> doc = nlp.make_doc(raw_text)
|
||||
> gold = GoldParse(doc, entities=entity_offsets)
|
||||
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
||||
> example = Example.from_dict(doc, {"entities": entity_offsets})
|
||||
> nlp.update([example], sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text. |
|
||||
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | An optimizer. |
|
||||
| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | `Optimizer` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||
|
||||
## Language.evaluate {#evaluate tag="method"}
|
||||
|
||||
|
@ -107,35 +107,37 @@ Evaluate a model's pipeline components.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> scorer = nlp.evaluate(docs_golds, verbose=True)
|
||||
> scorer = nlp.evaluate(examples, verbose=True)
|
||||
> print(scorer.scores)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **RETURNS** | Scorer | The scorer containing the evaluation scores. |
|
||||
|
||||
## Language.begin_training {#begin_training tag="method"}
|
||||
|
||||
Allocate models, pre-process training data and acquire an optimizer.
|
||||
Allocate models, pre-process training data and acquire an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> optimizer = nlp.begin_training(gold_tuples)
|
||||
> optimizer = nlp.begin_training(get_examples)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Gold-standard training data. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| `**cfg` | - | Config parameters (sent to all components). |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------ |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. If not set, a default one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| `**cfg` | - | Config parameters (sent to all components). |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. |
|
||||
|
||||
## Language.use_params {#use_params tag="contextmanager, method"}
|
||||
|
||||
|
@ -155,16 +157,6 @@ their original weights after the block.
|
|||
| `params` | dict | A dictionary of parameters keyed by model ID. |
|
||||
| `**cfg` | - | Config parameters. |
|
||||
|
||||
## Language.preprocess_gold {#preprocess_gold tag="method"}
|
||||
|
||||
Can be called before training to pre-process gold data. By default, it handles
|
||||
nonprojectivity and adds missing tags to the tag map.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | -------- | ---------------------------------------- |
|
||||
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
|
||||
| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. |
|
||||
|
||||
## Language.create_pipe {#create_pipe tag="method" new="2"}
|
||||
|
||||
Create a pipeline component from a factory.
|
||||
|
|
|
@ -27,22 +27,20 @@ Create a new `Scorer`.
|
|||
|
||||
## Scorer.score {#score tag="method"}
|
||||
|
||||
Update the evaluation scores from a single [`Doc`](/api/doc) /
|
||||
[`GoldParse`](/api/goldparse) pair.
|
||||
Update the evaluation scores from a single [`Example`](/api/example) object.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> scorer = Scorer()
|
||||
> scorer.score(doc, gold)
|
||||
> scorer.score(example)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The predicted annotations. |
|
||||
| `gold` | `GoldParse` | The correct annotations. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `example` | `Example` | The `Example` object holding both the predictions and the correct gold-standard annotations. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. |
|
||||
|
||||
## Properties
|
||||
|
||||
|
|
|
@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import Tagger
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> tagger = Tagger(nlp.vocab, tagger_model)
|
||||
> tagger.from_disk("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `Tagger` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `Tagger` | The newly constructed object. |
|
||||
|
||||
## Tagger.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -132,19 +132,20 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> tagger = Tagger(nlp.vocab, tagger_model)
|
||||
> losses = {}
|
||||
> optimizer = nlp.begin_training()
|
||||
> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
> tagger.update(examples, losses=losses, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | -------- | -------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of documents to learn from. |
|
||||
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
|
||||
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
|
||||
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
|
||||
## Tagger.get_loss {#get_loss tag="method"}
|
||||
|
||||
|
@ -168,8 +169,8 @@ predicted scores.
|
|||
|
||||
## Tagger.begin_training {#begin_training tag="method"}
|
||||
|
||||
Initialize the pipe for training, using data examples if available. If no model
|
||||
has been initialized yet, the model is added.
|
||||
Initialize the pipe for training, using data examples if available. Return an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -179,12 +180,12 @@ has been initialized yet, the model is added.
|
|||
> optimizer = tagger.begin_training(pipeline=nlp.pipeline)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
|
||||
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. |
|
||||
|
||||
## Tagger.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
|
|
|
@ -35,17 +35,16 @@ shortcut for this and instantiate the component using its string name and
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import TextCategorizer
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> textcat = TextCategorizer(nlp.vocab, textcat_model)
|
||||
> textcat.from_disk("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
|
||||
| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
|
||||
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------- | ------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `**cfg` | - | Configuration parameters. |
|
||||
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
|
||||
|
||||
### Architectures {#architectures new="2.1"}
|
||||
|
||||
|
@ -151,19 +150,20 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> textcat = TextCategorizer(nlp.vocab, textcat_model)
|
||||
> losses = {}
|
||||
> optimizer = nlp.begin_training()
|
||||
> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
> textcat.update(examples, losses=losses, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | -------- | -------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of documents to learn from. |
|
||||
| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. |
|
||||
| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
|
||||
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
||||
|
||||
## TextCategorizer.get_loss {#get_loss tag="method"}
|
||||
|
||||
|
@ -187,8 +187,8 @@ predicted scores.
|
|||
|
||||
## TextCategorizer.begin_training {#begin_training tag="method"}
|
||||
|
||||
Initialize the pipe for training, using data examples if available. If no model
|
||||
has been initialized yet, the model is added.
|
||||
Initialize the pipe for training, using data examples if available. Return an
|
||||
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -198,12 +198,12 @@ has been initialized yet, the model is added.
|
|||
> optimizer = textcat.begin_training(pipeline=nlp.pipeline)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. |
|
||||
| `pipeline` | list | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
||||
| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. |
|
||||
| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. |
|
||||
| **RETURNS** | `Optimizer` | An optimizer. |
|
||||
|
||||
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
|
|
|
@ -719,8 +719,7 @@ vary on each step.
|
|||
> ```python
|
||||
> batches = minibatch(train_data)
|
||||
> for batch in batches:
|
||||
> texts, annotations = zip(*batch)
|
||||
> nlp.update(texts, annotations)
|
||||
> nlp.update(batch)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
|
|
@ -45,10 +45,11 @@ an **annotated document**. It also orchestrates training and serialization.
|
|||
|
||||
### Other classes {#architecture-other}
|
||||
|
||||
| Name | Description |
|
||||
| --------------------------------- | ------------------------------------------------------------------------------------------------------------- |
|
||||
| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
|
||||
| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
|
||||
| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
|
||||
| [`GoldParse`](/api/goldparse) | Collection for training annotations. |
|
||||
| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
|
||||
| Name | Description |
|
||||
| --------------------------------- | ----------------------------------------------------------------------------- |
|
||||
| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
|
||||
| [`StringStore`](/api/stringstore) | Map strings to and from hash values. |
|
||||
| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. |
|
||||
| [`Example`](/api/example) | Collection for training annotations. |
|
||||
|
||||
|
|
||||
|
|
|
@ -633,8 +633,9 @@ for ent in doc.ents:
|
|||
### Train and update neural network models {#lightning-tour-training"}
|
||||
|
||||
```python
|
||||
import spacy
|
||||
import random
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
|
||||
|
@ -644,7 +645,9 @@ with nlp.select_pipes(enable="ner"):
|
|||
for i in range(10):
|
||||
random.shuffle(train_data)
|
||||
for text, annotations in train_data:
|
||||
nlp.update([text], [annotations], sgd=optimizer)
|
||||
doc = nlp.make_doc(text)
|
||||
example = Example.from_dict(doc, annotations)
|
||||
nlp.update([example], sgd=optimizer)
|
||||
nlp.to_disk("/model")
|
||||
```
|
||||
|
||||
|
|
|
@ -375,45 +375,71 @@ mattis pretium.
|
|||
|
||||
## Internal training API {#api}
|
||||
|
||||
<!-- TODO: rewrite for new nlp.update / example logic -->
|
||||
The [`Example`](/api/example) object contains annotated training data, also
|
||||
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
|
||||
that will hold the predictions, and another `Doc` object that holds the
|
||||
gold-standard annotations. Here's an example of a simple `Example` for
|
||||
part-of-speech tags:
|
||||
|
||||
The [`GoldParse`](/api/goldparse) object collects the annotated training
|
||||
examples, also called the **gold standard**. It's initialized with the
|
||||
[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the
|
||||
annotations, like `tags` or `entities`. Its job is to encode the annotations,
|
||||
keep them aligned and create the C-level data structures required for efficient
|
||||
access. Here's an example of a simple `GoldParse` for part-of-speech tags:
|
||||
```python
|
||||
words = ["I", "like", "stuff"]
|
||||
predicted = Doc(vocab, words=words)
|
||||
# create the reference Doc with gold-standard TAG annotations
|
||||
tags = ["NOUN", "VERB", "NOUN"]
|
||||
tag_ids = [vocab.strings.add(tag) for tag in tags]
|
||||
reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
example = Example(predicted, reference)
|
||||
```
|
||||
|
||||
Alternatively, the `reference` `Doc` with the gold-standard annotations can be
|
||||
created from a dictionary with keyword arguments specifying the annotations,
|
||||
like `tags` or `entities`:
|
||||
|
||||
```python
|
||||
words = ["I", "like", "stuff"]
|
||||
tags = ["NOUN", "VERB", "NOUN"]
|
||||
predicted = Doc(en_vocab, words=words)
|
||||
example = Example.from_dict(predicted, {"tags": tags})
|
||||
```
|
||||
|
||||
Using the `Example` object and its gold-standard annotations, the model can be
|
||||
updated to learn a sentence of three words with their assigned part-of-speech
|
||||
tags.
|
||||
|
||||
<!-- TODO: is this the best place for the tag_map explanation ? -->
|
||||
|
||||
The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and
|
||||
defines the annotation scheme. If you're training a new language model, this
|
||||
will let you map the tags present in the treebank you train on to spaCy's tag
|
||||
scheme:
|
||||
|
||||
```python
|
||||
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
|
||||
doc = Doc(vocab, words=["I", "like", "stuff"])
|
||||
gold = GoldParse(doc, tags=["N", "V", "N"])
|
||||
```
|
||||
|
||||
Using the `Doc` and its gold-standard annotations, the model can be updated to
|
||||
learn a sentence of three words with their assigned part-of-speech tags. The
|
||||
[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
|
||||
the annotation scheme. If you're training a new language model, this will let
|
||||
you map the tags present in the treebank you train on to spaCy's tag scheme.
|
||||
Another example shows how to define gold-standard named entities:
|
||||
|
||||
```python
|
||||
doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"])
|
||||
gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"])
|
||||
doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"])
|
||||
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
|
||||
```
|
||||
|
||||
The same goes for named entities. The letters added before the labels refer to
|
||||
the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O`
|
||||
is a token outside an entity, `U` an single entity unit, `B` the beginning of an
|
||||
entity, `I` a token inside an entity and `L` the last token of an entity.
|
||||
The letters added before the labels refer to the tags of the
|
||||
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
|
||||
outside an entity, `U` an single entity unit, `B` the beginning of an entity,
|
||||
`I` a token inside an entity and `L` the last token of an entity.
|
||||
|
||||
> - **Training data**: The training examples.
|
||||
> - **Text and label**: The current example.
|
||||
> - **Doc**: A `Doc` object created from the example text.
|
||||
> - **GoldParse**: A `GoldParse` object of the `Doc` and label.
|
||||
> - **Example**: An `Example` object holding both predictions and gold-standard
|
||||
> annotations.
|
||||
> - **nlp**: The `nlp` object with the model.
|
||||
> - **Optimizer**: A function that holds state between updates.
|
||||
> - **Update**: Update the model's weights.
|
||||
|
||||
<!-- TODO: update graphic & related text -->
|
||||
|
||||
![The training loop](../images/training-loop.svg)
|
||||
|
||||
Of course, it's not enough to only show a model a single example once.
|
||||
|
@ -427,32 +453,33 @@ dropout means that each feature or internal representation has a 1/4 likelihood
|
|||
of being dropped.
|
||||
|
||||
> - [`begin_training`](/api/language#begin_training): Start the training and
|
||||
> return an optimizer function to update the model's weights. Can take an
|
||||
> optional function converting the training data to spaCy's training format.
|
||||
> - [`update`](/api/language#update): Update the model with the training example
|
||||
> and gold data.
|
||||
> return an [`Optimizer`](https://thinc.ai/docs/api-optimizers) object to
|
||||
> update the model's weights.
|
||||
> - [`update`](/api/language#update): Update the model with the training
|
||||
> examplea.
|
||||
> - [`to_disk`](/api/language#to_disk): Save the updated model to a directory.
|
||||
|
||||
```python
|
||||
### Example training loop
|
||||
optimizer = nlp.begin_training(get_data)
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(100):
|
||||
random.shuffle(train_data)
|
||||
for raw_text, entity_offsets in train_data:
|
||||
doc = nlp.make_doc(raw_text)
|
||||
gold = GoldParse(doc, entities=entity_offsets)
|
||||
nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
||||
example = Example.from_dict(doc, {"entities": entity_offsets})
|
||||
nlp.update([example], sgd=optimizer)
|
||||
nlp.to_disk("/model")
|
||||
```
|
||||
|
||||
The [`nlp.update`](/api/language#update) method takes the following arguments:
|
||||
|
||||
| Name | Description |
|
||||
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. |
|
||||
| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. |
|
||||
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
|
||||
| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. |
|
||||
| Name | Description |
|
||||
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. |
|
||||
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
|
||||
| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. |
|
||||
|
||||
<!-- TODO: DocBin format ? -->
|
||||
|
||||
Instead of writing your own training loop, you can also use the built-in
|
||||
[`train`](/api/cli#train) command, which expects data in spaCy's
|
||||
|
|
Loading…
Reference in New Issue
Block a user