diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index 0980dc2e0..9c9a60490 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and > > # Construction from class > from spacy.pipeline import DependencyParser -> parser = DependencyParser(nlp.vocab) +> parser = DependencyParser(nlp.vocab, parser_model) > parser.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `DependencyParser` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ------------------ | ------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `DependencyParser` | The newly constructed object. | ## DependencyParser.\_\_call\_\_ {#call tag="method"} @@ -126,26 +126,28 @@ Modify a batch of documents, using pre-computed scores. ## DependencyParser.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and +Learn from a batch of [`Example`](/api/example) objects, updating the pipe's +model. Delegates to [`predict`](/api/dependencyparser#predict) and [`get_loss`](/api/dependencyparser#get_loss). > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = DependencyParser(nlp.vocab, parser_model) > losses = {} > optimizer = nlp.begin_training() -> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> parser.update(examples, losses=losses, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Type | Description | +| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). | +| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | +| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | ## DependencyParser.get_loss {#get_loss tag="method"} @@ -169,8 +171,8 @@ predicted scores. ## DependencyParser.begin_training {#begin_training tag="method"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Initialize the pipe for training, using data examples if available. Return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. > #### Example > @@ -180,16 +182,17 @@ has been initialized yet, the model is added. > optimizer = parser.begin_training(pipeline=nlp.pipeline) > ``` -| Name | Type | Description | -| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. | +| **RETURNS** | `Optimizer` | An optimizer. | ## DependencyParser.create_optimizer {#create_optimizer tag="method"} -Create an optimizer for the pipeline component. +Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline +component. > #### Example > @@ -198,9 +201,9 @@ Create an optimizer for the pipeline component. > optimizer = parser.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Type | Description | +| ----------- | ----------- | -------------- | +| **RETURNS** | `Optimizer` | The optimizer. | ## DependencyParser.use_params {#use_params tag="method, contextmanager"} diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index d7f25ed56..1e6a56a48 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -38,18 +38,17 @@ shortcut for this and instantiate the component using its string name and > > # Construction from class > from spacy.pipeline import EntityLinker -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = EntityLinker(nlp.vocab, nel_model) > entity_linker.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. | -| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. | -| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. | -| **RETURNS** | `EntityLinker` | The newly constructed object. | +| Name | Type | Description | +| ------- | ------- | ------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `**cfg` | - | Configuration parameters. | + +| **RETURNS** | `EntityLinker` | The newly constructed object. | ## EntityLinker.\_\_call\_\_ {#call tag="method"} @@ -134,7 +133,7 @@ entities. ## EntityLinker.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating both the +Learn from a batch of [`Example`](/api/example) objects, updating both the pipe's entity linking model and context encoder. Delegates to [`predict`](/api/entitylinker#predict) and [`get_loss`](/api/entitylinker#get_loss). @@ -142,19 +141,21 @@ pipe's entity linking model and context encoder. Delegates to > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = EntityLinker(nlp.vocab, nel_model) > losses = {} > optimizer = nlp.begin_training() -> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> entity_linker.update(examples, losses=losses, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | ------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate, used both for the EL model and the context encoder. | -| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Type | Description | +| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). | +| `sgd` | `Optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | +| **RETURNS** | float | The loss from this batch. | ## EntityLinker.get_loss {#get_loss tag="method"} @@ -195,9 +196,9 @@ identifiers. ## EntityLinker.begin_training {#begin_training tag="method"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. Before calling this method, a -knowledge base should have been defined with +Initialize the pipe for training, using data examples if available. Return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this +method, a knowledge base should have been defined with [`set_kb`](/api/entitylinker#set_kb). > #### Example @@ -209,12 +210,12 @@ knowledge base should have been defined with > optimizer = entity_linker.begin_training(pipeline=nlp.pipeline) > ``` -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entitylinker#create_optimizer) if not set. | +| **RETURNS** | `Optimizer` | An optimizer. | | ## EntityLinker.create_optimizer {#create_optimizer tag="method"} diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index c9a81f6f1..9a9b0926b 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and > > # Construction from class > from spacy.pipeline import EntityRecognizer -> ner = EntityRecognizer(nlp.vocab) +> ner = EntityRecognizer(nlp.vocab, ner_model) > ner.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `EntityRecognizer` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ------------------ | ------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `EntityRecognizer` | The newly constructed object. | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} @@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them. > scores, tensors = ner.predict([doc1, doc2]) > ``` -| Name | Type | Description | -| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). | +| Name | Type | Description | +| ----------- | -------- | ---------------------------------------------------------------------------------------------------------- | +| `docs` | iterable | The documents to predict. | +| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). | ## EntityRecognizer.set_annotations {#set_annotations tag="method"} @@ -127,26 +127,28 @@ Modify a batch of documents, using pre-computed scores. ## EntityRecognizer.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and +Learn from a batch of [`Example`](/api/example) objects, updating the pipe's +model. Delegates to [`predict`](/api/entityrecognizer#predict) and [`get_loss`](/api/entityrecognizer#get_loss). > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = EntityRecognizer(nlp.vocab, ner_model) > losses = {} > optimizer = nlp.begin_training() -> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> ner.update(examples, losses=losses, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Type | Description | +| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). | +| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | +| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. | ## EntityRecognizer.get_loss {#get_loss tag="method"} @@ -170,8 +172,8 @@ predicted scores. ## EntityRecognizer.begin_training {#begin_training tag="method"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Initialize the pipe for training, using data examples if available. Return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. > #### Example > @@ -181,12 +183,14 @@ has been initialized yet, the model is added. > optimizer = ner.begin_training(pipeline=nlp.pipeline) > ``` -| Name | Type | Description | -| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. | +| **RETURNS** | `Optimizer` | An optimizer. | + +| ## EntityRecognizer.create_optimizer {#create_optimizer tag="method"} diff --git a/website/docs/api/example.md b/website/docs/api/example.md index 0f1ed618d..ca1b762c1 100644 --- a/website/docs/api/example.md +++ b/website/docs/api/example.md @@ -141,11 +141,12 @@ of the `reference` document. > assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"] > ``` -Get the aligned view of a certain token attribute, denoted by its int ID or string name. +Get the aligned view of a certain token attribute, denoted by its int ID or +string name. | Name | Type | Description | Default | | ----------- | -------------------------- | ------------------------------------------------------------------ | ------- | -| `field` | int or str | Attribute ID or string name | | +| `field` | int or str | Attribute ID or string name | | | `as_string` | bool | Whether or not to return the list of values as strings. | `False` | | **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | | @@ -176,7 +177,7 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005). > ```python > words = ["Mrs", "Smith", "flew", "to", "New York"] > doc = Doc(en_vocab, words=words) -> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")] +> entities = [(0, 9, "PERSON"), (18, 26, "LOC")] > gold_words = ["Mrs Smith", "flew", "to", "New", "York"] > example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) > ner_tags = example.get_aligned_ner() @@ -197,7 +198,7 @@ Get the aligned view of the NER > ```python > words = ["Mr and Mrs Smith", "flew", "to", "New York"] > doc = Doc(en_vocab, words=words) -> entities = [(0, len("Mr and Mrs Smith"), "PERSON")] +> entities = [(0, 16, "PERSON")] > tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] > example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) > ents_ref = example.reference.ents @@ -220,15 +221,12 @@ in `example.predicted`. > #### Example > > ```python -> ruler = EntityRuler(nlp) -> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}] -> ruler.add_patterns(patterns) -> nlp.add_pipe(ruler) +> nlp.add_pipe(my_ner) > doc = nlp("Mr and Mrs Smith flew to New York") -> entities = [(0, len("Mr and Mrs Smith"), "PERSON")] > tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] -> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) +> example = Example.from_dict(doc, {"words": tokens_ref}) > ents_pred = example.predicted.ents +> # Assume the NER model has found "Mr and Mrs Smith" as a named entity > assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] > ents_x2y = example.get_aligned_spans_x2y(ents_pred) > assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] diff --git a/website/docs/api/language.md b/website/docs/api/language.md index e835168b7..f6631b1db 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -87,18 +87,18 @@ Update the models in the pipeline. > ```python > for raw_text, entity_offsets in train_data: > doc = nlp.make_doc(raw_text) -> gold = GoldParse(doc, entities=entity_offsets) -> nlp.update([doc], [gold], drop=0.5, sgd=optimizer) +> example = Example.from_dict(doc, {"entities": entity_offsets}) +> nlp.update([example], sgd=optimizer) > ``` -| Name | Type | Description | -| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text. | -| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). | -| `drop` | float | The dropout rate. | -| `sgd` | callable | An optimizer. | -| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | +| Name | Type | Description | +| -------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `sgd` | `Optimizer` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. | +| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. | ## Language.evaluate {#evaluate tag="method"} @@ -107,35 +107,37 @@ Evaluate a model's pipeline components. > #### Example > > ```python -> scorer = nlp.evaluate(docs_golds, verbose=True) +> scorer = nlp.evaluate(examples, verbose=True) > print(scorer.scores) > ``` -| Name | Type | Description | -| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). | -| `verbose` | bool | Print debugging information. | -| `batch_size` | int | The batch size to use. | -| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | -| **RETURNS** | Scorer | The scorer containing the evaluation scores. | +| Name | Type | Description | +| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| `verbose` | bool | Print debugging information. | +| `batch_size` | int | The batch size to use. | +| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | +| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. | +| **RETURNS** | Scorer | The scorer containing the evaluation scores. | ## Language.begin_training {#begin_training tag="method"} -Allocate models, pre-process training data and acquire an optimizer. +Allocate models, pre-process training data and acquire an +[`Optimizer`](https://thinc.ai/docs/api-optimizers). > #### Example > > ```python -> optimizer = nlp.begin_training(gold_tuples) +> optimizer = nlp.begin_training(get_examples) > ``` -| Name | Type | Description | -| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Gold-standard training data. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | -| `**cfg` | - | Config parameters (sent to all components). | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------ | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. If not set, a default one will be created. | +| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. | +| `**cfg` | - | Config parameters (sent to all components). | +| **RETURNS** | `Optimizer` | An optimizer. | ## Language.use_params {#use_params tag="contextmanager, method"} @@ -155,16 +157,6 @@ their original weights after the block. | `params` | dict | A dictionary of parameters keyed by model ID. | | `**cfg` | - | Config parameters. | -## Language.preprocess_gold {#preprocess_gold tag="method"} - -Can be called before training to pre-process gold data. By default, it handles -nonprojectivity and adds missing tags to the tag map. - -| Name | Type | Description | -| ------------ | -------- | ---------------------------------------- | -| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. | -| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. | - ## Language.create_pipe {#create_pipe tag="method" new="2"} Create a pipeline component from a factory. diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index 8ad735e0d..cd720d26c 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -27,22 +27,20 @@ Create a new `Scorer`. ## Scorer.score {#score tag="method"} -Update the evaluation scores from a single [`Doc`](/api/doc) / -[`GoldParse`](/api/goldparse) pair. +Update the evaluation scores from a single [`Example`](/api/example) object. > #### Example > > ```python > scorer = Scorer() -> scorer.score(doc, gold) +> scorer.score(example) > ``` -| Name | Type | Description | -| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The predicted annotations. | -| `gold` | `GoldParse` | The correct annotations. | -| `verbose` | bool | Print debugging information. | -| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. | +| Name | Type | Description | +| -------------- | --------- | -------------------------------------------------------------------------------------------------------------------- | +| `example` | `Example` | The `Example` object holding both the predictions and the correct gold-standard annotations. | +| `verbose` | bool | Print debugging information. | +| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. | ## Properties diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index f14da3ac5..1aa5fb327 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -33,16 +33,16 @@ shortcut for this and instantiate the component using its string name and > > # Construction from class > from spacy.pipeline import Tagger -> tagger = Tagger(nlp.vocab) +> tagger = Tagger(nlp.vocab, tagger_model) > tagger.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `Tagger` | The newly constructed object. | +| Name | Type | Description | +| ----------- | -------- | ------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `Tagger` | The newly constructed object. | ## Tagger.\_\_call\_\_ {#call tag="method"} @@ -132,19 +132,20 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = Tagger(nlp.vocab, tagger_model) > losses = {} > optimizer = nlp.begin_training() -> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> tagger.update(examples, losses=losses, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Type | Description | +| ----------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). | +| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | ## Tagger.get_loss {#get_loss tag="method"} @@ -168,8 +169,8 @@ predicted scores. ## Tagger.begin_training {#begin_training tag="method"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Initialize the pipe for training, using data examples if available. Return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. > #### Example > @@ -179,12 +180,12 @@ has been initialized yet, the model is added. > optimizer = tagger.begin_training(pipeline=nlp.pipeline) > ``` -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. | +| **RETURNS** | `Optimizer` | An optimizer. | ## Tagger.create_optimizer {#create_optimizer tag="method"} diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index dc1c083ac..c0c3e15a0 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -35,17 +35,16 @@ shortcut for this and instantiate the component using its string name and > > # Construction from class > from spacy.pipeline import TextCategorizer -> textcat = TextCategorizer(nlp.vocab) +> textcat = TextCategorizer(nlp.vocab, textcat_model) > textcat.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. | -| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. | -| **RETURNS** | `TextCategorizer` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ----------------- | ------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `TextCategorizer` | The newly constructed object. | ### Architectures {#architectures new="2.1"} @@ -151,19 +150,20 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = TextCategorizer(nlp.vocab, textcat_model) > losses = {} > optimizer = nlp.begin_training() -> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> textcat.update(examples, losses=losses, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Type | Description | +| ----------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| _keyword-only_ | | | +| `drop` | float | The dropout rate. | +| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). | +| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. | +| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. | ## TextCategorizer.get_loss {#get_loss tag="method"} @@ -187,8 +187,8 @@ predicted scores. ## TextCategorizer.begin_training {#begin_training tag="method"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Initialize the pipe for training, using data examples if available. Return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. > #### Example > @@ -198,12 +198,12 @@ has been initialized yet, the model is added. > optimizer = textcat.begin_training(pipeline=nlp.pipeline) > ``` -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Type | Description | +| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | `Iterable[Example]` | Optional gold-standard annotations in the form of [`Example`](/api/example) objects. | +| `pipeline` | `List[(str, callable)]` | Optional list of pipeline components that this component is part of. | +| `sgd` | `Optimizer` | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. | +| **RETURNS** | `Optimizer` | An optimizer. | ## TextCategorizer.create_optimizer {#create_optimizer tag="method"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index c8fea6a34..c9c8138e8 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -719,8 +719,7 @@ vary on each step. > ```python > batches = minibatch(train_data) > for batch in batches: -> texts, annotations = zip(*batch) -> nlp.update(texts, annotations) +> nlp.update(batch) > ``` | Name | Type | Description | diff --git a/website/docs/usage/101/_architecture.md b/website/docs/usage/101/_architecture.md index 4363b9b4f..95158b67d 100644 --- a/website/docs/usage/101/_architecture.md +++ b/website/docs/usage/101/_architecture.md @@ -45,10 +45,11 @@ an **annotated document**. It also orchestrates training and serialization. ### Other classes {#architecture-other} -| Name | Description | -| --------------------------------- | ------------------------------------------------------------------------------------------------------------- | -| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. | -| [`StringStore`](/api/stringstore) | Map strings to and from hash values. | -| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. | -| [`GoldParse`](/api/goldparse) | Collection for training annotations. | -| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. | +| Name | Description | +| --------------------------------- | ----------------------------------------------------------------------------- | +| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. | +| [`StringStore`](/api/stringstore) | Map strings to and from hash values. | +| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. | +| [`Example`](/api/example) | Collection for training annotations. | + +| diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md index 245d4ef42..19580dc0f 100644 --- a/website/docs/usage/spacy-101.md +++ b/website/docs/usage/spacy-101.md @@ -633,8 +633,9 @@ for ent in doc.ents: ### Train and update neural network models {#lightning-tour-training"} ```python -import spacy import random +import spacy +from spacy.gold import Example nlp = spacy.load("en_core_web_sm") train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})] @@ -644,7 +645,9 @@ with nlp.select_pipes(enable="ner"): for i in range(10): random.shuffle(train_data) for text, annotations in train_data: - nlp.update([text], [annotations], sgd=optimizer) + doc = nlp.make_doc(text) + example = Example.from_dict(doc, annotations) + nlp.update([example], sgd=optimizer) nlp.to_disk("/model") ``` diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index fd755c58b..51282c2ab 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -375,45 +375,71 @@ mattis pretium. ## Internal training API {#api} - +The [`Example`](/api/example) object contains annotated training data, also +called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object +that will hold the predictions, and another `Doc` object that holds the +gold-standard annotations. Here's an example of a simple `Example` for +part-of-speech tags: -The [`GoldParse`](/api/goldparse) object collects the annotated training -examples, also called the **gold standard**. It's initialized with the -[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the -annotations, like `tags` or `entities`. Its job is to encode the annotations, -keep them aligned and create the C-level data structures required for efficient -access. Here's an example of a simple `GoldParse` for part-of-speech tags: +```python +words = ["I", "like", "stuff"] +predicted = Doc(vocab, words=words) +# create the reference Doc with gold-standard TAG annotations +tags = ["NOUN", "VERB", "NOUN"] +tag_ids = [vocab.strings.add(tag) for tag in tags] +reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64")) +example = Example(predicted, reference) +``` + +Alternatively, the `reference` `Doc` with the gold-standard annotations can be +created from a dictionary with keyword arguments specifying the annotations, +like `tags` or `entities`: + +```python +words = ["I", "like", "stuff"] +tags = ["NOUN", "VERB", "NOUN"] +predicted = Doc(en_vocab, words=words) +example = Example.from_dict(predicted, {"tags": tags}) +``` + +Using the `Example` object and its gold-standard annotations, the model can be +updated to learn a sentence of three words with their assigned part-of-speech +tags. + + + +The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and +defines the annotation scheme. If you're training a new language model, this +will let you map the tags present in the treebank you train on to spaCy's tag +scheme: ```python vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}}) -doc = Doc(vocab, words=["I", "like", "stuff"]) -gold = GoldParse(doc, tags=["N", "V", "N"]) ``` -Using the `Doc` and its gold-standard annotations, the model can be updated to -learn a sentence of three words with their assigned part-of-speech tags. The -[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines -the annotation scheme. If you're training a new language model, this will let -you map the tags present in the treebank you train on to spaCy's tag scheme. +Another example shows how to define gold-standard named entities: ```python -doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"]) -gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]) +doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"]) +example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}) ``` -The same goes for named entities. The letters added before the labels refer to -the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` -is a token outside an entity, `U` an single entity unit, `B` the beginning of an -entity, `I` a token inside an entity and `L` the last token of an entity. +The letters added before the labels refer to the tags of the +[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token +outside an entity, `U` an single entity unit, `B` the beginning of an entity, +`I` a token inside an entity and `L` the last token of an entity. > - **Training data**: The training examples. > - **Text and label**: The current example. > - **Doc**: A `Doc` object created from the example text. -> - **GoldParse**: A `GoldParse` object of the `Doc` and label. +> - **Example**: An `Example` object holding both predictions and gold-standard +> annotations. > - **nlp**: The `nlp` object with the model. > - **Optimizer**: A function that holds state between updates. > - **Update**: Update the model's weights. + + ![The training loop](../images/training-loop.svg) Of course, it's not enough to only show a model a single example once. @@ -427,32 +453,33 @@ dropout means that each feature or internal representation has a 1/4 likelihood of being dropped. > - [`begin_training`](/api/language#begin_training): Start the training and -> return an optimizer function to update the model's weights. Can take an -> optional function converting the training data to spaCy's training format. -> - [`update`](/api/language#update): Update the model with the training example -> and gold data. +> return an [`Optimizer`](https://thinc.ai/docs/api-optimizers) object to +> update the model's weights. +> - [`update`](/api/language#update): Update the model with the training +> examplea. > - [`to_disk`](/api/language#to_disk): Save the updated model to a directory. ```python ### Example training loop -optimizer = nlp.begin_training(get_data) +optimizer = nlp.begin_training() for itn in range(100): random.shuffle(train_data) for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) - gold = GoldParse(doc, entities=entity_offsets) - nlp.update([doc], [gold], drop=0.5, sgd=optimizer) + example = Example.from_dict(doc, {"entities": entity_offsets}) + nlp.update([example], sgd=optimizer) nlp.to_disk("/model") ``` The [`nlp.update`](/api/language#update) method takes the following arguments: -| Name | Description | -| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. | -| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. | -| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. | -| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. | +| Name | Description | +| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. | +| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. | +| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. | + + Instead of writing your own training loop, you can also use the built-in [`train`](/api/cli#train) command, which expects data in spaCy's