remove tensors, fix predict, get_loss and set_annotations

2025-08-10 07:04:53 +03:00 · 2020-07-08 13:11:54 +02:00 · 2020-07-08 13:11:54 +02:00 · c94279ac1b
commit c94279ac1b
parent 90b100c39f
7 changed files with 104 additions and 135 deletions
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -15,7 +15,7 @@ via the ID `"parser"`.
 > ```python
 > # Construction via create_pipe with default model
 > parser = nlp.create_pipe("parser")
-> 
+>
 > # Construction via create_pipe with custom model
 > config = {"model": {"@architectures": "my_parser"}}
 > parser = nlp.create_pipe("parser", config)
@ -112,10 +112,10 @@ Modify a batch of documents, using pre-computed scores.
 > parser.set_annotations([doc1, doc2], scores)
 > ```

-| Name     | Type     | Description                                                |
-| -------- | -------- | ---------------------------------------------------------- |
-| `docs`   | iterable | The documents to modify.                                   |
-| `scores` | -        | The scores to set, produced by `DependencyParser.predict`. |
+| Name     | Type                | Description                                                |
+| -------- | ------------------- | ---------------------------------------------------------- |
+| `docs`   | `Iterable[Doc]`     | The documents to modify.                                   |
+| `scores` | `syntax.StateClass` | The scores to set, produced by `DependencyParser.predict`. |

 ## DependencyParser.update {#update tag="method"}

@ -150,16 +150,15 @@ predicted scores.
 >
 > ```python
 > parser = DependencyParser(nlp.vocab)
-> scores = parser.predict([doc1, doc2])
-> loss, d_loss = parser.get_loss([doc1, doc2], [gold1, gold2], scores)
+> scores = parser.predict([eg.predicted for eg in examples])
+> loss, d_loss = parser.get_loss(examples, scores)
 > ```

-| Name        | Type     | Description                                                  |
-| ----------- | -------- | ------------------------------------------------------------ |
-| `docs`      | iterable | The batch of documents.                                      |
-| `golds`     | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `scores`    | -        | Scores representing the model's predictions.                 |
-| **RETURNS** | tuple    | The loss and the gradient, i.e. `(loss, gradient)`.          |
+| Name        | Type                | Description                                         |
+| ----------- | ------------------- | --------------------------------------------------- |
+| `examples`  | `Iterable[Example]` | The batch of examples.                              |
+| `scores`    | `syntax.StateClass` | Scores representing the model's predictions.        |
+| **RETURNS** | tuple               | The loss and the gradient, i.e. `(loss, gradient)`. |

 ## DependencyParser.begin_training {#begin_training tag="method"}

@ -193,9 +192,9 @@ component.
 > optimizer = parser.create_optimizer()
 > ```

-| Name        | Type        | Description    |
-| ----------- | ----------- | -------------- |
-| **RETURNS** | `Optimizer` | The optimizer. |
+| Name        | Type        | Description                                                     |
+| ----------- | ----------- | --------------------------------------------------------------- |
+| **RETURNS** | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |

 ## DependencyParser.use_params {#use_params tag="method, contextmanager"}

--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -96,13 +96,13 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 >
 > ```python
 > entity_linker = EntityLinker(nlp.vocab)
-> kb_ids, tensors = entity_linker.predict([doc1, doc2])
+> kb_ids = entity_linker.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description                                                                                                                                                                                        |
-| ----------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs`      | iterable | The documents to predict.                                                                                                                                                                          |
-| **RETURNS** | tuple    | A `(kb_ids, tensors)` tuple where `kb_ids` are the model's predicted KB identifiers for the entities in the `docs`, and `tensors` are the token representations used to predict these identifiers. |
+| Name        | Type            | Description                                                  |
+| ----------- | --------------- | ------------------------------------------------------------ |
+| `docs`      | `Iterable[Doc]` | The documents to predict.                                    |
+| **RETURNS** | `Iterable[str]` | The predicted KB identifiers for the entities in the `docs`. |

 ## EntityLinker.set_annotations {#set_annotations tag="method"}

@ -113,15 +113,14 @@ entities.
 >
 > ```python
 > entity_linker = EntityLinker(nlp.vocab)
-> kb_ids, tensors = entity_linker.predict([doc1, doc2])
-> entity_linker.set_annotations([doc1, doc2], kb_ids, tensors)
+> kb_ids = entity_linker.predict([doc1, doc2])
+> entity_linker.set_annotations([doc1, doc2], kb_ids)
 > ```

-| Name      | Type     | Description                                                                                       |
-| --------- | -------- | ------------------------------------------------------------------------------------------------- |
-| `docs`    | iterable | The documents to modify.                                                                          |
-| `kb_ids`  | iterable | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. |
-| `tensors` | iterable | The token representations used to predict the identifiers.                                        |
+| Name     | Type            | Description                                                                                       |
+| -------- | --------------- | ------------------------------------------------------------------------------------------------- |
+| `docs`   | `Iterable[Doc]` | The documents to modify.                                                                          |
+| `kb_ids` | `Iterable[str]` | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. |

 ## EntityLinker.update {#update tag="method"}

@ -148,27 +147,6 @@ pipe's entity linking model and context encoder. Delegates to
 | `losses`          | `Dict[str, float]`  | Optional record of the loss during training. The value keyed by the model's name is updated.                                               |
 | **RETURNS**       | `Dict[str, float]`  | The updated `losses` dictionary.                                                                                                           |

-## EntityLinker.get_loss {#get_loss tag="method"}
-
-Find the loss and gradient of loss for the entities in a batch of documents and
-their predicted scores.
-
-> #### Example
->
-> ```python
-> entity_linker = EntityLinker(nlp.vocab)
-> kb_ids, tensors = entity_linker.predict(docs)
-> loss, d_loss = entity_linker.get_loss(docs, [gold1, gold2], kb_ids, tensors)
-> ```
-
-| Name        | Type     | Description                                                  |
-| ----------- | -------- | ------------------------------------------------------------ |
-| `docs`      | iterable | The batch of documents.                                      |
-| `golds`     | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `kb_ids`    | iterable | KB identifiers representing the model's predictions.         |
-| `tensors`   | iterable | The token representations used to predict the identifiers    |
-| **RETURNS** | tuple    | The loss and the gradient, i.e. `(loss, gradient)`.          |
-
 ## EntityLinker.set_kb {#set_kb tag="method"}

 Define the knowledge base (KB) used for disambiguating named entities to KB
@ -219,9 +197,9 @@ Create an optimizer for the pipeline component.
 > optimizer = entity_linker.create_optimizer()
 > ```

-| Name        | Type     | Description    |
-| ----------- | -------- | -------------- |
-| **RETURNS** | callable | The optimizer. |
+| Name        | Type        | Description                                                     |
+| ----------- | ----------- | --------------------------------------------------------------- |
+| **RETURNS** | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |

 ## EntityLinker.use_params {#use_params tag="method, contextmanager"}

--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -15,7 +15,7 @@ via the ID `"ner"`.
 > ```python
 > # Construction via create_pipe
 > ner = nlp.create_pipe("ner")
-> 
+>
 > # Construction via create_pipe with custom model
 > config = {"model": {"@architectures": "my_ner"}}
 > parser = nlp.create_pipe("ner", config)
@ -92,13 +92,13 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 >
 > ```python
 > ner = EntityRecognizer(nlp.vocab)
-> scores, tensors = ner.predict([doc1, doc2])
+> scores = ner.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description                                                                                                |
-| ----------- | -------- | ---------------------------------------------------------------------------------------------------------- |
-| `docs`      | iterable | The documents to predict.                                                                                  |
-| **RETURNS** | list     | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
+| Name        | Type               | Description                                                                                                |
+| ----------- | ------------------ | ---------------------------------------------------------------------------------------------------------- |
+| `docs`      | `Iterable[Doc]`    | The documents to predict.                                                                                  |
+| **RETURNS** | `List[StateClass]` | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |

 ## EntityRecognizer.set_annotations {#set_annotations tag="method"}

@ -108,15 +108,14 @@ Modify a batch of documents, using pre-computed scores.
 >
 > ```python
 > ner = EntityRecognizer(nlp.vocab)
-> scores, tensors = ner.predict([doc1, doc2])
-> ner.set_annotations([doc1, doc2], scores, tensors)
+> scores = ner.predict([doc1, doc2])
+> ner.set_annotations([doc1, doc2], scores)
 > ```

-| Name      | Type     | Description                                                |
-| --------- | -------- | ---------------------------------------------------------- |
-| `docs`    | iterable | The documents to modify.                                   |
-| `scores`  | -        | The scores to set, produced by `EntityRecognizer.predict`. |
-| `tensors` | iterable | The token representations used to predict the scores.      |
+| Name     | Type               | Description                                                |
+| -------- | ------------------ | ---------------------------------------------------------- |
+| `docs`   | `Iterable[Doc]`    | The documents to modify.                                   |
+| `scores` | `List[StateClass]` | The scores to set, produced by `EntityRecognizer.predict`. |

 ## EntityRecognizer.update {#update tag="method"}

@ -151,16 +150,15 @@ predicted scores.
 >
 > ```python
 > ner = EntityRecognizer(nlp.vocab)
-> scores = ner.predict([doc1, doc2])
-> loss, d_loss = ner.get_loss([doc1, doc2], [gold1, gold2], scores)
+> scores = ner.predict([eg.predicted for eg in examples])
+> loss, d_loss = ner.get_loss(examples, scores)
 > ```

-| Name        | Type     | Description                                                  |
-| ----------- | -------- | ------------------------------------------------------------ |
-| `docs`      | iterable | The batch of documents.                                      |
-| `golds`     | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `scores`    | -        | Scores representing the model's predictions.                 |
-| **RETURNS** | tuple    | The loss and the gradient, i.e. `(loss, gradient)`.          |
+| Name        | Type                | Description                                         |
+| ----------- | ------------------- | --------------------------------------------------- |
+| `examples`  | `Iterable[Example]` | The batch of examples.                              |
+| `scores`    | `List[StateClass]`  | Scores representing the model's predictions.        |
+| **RETURNS** | tuple               | The loss and the gradient, i.e. `(loss, gradient)`. |

 ## EntityRecognizer.begin_training {#begin_training tag="method"}

@ -182,8 +180,6 @@ Initialize the pipe for training, using data examples if available. Return an
 | `sgd`          | `Optimizer`             | An optional [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
 | **RETURNS**    | `Optimizer`             | An optimizer.                                                                                                                                                        |

-|
-
 ## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}

 Create an optimizer for the pipeline component.
@ -195,9 +191,9 @@ Create an optimizer for the pipeline component.
 > optimizer = ner.create_optimizer()
 > ```

-| Name        | Type     | Description    |
-| ----------- | -------- | -------------- |
-| **RETURNS** | callable | The optimizer. |
+| Name        | Type        | Description                                                     |
+| ----------- | ----------- | --------------------------------------------------------------- |
+| **RETURNS** | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |

 ## EntityRecognizer.use_params {#use_params tag="method, contextmanager"}

--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -52,7 +52,7 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
 | Name        | Type  | Description                                                                       |
 | ----------- | ----- | --------------------------------------------------------------------------------- |
 | `text`      | str   | The text to be processed.                                                         |
-| `disable`   | list  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
+| `disable`   | `List[str]`  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
 | **RETURNS** | `Doc` | A container for accessing the annotations.                                        |

 ## Language.pipe {#pipe tag="method"}
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -15,7 +15,7 @@ via the ID `"tagger"`.
 > ```python
 > # Construction via create_pipe
 > tagger = nlp.create_pipe("tagger")
-> 
+>
 > # Construction via create_pipe with custom model
 > config = {"model": {"@architectures": "my_tagger"}}
 > parser = nlp.create_pipe("tagger", config)
@ -90,13 +90,13 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 >
 > ```python
 > tagger = Tagger(nlp.vocab)
-> scores, tensors = tagger.predict([doc1, doc2])
+> scores = tagger.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description                                                                                                                                                                                                                        |
-| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
-| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |
+| Name        | Type            | Description                               |
+| ----------- | --------------- | ----------------------------------------- |
+| `docs`      | `Iterable[Doc]` | The documents to predict.                 |
+| **RETURNS** | -               | The model's prediction for each document. |

 ## Tagger.set_annotations {#set_annotations tag="method"}

@ -106,15 +106,14 @@ Modify a batch of documents, using pre-computed scores.
 >
 > ```python
 > tagger = Tagger(nlp.vocab)
-> scores, tensors = tagger.predict([doc1, doc2])
-> tagger.set_annotations([doc1, doc2], scores, tensors)
+> scores = tagger.predict([doc1, doc2])
+> tagger.set_annotations([doc1, doc2], scores)
 > ```

-| Name      | Type     | Description                                           |
-| --------- | -------- | ----------------------------------------------------- |
-| `docs`    | iterable | The documents to modify.                              |
-| `scores`  | -        | The scores to set, produced by `Tagger.predict`.      |
-| `tensors` | iterable | The token representations used to predict the scores. |
+| Name     | Type            | Description                                      |
+| -------- | --------------- | ------------------------------------------------ |
+| `docs`   | `Iterable[Doc]` | The documents to modify.                         |
+| `scores` | -               | The scores to set, produced by `Tagger.predict`. |

 ## Tagger.update {#update tag="method"}

@ -149,16 +148,15 @@ predicted scores.
 >
 > ```python
 > tagger = Tagger(nlp.vocab)
-> scores = tagger.predict([doc1, doc2])
-> loss, d_loss = tagger.get_loss([doc1, doc2], [gold1, gold2], scores)
+> scores = tagger.predict([eg.predicted for eg in examples])
+> loss, d_loss = tagger.get_loss(examples, scores)
 > ```

-| Name        | Type     | Description                                                  |
-| ----------- | -------- | ------------------------------------------------------------ |
-| `docs`      | iterable | The batch of documents.                                      |
-| `golds`     | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `scores`    | -        | Scores representing the model's predictions.                 |
-| **RETURNS** | tuple    | The loss and the gradient, i.e. `(loss, gradient)`.          |
+| Name        | Type                | Description                                         |
+| ----------- | ------------------- | --------------------------------------------------- |
+| `examples`  | `Iterable[Example]` | The batch of examples.                              |
+| `scores`    | -                   | Scores representing the model's predictions.        |
+| **RETURNS** | tuple               | The loss and the gradient, i.e. `(loss, gradient)`. |

 ## Tagger.begin_training {#begin_training tag="method"}

@ -191,9 +189,9 @@ Create an optimizer for the pipeline component.
 > optimizer = tagger.create_optimizer()
 > ```

-| Name        | Type     | Description    |
-| ----------- | -------- | -------------- |
-| **RETURNS** | callable | The optimizer. |
+| Name        | Type        | Description                                                     |
+| ----------- | ----------- | --------------------------------------------------------------- |
+| **RETURNS** | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |

 ## Tagger.use_params {#use_params tag="method, contextmanager"}

--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -16,11 +16,11 @@ via the ID `"textcat"`.
 > ```python
 > # Construction via create_pipe
 > textcat = nlp.create_pipe("textcat")
-> 
+>
 > # Construction via create_pipe with custom model
 > config = {"model": {"@architectures": "my_textcat"}}
 > parser = nlp.create_pipe("textcat", config)
-> 
+>
 > # Construction from class with custom model from file
 > from spacy.pipeline import TextCategorizer
 > model = util.load_config("model.cfg", create_objects=True)["model"]
@ -38,7 +38,7 @@ shortcut for this and instantiate the component using its string name and
 | `**cfg`     | -                 | Configuration parameters.                                                       |
 | **RETURNS** | `TextCategorizer` | The newly constructed object.                                                   |

-<!-- TODO move to config page 
+<!-- TODO move to config page
 ### Architectures {#architectures new="2.1"}

 Text classification models can be used to solve a wide variety of problems.
@ -109,13 +109,13 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 >
 > ```python
 > textcat = TextCategorizer(nlp.vocab)
-> scores, tensors = textcat.predict([doc1, doc2])
+> scores = textcat.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description                                                                                                                                                                                                                        |
-| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
-| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |
+| Name        | Type            | Description                               |
+| ----------- | --------------- | ----------------------------------------- |
+| `docs`      | `Iterable[Doc]` | The documents to predict.                 |
+| **RETURNS** | -               | The model's prediction for each document. |

 ## TextCategorizer.set_annotations {#set_annotations tag="method"}

@ -125,15 +125,14 @@ Modify a batch of documents, using pre-computed scores.
 >
 > ```python
 > textcat = TextCategorizer(nlp.vocab)
-> scores, tensors = textcat.predict([doc1, doc2])
-> textcat.set_annotations([doc1, doc2], scores, tensors)
+> scores = textcat.predict(docs)
+> textcat.set_annotations(docs, scores)
 > ```

-| Name      | Type     | Description                                               |
-| --------- | -------- | --------------------------------------------------------- |
-| `docs`    | iterable | The documents to modify.                                  |
-| `scores`  | -        | The scores to set, produced by `TextCategorizer.predict`. |
-| `tensors` | iterable | The token representations used to predict the scores.     |
+| Name     | Type            | Description                                               |
+| -------- | --------------- | --------------------------------------------------------- |
+| `docs`   | `Iterable[Doc]` | The documents to modify.                                  |
+| `scores` | -               | The scores to set, produced by `TextCategorizer.predict`. |

 ## TextCategorizer.update {#update tag="method"}

@ -168,16 +167,15 @@ predicted scores.
 >
 > ```python
 > textcat = TextCategorizer(nlp.vocab)
-> scores = textcat.predict([doc1, doc2])
-> loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)
+> scores = textcat.predict([eg.predicted for eg in examples])
+> loss, d_loss = textcat.get_loss(examples, scores)
 > ```

-| Name        | Type     | Description                                                  |
-| ----------- | -------- | ------------------------------------------------------------ |
-| `docs`      | iterable | The batch of documents.                                      |
-| `golds`     | iterable | The gold-standard data. Must have the same length as `docs`. |
-| `scores`    | -        | Scores representing the model's predictions.                 |
-| **RETURNS** | tuple    | The loss and the gradient, i.e. `(loss, gradient)`.          |
+| Name        | Type                | Description                                         |
+| ----------- | ------------------- | --------------------------------------------------- |
+| `examples`  | `Iterable[Example]` | The batch of examples.                              |
+| `scores`    | -                   | Scores representing the model's predictions.        |
+| **RETURNS** | tuple               | The loss and the gradient, i.e. `(loss, gradient)`. |

 ## TextCategorizer.begin_training {#begin_training tag="method"}

@ -210,9 +208,9 @@ Create an optimizer for the pipeline component.
 > optimizer = textcat.create_optimizer()
 > ```

-| Name        | Type     | Description    |
-| ----------- | -------- | -------------- |
-| **RETURNS** | callable | The optimizer. |
+| Name        | Type        | Description                                                     |
+| ----------- | ----------- | --------------------------------------------------------------- |
+| **RETURNS** | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |

 ## TextCategorizer.use_params {#use_params tag="method, contextmanager"}

--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -34,7 +34,7 @@ loaded in via [`Language.from_disk`](/api/language#from_disk).
 | Name        | Type         | Description                                                                       |
 | ----------- | ------------ | --------------------------------------------------------------------------------- |
 | `name`      | str / `Path` | Model to load, i.e. package name or path.                                         |
-| `disable`   | list         | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
+| `disable`   | `List[str]`  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
 | **RETURNS** | `Language`   | A `Language` object with the loaded model.                                        |

 Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
@ -61,11 +61,11 @@ Create a blank model of a given language class. This function is the twin of
 > nlp_de = spacy.blank("de")
 > ```

-| Name        | Type       | Description                                                                                      |
-| ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
-| `name`      | str        | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
-| `disable`   | list       | Names of pipeline components to [disable](/usage/processing-pipelines#disabling).                |
-| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass.                                          |
+| Name        | Type        | Description                                                                                      |
+| ----------- | ----------- | ------------------------------------------------------------------------------------------------ |
+| `name`      | str         | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
+| `disable`   | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling).                |
+| **RETURNS** | `Language`  | An empty `Language` object of the appropriate subclass.                                          |

 #### spacy.info {#spacy.info tag="function"}