remove component.Model, update constructor, losses is return value of update

2025-08-09 14:44:52 +03:00 · 2020-07-08 12:14:30 +02:00 · 2020-07-08 12:14:30 +02:00 · 90b100c39f
commit 90b100c39f
parent 2298e129e6
6 changed files with 104 additions and 141 deletions
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -8,35 +8,28 @@ This class is a subclass of `Pipe` and follows the same API. The pipeline
 component is available in the [processing pipeline](/usage/processing-pipelines)
 via the ID `"parser"`.

-## DependencyParser.Model {#model tag="classmethod"}
-
-Initialize a model for the pipe. The model should implement the
-`thinc.neural.Model` API. Wrappers are under development for most major machine
-learning libraries.
-
-| Name        | Type   | Description                           |
-| ----------- | ------ | ------------------------------------- |
-| `**kwargs`  | -      | Parameters for initializing the model |
-| **RETURNS** | object | The initialized model.                |
-
 ## DependencyParser.\_\_init\_\_ {#init tag="method"}

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.create_pipe`](/api/language#create_pipe).
-
 > #### Example
 >
 > ```python
-> # Construction via create_pipe
+> # Construction via create_pipe with default model
 > parser = nlp.create_pipe("parser")
+> 
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_parser"}}
+> parser = nlp.create_pipe("parser", config)
 >
-> # Construction from class
+> # Construction from class with custom model from file
 > from spacy.pipeline import DependencyParser
-> parser = DependencyParser(nlp.vocab, parser_model)
-> parser.from_disk("/path/to/model")
+> model = util.load_config("model.cfg", create_objects=True)["model"]
+> parser = DependencyParser(nlp.vocab, model)
 > ```

+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.create_pipe`](/api/language#create_pipe).
+
 | Name        | Type               | Description                                                                     |
 | ----------- | ------------------ | ------------------------------------------------------------------------------- |
 | `vocab`     | `Vocab`            | The shared vocabulary.                                                          |
@ -85,11 +78,11 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                            |
-| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
+| Name         | Type            | Description                                            |
+| ------------ | --------------- | ------------------------------------------------------ |
+| `stream`     | `Iterable[Doc]` | A stream of documents.                                 |
+| `batch_size` | int             | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`           | Processed documents in the order of the original text. |

 ## DependencyParser.predict {#predict tag="method"}

@ -104,7 +97,7 @@ Apply the pipeline's model to a batch of docs, without modifying them.

 | Name        | Type                | Description                                    |
 | ----------- | ------------------- | ---------------------------------------------- |
-| `docs`      | iterable            | The documents to predict.                      |
+| `docs`      | `Iterable[Doc]`     | The documents to predict.                      |
 | **RETURNS** | `syntax.StateClass` | A helper class for the parse state (internal). |

 ## DependencyParser.set_annotations {#set_annotations tag="method"}
@ -134,9 +127,8 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and
 >
 > ```python
 > parser = DependencyParser(nlp.vocab, parser_model)
-> losses = {}
 > optimizer = nlp.begin_training()
-> parser.update(examples, losses=losses, sgd=optimizer)
+> losses = parser.update(examples, sgd=optimizer)
 > ```

 | Name              | Type                | Description                                                                                                                                    |
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -12,36 +12,28 @@ This class is a subclass of `Pipe` and follows the same API. The pipeline
 component is available in the [processing pipeline](/usage/processing-pipelines)
 via the ID `"entity_linker"`.

-## EntityLinker.Model {#model tag="classmethod"}
-
-Initialize a model for the pipe. The model should implement the
-`thinc.neural.Model` API, and should contain a field `tok2vec` that contains the
-context encoder. Wrappers are under development for most major machine learning
-libraries.
-
-| Name        | Type   | Description                           |
-| ----------- | ------ | ------------------------------------- |
-| `**kwargs`  | -      | Parameters for initializing the model |
-| **RETURNS** | object | The initialized model.                |
-
 ## EntityLinker.\_\_init\_\_ {#init tag="method"}

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.create_pipe`](/api/language#create_pipe).
-
 > #### Example
 >
 > ```python
-> # Construction via create_pipe
+> # Construction via create_pipe with default model
 > entity_linker = nlp.create_pipe("entity_linker")
 >
-> # Construction from class
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_el"}}
+> entity_linker = nlp.create_pipe("entity_linker", config)
+>
+> # Construction from class with custom model from file
 > from spacy.pipeline import EntityLinker
-> entity_linker = EntityLinker(nlp.vocab, nel_model)
-> entity_linker.from_disk("/path/to/model")
+> model = util.load_config("model.cfg", create_objects=True)["model"]
+> entity_linker = EntityLinker(nlp.vocab, model)
 > ```

+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.create_pipe`](/api/language#create_pipe).
+
 | Name    | Type    | Description                                                                     |
 | ------- | ------- | ------------------------------------------------------------------------------- |
 | `vocab` | `Vocab` | The shared vocabulary.                                                          |
@ -90,11 +82,11 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                            |
-| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
+| Name         | Type            | Description                                            |
+| ------------ | --------------- | ------------------------------------------------------ |
+| `stream`     | `Iterable[Doc]` | A stream of documents.                                 |
+| `batch_size` | int             | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`           | Processed documents in the order of the original text. |

 ## EntityLinker.predict {#predict tag="method"}

@ -142,9 +134,8 @@ pipe's entity linking model and context encoder. Delegates to
 >
 > ```python
 > entity_linker = EntityLinker(nlp.vocab, nel_model)
-> losses = {}
 > optimizer = nlp.begin_training()
-> entity_linker.update(examples, losses=losses, sgd=optimizer)
+> losses = entity_linker.update(examples, sgd=optimizer)
 > ```

 | Name              | Type                | Description                                                                                                                                |
@ -155,7 +146,7 @@ pipe's entity linking model and context encoder. Delegates to
 | `set_annotations` | bool                | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entitylinker#set_annotations). |
 | `sgd`             | `Optimizer`         | [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.                                                                                |
 | `losses`          | `Dict[str, float]`  | Optional record of the loss during training. The value keyed by the model's name is updated.                                               |
-| **RETURNS**       | float               | The loss from this batch.                                                                                                                  |
+| **RETURNS**       | `Dict[str, float]`  | The updated `losses` dictionary.                                                                                                           |

 ## EntityLinker.get_loss {#get_loss tag="method"}

--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -8,35 +8,28 @@ This class is a subclass of `Pipe` and follows the same API. The pipeline
 component is available in the [processing pipeline](/usage/processing-pipelines)
 via the ID `"ner"`.

-## EntityRecognizer.Model {#model tag="classmethod"}
-
-Initialize a model for the pipe. The model should implement the
-`thinc.neural.Model` API. Wrappers are under development for most major machine
-learning libraries.
-
-| Name        | Type   | Description                           |
-| ----------- | ------ | ------------------------------------- |
-| `**kwargs`  | -      | Parameters for initializing the model |
-| **RETURNS** | object | The initialized model.                |
-
 ## EntityRecognizer.\_\_init\_\_ {#init tag="method"}

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.create_pipe`](/api/language#create_pipe).
-
 > #### Example
 >
 > ```python
 > # Construction via create_pipe
 > ner = nlp.create_pipe("ner")
+> 
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_ner"}}
+> parser = nlp.create_pipe("ner", config)
 >
-> # Construction from class
+> # Construction from class with custom model from file
 > from spacy.pipeline import EntityRecognizer
-> ner = EntityRecognizer(nlp.vocab, ner_model)
-> ner.from_disk("/path/to/model")
+> model = util.load_config("model.cfg", create_objects=True)["model"]
+> ner = EntityRecognizer(nlp.vocab, model)
 > ```

+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.create_pipe`](/api/language#create_pipe).
+
 | Name        | Type               | Description                                                                     |
 | ----------- | ------------------ | ------------------------------------------------------------------------------- |
 | `vocab`     | `Vocab`            | The shared vocabulary.                                                          |
@ -85,11 +78,11 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                            |
-| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
+| Name         | Type            | Description                                            |
+| ------------ | --------------- | ------------------------------------------------------ |
+| `stream`     | `Iterable[Doc]` | A stream of documents.                                 |
+| `batch_size` | int             | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`           | Processed documents in the order of the original text. |

 ## EntityRecognizer.predict {#predict tag="method"}

@ -135,9 +128,8 @@ model. Delegates to [`predict`](/api/entityrecognizer#predict) and
 >
 > ```python
 > ner = EntityRecognizer(nlp.vocab, ner_model)
-> losses = {}
 > optimizer = nlp.begin_training()
-> ner.update(examples, losses=losses, sgd=optimizer)
+> losses = ner.update(examples, sgd=optimizer)
 > ```

 | Name              | Type                | Description                                                                                                                                    |
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -68,15 +68,15 @@ more efficient than processing texts one-by-one.
 >     assert doc.is_parsed
 > ```

-| Name                                         | Type     | Description                                                                                                                                                |
-| -------------------------------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `texts`                                      | iterable | A sequence of strings.                                                                                                                                     |
-| `as_tuples`                                  | bool     | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
-| `batch_size`                                 | int      | The number of texts to buffer.                                                                                                                             |
-| `disable`                                    | list     | Names of pipeline components to [disable](/usage/processing-pipelines#disabling).                                                                          |
-| `component_cfg` <Tag variant="new">2.1</Tag> | dict     | Config parameters for specific pipeline components, keyed by component name.                                                                               |
-| `n_process` <Tag variant="new">2.2.2</Tag>   | int      | Number of processors to use, only supported in Python 3. Defaults to `1`.                                                                                  |
-| **YIELDS**                                   | `Doc`    | Documents in the order of the original text.                                                                                                               |
+| Name                                         | Type              | Description                                                                                                                                                |
+| -------------------------------------------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `texts`                                      | `Iterable[str]`   | A sequence of strings.                                                                                                                                     |
+| `as_tuples`                                  | bool              | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
+| `batch_size`                                 | int               | The number of texts to buffer.                                                                                                                             |
+| `disable`                                    | `List[str]`       | Names of pipeline components to [disable](/usage/processing-pipelines#disabling).                                                                          |
+| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name.                                                                               |
+| `n_process` <Tag variant="new">2.2.2</Tag>   | int               | Number of processors to use, only supported in Python 3. Defaults to `1`.                                                                                  |
+| **YIELDS**                                   | `Doc`             | Documents in the order of the original text.                                                                                                               |

 ## Language.update {#update tag="method"}

@ -99,6 +99,7 @@ Update the models in the pipeline.
 | `sgd`                                        | `Optimizer`         | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.               |
 | `losses`                                     | `Dict[str, float]`  | Dictionary to update with the loss, keyed by pipeline component.             |
 | `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]`   | Config parameters for specific pipeline components, keyed by component name. |
+| **RETURNS**                                  | `Dict[str, float]`  | The updated `losses` dictionary.                                             |

 ## Language.evaluate {#evaluate tag="method"}

--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -8,35 +8,28 @@ This class is a subclass of `Pipe` and follows the same API. The pipeline
 component is available in the [processing pipeline](/usage/processing-pipelines)
 via the ID `"tagger"`.

-## Tagger.Model {#model tag="classmethod"}
-
-Initialize a model for the pipe. The model should implement the
-`thinc.neural.Model` API. Wrappers are under development for most major machine
-learning libraries.
-
-| Name        | Type   | Description                           |
-| ----------- | ------ | ------------------------------------- |
-| `**kwargs`  | -      | Parameters for initializing the model |
-| **RETURNS** | object | The initialized model.                |
-
 ## Tagger.\_\_init\_\_ {#init tag="method"}

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.create_pipe`](/api/language#create_pipe).
-
 > #### Example
 >
 > ```python
 > # Construction via create_pipe
 > tagger = nlp.create_pipe("tagger")
+> 
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_tagger"}}
+> parser = nlp.create_pipe("tagger", config)
 >
-> # Construction from class
+> # Construction from class with custom model from file
 > from spacy.pipeline import Tagger
-> tagger = Tagger(nlp.vocab, tagger_model)
-> tagger.from_disk("/path/to/model")
+> model = util.load_config("model.cfg", create_objects=True)["model"]
+> tagger = Tagger(nlp.vocab, model)
 > ```

+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.create_pipe`](/api/language#create_pipe).
+
 | Name        | Type     | Description                                                                     |
 | ----------- | -------- | ------------------------------------------------------------------------------- |
 | `vocab`     | `Vocab`  | The shared vocabulary.                                                          |
@ -83,11 +76,11 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                            |
-| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
+| Name         | Type            | Description                                            |
+| ------------ | --------------- | ------------------------------------------------------ |
+| `stream`     | `Iterable[Doc]` | A stream of documents.                                 |
+| `batch_size` | int             | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`           | Processed documents in the order of the original text. |

 ## Tagger.predict {#predict tag="method"}

@ -133,9 +126,8 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
 >
 > ```python
 > tagger = Tagger(nlp.vocab, tagger_model)
-> losses = {}
 > optimizer = nlp.begin_training()
-> tagger.update(examples, losses=losses, sgd=optimizer)
+> losses = tagger.update(examples, sgd=optimizer)
 > ```

 | Name              | Type                | Description                                                                                                                          |
@ -146,6 +138,7 @@ pipe's model. Delegates to [`predict`](/api/tagger#predict) and
 | `set_annotations` | bool                | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
 | `sgd`             | `Optimizer`         | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.                                                                      |
 | `losses`          | `Dict[str, float]`  | Optional record of the loss during training. The value keyed by the model's name is updated.                                         |
+| **RETURNS**       | `Dict[str, float]`  | The updated `losses` dictionary.                                                                                                     |

 ## Tagger.get_loss {#get_loss tag="method"}

--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -9,36 +9,28 @@ This class is a subclass of `Pipe` and follows the same API. The pipeline
 component is available in the [processing pipeline](/usage/processing-pipelines)
 via the ID `"textcat"`.

-## TextCategorizer.Model {#model tag="classmethod"}
-
-Initialize a model for the pipe. The model should implement the
-`thinc.neural.Model` API. Wrappers are under development for most major machine
-learning libraries.
-
-| Name        | Type   | Description                           |
-| ----------- | ------ | ------------------------------------- |
-| `**kwargs`  | -      | Parameters for initializing the model |
-| **RETURNS** | object | The initialized model.                |
-
 ## TextCategorizer.\_\_init\_\_ {#init tag="method"}

-Create a new pipeline instance. In your application, you would normally use a
-shortcut for this and instantiate the component using its string name and
-[`nlp.create_pipe`](/api/language#create_pipe).
-
 > #### Example
 >
 > ```python
 > # Construction via create_pipe
 > textcat = nlp.create_pipe("textcat")
-> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
->
-> # Construction from class
+> 
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_textcat"}}
+> parser = nlp.create_pipe("textcat", config)
+> 
+> # Construction from class with custom model from file
 > from spacy.pipeline import TextCategorizer
-> textcat = TextCategorizer(nlp.vocab, textcat_model)
-> textcat.from_disk("/path/to/model")
+> model = util.load_config("model.cfg", create_objects=True)["model"]
+> textcat = TextCategorizer(nlp.vocab, model)
 > ```

+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.create_pipe`](/api/language#create_pipe).
+
 | Name        | Type              | Description                                                                     |
 | ----------- | ----------------- | ------------------------------------------------------------------------------- |
 | `vocab`     | `Vocab`           | The shared vocabulary.                                                          |
@ -46,6 +38,7 @@ shortcut for this and instantiate the component using its string name and
 | `**cfg`     | -                 | Configuration parameters.                                                       |
 | **RETURNS** | `TextCategorizer` | The newly constructed object.                                                   |

+<!-- TODO move to config page 
 ### Architectures {#architectures new="2.1"}

 Text classification models can be used to solve a wide variety of problems.
@ -60,6 +53,7 @@ argument.
 | `"ensemble"`   | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model.                                                                                                                                               |
 | `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.                                                                                                                                                                                |
 | `"bow"`        | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. |
+-->

 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}

@ -101,11 +95,11 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                            |
-| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
+| Name         | Type            | Description                                            |
+| ------------ | --------------- | ------------------------------------------------------ |
+| `stream`     | `Iterable[Doc]` | A stream of documents.                                 |
+| `batch_size` | int             | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`           | Processed documents in the order of the original text. |

 ## TextCategorizer.predict {#predict tag="method"}

@ -151,9 +145,8 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
 >
 > ```python
 > textcat = TextCategorizer(nlp.vocab, textcat_model)
-> losses = {}
 > optimizer = nlp.begin_training()
-> textcat.update(examples, losses=losses, sgd=optimizer)
+> losses = textcat.update(examples, sgd=optimizer)
 > ```

 | Name              | Type                | Description                                                                                                                                   |
@ -164,6 +157,7 @@ pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and
 | `set_annotations` | bool                | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
 | `sgd`             | `Optimizer`         | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object.                                                                               |
 | `losses`          | `Dict[str, float]`  | Optional record of the loss during training. The value keyed by the model's name is updated.                                                  |
+| **RETURNS**       | `Dict[str, float]`  | The updated `losses` dictionary.                                                                                                              |

 ## TextCategorizer.get_loss {#get_loss tag="method"}