Merge pull request #5938 from svlandeg/feature/update-docs [ci skip]

2025-08-09 22:54:53 +03:00 · 2020-08-19 20:24:02 +02:00 · 2020-08-19 20:24:02 +02:00 · 7a8cc64ea8
commit 7a8cc64ea8
parent cb9a2402ee 85b39639e1
8 changed files with 84 additions and 60 deletions
--- a/spacy/cli/templates/quickstart_training_recommendations.yml
+++ b/spacy/cli/templates/quickstart_training_recommendations.yml
@ -1,5 +1,5 @@
 # Recommended settings and available resources for each language, if available.
-# Not all languages have recommended word vecotrs or transformers and for some,
+# Not all languages have recommended word vectors or transformers and for some,
 # the recommended transformer for efficiency and accuracy may be the same.
 en:
  word_vectors: en_vectors_web_lg
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -13,7 +13,7 @@ menu:

 TODO: intro and how architectures work, link to
 [`registry`](/api/top-level#registry),
-[custom models](/usage/training#custom-models) usage etc.
+[custom functions](/usage/training#custom-functions) usage etc.

 ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}

@ -622,7 +622,7 @@ others, but may not be as accurate, especially if texts are short.

 An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
 (tagged as named entities) to unique identifiers, grounding the named entities
-into the "real world". This requires 3 main component
+into the "real world". This requires 3 main components:

 - A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
  synonyms and prior probabilities.
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -276,7 +276,7 @@ python -m spacy init fill-config tmp/starter-config_invalid.cfg --base tmp/start
 | Name                | Description                                                                                                                                                                                |
 | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `config_path`       | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~                                                                |
-| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~          |
+| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~       |
 | `--help`, `-h`      | Show help message and available arguments. ~~bool (flag)~~                                                                                                                                 |
 | overrides           | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
 | **PRINTS**          | Config validation errors, if available.                                                                                                                                                    |
@ -448,7 +448,7 @@ will not be available.
 | Name                       | Description                                                                                                                                                                                |
 | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `config_path`              | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~                                                                |
-| `--code`, `-c`             | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~          |
+| `--code`, `-c`             | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~       |
 | `--ignore-warnings`, `-IW` | Ignore warnings, only show stats and errors. ~~bool (flag)~~                                                                                                                               |
 | `--verbose`, `-V`          | Print additional information and explanations. ~~bool (flag)~~                                                                                                                             |
 | `--no-format`, `-NF`       | Don't pretty-print the results. Use this if you want to write to a file. ~~bool (flag)~~                                                                                                   |
@ -612,9 +612,9 @@ Train a model. Expects data in spaCy's
 Will save out the best model from all epochs, as well as the final model. The
 `--code` argument can be used to provide a Python file that's imported before
 the training process starts. This lets you register
-[custom functions](/usage/training#custom-models) and architectures and refer to
-them in your config, all while still using spaCy's built-in `train` workflow. If
-you need to manage complex multi-step training workflows, check out the new
+[custom functions](/usage/training#custom-functions) and architectures and refer
+to them in your config, all while still using spaCy's built-in `train` workflow.
+If you need to manage complex multi-step training workflows, check out the new
 [spaCy projects](/usage/projects).

 <Infobox title="New in v3.0" variant="warning">
@ -636,7 +636,7 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides
 | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `config_path`     | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~                                                                |
 | `--output`, `-o`  | Directory to store model in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~                                                                                         |
-| `--code`, `-c`    | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~          |
+| `--code`, `-c`    | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~       |
 | `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~                                                                                                                               |
 | `--help`, `-h`    | Show help message and available arguments. ~~bool (flag)~~                                                                                                                                 |
 | overrides         | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
@ -674,7 +674,7 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--re
 | `texts_loc`             | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. ~~Path (positional)~~ |
 | `output_dir`            | Directory to write models to on each epoch. ~~Path (positional)~~                                                                                                                                  |
 | `config_path`           | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~                                                                        |
-| `--code`, `-c`          | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~                  |
+| `--code`, `-c`          | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~               |
 | `--resume-path`, `-r`   | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~                                                                                                          |
 | `--epoch-resume`, `-er` | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. ~~Optional[int] \(option)~~                                                |
 | `--help`, `-h`          | Show help message and available arguments. ~~bool (flag)~~                                                                                                                                         |
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -40,13 +40,14 @@ architectures and their arguments and hyperparameters.
 > nlp.add_pipe("entity_linker", config=config)
 > ```

-| Setting          | Description                                                                                                                                                                                |
-| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~                                                                                             |
-| `incl_prior`     | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~                                                                                       |
-| `incl_context`   | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~                                                                                                     |
-| `model`          | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~                                     |
-| `kb`             | The [`KnowledgeBase`](/api/kb). Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~KnowledgeBase~~ |
+| Setting          | Description                                                                                                                                                                                                                                                              |
+| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~                                                                                                                                                                           |
+| `incl_prior`     | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~                                                                                                                                                                     |
+| `incl_context`   | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~                                                                                                                                                                                   |
+| `model`          | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~                                                                                                                   |
+| `kb_loader`      | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~Callable[[Vocab], KnowledgeBase]~~                |
+| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |

 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
@ -79,16 +80,17 @@ shortcut for this and instantiate the component using its string name and
 `KnowledgeBase` as well as the Candidate generator can be customized by
 providing custom registered functions.

-| Name             | Description                                                                                         |
-| ---------------- | --------------------------------------------------------------------------------------------------- |
-| `vocab`          | The shared vocabulary. ~~Vocab~~                                                                    |
-| `model`          | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~           |
-| `name`           | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
-| _keyword-only_   |                                                                                                     |  |
-| `kb`             | The [`KnowledgeBase`](/api/kb). ~~KnowledgeBase~~                                                   |
-| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~                      |
-| `incl_prior`     | Whether or not to include prior probabilities from the KB in the model. ~~bool~~                    |
-| `incl_context`   | Whether or not to include the local context in the model. ~~bool~~                                  |
+| Name             | Description                                                                                                                      |
+| ---------------- | -------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`          | The shared vocabulary. ~~Vocab~~                                                                                                 |
+| `model`          | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~                                        |
+| `name`           | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~                              |
+| _keyword-only_   |                                                                                                                                  |  |
+| `kb_loader`      | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~                 |
+| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
+| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~                                                   |
+| `incl_prior`     | Whether or not to include prior probabilities from the KB in the model. ~~bool~~                                                 |
+| `incl_context`   | Whether or not to include the local context in the model. ~~bool~~                                                               |

 ## EntityLinker.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -295,23 +295,23 @@ factories.
 >     return Model("custom", forward, dims={"nO": nO})
 > ```

-| Registry name     | Description                                                                                                                                                                                                                                       |
-| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                          |
-| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) |
-| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                  |
-| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                |
-| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                   |
-| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                            |
-| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                    |
-| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                            |
-| `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                   |
-| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                  |
-| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                            |
-| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                              |
-| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                    |
-| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                      |
-| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                        |
+| Registry name     | Description                                                                                                                                                                                                                                        |
+| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
+| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
+| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |
+| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
+| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
+| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
+| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                     |
+| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
+| `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                    |
+| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
+| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
+| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
+| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
+| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
+| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |

 ### spacy-transformers registry {#registry-transformers}

@ -340,7 +340,17 @@ See the [`Transformer`](/api/transformer) API reference and

 ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}

-<!-- TODO: intro -->
+A data batcher implements a batching strategy that essentially turns a stream of
+items into a stream of batches, with each batch consisting of one item or a list
+of items. During training, the models update their weights after processing one
+batch at a time. Typical batching strategies include presenting the training
+data as a stream of batches with similar sizes, or with increasing batch sizes.
+See the Thinc documentation on
+[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.
+
+Instead of using one of the built-in batchers listed here, you can also
+[implement your own](/usage/training#custom-code-readers-batchers), which may or
+may not use a custom schedule.

 #### batch_by_words.v1 {#batch_by_words tag="registered function"}

--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -155,8 +155,8 @@ other. For instance, to generate a packaged model, you might start by converting
 your data, then run [`spacy train`](/api/cli#train) to train your model on the
 converted data and if that's successful, run [`spacy package`](/api/cli#package)
 to turn the best model artifact into an installable Python package. The
-following command run the workflow named `all` defined in the `project.yml`, and
-execute the commands it specifies, in order:
+following command runs the workflow named `all` defined in the `project.yml`, and
+executes the commands it specifies, in order:

 ```cli
 $ python -m spacy project run all
@ -199,7 +199,7 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
 ### Dependencies and outputs {#deps-outputs}

 Each command defined in the `project.yml` can optionally define a list of
-dependencies and outputs. These are the files the commands requires and creates.
+dependencies and outputs. These are the files the command requires and creates.
 For example, a command for training a model may depend on a
 [`config.cfg`](/usage/training#config) and the training and evaluation data, and
 it will export a directory `model-best`, containing the best model, which you
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -5,7 +5,7 @@ menu:
  - ['Introduction', 'basics']
  - ['Quickstart', 'quickstart']
  - ['Config System', 'config']
-  - ['Custom Models', 'custom-models']
+  - ['Custom Functions', 'custom-functions']
  - ['Transfer Learning', 'transfer-learning']
  - ['Parallel Training', 'parallel-training']
  - ['Internal API', 'api']
@ -127,7 +127,7 @@ Some of the main advantages and features of spaCy's training config are:
  [optimizers](https://thinc.ai/docs/api-optimizers) or
  [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
  passed into them. You can also register your own functions to define
-  [custom architectures](#custom-models), reference them in your config and
+  [custom architectures](#custom-functions), reference them in your config and
  tweak their parameters.
 - **Interpolation.** If you have hyperparameters or other settings used by
  multiple components, define them once and reference them as
@ -299,7 +299,7 @@ case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined
 in the [function registry](/api/top-level#registry). All other values defined in
 the block are passed to the function as keyword arguments when it's initialized.
 You can also use this mechanism to register
-[custom implementations and architectures](#custom-models) and reference them
+[custom implementations and architectures](#custom-functions) and reference them
 from your configs.

 > #### How the config is resolved
@ -481,9 +481,25 @@ still look good.

 </Accordion>

-## Custom model implementations and architectures {#custom-models}
+## Custom Functions {#custom-functions}

-<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. possibly link to new (not yet created) page on creating models -->
+Registered functions in the training config files can refer to built-in
+implementations, but you can also plug in fully custom implementations. To do
+so, you first write your own implementation of a custom architectures, data
+reader or any other functionality, and then register this function with the
+correct [registry](/api/top-level#registry). This allows you to plug in models
+defined in PyTorch or Tensorflow, make custom modifications to the `nlp` object,
+create custom optimizers or schedules, or write a function that streams in data
+and preprocesses it on the fly while training.
+
+Each custom function can have any numbers of arguments that should be passed
+into them through the config similar as with the built-in functions. If your
+function defines **default argument values**, spaCy is able to auto-fill your
+config when you run [`init fill-config`](/api/cli#init-fill-config). If you want
+to make sure that a given parameter is always explicitely set in the config,
+avoid setting a default value for it.
+
+<!-- TODO: possibly link to new (not yet created) page on creating models ? -->

 ### Training with custom code {#custom-code}

@ -642,11 +658,7 @@ In your config, you can now reference the schedule in the
 starting with an `@`, it's interpreted as a reference to a function. All other
 settings in the block will be passed to the function as keyword arguments. Keep
 in mind that the config shouldn't have any hidden defaults and all arguments on
-the functions need to be represented in the config. If your function defines
-**default argument values**, spaCy is able to auto-fill your config when you run
-[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
-given parameter is always explicitely set in the config, avoid setting a default
-value for it.
+the functions need to be represented in the config.

 ```ini
 ### config.cfg (excerpt)
@ -733,7 +745,7 @@ the annotations are exactly the same.

 ```python
 ### functions.py
-from typing import Callable, Iterable, Iterator
+from typing import Callable, Iterable, Iterator, List
 import spacy
 from spacy.gold import Example

--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -44,7 +44,7 @@ menu:

 </Infobox>

-### Custom models using any framework {#feautres-custom-models}
+### Custom models using any framework {#features-custom-models}

 ### Manage end-to-end workflows with projects {#features-projects}