Update docs [ci skip]

2025-08-05 21:00:19 +03:00 · 2020-08-05 20:29:53 +02:00 · 2020-08-05 20:29:53 +02:00 · 50311a4d37
commit 50311a4d37
parent c675746ca2
5 changed files with 301 additions and 124 deletions
--- a/website/docs/api/corpus.md
+++ b/website/docs/api/corpus.md
@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
 new: 3
 ---

-This class manages annotated corpora and can read training and development
-datasets in the [DocBin](/api/docbin) (`.spacy`) format.
+This class manages annotated corpora and can be used for training and
+development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
+customize the data loading during training, you can register your own
+[data readers and batchers](/usage/training#custom-code-readers-batchers)

 ## Corpus.\_\_init\_\_ {#init tag="method"}

-Create a `Corpus`. The input data can be a file or a directory of files.
+Create a `Corpus` for iterating [Example](/api/example) objects from a file or
+directory of [`.spacy` data files](/api/data-formats#binary-training). The
+`gold_preproc` setting lets you specify whether to set up the `Example` object
+with gold-standard sentences and tokens for the predictions. Gold preprocessing
+helps the annotations align to the tokenization, and may result in sequences of
+more consistent length. However, it may reduce runtime accuracy due to
+train/test skew.

 > #### Example
 >
 > ```python
 > from spacy.gold import Corpus
 >
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> # With a single file
+> corpus = Corpus("./data/train.spacy")
+>
+> # With a directory
+> corpus = Corpus("./data", limit=10)
 > ```

-| Name    | Type         | Description                                                      |
-| ------- | ------------ | ---------------------------------------------------------------- |
-| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files).    |
-| `dev`   | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
-| `limit` | int          | Maximum number of examples returned. `0` for no limit (default). |
+| Name            | Type         | Description                                                                                                                                 |
+| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path`          | str / `Path` | The directory or filename to read from.                                                                                                     |
+| _keyword-only_  |              |                                                                                                                                             |
+|  `gold_preproc` | bool         | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`.                      |
+| `max_length`    | int          | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
+| `limit`         | int          | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit.                                                     |

-## Corpus.train_dataset {#train_dataset tag="method"}
+## Corpus.\_\_call\_\_ {#call tag="method"}

-Yield examples from the training data.
+Yield examples from the data.

 > #### Example
 >
@ -37,60 +51,12 @@ Yield examples from the training data.
 > from spacy.gold import Corpus
 > import spacy
 >
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> corpus = Corpus("./train.spacy")
 > nlp = spacy.blank("en")
-> train_data = corpus.train_dataset(nlp)
+> train_data = corpus(nlp)
 > ```

-| Name           | Type       | Description                                                                                                                                |
-| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
-| `nlp`          | `Language` | The current `nlp` object.                                                                                                                  |
-| _keyword-only_ |            |                                                                                                                                            |
-| `shuffle`      | bool       | Whether to shuffle the examples. Defaults to `True`.                                                                                       |
-| `gold_preproc` | bool       | Whether to train on gold-standard sentences and tokens. Defaults to `False`.                                                               |
-| `max_length`   | int        | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default).  |
-| **YIELDS**     | `Example`  | The examples.                                                                                                                              |
-
-## Corpus.dev_dataset {#dev_dataset tag="method"}
-
-Yield examples from the development data.
-
-> #### Example
->
-> ```python
-> from spacy.gold import Corpus
-> import spacy
->
-> corpus = Corpus("./train.spacy", "./dev.spacy")
-> nlp = spacy.blank("en")
-> dev_data = corpus.dev_dataset(nlp)
-> ```
-
-| Name           | Type       | Description                                                                  |
-| -------------- | ---------- | ---------------------------------------------------------------------------- |
-| `nlp`          | `Language` | The current `nlp` object.                                                    |
-| _keyword-only_ |            |                                                                              |
-| `gold_preproc` | bool       | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
-| **YIELDS**     | `Example`  | The examples.                                                                |
-
-## Corpus.count_train {#count_train tag="method"}
-
-Get the word count of all training examples.
-
-> #### Example
->
-> ```python
-> from spacy.gold import Corpus
-> import spacy
->
-> corpus = Corpus("./train.spacy", "./dev.spacy")
-> nlp = spacy.blank("en")
-> word_count = corpus.count_train(nlp)
-> ```
-
-| Name        | Type       | Description               |
-| ----------- | ---------- | ------------------------- |
-| `nlp`       | `Language` | The current `nlp` object. |
-| **RETURNS** | int        | The word count.           |
-
-<!-- TODO: document remaining methods? / decide which to document -->
+| Name       | Type       | Description               |
+| ---------- | ---------- | ------------------------- |
+| `nlp`      | `Language` | The current `nlp` object. |
+| **YIELDS** | `Example`  | The examples.             |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -4,7 +4,7 @@ menu:
  - ['spacy', 'spacy']
  - ['displacy', 'displacy']
  - ['registry', 'registry']
-  - ['Loaders & Batchers', 'loaders-batchers']
+  - ['Readers & Batchers', 'readers-batchers']
  - ['Data & Alignment', 'gold']
  - ['Utility Functions', 'util']
 ---
@ -303,6 +303,9 @@ factories.
 | `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                   |
 | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                            |
 | `assets`          |                                                                                                                                                                                                                                                   |
+| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                            |
+| `readers`         | Registry for training and evaluation [data readers](#readers-batchers).                                                                                                                                                                           |
+| `batchers`        | Registry for training and evaluation [data batchers](#readers-batchers).                                                                                                                                                                          |
 | `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                            |
 | `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                              |
 | `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                    |
@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
 | [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 | [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |

-## Training data loaders and batchers {#loaders-batchers new="3"}
+## Data readers and batchers {#readers-batchers new="3"}

 <!-- TODO: -->

+### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
+
+Registered function that creates a [`Corpus`](/api/corpus) of training or
+evaluation data. It takes the same arguments as the `Corpus` class and returns a
+callable that yields [`Example`](/api/example) objects. You can replace it with
+your own registered function in the [`@readers` registry](#regsitry) to
+customize the data loading and streaming.
+
+> #### Example config
+>
+> ```ini
+> [paths]
+> train = "corpus/train.spacy"
+>
+> [training.train_corpus]
+> @readers = "spacy.Corpus.v1"
+> path = ${paths:train}
+> gold_preproc = false
+> max_length = 0
+> limit = 0
+> ```
+
+| Name            | Type   | Description                                                                                                                                     |
+| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path`          | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training).                    |
+|  `gold_preproc` | bool   | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
+| `max_length`    | int    | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit.     |
+| `limit`         | int    | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit.                                                         |
+
+### Batchers {#batchers source="spacy/gold/batchers.py"}
+
+<!-- TODO: -->
+
+#### batch_by_words.v1 {#batch_by_words tag="registered function"}
+
+Create minibatches of roughly a given number of words. If any examples are
+longer than the specified batch length, they will appear in a batch by
+themselves, or be discarded if `discard_oversize` is set to `True`. The argument
+`docs` can be a list of strings, [`Doc`](/api/doc) objects or
+[`Example`](/api/example) objects.
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_words.v1"
+> size = 100
+> tolerance = 0.2
+> discard_oversize = false
+> get_length = null
+> ```
+
+<!-- TODO: complete table -->
+
+| Name               | Type                   | Description                                                                                                                         |
+| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `tolerance`        | float                  |                                                                                                                                     |
+| `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
+| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+
+#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
+
+<!-- TODO: -->
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_sequence.v1"
+> size = 32
+> get_length = null
+> ```
+
+<!-- TODO: complete table -->
+
+| Name         | Type                   | Description                                                                                                                         |
+| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size`       | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+
+#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
+
+<!-- TODO: -->
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_words.v1"
+> size = 100
+> buffer = TODO:
+> discard_oversize = false
+> get_length = null
+> ```
+
+| Name               | Type                   | Description                                                                                                                         |
+| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `buffer`           | int                    |                                                                                                                                     |
+| `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
+| `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
+
 ## Training data and alignment {#gold source="spacy/gold"}

 ### gold.docs_to_json {#docs_to_json tag="function"}
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -5,8 +5,8 @@ menu:
  - ['Introduction', 'basics']
  - ['Quickstart', 'quickstart']
  - ['Config System', 'config']
-  - ['Transfer Learning', 'transfer-learning']
  - ['Custom Models', 'custom-models']
+  - ['Transfer Learning', 'transfer-learning']
  - ['Parallel Training', 'parallel-training']
  - ['Internal API', 'api']
 ---
@ -315,6 +315,10 @@ stop = 1000
 compound = 1.001
 ```

+### Using variable interpolation {#config-interpolation}
+
+<!-- TODO: describe and come up with good example showing both values and sections -->
+
 ### Model architectures {#model-architectures}

 <!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
@ -384,41 +388,17 @@ still look good.

 </Accordion>

-## Transfer learning {#transfer-learning}
-
-### Using transformer models like BERT {#transformers}
-
-spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
-can use models implemented in a variety of frameworks. A transformer model is
-just a statistical model, so the
-[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
-actually has very little work to do: it just has to provide a few functions that
-do the required plumbing. It also provides a pipeline component,
-[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
-you save the transformer outputs for later use.
-
-<Project id="en_core_bert">
-
-Try out a BERT-based model pipeline using this project template: swap in your
-data, edit the settings and hyperparameters and train, evaluate, package and
-visualize your model.
-
-</Project>
-
-For more details on how to integrate transformer models into your training
-config and customize the implementations, see the usage guide on
-[training transformers](/usage/transformers#training).
-
-### Pretraining with spaCy {#pretraining}
-
-<!-- TODO: document spacy pretrain -->
-
 ## Custom model implementations and architectures {#custom-models}

 <!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->

 ### Training with custom code {#custom-code}

+> ```bash
+> ### Example {wrap="true"}
+> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
+> ```
+
 The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
 `--code` that points to a Python file. The file is imported before training and
 allows you to add custom functions and architectures to the function registry
@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
 models with custom components, without having to re-implement the whole training
 workflow.

+#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
+
+For many use cases, you don't necessarily want to implement the whole `Language`
+subclass and language data from scratch – it's often enough to make a few small
+modifications, like adjusting the
+[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
+[language defaults](/api/language#defaults) like stop words. The config lets you
+provide three optional **callback functions** that give you access to the
+language class and `nlp` object at different points of the lifecycle:
+
+| Callback                  | Description                                                                                                                                                                              |
+| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `before_creation`         | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
+| `after_creation`          | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer.          |
+| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components.                                                |
+
+The `@spacy.registry.callbacks` decorator lets you register that function in the
+`callbacks` [registry](/api/top-level#registry) under a given name. You can then
+reference the function in a config block using the `@callbacks` key. If a block
+contains a key starting with an `@`, it's interpreted as a reference to a
+function. Because you've registered the function, spaCy knows how to create it
+when you reference `"customize_language_data"` in your config. Here's an example
+of a callback that runs before the `nlp` object is created and adds a few custom
+tokenization rules to the defaults:
+
+> #### config.cfg
+>
+> ```ini
+> [nlp.before_creation]
+> @callbacks = "customize_language_data"
+> ```
+
+```python
+### functions.py {highlight="3,6"}
+import spacy
+
+@spacy.registry.callbacks("customize_language_data")
+def create_callback():
+    def customize_language_data(lang_cls):
+        lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
+        return lang_cls
+
+    return customize_language_data
+```
+
+<Infobox variant="warning">
+
+Remember that a registered function should always be a function that spaCy
+**calls to create something**. In this case, it **creates a callback** – it's
+not the callback itself.
+
+</Infobox>
+
+Any registered function – in this case `create_callback` – can also take
+**arguments** that can be **set by the config**. This lets you implement and
+keep track of different configurations, without having to hack at your code. You
+can choose any arguments that make sense for your use case. In this example,
+we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
+(boolean) for printing additional info when the function runs.
+
+> #### config.cfg
+>
+> ```ini
+> [nlp.before_creation]
+> @callbacks = "customize_language_data"
+> extra_stop_words = ["ooh", "aah"]
+> debug = true
+> ```
+
+```python
+### functions.py {highlight="5,8-10"}
+from typing import List
+import spacy
+
+@spacy.registry.callbacks("customize_language_data")
+def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
+    def customize_language_data(lang_cls):
+        lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
+        lang_cls.Defaults.stop_words.add(extra_stop_words)
+        if debug:
+            print("Updated stop words and tokenizer suffixes")
+        return lang_cls
+
+    return customize_language_data
+```
+
+<Infobox title="Tip: Use Python type hints" emoji="💡">
+
+spaCy's configs are powered by our machine learning library Thinc's
+[configuration system](https://thinc.ai/docs/usage-config), which supports
+[type hints](https://docs.python.org/3/library/typing.html) and even
+[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
+using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
+function provides type hints, the values that are passed in will be checked
+against the expected types. For example, `debug: bool` in the example above will
+ensure that the value received as the argument `debug` is an boolean. If the
+value can't be coerced into a boolean, spaCy will raise an error.
+`start: pydantic.StrictBool` will force the value to be an boolean and raise an
+error if it's not – for instance, if your config defines `1` instead of `true`.
+
+</Infobox>
+
+With your `functions.py` defining additional code and the updated `config.cfg`,
+you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
+to your Python file. Before loading the config, spaCy will import the
+`functions.py` module and your custom functions will be registered.
+
+```bash
+### Training with custom code {wrap="true"}
+python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
+```
+
+#### Example: Custom batch size schedule {#custom-code-schedule}
+
 For example, let's say you've implemented your own batch size schedule to use
 during training. The `@spacy.registry.schedules` decorator lets you register
 that function in the `schedules` [registry](/api/top-level#registry) and assign
@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
 **default argument values**, spaCy is able to auto-fill your config when you run
 [`init config`](/api/cli#init-config).

-<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
-
 ```ini
 ### config.cfg (excerpt)
 [training.batch_size]
@ -469,31 +561,9 @@ start = 2
 factor = 1.005
 ```

-You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
-custom `functions.py` as the argument `--code`. Before loading the config, spaCy
-will import the `functions.py` module and your custom functions will be
-registered.
+#### Example: Custom data reading and batching {#custom-code-readers-batchers}

-```bash
-### Training with custom code {wrap="true"}
-python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
-```
-
-<Infobox title="Tip: Use Python type hints" emoji="💡">
-
-spaCy's configs are powered by our machine learning library Thinc's
-[configuration system](https://thinc.ai/docs/usage-config), which supports
-[type hints](https://docs.python.org/3/library/typing.html) and even
-[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
-using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
-function provides type hints, the values that are passed in will be checked
-against the expected types. For example, `start: int` in the example above will
-ensure that the value received as the argument `start` is an integer. If the
-value can't be coerced into an integer, spaCy will raise an error.
-`start: pydantic.StrictInt` will force the value to be an integer and raise an
-error if it's not – for instance, if your config defines a float.
-
-</Infobox>
+<!-- TODO: -->

 ### Wrapping PyTorch and TensorFlow {#custom-frameworks}

@ -511,6 +581,35 @@ mattis pretium.

 <!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->

+## Transfer learning {#transfer-learning}
+
+### Using transformer models like BERT {#transformers}
+
+spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
+can use models implemented in a variety of frameworks. A transformer model is
+just a statistical model, so the
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
+actually has very little work to do: it just has to provide a few functions that
+do the required plumbing. It also provides a pipeline component,
+[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
+you save the transformer outputs for later use.
+
+<Project id="en_core_bert">
+
+Try out a BERT-based model pipeline using this project template: swap in your
+data, edit the settings and hyperparameters and train, evaluate, package and
+visualize your model.
+
+</Project>
+
+For more details on how to integrate transformer models into your training
+config and customize the implementations, see the usage guide on
+[training transformers](/usage/transformers#training).
+
+### Pretraining with spaCy {#pretraining}
+
+<!-- TODO: document spacy pretrain -->
+
 ## Parallel Training with Ray {#parallel-training}

 <!-- TODO: document Ray integration -->
--- a/website/src/styles/aside.module.sass
+++ b/website/src/styles/aside.module.sass
@ -24,10 +24,16 @@ $border-radius: 6px
        &:last-child
            margin: 0

+        &:first-child h4
+            margin-top: 0 !important
+
        code
            padding: 0
            margin: 0

+        h4
+            margin-left: 0
+
    p, ul, ol
        font: inherit
        margin-bottom: var(--spacing-sm)
--- a/website/src/styles/layout.sass
+++ b/website/src/styles/layout.sass
@ -373,7 +373,7 @@ body [id]:target
    margin-right: -1.5em
    margin-left: -1.5em
    padding-right: 1.5em
-    padding-left: 1.65em
+    padding-left: 1.1em

    &:empty:before
        // Fix issue where empty lines would disappear