Update docs [ci skip]

2026-01-06 00:39:25 +03:00 · 2020-08-05 20:29:53 +02:00 · 2020-08-05 20:29:53 +02:00 · 50311a4d37
commit 50311a4d37
parent c675746ca2
5 changed files with 301 additions and 124 deletions
--- a/website/docs/api/corpus.md
+++ b/website/docs/api/corpus.md
@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
 new: 3
 ---
-This class manages annotated corpora and can read training and development
+This class manages annotated corpora and can be used for training and
-datasets in the [DocBin](/api/docbin) (`.spacy`) format.
+development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
 customize the data loading during training, you can register your own
 [data readers and batchers](/usage/training#custom-code-readers-batchers)
 ## Corpus.\_\_init\_\_ {#init tag="method"}
-Create a `Corpus`. The input data can be a file or a directory of files.
+Create a `Corpus` for iterating [Example](/api/example) objects from a file or
 directory of [`.spacy` data files](/api/data-formats#binary-training). The
 `gold_preproc` setting lets you specify whether to set up the `Example` object
 with gold-standard sentences and tokens for the predictions. Gold preprocessing
 helps the annotations align to the tokenization, and may result in sequences of
 more consistent length. However, it may reduce runtime accuracy due to
 train/test skew.
 > #### Example
 >
 > ```python
 > from spacy.gold import Corpus
 >
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> # With a single file
 > corpus = Corpus("./data/train.spacy")
 >
 > # With a directory
 > corpus = Corpus("./data", limit=10)
 > ```
-| Name    | Type         | Description                                                      |
+| Name            | Type         | Description                                                                                                                                 |
-| ------- | ------------ | ---------------------------------------------------------------- |
+| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
-| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files).    |
+| `path`          | str / `Path` | The directory or filename to read from.                                                                                                     |
-| `dev`   | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
+| _keyword-only_  |              |                                                                                                                                             |
-| `limit` | int          | Maximum number of examples returned. `0` for no limit (default). |
+|  `gold_preproc` | bool         | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`.                      |
 | `max_length`    | int          | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
 | `limit`         | int          | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit.                                                     |
-## Corpus.train_dataset {#train_dataset tag="method"}
+## Corpus.\_\_call\_\_ {#call tag="method"}
-Yield examples from the training data.
+Yield examples from the data.
 > #### Example
 >
@ -37,60 +51,12 @@ Yield examples from the training data.
 > from spacy.gold import Corpus
 > import spacy
 >
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> corpus = Corpus("./train.spacy")
 > nlp = spacy.blank("en")
-> train_data = corpus.train_dataset(nlp)
+> train_data = corpus(nlp)
 > ```
-| Name           | Type       | Description                                                                                                                                |
+| Name       | Type       | Description               |
-| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| ---------- | ---------- | ------------------------- |
-| `nlp`          | `Language` | The current `nlp` object.                                                                                                                  |
+| `nlp`      | `Language` | The current `nlp` object. |
-| _keyword-only_ |            |                                                                                                                                            |
+| **YIELDS** | `Example`  | The examples.             |
 | `shuffle`      | bool       | Whether to shuffle the examples. Defaults to `True`.                                                                                       |
 | `gold_preproc` | bool       | Whether to train on gold-standard sentences and tokens. Defaults to `False`.                                                               |
 | `max_length`   | int        | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default).  |
 | **YIELDS**     | `Example`  | The examples.                                                                                                                              |
 ## Corpus.dev_dataset {#dev_dataset tag="method"}
 Yield examples from the development data.
 > #### Example
 >
 > ```python
 > from spacy.gold import Corpus
 > import spacy
 >
 > corpus = Corpus("./train.spacy", "./dev.spacy")
 > nlp = spacy.blank("en")
 > dev_data = corpus.dev_dataset(nlp)
 > ```
 | Name           | Type       | Description                                                                  |
 | -------------- | ---------- | ---------------------------------------------------------------------------- |
 | `nlp`          | `Language` | The current `nlp` object.                                                    |
 | _keyword-only_ |            |                                                                              |
 | `gold_preproc` | bool       | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
 | **YIELDS**     | `Example`  | The examples.                                                                |
 ## Corpus.count_train {#count_train tag="method"}
 Get the word count of all training examples.
 > #### Example
 >
 > ```python
 > from spacy.gold import Corpus
 > import spacy
 >
 > corpus = Corpus("./train.spacy", "./dev.spacy")
 > nlp = spacy.blank("en")
 > word_count = corpus.count_train(nlp)
 > ```
 | Name        | Type       | Description               |
 | ----------- | ---------- | ------------------------- |
 | `nlp`       | `Language` | The current `nlp` object. |
 | **RETURNS** | int        | The word count.           |
 <!-- TODO: document remaining methods? / decide which to document -->
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -4,7 +4,7 @@ menu:
  - ['spacy', 'spacy']
  - ['displacy', 'displacy']
  - ['registry', 'registry']
-  - ['Loaders & Batchers', 'loaders-batchers']
+  - ['Readers & Batchers', 'readers-batchers']
  - ['Data & Alignment', 'gold']
  - ['Utility Functions', 'util']
 ---
@ -303,6 +303,9 @@ factories.
 | `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                   |
 | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                            |
 | `assets`          |                                                                                                                                                                                                                                                   |
 | `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                            |
 | `readers`         | Registry for training and evaluation [data readers](#readers-batchers).                                                                                                                                                                           |
 | `batchers`        | Registry for training and evaluation [data batchers](#readers-batchers).                                                                                                                                                                          |
 | `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                            |
 | `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                              |
 | `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                    |
@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
 | [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 | [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
-## Training data loaders and batchers {#loaders-batchers new="3"}
+## Data readers and batchers {#readers-batchers new="3"}
 <!-- TODO: -->
 ### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
 Registered function that creates a [`Corpus`](/api/corpus) of training or
 evaluation data. It takes the same arguments as the `Corpus` class and returns a
 callable that yields [`Example`](/api/example) objects. You can replace it with
 your own registered function in the [`@readers` registry](#regsitry) to
 customize the data loading and streaming.
 > #### Example config
 >
 > ```ini
 > [paths]
 > train = "corpus/train.spacy"
 >
 > [training.train_corpus]
 > @readers = "spacy.Corpus.v1"
 > path = ${paths:train}
 > gold_preproc = false
 > max_length = 0
 > limit = 0
 > ```
 | Name            | Type   | Description                                                                                                                                     |
 | --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
 | `path`          | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training).                    |
 |  `gold_preproc` | bool   | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
 | `max_length`    | int    | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit.     |
 | `limit`         | int    | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit.                                                         |
 ### Batchers {#batchers source="spacy/gold/batchers.py"}
 <!-- TODO: -->
 #### batch_by_words.v1 {#batch_by_words tag="registered function"}
 Create minibatches of roughly a given number of words. If any examples are
 longer than the specified batch length, they will appear in a batch by
 themselves, or be discarded if `discard_oversize` is set to `True`. The argument
 `docs` can be a list of strings, [`Doc`](/api/doc) objects or
 [`Example`](/api/example) objects.
 > #### Example config
 >
 > ```ini
 > [training.batcher]
 > @batchers = "batch_by_words.v1"
 > size = 100
 > tolerance = 0.2
 > discard_oversize = false
 > get_length = null
 > ```
 <!-- TODO: complete table -->
 | Name               | Type                   | Description                                                                                                                         |
 | ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
 | `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
 | `tolerance`        | float                  |                                                                                                                                     |
 | `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
 | `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
 #### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
 <!-- TODO: -->
 > #### Example config
 >
 > ```ini
 > [training.batcher]
 > @batchers = "batch_by_sequence.v1"
 > size = 32
 > get_length = null
 > ```
 <!-- TODO: complete table -->
 | Name         | Type                   | Description                                                                                                                         |
 | ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
 | `size`       | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
 | `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
 #### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
 <!-- TODO: -->
 > #### Example config
 >
 > ```ini
 > [training.batcher]
 > @batchers = "batch_by_words.v1"
 > size = 100
 > buffer = TODO:
 > discard_oversize = false
 > get_length = null
 > ```
 | Name               | Type                   | Description                                                                                                                         |
 | ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
 | `size`             | `Iterable[int]` / int  | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
 | `buffer`           | int                    |                                                                                                                                     |
 | `discard_oversize` | bool                   | Discard items that are longer than the specified batch length.                                                                      |
 | `get_length`       | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set.                     |
 ## Training data and alignment {#gold source="spacy/gold"}
 ### gold.docs_to_json {#docs_to_json tag="function"}
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -5,8 +5,8 @@ menu:
  - ['Introduction', 'basics']
  - ['Quickstart', 'quickstart']
  - ['Config System', 'config']
  - ['Transfer Learning', 'transfer-learning']
  - ['Custom Models', 'custom-models']
  - ['Transfer Learning', 'transfer-learning']
  - ['Parallel Training', 'parallel-training']
  - ['Internal API', 'api']
 ---
@ -315,6 +315,10 @@ stop = 1000
 compound = 1.001
 ```
 ### Using variable interpolation {#config-interpolation}
 <!-- TODO: describe and come up with good example showing both values and sections -->
 ### Model architectures {#model-architectures}
 <!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
@ -384,41 +388,17 @@ still look good.
 </Accordion>
 ## Transfer learning {#transfer-learning}
 ### Using transformer models like BERT {#transformers}
 spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
 can use models implemented in a variety of frameworks. A transformer model is
 just a statistical model, so the
 [`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
 actually has very little work to do: it just has to provide a few functions that
 do the required plumbing. It also provides a pipeline component,
 [`Transformer`](/api/transformer), that lets you do multi-task learning and lets
 you save the transformer outputs for later use.
 <Project id="en_core_bert">
 Try out a BERT-based model pipeline using this project template: swap in your
 data, edit the settings and hyperparameters and train, evaluate, package and
 visualize your model.
 </Project>
 For more details on how to integrate transformer models into your training
 config and customize the implementations, see the usage guide on
 [training transformers](/usage/transformers#training).
 ### Pretraining with spaCy {#pretraining}
 <!-- TODO: document spacy pretrain -->
 ## Custom model implementations and architectures {#custom-models}
 <!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
 ### Training with custom code {#custom-code}
 > ```bash
 > ### Example {wrap="true"}
 > $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
 > ```
 The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
 `--code` that points to a Python file. The file is imported before training and
 allows you to add custom functions and architectures to the function registry
@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
 models with custom components, without having to re-implement the whole training
 workflow.
 #### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
 For many use cases, you don't necessarily want to implement the whole `Language`
 subclass and language data from scratch – it's often enough to make a few small
 modifications, like adjusting the
 [tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
 [language defaults](/api/language#defaults) like stop words. The config lets you
 provide three optional **callback functions** that give you access to the
 language class and `nlp` object at different points of the lifecycle:
 | Callback                  | Description                                                                                                                                                                              |
 | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `before_creation`         | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
 | `after_creation`          | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer.          |
 | `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components.                                                |
 The `@spacy.registry.callbacks` decorator lets you register that function in the
 `callbacks` [registry](/api/top-level#registry) under a given name. You can then
 reference the function in a config block using the `@callbacks` key. If a block
 contains a key starting with an `@`, it's interpreted as a reference to a
 function. Because you've registered the function, spaCy knows how to create it
 when you reference `"customize_language_data"` in your config. Here's an example
 of a callback that runs before the `nlp` object is created and adds a few custom
 tokenization rules to the defaults:
 > #### config.cfg
 >
 > ```ini
 > [nlp.before_creation]
 > @callbacks = "customize_language_data"
 > ```
 ```python
 ### functions.py {highlight="3,6"}
 import spacy
@spacy.registry.callbacks("customize_language_data")
 def create_callback():
    def customize_language_data(lang_cls):
        lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
        return lang_cls
    return customize_language_data
 ```
 <Infobox variant="warning">
 Remember that a registered function should always be a function that spaCy
 **calls to create something**. In this case, it **creates a callback** – it's
 not the callback itself.
 </Infobox>
 Any registered function – in this case `create_callback` – can also take
 **arguments** that can be **set by the config**. This lets you implement and
 keep track of different configurations, without having to hack at your code. You
 can choose any arguments that make sense for your use case. In this example,
 we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
 (boolean) for printing additional info when the function runs.
 > #### config.cfg
 >
 > ```ini
 > [nlp.before_creation]
 > @callbacks = "customize_language_data"
 > extra_stop_words = ["ooh", "aah"]
 > debug = true
 > ```
 ```python
 ### functions.py {highlight="5,8-10"}
 from typing import List
 import spacy
@spacy.registry.callbacks("customize_language_data")
 def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
    def customize_language_data(lang_cls):
        lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
        lang_cls.Defaults.stop_words.add(extra_stop_words)
        if debug:
            print("Updated stop words and tokenizer suffixes")
        return lang_cls
    return customize_language_data
 ```
 <Infobox title="Tip: Use Python type hints" emoji="💡">
 spaCy's configs are powered by our machine learning library Thinc's
 [configuration system](https://thinc.ai/docs/usage-config), which supports
 [type hints](https://docs.python.org/3/library/typing.html) and even
 [advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
 using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
 function provides type hints, the values that are passed in will be checked
 against the expected types. For example, `debug: bool` in the example above will
 ensure that the value received as the argument `debug` is an boolean. If the
 value can't be coerced into a boolean, spaCy will raise an error.
 `start: pydantic.StrictBool` will force the value to be an boolean and raise an
 error if it's not – for instance, if your config defines `1` instead of `true`.
 </Infobox>
 With your `functions.py` defining additional code and the updated `config.cfg`,
 you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
 to your Python file. Before loading the config, spaCy will import the
 `functions.py` module and your custom functions will be registered.
 ```bash
 ### Training with custom code {wrap="true"}
 python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
 ```
 #### Example: Custom batch size schedule {#custom-code-schedule}
 For example, let's say you've implemented your own batch size schedule to use
 during training. The `@spacy.registry.schedules` decorator lets you register
 that function in the `schedules` [registry](/api/top-level#registry) and assign
@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
 **default argument values**, spaCy is able to auto-fill your config when you run
 [`init config`](/api/cli#init-config).
 <!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
 ```ini
 ### config.cfg (excerpt)
 [training.batch_size]
@ -469,31 +561,9 @@ start = 2
 factor = 1.005
 ```
-You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
+#### Example: Custom data reading and batching {#custom-code-readers-batchers}
 custom `functions.py` as the argument `--code`. Before loading the config, spaCy
 will import the `functions.py` module and your custom functions will be
 registered.
-```bash
+<!-- TODO: -->
 ### Training with custom code {wrap="true"}
 python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
 ```
 <Infobox title="Tip: Use Python type hints" emoji="💡">
 spaCy's configs are powered by our machine learning library Thinc's
 [configuration system](https://thinc.ai/docs/usage-config), which supports
 [type hints](https://docs.python.org/3/library/typing.html) and even
 [advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
 using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
 function provides type hints, the values that are passed in will be checked
 against the expected types. For example, `start: int` in the example above will
 ensure that the value received as the argument `start` is an integer. If the
 value can't be coerced into an integer, spaCy will raise an error.
 `start: pydantic.StrictInt` will force the value to be an integer and raise an
 error if it's not – for instance, if your config defines a float.
 </Infobox>
 ### Wrapping PyTorch and TensorFlow {#custom-frameworks}
@ -511,6 +581,35 @@ mattis pretium.
 <!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
 ## Transfer learning {#transfer-learning}
 ### Using transformer models like BERT {#transformers}
 spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
 can use models implemented in a variety of frameworks. A transformer model is
 just a statistical model, so the
 [`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
 actually has very little work to do: it just has to provide a few functions that
 do the required plumbing. It also provides a pipeline component,
 [`Transformer`](/api/transformer), that lets you do multi-task learning and lets
 you save the transformer outputs for later use.
 <Project id="en_core_bert">
 Try out a BERT-based model pipeline using this project template: swap in your
 data, edit the settings and hyperparameters and train, evaluate, package and
 visualize your model.
 </Project>
 For more details on how to integrate transformer models into your training
 config and customize the implementations, see the usage guide on
 [training transformers](/usage/transformers#training).
 ### Pretraining with spaCy {#pretraining}
 <!-- TODO: document spacy pretrain -->
 ## Parallel Training with Ray {#parallel-training}
 <!-- TODO: document Ray integration -->
--- a/website/src/styles/aside.module.sass
+++ b/website/src/styles/aside.module.sass
@ -24,10 +24,16 @@ $border-radius: 6px
        &:last-child
            margin: 0
        &:first-child h4
            margin-top: 0 !important
        code
            padding: 0
            margin: 0
        h4
            margin-left: 0
    p, ul, ol
        font: inherit
        margin-bottom: var(--spacing-sm)
--- a/website/src/styles/layout.sass
+++ b/website/src/styles/layout.sass
@ -373,7 +373,7 @@ body [id]:target
    margin-right: -1.5em
    margin-left: -1.5em
    padding-right: 1.5em
-    padding-left: 1.65em
+    padding-left: 1.1em
    &:empty:before
        // Fix issue where empty lines would disappear