Update docs [ci skip]

2025-07-15 18:52:29 +03:00 · 2020-07-25 18:51:12 +02:00 · 2020-07-25 18:51:12 +02:00 · c288dba8e7
commit c288dba8e7
parent 1346ee06d4
6 changed files with 297 additions and 147 deletions
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
 > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
 > ```
-| Name        | Type  | Description                                                                       |
+| Name        | Type        | Description                                                                       |
-| ----------- | ----- | --------------------------------------------------------------------------------- |
+| ----------- | ----------- | --------------------------------------------------------------------------------- |
-| `text`      | str   | The text to be processed.                                                         |
+| `text`      | str         | The text to be processed.                                                         |
-| `disable`   | `List[str]`  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
+| `disable`   | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
-| **RETURNS** | `Doc` | A container for accessing the annotations.                                        |
+| **RETURNS** | `Doc`       | A container for accessing the annotations.                                        |
 ## Language.pipe {#pipe tag="method"}
@ -112,14 +112,14 @@ Evaluate a model's pipeline components.
 > print(scores)
 > ```
-| Name                                         | Type                | Description                                                                           |
+| Name                                         | Type                            | Description                                                                           |
-| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
+| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
-| `examples`                                   | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from.                           |
+| `examples`                                   | `Iterable[Example]`             | A batch of [`Example`](/api/example) objects to learn from.                           |
-| `verbose`                                    | bool                | Print debugging information.                                                          |
+| `verbose`                                    | bool                            | Print debugging information.                                                          |
-| `batch_size`                                 | int                 | The batch size to use.                                                                |
+| `batch_size`                                 | int                             | The batch size to use.                                                                |
-| `scorer`                                     | `Scorer`            | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `scorer`                                     | `Scorer`                        | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
-| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]`   | Config parameters for specific pipeline components, keyed by component name.          |
+| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]`               | Config parameters for specific pipeline components, keyed by component name.          |
-| **RETURNS**                                  | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores.                                        |
+| **RETURNS**                                  | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores.                                                    |
 ## Language.begin_training {#begin_training tag="method"}
@ -418,11 +418,70 @@ available to the loaded object.
 ## Class attributes {#class-attributes}
-| Name                                   | Type  | Description                                                                                                                         |
+| Name       | Type  | Description                                                                                     |
-| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
-| `Defaults`                             | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline.                                           |
+| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline.       |
-| `lang`                                 | str   | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).                                     |
+| `lang`     | str   | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
-| `factories` <Tag variant="new">2</Tag> | dict  | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
+
 ## Defaults {#defaults}
 The following attributes can be set on the `Language.Defaults` class to
 customize the default language data:
 > #### Example
 >
 > ```python
 > from spacy.language import language
 > from spacy.lang.tokenizer_exceptions import URL_MATCH
 > from thinc.api import Config
 >
 > DEFAULT_CONFIFG = """
 > [nlp.tokenizer]
 > @tokenizers = "MyCustomTokenizer.v1"
 > """
 >
 > class Defaults(Language.Defaults):
 >    stop_words = set()
 >    tokenizer_exceptions = {}
 >    prefixes = tuple()
 >    suffixes = tuple()
 >    infixes = tuple()
 >    token_match = None
 >    url_match = URL_MATCH
 >    lex_attr_getters = {}
 >    syntax_iterators = {}
 >    writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
 >    config = Config().from_str(DEFAULT_CONFIG)
 > ```
 | Name                              | Description                                                                                                                                                                                                              |
 | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `stop_words`                      | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py]                                                                                                                         |
 | `tokenizer_exceptions`            | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py]                                                                       |
 | `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py]                                                                                                           |
 | `token_match`                     | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py]                                                 |
 | `url_match`                       | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py]                                                |
 | `lex_attr_getters`                | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py]                                                                                             |
 | `syntax_iterators`                | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py].  |
 | `writing_system`                  | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
 | `config`                          | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py]                                    |
 [stop_words.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
 [tokenizer_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
 [de/tokenizer_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
 [fr/tokenizer_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
 [punctuation.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
 [lex_attrs.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
 [zh/__init__.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py
 ## Serialization fields {#serialization-fields}
--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@ -8,12 +8,10 @@ makes the data easy to update and extend.
 The **shared language data** in the directory root includes rules that can be
 generalized across languages – for example, rules for basic punctuation, emoji,
-emoticons, single-letter abbreviations and norms for equivalent tokens with
+emoticons and single-letter abbreviations. The **individual language data** in a
-different spellings, like `"` and `”`. This helps the models make more accurate
+submodule contains rules that are only relevant to a particular language. It
-predictions. The **individual language data** in a submodule contains rules that
+also takes care of putting together all components and creating the `Language`
-are only relevant to a particular language. It also takes care of putting
+subclass – for example, `English` or `German`.
 together all components and creating the `Language` subclass – for example,
 `English` or `German`.
 > ```python
 > from spacy.lang.en import English
@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example,
 > nlp_de = German()  # Includes German data
 > ```
 <!-- TODO: upgrade graphic
 ![Language data architecture](../../images/language_data.svg)
 -->
 <!-- TODO: remove this table in favor of more specific Language.Defaults docs in linguistic features? -->
 | Name                                                                               | Description                                                                                                                                              |
 | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | **Stop words**<br />[`stop_words.py`][stop_words.py]                               | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
 | **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".                            |
 | **Norm exceptions**<br />[`norm_exceptions.py`][norm_exceptions.py]                | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling.                              |
 | **Punctuation rules**<br />[`punctuation.py`][punctuation.py]                      | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.       |
 | **Character classes**<br />[`char_classes.py`][char_classes.py]                    | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.                                            |
 | **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                         | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
 | **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]             | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
 | **Tag map**<br />[`tag_map.py`][tag_map.py]                                        | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags.                            |
 | **Morph rules**<br />[`morph_rules.py`][morph_rules.py]                            | Exception rules for morphological analysis of irregular words like personal pronouns.                                                                    |
 | **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data]                     | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
 [stop_words.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
 [tokenizer_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
 [norm_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
 [punctuation.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
 [char_classes.py]:
@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example,
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
 [tag_map.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
 [morph_rules.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
 [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
 <Tokenization101 />
-### Tokenizer data {#101-data}
+<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
 spaCy introduces a novel tokenization algorithm, that gives a better balance
 between performance, ease of definition, and ease of alignment into the original
 string.
 After consuming a prefix or suffix, we consult the special cases again. We want
 the special cases to handle things like "don't" in English, and we want the same
 rule to work for "(don't)!". We do this by splitting off the open bracket, then
 the exclamation, then the close bracket, and finally matching the special case.
 Here's an implementation of the algorithm in Python, optimized for readability
 rather than performance:
 ```python
 def tokenizer_pseudo_code(
    special_cases,
    prefix_search,
    suffix_search,
    infix_finditer,
    token_match,
    url_match
 ):
    tokens = []
    for substring in text.split():
        suffixes = []
        while substring:
            while prefix_search(substring) or suffix_search(substring):
                if token_match(substring):
                    tokens.append(substring)
                    substring = ""
                    break
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ""
                    break
                if prefix_search(substring):
                    split = prefix_search(substring).end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    if substring in special_cases:
                        continue
                if suffix_search(substring):
                    split = suffix_search(substring).start()
                    suffixes.append(substring[split:])
                    substring = substring[:split]
            if token_match(substring):
                tokens.append(substring)
                substring = ""
            elif url_match(substring):
                tokens.append(substring)
                substring = ""
            elif substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ""
            elif list(infix_finditer(substring)):
                infixes = infix_finditer(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                if substring[offset:]:
                    tokens.append(substring[offset:])
                substring = ""
            elif substring:
                tokens.append(substring)
                substring = ""
        tokens.extend(reversed(suffixes))
    return tokens
 ```
 The algorithm can be summarized as follows:
 1. Iterate over whitespace-separated substrings.
 2. Look for a token match. If there is a match, stop processing and keep this
   token.
 3. Check whether we have an explicitly defined special case for this substring.
   If we do, use it.
 4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
   so that the token match and special cases always get priority.
 5. If we didn't consume a prefix, try to consume a suffix and then go back to
   #2.
 6. If we can't consume a prefix or a suffix, look for a URL match.
 7. If there's no URL match, then look for a special case.
 8. Look for "infixes" — stuff like hyphens etc. and split the substring into
   tokens on all infixes.
 9. Once we can't consume any more of the string, handle it as a single token.
 </Accordion>
 **Global** and **language-specific** tokenizer data is supplied via the language
 data in
@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for
 example, when to split off periods (at the end of a sentence), and when to leave
 tokens containing periods intact (abbreviations like "U.S.").
 ![Language data architecture](../images/language_data.svg)
 <Infobox title="Language data" emoji="📖">
 For more details on the language-specific data, see the usage guide on
 [adding languages](/usage/adding-languages).
 </Infobox>
 <Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
 Tokenization rules that are specific to one language, but can be **generalized
@ -637,6 +716,14 @@ subclass.
 ---
 <!--
 ### Customizing the tokenizer {#tokenizer-custom}
 TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
 -->
 ### Adding special case tokenization rules {#special-cases}
 Most domains have at least some idiosyncrasies that require custom tokenization
@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
 assert len(nlp("...gimme...?")) == 1
 ```
 ### How spaCy's tokenizer works {#how-tokenizer-works}
 spaCy introduces a novel tokenization algorithm, that gives a better balance
 between performance, ease of definition, and ease of alignment into the original
 string.
 After consuming a prefix or suffix, we consult the special cases again. We want
 the special cases to handle things like "don't" in English, and we want the same
 rule to work for "(don't)!". We do this by splitting off the open bracket, then
 the exclamation, then the close bracket, and finally matching the special case.
 Here's an implementation of the algorithm in Python, optimized for readability
 rather than performance:
 ```python
 def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
                          infix_finditer, token_match, url_match):
    tokens = []
    for substring in text.split():
        suffixes = []
        while substring:
            while prefix_search(substring) or suffix_search(substring):
                if token_match(substring):
                    tokens.append(substring)
                    substring = ''
                    break
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ''
                    break
                if prefix_search(substring):
                    split = prefix_search(substring).end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    if substring in special_cases:
                        continue
                if suffix_search(substring):
                    split = suffix_search(substring).start()
                    suffixes.append(substring[split:])
                    substring = substring[:split]
            if token_match(substring):
                tokens.append(substring)
                substring = ''
            elif url_match(substring):
                tokens.append(substring)
                substring = ''
            elif substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif list(infix_finditer(substring)):
                infixes = infix_finditer(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                if substring[offset:]:
                    tokens.append(substring[offset:])
                substring = ''
            elif substring:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
    return tokens
 ```
 The algorithm can be summarized as follows:
 1. Iterate over whitespace-separated substrings.
 2. Look for a token match. If there is a match, stop processing and keep this
   token.
 3. Check whether we have an explicitly defined special case for this substring.
   If we do, use it.
 4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
   so that the token match and special cases always get priority.
 5. If we didn't consume a prefix, try to consume a suffix and then go back to
   #2.
 6. If we can't consume a prefix or a suffix, look for a URL match.
 7. If there's no URL match, then look for a special case.
 8. Look for "infixes" — stuff like hyphens etc. and split the substring into
   tokens on all infixes.
 9. Once we can't consume any more of the string, handle it as a single token.
 #### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
 A working implementation of the pseudo-code above is available for debugging as
@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
 tuples showing which tokenizer rule or pattern was matched for each token. The
 tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
 > #### Expected output
 >
 > ```
 > "      PREFIX
 > Let    SPECIAL-1
 > 's     SPECIAL-2
 > go     TOKEN
 > !      SUFFIX
 > "      SUFFIX
 > ```
 ```python
 ### {executable="true"}
 from spacy.lang.en import English
@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
 assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
 for t in tok_exp:
    print(t[1], "\\t", t[0])
 # " 	 PREFIX
 # Let 	 SPECIAL-1
 # 's 	 SPECIAL-2
 # go 	 TOKEN
 # ! 	 SUFFIX
 # " 	 SUFFIX
 ```
 ### Customizing spaCy's Tokenizer class {#native-tokenizers}
@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
 import LanguageData101 from 'usage/101/\_language-data.md'
 <LanguageData101 />
 ### Creating a custom language subclass {#language-subclass}
 If you want to customize multiple components of the language data or add support
 for a custom language or domain-specific "dialect", you can also implement your
 own language subclass. The subclass should define two attributes: the `lang`
 (unique language code) and the `Defaults` defining the language data. For an
 overview of the available attributes that can be overwritten, see the
 [`Language.Defaults`](/api/language#defaults) documentation.
 ```python
 ### {executable="true"}
 from spacy.lang.en import English
 class CustomEnglishDefaults(English.Defaults):
    stop_words = set(["custom", "stop"])
 class CustomEnglish(English):
    lang = "custom_en"
    Defaults = CustomEnglishDefaults
 nlp1 = English()
 nlp2 = CustomEnglish()
 print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
 print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
 ```
 The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
 register a custom language class and assign it a string name. This means that
 you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
 language name, and even train models with it and refer to it in your
 [training config](/usage/training#config).
 > #### Config usage
 >
 > After registering your custom language class using the `languages` registry,
 > you can refer to it in your [training config](/usage/training#config). This
 > means spaCy will train your model using the custom subclass.
 >
 > ```ini
 > [nlp]
 > lang = "custom_en"
 > ```
 >
 > In order to resolve `"custom_en"` to your subclass, the registered function
 > needs to be available during training. You can load a Python file containing
 > the code using the `--code` argument:
 >
 > ```bash
 > ### {wrap="true"}
 > $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
 > ```
 ```python
 ### Registering a custom language {highlight="7,12-13"}
 import spacy
 from spacy.lang.en import English
 class CustomEnglishDefaults(English.Defaults):
    stop_words = set(["custom", "stop"])
@spacy.registry.languages("custom_en")
 class CustomEnglish(English):
    lang = "custom_en"
    Defaults = CustomEnglishDefaults
 # This now works! 🎉
 nlp = spacy.blank("custom_en")
 ```
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -618,7 +618,9 @@ mattis pretium.
 [FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
 for building REST APIs with Python, based on Python
 [type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
-library for serving machine learning models and
+library for serving machine learning models and you can use it in your spaCy
 projects to quickly serve up a trained model and make it available behind a REST
 API.
 ```python
 # TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
 see a table showing metrics after each pass over the data. Here's what those
 metrics means:
-<!-- TODO: update table below with updated metrics if needed -->
+<!-- TODO: update table below and include note about scores in config -->
 | Name       | Description                                                                                       |
 | ---------- | ------------------------------------------------------------------------------------------------- |
@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
 Some of the main advantages and features of spaCy's training config are:
 - **Structured sections.** The config is grouped into sections, and nested
-  sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
+  sections are defined using the `.` notation. For example, `[components.ner]`
  defines the settings for the pipeline's named entity recognizer. The config
  can be loaded as a Python dict.
 - **References to registered functions.** Sections can refer to registered
@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
  Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
  config which types of data to expect.
 <!-- TODO: update this config? -->
 ```ini
-https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
+https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
 ```
 Under the hood, the config is parsed into a dictionary. It's divided into
@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures,
 schedules, optimizers or any other custom components. The main top-level
 sections of a config file are:
-| Section       | Description                                                                                           |
+| Section       | Description                                                                                                          |
-| ------------- | ----------------------------------------------------------------------------------------------------- |
+| ------------- | -------------------------------------------------------------------------------------------------------------------- |
-| `training`    | Settings and controls for the training and evaluation process.                                        |
+| `training`    | Settings and controls for the training and evaluation process.                                                       |
-| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                    |
+| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                                   |
-| `nlp`         | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
+| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
 | `components`  | Definitions of the [pipeline components](/docs/processing-pipelines) and their models.                               |
 <Infobox title="Config format and settings" emoji="📖">
@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
 and no hidden defaults. However, there can still be scenarios where you may want
 to override config settings when you run [`spacy train`](/api/cli#train). This
 includes **file paths** to vectors or other resources that shouldn't be
-hard-code in a config file, or **system-dependent settings** like the GPU ID.
+hard-code in a config file, or **system-dependent settings**.
 For cases like this, you can set additional command-line options starting with
 `--` that correspond to the config section and value to override. For example,
-`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
+`--training.batch_size 128` sets the `batch_size` value in the `[training]`
-`1`.
+block to `128`.
 ```bash
 $ python -m spacy train train.spacy dev.spacy config.cfg
--training.use_gpu 1 --nlp.vectors /path/to/vectors
+--training.batch_size 128 --nlp.vectors /path/to/vectors
 ```
 Only existing sections and values in the config can be overwritten. At the end
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -14,4 +14,20 @@ menu:
 ## Backwards Incompatibilities {#incompat}
 ### Removed deprecated methods, attributes and arguments {#incompat-removed}
 The following deprecated methods, attributes and arguments were removed in v3.0.
 Most of them have been deprecated for quite a while now and many would
 previously raise errors. Many of them were also mostly internals. If you've been
 working with more recent versions of spaCy v2.x, it's unlikely that your code
 relied on them.
 | Class                 | Removed                                                 |
 | --------------------- | ------------------------------------------------------- |
 | [`Doc`](/api/doc)     | `Doc.tokens_from_list`, `Doc.merge`                     |
 | [`Span`](/api/span)   | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
 | [`Token`](/api/token) | `Token.string`                                          |
 <!-- TODO: complete (see release notes Dropbox Paper doc) -->
 ## Migrating from v2.x {#migrating}