Update docs [ci skip]

2025-07-15 10:42:34 +03:00 · 2020-10-02 13:24:33 +02:00 · 2020-10-02 13:24:33 +02:00 · df06f7a792
commit df06f7a792
parent d2aa662ab2
10 changed files with 88 additions and 44 deletions
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -8,8 +8,8 @@ source: spacy/language.py
 Usually you'll load this once per process as `nlp` and pass the instance around
 your application. The `Language` class is created when you call
 [`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
-[language data](/usage/adding-languages), optional binary weights, e.g. provided
-by a [trained pipeline](/models), and the
+[language data](/usage/linguistic-features#language-data), optional binary
+weights, e.g. provided by a [trained pipeline](/models), and the
 [processing pipeline](/usage/processing-pipelines) containing components like
 the tagger or parser that are called on a document in order. You can also add
 your own processing pipeline components that take a `Doc` object, modify it and
@ -210,7 +210,9 @@ settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
 config block to set up the vocabulary, load in vectors and tok2vec weights and
 pass optional arguments to the `initialize` methods implemented by pipeline
 components or the tokenizer. This method is typically called automatically when
-you run [`spacy train`](/api/cli#train).
+you run [`spacy train`](/api/cli#train). See the usage guide on the
+[config lifecycle](/usage/training#config-lifecycle) and
+[initialization](/usage/training#initialization) for details.

 `get_examples` should be a function that returns an iterable of
 [`Example`](/api/example) objects. The data examples can either be the full
@ -928,7 +930,7 @@ Serialize the current state to a binary string.

 Load state from a binary string. Note that this method is commonly used via the
 subclasses like `English` or `German` to make language-specific functionality
-like the [lexical attribute getters](/usage/adding-languages#lex-attrs)
+like the [lexical attribute getters](/usage/linguistic-features#language-data)
 available to the loaded object.

 > #### Example
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -130,8 +130,7 @@ applied to the `Doc` in order.
 ## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}

 Lemmatize a token using a lookup-based approach. If no lemma is found, the
-original string is returned. Languages can provide a
-[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
+original string is returned.

 | Name        | Description                                         |
 | ----------- | --------------------------------------------------- |
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `ent_id_`                                    | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~                                                                                                                                         |
 | `lemma`                                      | Base form of the token, with no inflectional suffixes. ~~int~~                                                                                                                                                                                                        |
 | `lemma_`                                     | Base form of the token, with no inflectional suffixes. ~~str~~                                                                                                                                                                                                        |
-| `norm`                                       | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~                                                                                                |
-| `norm_`                                      | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~                                                                                                |
+| `norm`                                       | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~                                                                                                    |
+| `norm_`                                      | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~                                                                                                    |
 | `lower`                                      | Lowercase form of the token. ~~int~~                                                                                                                                                                                                                                  |
 | `lower_`                                     | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~                                                                                                                                                                                         |
 | `shape`                                      | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
--- a/website/docs/api/tokenizer.md
+++ b/website/docs/api/tokenizer.md
@ -22,9 +22,8 @@ like punctuation and special case rules from the

 ## Tokenizer.\_\_init\_\_ {#init tag="method"}

-Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
-of how to construct a custom tokenizer with different tokenization rules, see
-the
+Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
+how to construct a custom tokenizer with different tokenization rules, see the
 [usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).

 > #### Example
@ -121,10 +120,10 @@ if no suffix rules match.
 ## Tokenizer.add_special_case {#add_special_case tag="method"}

 Add a special-case tokenization rule. This mechanism is also used to add custom
-tokenizer exceptions to the language data. See the usage guide on
-[adding languages](/usage/adding-languages#tokenizer-exceptions) and
-[linguistic features](/usage/linguistic-features#special-cases) for more details
-and examples.
+tokenizer exceptions to the language data. See the usage guide on the
+[languages data](/usage/linguistic-features#language-data) and
+[tokenizer special cases](/usage/linguistic-features#special-cases) for more
+details and examples.

 > #### Example
 >
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -827,7 +827,7 @@ utilities.
 ### util.get_lang_class {#util.get_lang_class tag="function"}

 Import and load a `Language` class. Allows lazy-loading
-[language data](/usage/adding-languages) and importing languages using the
+[language data](/usage/linguistic-features#language-data) and importing languages using the
 two-letter language code. To add a language code for a custom language class,
 you can register it using the [`@registry.languages`](/api/top-level#registry)
 decorator.
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -30,7 +30,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
 ## Language support {#languages}

 spaCy currently provides support for the following languages. You can help by
-[improving the existing language data](/usage/adding-languages#language-data)
+improving the existing [language data](/usage/linguistic-features#language-data)
 and extending the tokenization patterns.
 [See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
 contribute to development.
@ -83,55 +83,81 @@ To train a pipeline using the neutral multi-language class, you can set
 import the `MultiLanguage` class directly, or call
 [`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.

-### Chinese language support {#chinese new=2.3}
+### Chinese language support {#chinese new="2.3"}

 The Chinese language class supports three word segmentation options, `char`,
-`jieba` and `pkuseg`:
+`jieba` and `pkuseg`.

+> #### Manual setup
+>
 > ```python
 > from spacy.lang.zh import Chinese
 >
 > # Character segmentation (default)
 > nlp = Chinese()
->
 > # Jieba
 > cfg = {"segmenter": "jieba"}
 > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
->
 > # PKUSeg with "default" model provided by pkuseg
 > cfg = {"segmenter": "pkuseg"}
 > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 > nlp.tokenizer.initialize(pkuseg_model="default")
 > ```

-1. **Character segmentation:** Character segmentation is the default
-   segmentation option. It's enabled when you create a new `Chinese` language
-   class or call `spacy.blank("zh")`.
-2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
-   segmentation with the tokenizer option `{"segmenter": "jieba"}`.
-3. **PKUSeg**: As of spaCy v2.3.0, support for
-   [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
-   better segmentation for Chinese OntoNotes and the provided
-   [Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option
-   `{"segmenter": "pkuseg"}`.
+```ini
+### config.cfg
+[nlp.tokenizer]
+@tokenizers = "spacy.zh.ChineseTokenizer"
+segmenter = "char"
+```

-<Infobox variant="warning">
+| Segmenter | Description                                                                                                                                                                                                                                                                                |
+| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `char`    | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`.                                                                                                            |
+| `jieba`   | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`.                                                                                                                                                          |
+| `pkuseg`  | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |

-In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
-character segmentation.
+<Infobox title="Changed in v3.0" variant="warning">
+
+In v3.0, the default word segmenter has switched from Jieba to character
+segmentation. Because the `pkuseg` segmenter depends on a model that can be
+loaded from a file, the model is loaded on
+[initialization](/usage/training#config-lifecycle) (typically before training).
+This ensures that your packaged Chinese model doesn't depend on a local path at
+runtime.

 </Infobox>

 <Accordion title="Details on spaCy's Chinese API">

 The `initialize` method for the Chinese tokenizer class supports the following
-config settings for loading pkuseg models:
+config settings for loading `pkuseg` models:

 | Name               | Description                                                                                                                           |
 | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
 | `pkuseg_model`     | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~                                                  |
 | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |

+The initialization settings are typically provided in the
+[training config](/usage/training#config) and the data is loaded in before
+training and serialized with the model. This allows you to load the data from a
+local path and save out your pipeline and config, without requiring the same
+local path at runtime. See the usage guide on the
+[config lifecycle](/usage/training#config-lifecycle) for more background on
+this.
+
+```ini
+### config.cfg
+[initialize]
+
+[initialize.tokenizer]
+pkuseg_model = "/path/to/model"
+pkuseg_user_dict = "default"
+```
+
+You can also initialize the tokenizer for a blank language class by calling its
+`initialize` method:
+
 ```python
 ### Examples
 # Initialize the pkuseg tokenizer
@ -191,12 +217,13 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")

 ### Japanese language support {#japanese new=2.3}

+> #### Manual setup
+>
 > ```python
 > from spacy.lang.ja import Japanese
 >
 > # Load SudachiPy with split mode A (default)
 > nlp = Japanese()
->
 > # Load SudachiPy with split mode B
 > cfg = {"split_mode": "B"}
 > nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
@ -208,6 +235,13 @@ segmentation and part-of-speech tagging. The default Japanese language class and
 the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
 config can be used to configure the split mode to `A`, `B` or `C`.

+```ini
+### config.cfg
+[nlp.tokenizer]
+@tokenizers = "spacy.ja.JapaneseTokenizer"
+split_mode = "A"
+```
+
 <Infobox variant="warning">

 If you run into errors related to `sudachipy`, which is currently under active
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -895,6 +895,10 @@ the name. Registered functions can also take **arguments** by the way that can
 be defined in the config as well – you can read more about this in the docs on
 [training with custom code](/usage/training#custom-code).

+### Initializing components with data {#initialization}
+
+<!-- TODO: -->
+
 ### Python type hints and pydantic validation {#type-hints new="3"}

 spaCy's configs are powered by our machine learning library Thinc's
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -291,7 +291,7 @@ installed in the same environment – that's it.
 | Entry point                                                                    | Description                                                                                                                                                                                                                                              |
 | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | [`spacy_factories`](#entry-points-components)                                  | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package.                                                                                                     |
-| [`spacy_languages`](#entry-points-languages)                                   | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut.                                                                                                                                           |
+| [`spacy_languages`](#entry-points-languages)                                   | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut.                                                                                                                          |
 | `spacy_lookups` <Tag variant="new">2.2</Tag>                                   | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package.                                                                  |
 | [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |

--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@ -200,7 +200,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
 To learn more about how spaCy's tokenization rules work in detail, how to
 **customize and replace** the default tokenizer and how to **add
 language-specific data**, see the usage guides on
-[adding languages](/usage/adding-languages) and
+[language data](/usage/linguistic-features#language-data) and
 [customizing the tokenizer](/usage/linguistic-features#tokenization).

 </Infobox>
@ -479,7 +479,7 @@ find a "Suggest edits" link at the bottom of each page that points you to the
 source.

 Another way of getting involved is to help us improve the
-[language data](/usage/adding-languages#language-data) – especially if you
+[language data](/usage/linguistic-features#language-data) – especially if you
 happen to speak one of the languages currently in
 [alpha support](/usage/models#languages). Even adding simple tokenizer
 exceptions, stop words or lemmatizer data can make a big difference. It will
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -216,7 +216,9 @@ The initialization settings are only loaded and used when
 [`nlp.initialize`](/api/language#initialize) is called (typically right before
 training). This allows you to set up your pipeline using local data resources
 and custom functions, and preserve the information in your config – but without
-requiring it to be available at runtime
+requiring it to be available at runtime. You can also use this mechanism to
+provide data paths to custom pipeline components and custom tokenizers – see the
+section on [custom initialization](#initialization) for details.

 ### Overwriting config settings on the command line {#config-overrides}

@ -815,9 +817,9 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
    return create_model(output_width)
 ```

-<!-- TODO:
 ### Customizing the initialization {#initialization}
-->
+
+<!-- TODO: -->

 ## Data utilities {#data}

@ -1135,7 +1137,11 @@ An easy way to create modified `Example` objects is to use the
 capitalization changes, so only the `ORTH` values of the tokens will be
 different between the original and augmented examples.

-<!-- TODO: mention alignment -->
+Note that if your data augmentation strategy involves changing the tokenization
+(for instance, removing or adding tokens) and your training examples include
+token-based annotations like the dependency parse or entity labels, you'll need
+to take care to adjust the `Example` object so its annotations match and remain
+valid.

 ## Parallel & distributed training with Ray {#parallel-training}