Update docs [ci skip]

2025-08-09 06:34:54 +03:00 · 2020-08-10 01:20:10 +02:00 · 2020-08-10 01:20:10 +02:00 · 12052bd8f6
commit 12052bd8f6
parent 0832cdd443
8 changed files with 86 additions and 65 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout.

 ### spacy.Tok2Vec.v1 {#Tok2Vec}

-<!-- TODO: example config -->
-
 > #### Example config
 >
 > ```ini
@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout.
 > @architectures = "spacy.Tok2Vec.v1"
 >
 > [model.embed]
+> @architectures = "spacy.CharacterEmbed.v1"
+> # ...
 >
 > [model.encode]
+> @architectures = "spacy.MaxoutWindowEncoder.v1"
+> # ...
 > ```

 Construct a tok2vec model out of embedding and encoding subnetworks. See the
 ["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
 blog post for background.

-| Name     | Type                                       | Description                                                                                                                                                |
-| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
-| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
+| Name     | Type                                       | Description                                                                                                                                                                                                                                      |
+| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
+| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder).           |

 ### spacy.Tok2VecListener.v1 {#Tok2VecListener}

@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline.

 ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}

-<!-- TODO: check example config -->
-
 > #### Example config
 >
 > ```ini
@ -143,17 +143,15 @@ representation.

 ### spacy.CharacterEmbed.v1 {#CharacterEmbed}

-<!-- TODO: check example config -->
-
 > #### Example config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.CharacterEmbed.v1"
-> width = 64
-> rows = 2000
-> nM = 16
-> nC = 4
+> width = 128
+> rows = 7000
+> nM = 64
+> nC = 8
 > ```

 Construct an embedded representations based on character embeddings, using a
@ -186,9 +184,9 @@ construct a single vector to represent the information.
 > ```ini
 > [model]
 > @architectures = "spacy.MaxoutWindowEncoder.v1"
-> width = 64
+> width = 128
 > window_size = 1
-> maxout_pieces = 2
+> maxout_pieces = 3
 > depth = 4
 > ```

@ -254,8 +252,6 @@ architectures into your training config.

 ### spacy-transformers.TransformerModel.v1 {#TransformerModel}

-<!-- TODO: description -->
-
 > #### Example Config
 >
 > ```ini
@ -270,6 +266,8 @@ architectures into your training config.
 > stride = 96
 > ```

+<!-- TODO: description -->
+
 | Name               | Type             | Description                                                                                                                                                                                                     |
 | ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `name`             | str              | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel).                                                                |
@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors.
 > #### Example Config
 >
 > ```ini
-> # TODO:
+> [model]
+> @architectures = "spacy.Tok2VecTransformer.v1"
+> name = "albert-base-v2"
+> tokenizer_config = {"use_fast": false}
+> grad_factor = 1.0
 > ```

 Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short.
 | `no_output_layer`   | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                                                               |
 | `nO`                | int   | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |

-<!-- TODO:
-### spacy.TextCatLowData.v1 {#TextCatLowData}
-->
-
 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

 An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -438,7 +438,29 @@ will not be available.
 | `--help`, `-h`             | flag       | Show help message and available arguments.                                                                                                                           |
 | overrides                  |            | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |

-<!-- TODO: document debug profile?-->
+### debug profile {#debug-profile}
+
+Profile which functions take the most time in a spaCy pipeline. Input should be
+formatted as one JSON object per line with a key `"text"`. It can either be
+provided as a JSONL file, or be read from `sys.sytdin`. If no input file is
+specified, the IMDB dataset is loaded via
+[`ml_datasets`](https://github.com/explosion/ml_datasets).
+
+<Infobox title="New in v3.0" variant="warning">
+
+The `profile` command is now available as a subcommand of `spacy debug`.
+
+</Infobox>
+
+```bash
+$ python -m spacy debug profile [model] [inputs] [--n-texts]
+```
+
+| Argument          | Type                                                              | Description                                             |
+| ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
+| `model`           | positional                                                        | A loadable spaCy model.                                 |
+| `inputs`          | positional                                                        | Optional path to input file, or `-` for standard input. |
+| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. |

 ### debug model {#debug-model}

@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P

 </Accordion>

-| Argument                | Type       | Default | Description                                                                                           |
-| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
-| `config_path`           | positional |         | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
-| `component`             | positional |         | Name of the pipeline component of which the model should be analyzed.                                 |
-| `--layers`, `-l`        | option     |         | Comma-separated names of layer IDs to print.                                                          |
-| `--dimensions`, `-DIM`  | option     | `False` | Show dimensions of each layer.                                                                        |
-| `--parameters`, `-PAR`  | option     | `False` | Show parameters of each layer.                                                                        |
-| `--gradients`, `-GRAD`  | option     | `False` | Show gradients of each layer.                                                                         |
-| `--attributes`, `-ATTR` | option     | `False` | Show attributes of each layer.                                                                        |
-| `--print-step0`, `-P0`  | option     | `False` | Print model before training.                                                                          |
-| `--print-step1`, `-P1`  | option     | `False` | Print model after initialization.                                                                     |
-| `--print-step2`, `-P2`  | option     | `False` | Print model after training.                                                                           |
-| `--print-step3`, `-P3`  | option     | `False` | Print final predictions.                                                                              |
-| `--help`, `-h`          | flag       |         | Show help message and available arguments.                                                            |
+| Argument                | Type       | Description                                                                                           | Default |
+| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- |
+| `config_path`           | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |         |
+| `component`             | positional | Name of the pipeline component of which the model should be analyzed.                                 |         |
+| `--layers`, `-l`        | option     | Comma-separated names of layer IDs to print.                                                          |         |
+| `--dimensions`, `-DIM`  | option     | Show dimensions of each layer.                                                                        | `False` |
+| `--parameters`, `-PAR`  | option     | Show parameters of each layer.                                                                        | `False` |
+| `--gradients`, `-GRAD`  | option     | Show gradients of each layer.                                                                         | `False` |
+| `--attributes`, `-ATTR` | option     | Show attributes of each layer.                                                                        | `False` |
+| `--print-step0`, `-P0`  | option     | Print model before training.                                                                          | `False` |
+| `--print-step1`, `-P1`  | option     | Print model after initialization.                                                                     | `False` |
+| `--print-step2`, `-P2`  | option     | Print model after training.                                                                           | `False` |
+| `--print-step3`, `-P3`  | option     | Print final predictions.                                                                              | `False` |
+| `--help`, `-h`          | flag       | Show help message and available arguments.                                                            |         |

 ## Train {#train}

--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -293,8 +293,6 @@ factories.
 >     return Model("custom", forward, dims={"nO": nO})
 > ```

-<!-- TODO: finish table -->
-
 | Registry name     | Description                                                                                                                                                                                                                                       |
 | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                          |
@ -303,7 +301,7 @@ factories.
 | `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                |
 | `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                   |
 | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                            |
-| `assets`          |                                                                                                                                                                                                                                                   |
+| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                    |
 | `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                            |
 | `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                   |
 | `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                  |
--- a/website/docs/usage/101/_pipelines.md
+++ b/website/docs/usage/101/_pipelines.md
@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js'

 <Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">

-<!-- TODO: note on v3 tok2vec own model vs. upstream listeners -->
+The statistical components like the tagger or parser are typically independent
+and don't share any data between each other. For example, the named entity
+recognizer doesn't use any features set by the tagger and parser, and so on.
+This means that you can swap them, or remove single components from the pipeline
+without affecting the others. However, components may share a "token-to-vector"
+component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).

-In spaCy v2.x, the statistical components like the tagger or parser are
-independent and don't share any data between themselves. For example, the named
-entity recognizer doesn't use any features set by the tagger and parser, and so
-on. This means that you can swap them, or remove single components from the
-pipeline without affecting the others.
-
-However, custom components may depend on annotations set by other components.
-For example, a custom lemmatizer may need the part-of-speech tags assigned, so
-it'll only work if it's added after the tagger. The parser will respect
-pre-defined sentence boundaries, so if a previous component in the pipeline sets
-them, its dependency predictions may be different. Similarly, it matters if you
-add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
+Custom components may also depend on annotations set by other components. For
+example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
+only work if it's added after the tagger. The parser will respect pre-defined
+sentence boundaries, so if a previous component in the pipeline sets them, its
+dependency predictions may be different. Similarly, it matters if you add the
+[`EntityRuler`](/api/entityruler) before or after the statistical entity
 recognizer: if it's added before, the entity recognizer will take the existing
 entities into account when making predictions. The
 [`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -371,7 +371,7 @@ that reference this variable.

 ### Model architectures {#model-architectures}

-<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
+<!-- TODO: refer to architectures API: /api/architectures -->

 ### Metrics, training output and weighted scores {#metrics}

--- a/website/docs/usage/transformers.md
+++ b/website/docs/usage/transformers.md
@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy.
 $ pip install spacy-transformers
 ```

-<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
-
 ## Runtime usage {#runtime}

 Transformer models can be used as **drop-in replacements** for other types of
@ -99,9 +97,9 @@ evaluate, package and visualize your model.

 </Project>

-The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
-components and the settings used to construct them, including their model
-implementation. Here's a config snippet for the
+The `[components]` section in the [`config.cfg`](/api/data-formats#config)
+describes the pipeline components and the settings used to construct them,
+including their model implementation. Here's a config snippet for the
 [`Transformer`](/api/transformer) component, along with matching Python code. In
 this case, the `[components.transformer]` block describes the `transformer`
 component:
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output

 #### Training config {#migrating-training-config}

-<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
+The easiest way to get started with a training config is to use the
+[`init config`](/api/cli#init-config) command. You can start off with a blank
+config for a new model, copy the config from an existing model, or auto-fill a
+partial config like a starter config generated by our
+[quickstart widget](/usage/training#quickstart).
+
+```bash
+python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
+```

 ```diff
 ### {wrap="true"}
--- a/website/src/widgets/quickstart-training.js
+++ b/website/src/widgets/quickstart-training.js
@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby'
 import { Quickstart, QS } from '../components/quickstart'

 const DEFAULT_LANG = 'en'
-const MODELS_SMALL = { en: 'roberta-base-small' }
-const MODELS_LARGE = { en: 'roberta-base' }
-
 const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
 const COMMENT = `# This is an auto-generated partial config for training a model.
-# TODO: intructions for how to fill and use it`
+# To use it for training, auto-fill it with all default values.
+# python -m spacy init config config.cfg --base base_config.cfg`
 const DATA = [
    {
        id: 'lang',