Merge branch 'develop' into nightly.spacy.io

2025-10-22 19:54:18 +03:00 · 2020-08-10 01:20:37 +02:00 · 2020-08-10 01:20:37 +02:00 · 922250ca58
commit 922250ca58
parent a7b820aea4 12052bd8f6
8 changed files with 86 additions and 65 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout.
 ### spacy.Tok2Vec.v1 {#Tok2Vec}
 <!-- TODO: example config -->
 > #### Example config
 >
 > ```ini
@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout.
 > @architectures = "spacy.Tok2Vec.v1"
 >
 > [model.embed]
 > @architectures = "spacy.CharacterEmbed.v1"
 > # ...
 >
 > [model.encode]
 > @architectures = "spacy.MaxoutWindowEncoder.v1"
 > # ...
 > ```
 Construct a tok2vec model out of embedding and encoding subnetworks. See the
 ["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
 blog post for background.
-| Name     | Type                                       | Description                                                                                                                                                |
+| Name     | Type                                       | Description                                                                                                                                                                                                                                      |
-| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations.                                   |
+| `embed`  | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
-| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
+| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder).           |
 ### spacy.Tok2VecListener.v1 {#Tok2VecListener}
@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline.
 ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
 <!-- TODO: check example config -->
 > #### Example config
 >
 > ```ini
@ -143,17 +143,15 @@ representation.
 ### spacy.CharacterEmbed.v1 {#CharacterEmbed}
 <!-- TODO: check example config -->
 > #### Example config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.CharacterEmbed.v1"
-> width = 64
+> width = 128
-> rows = 2000
+> rows = 7000
-> nM = 16
+> nM = 64
-> nC = 4
+> nC = 8
 > ```
 Construct an embedded representations based on character embeddings, using a
@ -186,9 +184,9 @@ construct a single vector to represent the information.
 > ```ini
 > [model]
 > @architectures = "spacy.MaxoutWindowEncoder.v1"
-> width = 64
+> width = 128
 > window_size = 1
-> maxout_pieces = 2
+> maxout_pieces = 3
 > depth = 4
 > ```
@ -254,8 +252,6 @@ architectures into your training config.
 ### spacy-transformers.TransformerModel.v1 {#TransformerModel}
 <!-- TODO: description -->
 > #### Example Config
 >
 > ```ini
@ -270,6 +266,8 @@ architectures into your training config.
 > stride = 96
 > ```
 <!-- TODO: description -->
 | Name               | Type             | Description                                                                                                                                                                                                     |
 | ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `name`             | str              | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel).                                                                |
@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors.
 > #### Example Config
 >
 > ```ini
-> # TODO:
+> [model]
 > @architectures = "spacy.Tok2VecTransformer.v1"
 > name = "albert-base-v2"
 > tokenizer_config = {"use_fast": false}
 > grad_factor = 1.0
 > ```
 Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short.
 | `no_output_layer`   | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                                                               |
 | `nO`                | int   | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
 <!-- TODO:
 ### spacy.TextCatLowData.v1 {#TextCatLowData}
 -->
 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
 An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -438,7 +438,29 @@ will not be available.
 | `--help`, `-h`             | flag       | Show help message and available arguments.                                                                                                                           |
 | overrides                  |            | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
-<!-- TODO: document debug profile?-->
+### debug profile {#debug-profile}
 Profile which functions take the most time in a spaCy pipeline. Input should be
 formatted as one JSON object per line with a key `"text"`. It can either be
 provided as a JSONL file, or be read from `sys.sytdin`. If no input file is
 specified, the IMDB dataset is loaded via
 [`ml_datasets`](https://github.com/explosion/ml_datasets).
 <Infobox title="New in v3.0" variant="warning">
 The `profile` command is now available as a subcommand of `spacy debug`.
 </Infobox>
 ```bash
 $ python -m spacy debug profile [model] [inputs] [--n-texts]
 ```
 | Argument          | Type                                                              | Description                                             |
 | ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
 | `model`           | positional                                                        | A loadable spaCy model.                                 |
 | `inputs`          | positional                                                        | Optional path to input file, or `-` for standard input. |
 | `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. |
 ### debug model {#debug-model}
@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
 </Accordion>
-| Argument                | Type       | Default | Description                                                                                           |
+| Argument                | Type       | Description                                                                                           | Default |
-| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
+| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- |
-| `config_path`           | positional |         | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
+| `config_path`           | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |         |
-| `component`             | positional |         | Name of the pipeline component of which the model should be analyzed.                                 |
+| `component`             | positional | Name of the pipeline component of which the model should be analyzed.                                 |         |
-| `--layers`, `-l`        | option     |         | Comma-separated names of layer IDs to print.                                                          |
+| `--layers`, `-l`        | option     | Comma-separated names of layer IDs to print.                                                          |         |
-| `--dimensions`, `-DIM`  | option     | `False` | Show dimensions of each layer.                                                                        |
+| `--dimensions`, `-DIM`  | option     | Show dimensions of each layer.                                                                        | `False` |
-| `--parameters`, `-PAR`  | option     | `False` | Show parameters of each layer.                                                                        |
+| `--parameters`, `-PAR`  | option     | Show parameters of each layer.                                                                        | `False` |
-| `--gradients`, `-GRAD`  | option     | `False` | Show gradients of each layer.                                                                         |
+| `--gradients`, `-GRAD`  | option     | Show gradients of each layer.                                                                         | `False` |
-| `--attributes`, `-ATTR` | option     | `False` | Show attributes of each layer.                                                                        |
+| `--attributes`, `-ATTR` | option     | Show attributes of each layer.                                                                        | `False` |
-| `--print-step0`, `-P0`  | option     | `False` | Print model before training.                                                                          |
+| `--print-step0`, `-P0`  | option     | Print model before training.                                                                          | `False` |
-| `--print-step1`, `-P1`  | option     | `False` | Print model after initialization.                                                                     |
+| `--print-step1`, `-P1`  | option     | Print model after initialization.                                                                     | `False` |
-| `--print-step2`, `-P2`  | option     | `False` | Print model after training.                                                                           |
+| `--print-step2`, `-P2`  | option     | Print model after training.                                                                           | `False` |
-| `--print-step3`, `-P3`  | option     | `False` | Print final predictions.                                                                              |
+| `--print-step3`, `-P3`  | option     | Print final predictions.                                                                              | `False` |
-| `--help`, `-h`          | flag       |         | Show help message and available arguments.                                                            |
+| `--help`, `-h`          | flag       | Show help message and available arguments.                                                            |         |
 ## Train {#train}
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -293,8 +293,6 @@ factories.
 >     return Model("custom", forward, dims={"nO": nO})
 > ```
 <!-- TODO: finish table -->
 | Registry name     | Description                                                                                                                                                                                                                                       |
 | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                          |
@ -303,7 +301,7 @@ factories.
 | `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                |
 | `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                   |
 | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                            |
-| `assets`          |                                                                                                                                                                                                                                                   |
+| `assets`          | Registry for data assets, knowledge bases etc.                                                                                                                                                                                                    |
 | `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                            |
 | `readers`         | Registry for training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                                                   |
 | `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                  |
--- a/website/docs/usage/101/_pipelines.md
+++ b/website/docs/usage/101/_pipelines.md
@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js'
 <Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
-<!-- TODO: note on v3 tok2vec own model vs. upstream listeners -->
+The statistical components like the tagger or parser are typically independent
 and don't share any data between each other. For example, the named entity
 recognizer doesn't use any features set by the tagger and parser, and so on.
 This means that you can swap them, or remove single components from the pipeline
 without affecting the others. However, components may share a "token-to-vector"
 component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
-In spaCy v2.x, the statistical components like the tagger or parser are
+Custom components may also depend on annotations set by other components. For
-independent and don't share any data between themselves. For example, the named
+example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
-entity recognizer doesn't use any features set by the tagger and parser, and so
+only work if it's added after the tagger. The parser will respect pre-defined
-on. This means that you can swap them, or remove single components from the
+sentence boundaries, so if a previous component in the pipeline sets them, its
-pipeline without affecting the others.
+dependency predictions may be different. Similarly, it matters if you add the
-
+[`EntityRuler`](/api/entityruler) before or after the statistical entity
 However, custom components may depend on annotations set by other components.
 For example, a custom lemmatizer may need the part-of-speech tags assigned, so
 it'll only work if it's added after the tagger. The parser will respect
 pre-defined sentence boundaries, so if a previous component in the pipeline sets
 them, its dependency predictions may be different. Similarly, it matters if you
 add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
 recognizer: if it's added before, the entity recognizer will take the existing
 entities into account when making predictions. The
 [`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -371,7 +371,7 @@ that reference this variable.
 ### Model architectures {#model-architectures}
-<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
+<!-- TODO: refer to architectures API: /api/architectures -->
 ### Metrics, training output and weighted scores {#metrics}
--- a/website/docs/usage/transformers.md
+++ b/website/docs/usage/transformers.md
@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy.
 $ pip install spacy-transformers
 ```
 <!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
 ## Runtime usage {#runtime}
 Transformer models can be used as **drop-in replacements** for other types of
@ -99,9 +97,9 @@ evaluate, package and visualize your model.
 </Project>
-The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
+The `[components]` section in the [`config.cfg`](/api/data-formats#config)
-components and the settings used to construct them, including their model
+describes the pipeline components and the settings used to construct them,
-implementation. Here's a config snippet for the
+including their model implementation. Here's a config snippet for the
 [`Transformer`](/api/transformer) component, along with matching Python code. In
 this case, the `[components.transformer]` block describes the `transformer`
 component:
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output
 #### Training config {#migrating-training-config}
-<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
+The easiest way to get started with a training config is to use the
 [`init config`](/api/cli#init-config) command. You can start off with a blank
 config for a new model, copy the config from an existing model, or auto-fill a
 partial config like a starter config generated by our
 [quickstart widget](/usage/training#quickstart).
 ```bash
 python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
 ```
 ```diff
 ### {wrap="true"}
--- a/website/src/widgets/quickstart-training.js
+++ b/website/src/widgets/quickstart-training.js
@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby'
 import { Quickstart, QS } from '../components/quickstart'
 const DEFAULT_LANG = 'en'
 const MODELS_SMALL = { en: 'roberta-base-small' }
 const MODELS_LARGE = { en: 'roberta-base' }
 const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
 const COMMENT = `# This is an auto-generated partial config for training a model.
-# TODO: intructions for how to fill and use it`
+# To use it for training, auto-fill it with all default values.
 # python -m spacy init config config.cfg --base base_config.cfg`
 const DATA = [
    {
        id: 'lang',