From 12052bd8f66b5ec41f5f2469f296763d6b88c0e4 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 10 Aug 2020 01:20:10 +0200 Subject: [PATCH] Update docs [ci skip] --- website/docs/api/architectures.md | 44 +++++++++--------- website/docs/api/cli.md | 52 +++++++++++++++------- website/docs/api/top-level.md | 4 +- website/docs/usage/101/_pipelines.md | 25 +++++------ website/docs/usage/training.md | 2 +- website/docs/usage/transformers.md | 8 ++-- website/docs/usage/v3.md | 10 ++++- website/src/widgets/quickstart-training.js | 6 +-- 8 files changed, 86 insertions(+), 65 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index c79551761..73631c64a 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout. ### spacy.Tok2Vec.v1 {#Tok2Vec} - - > #### Example config > > ```ini @@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout. > @architectures = "spacy.Tok2Vec.v1" > > [model.embed] +> @architectures = "spacy.CharacterEmbed.v1" +> # ... > > [model.encode] +> @architectures = "spacy.MaxoutWindowEncoder.v1" +> # ... > ``` Construct a tok2vec model out of embedding and encoding subnetworks. See the ["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp) blog post for background. -| Name | Type | Description | -| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. | -| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. | +| Name | Type | Description | +| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) | +| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). | ### spacy.Tok2VecListener.v1 {#Tok2VecListener} @@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline. ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} - - > #### Example config > > ```ini @@ -143,17 +143,15 @@ representation. ### spacy.CharacterEmbed.v1 {#CharacterEmbed} - - > #### Example config > > ```ini > [model] > @architectures = "spacy.CharacterEmbed.v1" -> width = 64 -> rows = 2000 -> nM = 16 -> nC = 4 +> width = 128 +> rows = 7000 +> nM = 64 +> nC = 8 > ``` Construct an embedded representations based on character embeddings, using a @@ -186,9 +184,9 @@ construct a single vector to represent the information. > ```ini > [model] > @architectures = "spacy.MaxoutWindowEncoder.v1" -> width = 64 +> width = 128 > window_size = 1 -> maxout_pieces = 2 +> maxout_pieces = 3 > depth = 4 > ``` @@ -254,8 +252,6 @@ architectures into your training config. ### spacy-transformers.TransformerModel.v1 {#TransformerModel} - - > #### Example Config > > ```ini @@ -270,6 +266,8 @@ architectures into your training config. > stride = 96 > ``` + + | Name | Type | Description | | ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). | @@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors. > #### Example Config > > ```ini -> # TODO: +> [model] +> @architectures = "spacy.Tok2VecTransformer.v1" +> name = "albert-base-v2" +> tokenizer_config = {"use_fast": false} +> grad_factor = 1.0 > ``` Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does @@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short. | `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | | `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. | - - ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index c4a774cd0..5c971effa 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -438,7 +438,29 @@ will not be available. | `--help`, `-h` | flag | Show help message and available arguments. | | overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. | - +### debug profile {#debug-profile} + +Profile which functions take the most time in a spaCy pipeline. Input should be +formatted as one JSON object per line with a key `"text"`. It can either be +provided as a JSONL file, or be read from `sys.sytdin`. If no input file is +specified, the IMDB dataset is loaded via +[`ml_datasets`](https://github.com/explosion/ml_datasets). + + + +The `profile` command is now available as a subcommand of `spacy debug`. + + + +```bash +$ python -m spacy debug profile [model] [inputs] [--n-texts] +``` + +| Argument | Type | Description | +| ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- | +| `model` | positional | A loadable spaCy model. | +| `inputs` | positional | Optional path to input file, or `-` for standard input. | +| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. | ### debug model {#debug-model} @@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P -| Argument | Type | Default | Description | -| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- | -| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | -| `component` | positional | | Name of the pipeline component of which the model should be analyzed. | -| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. | -| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. | -| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. | -| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. | -| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. | -| `--print-step0`, `-P0` | option | `False` | Print model before training. | -| `--print-step1`, `-P1` | option | `False` | Print model after initialization. | -| `--print-step2`, `-P2` | option | `False` | Print model after training. | -| `--print-step3`, `-P3` | option | `False` | Print final predictions. | -| `--help`, `-h` | flag | | Show help message and available arguments. | +| Argument | Type | Description | Default | +| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- | +| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | | +| `component` | positional | Name of the pipeline component of which the model should be analyzed. |   | +| `--layers`, `-l` | option | Comma-separated names of layer IDs to print. | | +| `--dimensions`, `-DIM` | option | Show dimensions of each layer. | `False` | +| `--parameters`, `-PAR` | option | Show parameters of each layer. | `False` | +| `--gradients`, `-GRAD` | option | Show gradients of each layer. | `False` | +| `--attributes`, `-ATTR` | option | Show attributes of each layer. | `False` | +| `--print-step0`, `-P0` | option | Print model before training. | `False` | +| `--print-step1`, `-P1` | option | Print model after initialization. | `False` | +| `--print-step2`, `-P2` | option | Print model after training. | `False` | +| `--print-step3`, `-P3` | option | Print final predictions. | `False` | +| `--help`, `-h` | flag | Show help message and available arguments. | | ## Train {#train} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 0b3167901..60885f246 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -293,8 +293,6 @@ factories. > return Model("custom", forward, dims={"nO": nO}) > ``` - - | Registry name | Description | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | @@ -303,7 +301,7 @@ factories. | `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `lookups` | Registry for large lookup tables available via `vocab.lookups`. | | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `assets` | | +| `assets` | Registry for data assets, knowledge bases etc. | | `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | | `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | | `batchers` | Registry for training and evaluation [data batchers](#batchers). | diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index 4bbc41f62..295aa6e52 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js' - +The statistical components like the tagger or parser are typically independent +and don't share any data between each other. For example, the named entity +recognizer doesn't use any features set by the tagger and parser, and so on. +This means that you can swap them, or remove single components from the pipeline +without affecting the others. However, components may share a "token-to-vector" +component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer). -In spaCy v2.x, the statistical components like the tagger or parser are -independent and don't share any data between themselves. For example, the named -entity recognizer doesn't use any features set by the tagger and parser, and so -on. This means that you can swap them, or remove single components from the -pipeline without affecting the others. - -However, custom components may depend on annotations set by other components. -For example, a custom lemmatizer may need the part-of-speech tags assigned, so -it'll only work if it's added after the tagger. The parser will respect -pre-defined sentence boundaries, so if a previous component in the pipeline sets -them, its dependency predictions may be different. Similarly, it matters if you -add the [`EntityRuler`](/api/entityruler) before or after the statistical entity +Custom components may also depend on annotations set by other components. For +example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll +only work if it's added after the tagger. The parser will respect pre-defined +sentence boundaries, so if a previous component in the pipeline sets them, its +dependency predictions may be different. Similarly, it matters if you add the +[`EntityRuler`](/api/entityruler) before or after the statistical entity recognizer: if it's added before, the entity recognizer will take the existing entities into account when making predictions. The [`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index ef69c302c..e6d328d02 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -371,7 +371,7 @@ that reference this variable. ### Model architectures {#model-architectures} - + ### Metrics, training output and weighted scores {#metrics} diff --git a/website/docs/usage/transformers.md b/website/docs/usage/transformers.md index 9a8c472af..e52417d13 100644 --- a/website/docs/usage/transformers.md +++ b/website/docs/usage/transformers.md @@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy. $ pip install spacy-transformers ``` - - ## Runtime usage {#runtime} Transformer models can be used as **drop-in replacements** for other types of @@ -99,9 +97,9 @@ evaluate, package and visualize your model. -The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline -components and the settings used to construct them, including their model -implementation. Here's a config snippet for the +The `[components]` section in the [`config.cfg`](/api/data-formats#config) +describes the pipeline components and the settings used to construct them, +including their model implementation. Here's a config snippet for the [`Transformer`](/api/transformer) component, along with matching Python code. In this case, the `[components.transformer]` block describes the `transformer` component: diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 36f934e96..02f6882e4 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output #### Training config {#migrating-training-config} - +The easiest way to get started with a training config is to use the +[`init config`](/api/cli#init-config) command. You can start off with a blank +config for a new model, copy the config from an existing model, or auto-fill a +partial config like a starter config generated by our +[quickstart widget](/usage/training#quickstart). + +```bash +python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser +``` ```diff ### {wrap="true"} diff --git a/website/src/widgets/quickstart-training.js b/website/src/widgets/quickstart-training.js index 8d20a0744..b7920dd02 100644 --- a/website/src/widgets/quickstart-training.js +++ b/website/src/widgets/quickstart-training.js @@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby' import { Quickstart, QS } from '../components/quickstart' const DEFAULT_LANG = 'en' -const MODELS_SMALL = { en: 'roberta-base-small' } -const MODELS_LARGE = { en: 'roberta-base' } - const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat'] const COMMENT = `# This is an auto-generated partial config for training a model. -# TODO: intructions for how to fill and use it` +# To use it for training, auto-fill it with all default values. +# python -m spacy init config config.cfg --base base_config.cfg` const DATA = [ { id: 'lang',