Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-10 01:20:10 +02:00
parent 0832cdd443
commit 12052bd8f6
8 changed files with 86 additions and 65 deletions

View File

@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout.
### spacy.Tok2Vec.v1 {#Tok2Vec}
<!-- TODO: example config -->
> #### Example config
>
> ```ini
@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout.
> @architectures = "spacy.Tok2Vec.v1"
>
> [model.embed]
> @architectures = "spacy.CharacterEmbed.v1"
> # ...
>
> [model.encode]
> @architectures = "spacy.MaxoutWindowEncoder.v1"
> # ...
> ```
Construct a tok2vec model out of embedding and encoding subnetworks. See the
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
blog post for background.
| Name | Type | Description |
| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. |
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
| Name | Type | Description |
| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). |
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline.
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
<!-- TODO: check example config -->
> #### Example config
>
> ```ini
@ -143,17 +143,15 @@ representation.
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
<!-- TODO: check example config -->
> #### Example config
>
> ```ini
> [model]
> @architectures = "spacy.CharacterEmbed.v1"
> width = 64
> rows = 2000
> nM = 16
> nC = 4
> width = 128
> rows = 7000
> nM = 64
> nC = 8
> ```
Construct an embedded representations based on character embeddings, using a
@ -186,9 +184,9 @@ construct a single vector to represent the information.
> ```ini
> [model]
> @architectures = "spacy.MaxoutWindowEncoder.v1"
> width = 64
> width = 128
> window_size = 1
> maxout_pieces = 2
> maxout_pieces = 3
> depth = 4
> ```
@ -254,8 +252,6 @@ architectures into your training config.
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
<!-- TODO: description -->
> #### Example Config
>
> ```ini
@ -270,6 +266,8 @@ architectures into your training config.
> stride = 96
> ```
<!-- TODO: description -->
| Name | Type | Description |
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors.
> #### Example Config
>
> ```ini
> # TODO:
> [model]
> @architectures = "spacy.Tok2VecTransformer.v1"
> name = "albert-base-v2"
> tokenizer_config = {"use_fast": false}
> grad_factor = 1.0
> ```
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short.
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
<!-- TODO:
### spacy.TextCatLowData.v1 {#TextCatLowData}
-->
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions

View File

@ -438,7 +438,29 @@ will not be available.
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
<!-- TODO: document debug profile?-->
### debug profile {#debug-profile}
Profile which functions take the most time in a spaCy pipeline. Input should be
formatted as one JSON object per line with a key `"text"`. It can either be
provided as a JSONL file, or be read from `sys.sytdin`. If no input file is
specified, the IMDB dataset is loaded via
[`ml_datasets`](https://github.com/explosion/ml_datasets).
<Infobox title="New in v3.0" variant="warning">
The `profile` command is now available as a subcommand of `spacy debug`.
</Infobox>
```bash
$ python -m spacy debug profile [model] [inputs] [--n-texts]
```
| Argument | Type | Description |
| ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
| `model` | positional | A loadable spaCy model. |
| `inputs` | positional | Optional path to input file, or `-` for standard input. |
| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. |
### debug model {#debug-model}
@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
</Accordion>
| Argument | Type | Default | Description |
| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
| `--help`, `-h` | flag | | Show help message and available arguments. |
| Argument | Type | Description | Default |
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | |
| `component` | positional | Name of the pipeline component of which the model should be analyzed. |   |
| `--layers`, `-l` | option | Comma-separated names of layer IDs to print. | |
| `--dimensions`, `-DIM` | option | Show dimensions of each layer. | `False` |
| `--parameters`, `-PAR` | option | Show parameters of each layer. | `False` |
| `--gradients`, `-GRAD` | option | Show gradients of each layer. | `False` |
| `--attributes`, `-ATTR` | option | Show attributes of each layer. | `False` |
| `--print-step0`, `-P0` | option | Print model before training. | `False` |
| `--print-step1`, `-P1` | option | Print model after initialization. | `False` |
| `--print-step2`, `-P2` | option | Print model after training. | `False` |
| `--print-step3`, `-P3` | option | Print final predictions. | `False` |
| `--help`, `-h` | flag | Show help message and available arguments. | |
## Train {#train}

View File

@ -293,8 +293,6 @@ factories.
> return Model("custom", forward, dims={"nO": nO})
> ```
<!-- TODO: finish table -->
| Registry name | Description |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
@ -303,7 +301,7 @@ factories.
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | |
| `assets` | Registry for data assets, knowledge bases etc. |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |

View File

@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js'
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
<!-- TODO: note on v3 tok2vec own model vs. upstream listeners -->
The statistical components like the tagger or parser are typically independent
and don't share any data between each other. For example, the named entity
recognizer doesn't use any features set by the tagger and parser, and so on.
This means that you can swap them, or remove single components from the pipeline
without affecting the others. However, components may share a "token-to-vector"
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
In spaCy v2.x, the statistical components like the tagger or parser are
independent and don't share any data between themselves. For example, the named
entity recognizer doesn't use any features set by the tagger and parser, and so
on. This means that you can swap them, or remove single components from the
pipeline without affecting the others.
However, custom components may depend on annotations set by other components.
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
it'll only work if it's added after the tagger. The parser will respect
pre-defined sentence boundaries, so if a previous component in the pipeline sets
them, its dependency predictions may be different. Similarly, it matters if you
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
Custom components may also depend on annotations set by other components. For
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
only work if it's added after the tagger. The parser will respect pre-defined
sentence boundaries, so if a previous component in the pipeline sets them, its
dependency predictions may be different. Similarly, it matters if you add the
[`EntityRuler`](/api/entityruler) before or after the statistical entity
recognizer: if it's added before, the entity recognizer will take the existing
entities into account when making predictions. The
[`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge

View File

@ -371,7 +371,7 @@ that reference this variable.
### Model architectures {#model-architectures}
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
<!-- TODO: refer to architectures API: /api/architectures -->
### Metrics, training output and weighted scores {#metrics}

View File

@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy.
$ pip install spacy-transformers
```
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
## Runtime usage {#runtime}
Transformer models can be used as **drop-in replacements** for other types of
@ -99,9 +97,9 @@ evaluate, package and visualize your model.
</Project>
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
components and the settings used to construct them, including their model
implementation. Here's a config snippet for the
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
describes the pipeline components and the settings used to construct them,
including their model implementation. Here's a config snippet for the
[`Transformer`](/api/transformer) component, along with matching Python code. In
this case, the `[components.transformer]` block describes the `transformer`
component:

View File

@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output
#### Training config {#migrating-training-config}
<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
The easiest way to get started with a training config is to use the
[`init config`](/api/cli#init-config) command. You can start off with a blank
config for a new model, copy the config from an existing model, or auto-fill a
partial config like a starter config generated by our
[quickstart widget](/usage/training#quickstart).
```bash
python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
```
```diff
### {wrap="true"}

View File

@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby'
import { Quickstart, QS } from '../components/quickstart'
const DEFAULT_LANG = 'en'
const MODELS_SMALL = { en: 'roberta-base-small' }
const MODELS_LARGE = { en: 'roberta-base' }
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
const COMMENT = `# This is an auto-generated partial config for training a model.
# TODO: intructions for how to fill and use it`
# To use it for training, auto-fill it with all default values.
# python -m spacy init config config.cfg --base base_config.cfg`
const DATA = [
{
id: 'lang',