mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-23 23:20:52 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
922250ca58
|
@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout.
|
||||||
|
|
||||||
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
||||||
|
|
||||||
<!-- TODO: example config -->
|
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
|
@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout.
|
||||||
> @architectures = "spacy.Tok2Vec.v1"
|
> @architectures = "spacy.Tok2Vec.v1"
|
||||||
>
|
>
|
||||||
> [model.embed]
|
> [model.embed]
|
||||||
|
> @architectures = "spacy.CharacterEmbed.v1"
|
||||||
|
> # ...
|
||||||
>
|
>
|
||||||
> [model.encode]
|
> [model.encode]
|
||||||
|
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
> # ...
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Construct a tok2vec model out of embedding and encoding subnetworks. See the
|
Construct a tok2vec model out of embedding and encoding subnetworks. See the
|
||||||
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
|
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
|
||||||
blog post for background.
|
blog post for background.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. |
|
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
|
||||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
|
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). |
|
||||||
|
|
||||||
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
||||||
|
|
||||||
|
@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline.
|
||||||
|
|
||||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||||
|
|
||||||
<!-- TODO: check example config -->
|
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
|
@ -143,17 +143,15 @@ representation.
|
||||||
|
|
||||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
||||||
|
|
||||||
<!-- TODO: check example config -->
|
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.CharacterEmbed.v1"
|
> @architectures = "spacy.CharacterEmbed.v1"
|
||||||
> width = 64
|
> width = 128
|
||||||
> rows = 2000
|
> rows = 7000
|
||||||
> nM = 16
|
> nM = 64
|
||||||
> nC = 4
|
> nC = 8
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Construct an embedded representations based on character embeddings, using a
|
Construct an embedded representations based on character embeddings, using a
|
||||||
|
@ -186,9 +184,9 @@ construct a single vector to represent the information.
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
> width = 64
|
> width = 128
|
||||||
> window_size = 1
|
> window_size = 1
|
||||||
> maxout_pieces = 2
|
> maxout_pieces = 3
|
||||||
> depth = 4
|
> depth = 4
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -254,8 +252,6 @@ architectures into your training config.
|
||||||
|
|
||||||
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
||||||
|
|
||||||
<!-- TODO: description -->
|
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
|
@ -270,6 +266,8 @@ architectures into your training config.
|
||||||
> stride = 96
|
> stride = 96
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
<!-- TODO: description -->
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
|
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
|
||||||
|
@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors.
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> # TODO:
|
> [model]
|
||||||
|
> @architectures = "spacy.Tok2VecTransformer.v1"
|
||||||
|
> name = "albert-base-v2"
|
||||||
|
> tokenizer_config = {"use_fast": false}
|
||||||
|
> grad_factor = 1.0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
|
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
|
||||||
|
@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short.
|
||||||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
|
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
|
||||||
-->
|
|
||||||
|
|
||||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||||
|
|
||||||
An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
|
An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
|
||||||
|
|
|
@ -438,7 +438,29 @@ will not be available.
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
||||||
|
|
||||||
<!-- TODO: document debug profile?-->
|
### debug profile {#debug-profile}
|
||||||
|
|
||||||
|
Profile which functions take the most time in a spaCy pipeline. Input should be
|
||||||
|
formatted as one JSON object per line with a key `"text"`. It can either be
|
||||||
|
provided as a JSONL file, or be read from `sys.sytdin`. If no input file is
|
||||||
|
specified, the IMDB dataset is loaded via
|
||||||
|
[`ml_datasets`](https://github.com/explosion/ml_datasets).
|
||||||
|
|
||||||
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
|
||||||
|
The `profile` command is now available as a subcommand of `spacy debug`.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python -m spacy debug profile [model] [inputs] [--n-texts]
|
||||||
|
```
|
||||||
|
|
||||||
|
| Argument | Type | Description |
|
||||||
|
| ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
|
||||||
|
| `model` | positional | A loadable spaCy model. |
|
||||||
|
| `inputs` | positional | Optional path to input file, or `-` for standard input. |
|
||||||
|
| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. |
|
||||||
|
|
||||||
### debug model {#debug-model}
|
### debug model {#debug-model}
|
||||||
|
|
||||||
|
@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Description | Default |
|
||||||
| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
|
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- |
|
||||||
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | |
|
||||||
| `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
|
| `component` | positional | Name of the pipeline component of which the model should be analyzed. | |
|
||||||
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
|
| `--layers`, `-l` | option | Comma-separated names of layer IDs to print. | |
|
||||||
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
|
| `--dimensions`, `-DIM` | option | Show dimensions of each layer. | `False` |
|
||||||
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
|
| `--parameters`, `-PAR` | option | Show parameters of each layer. | `False` |
|
||||||
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
|
| `--gradients`, `-GRAD` | option | Show gradients of each layer. | `False` |
|
||||||
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
|
| `--attributes`, `-ATTR` | option | Show attributes of each layer. | `False` |
|
||||||
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
|
| `--print-step0`, `-P0` | option | Print model before training. | `False` |
|
||||||
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
|
| `--print-step1`, `-P1` | option | Print model after initialization. | `False` |
|
||||||
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
|
| `--print-step2`, `-P2` | option | Print model after training. | `False` |
|
||||||
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
|
| `--print-step3`, `-P3` | option | Print final predictions. | `False` |
|
||||||
| `--help`, `-h` | flag | | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. | |
|
||||||
|
|
||||||
## Train {#train}
|
## Train {#train}
|
||||||
|
|
||||||
|
|
|
@ -293,8 +293,6 @@ factories.
|
||||||
> return Model("custom", forward, dims={"nO": nO})
|
> return Model("custom", forward, dims={"nO": nO})
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
<!-- TODO: finish table -->
|
|
||||||
|
|
||||||
| Registry name | Description |
|
| Registry name | Description |
|
||||||
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||||
|
@ -303,7 +301,7 @@ factories.
|
||||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
| `assets` | |
|
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||||
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
||||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||||
|
|
|
@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js'
|
||||||
|
|
||||||
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
|
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
|
||||||
|
|
||||||
<!-- TODO: note on v3 tok2vec own model vs. upstream listeners -->
|
The statistical components like the tagger or parser are typically independent
|
||||||
|
and don't share any data between each other. For example, the named entity
|
||||||
|
recognizer doesn't use any features set by the tagger and parser, and so on.
|
||||||
|
This means that you can swap them, or remove single components from the pipeline
|
||||||
|
without affecting the others. However, components may share a "token-to-vector"
|
||||||
|
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
|
||||||
|
|
||||||
In spaCy v2.x, the statistical components like the tagger or parser are
|
Custom components may also depend on annotations set by other components. For
|
||||||
independent and don't share any data between themselves. For example, the named
|
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
|
||||||
entity recognizer doesn't use any features set by the tagger and parser, and so
|
only work if it's added after the tagger. The parser will respect pre-defined
|
||||||
on. This means that you can swap them, or remove single components from the
|
sentence boundaries, so if a previous component in the pipeline sets them, its
|
||||||
pipeline without affecting the others.
|
dependency predictions may be different. Similarly, it matters if you add the
|
||||||
|
[`EntityRuler`](/api/entityruler) before or after the statistical entity
|
||||||
However, custom components may depend on annotations set by other components.
|
|
||||||
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
|
|
||||||
it'll only work if it's added after the tagger. The parser will respect
|
|
||||||
pre-defined sentence boundaries, so if a previous component in the pipeline sets
|
|
||||||
them, its dependency predictions may be different. Similarly, it matters if you
|
|
||||||
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
|
|
||||||
recognizer: if it's added before, the entity recognizer will take the existing
|
recognizer: if it's added before, the entity recognizer will take the existing
|
||||||
entities into account when making predictions. The
|
entities into account when making predictions. The
|
||||||
[`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge
|
[`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge
|
||||||
|
|
|
@ -371,7 +371,7 @@ that reference this variable.
|
||||||
|
|
||||||
### Model architectures {#model-architectures}
|
### Model architectures {#model-architectures}
|
||||||
|
|
||||||
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
<!-- TODO: refer to architectures API: /api/architectures -->
|
||||||
|
|
||||||
### Metrics, training output and weighted scores {#metrics}
|
### Metrics, training output and weighted scores {#metrics}
|
||||||
|
|
||||||
|
|
|
@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy.
|
||||||
$ pip install spacy-transformers
|
$ pip install spacy-transformers
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
|
|
||||||
|
|
||||||
## Runtime usage {#runtime}
|
## Runtime usage {#runtime}
|
||||||
|
|
||||||
Transformer models can be used as **drop-in replacements** for other types of
|
Transformer models can be used as **drop-in replacements** for other types of
|
||||||
|
@ -99,9 +97,9 @@ evaluate, package and visualize your model.
|
||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
|
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
|
||||||
components and the settings used to construct them, including their model
|
describes the pipeline components and the settings used to construct them,
|
||||||
implementation. Here's a config snippet for the
|
including their model implementation. Here's a config snippet for the
|
||||||
[`Transformer`](/api/transformer) component, along with matching Python code. In
|
[`Transformer`](/api/transformer) component, along with matching Python code. In
|
||||||
this case, the `[components.transformer]` block describes the `transformer`
|
this case, the `[components.transformer]` block describes the `transformer`
|
||||||
component:
|
component:
|
||||||
|
|
|
@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output
|
||||||
|
|
||||||
#### Training config {#migrating-training-config}
|
#### Training config {#migrating-training-config}
|
||||||
|
|
||||||
<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
|
The easiest way to get started with a training config is to use the
|
||||||
|
[`init config`](/api/cli#init-config) command. You can start off with a blank
|
||||||
|
config for a new model, copy the config from an existing model, or auto-fill a
|
||||||
|
partial config like a starter config generated by our
|
||||||
|
[quickstart widget](/usage/training#quickstart).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
|
||||||
|
```
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
### {wrap="true"}
|
### {wrap="true"}
|
||||||
|
|
|
@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby'
|
||||||
import { Quickstart, QS } from '../components/quickstart'
|
import { Quickstart, QS } from '../components/quickstart'
|
||||||
|
|
||||||
const DEFAULT_LANG = 'en'
|
const DEFAULT_LANG = 'en'
|
||||||
const MODELS_SMALL = { en: 'roberta-base-small' }
|
|
||||||
const MODELS_LARGE = { en: 'roberta-base' }
|
|
||||||
|
|
||||||
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
|
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
|
||||||
const COMMENT = `# This is an auto-generated partial config for training a model.
|
const COMMENT = `# This is an auto-generated partial config for training a model.
|
||||||
# TODO: intructions for how to fill and use it`
|
# To use it for training, auto-fill it with all default values.
|
||||||
|
# python -m spacy init config config.cfg --base base_config.cfg`
|
||||||
const DATA = [
|
const DATA = [
|
||||||
{
|
{
|
||||||
id: 'lang',
|
id: 'lang',
|
||||||
|
|
Loading…
Reference in New Issue
Block a user