mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Update docs [ci skip]
This commit is contained in:
parent
0832cdd443
commit
12052bd8f6
|
@ -48,8 +48,6 @@ features and a CNN with layer-normalized maxout.
|
|||
|
||||
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
||||
|
||||
<!-- TODO: example config -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
|
@ -57,18 +55,22 @@ features and a CNN with layer-normalized maxout.
|
|||
> @architectures = "spacy.Tok2Vec.v1"
|
||||
>
|
||||
> [model.embed]
|
||||
> @architectures = "spacy.CharacterEmbed.v1"
|
||||
> # ...
|
||||
>
|
||||
> [model.encode]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
> # ...
|
||||
> ```
|
||||
|
||||
Construct a tok2vec model out of embedding and encoding subnetworks. See the
|
||||
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
|
||||
blog post for background.
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. |
|
||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. |
|
||||
| Name | Type | Description |
|
||||
| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
|
||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). |
|
||||
|
||||
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
||||
|
||||
|
@ -113,8 +115,6 @@ argument that connects to the shared `tok2vec` component in the pipeline.
|
|||
|
||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||
|
||||
<!-- TODO: check example config -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
|
@ -143,17 +143,15 @@ representation.
|
|||
|
||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
||||
|
||||
<!-- TODO: check example config -->
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.CharacterEmbed.v1"
|
||||
> width = 64
|
||||
> rows = 2000
|
||||
> nM = 16
|
||||
> nC = 4
|
||||
> width = 128
|
||||
> rows = 7000
|
||||
> nM = 64
|
||||
> nC = 8
|
||||
> ```
|
||||
|
||||
Construct an embedded representations based on character embeddings, using a
|
||||
|
@ -186,9 +184,9 @@ construct a single vector to represent the information.
|
|||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
> width = 64
|
||||
> width = 128
|
||||
> window_size = 1
|
||||
> maxout_pieces = 2
|
||||
> maxout_pieces = 3
|
||||
> depth = 4
|
||||
> ```
|
||||
|
||||
|
@ -254,8 +252,6 @@ architectures into your training config.
|
|||
|
||||
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
||||
|
||||
<!-- TODO: description -->
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
|
@ -270,6 +266,8 @@ architectures into your training config.
|
|||
> stride = 96
|
||||
> ```
|
||||
|
||||
<!-- TODO: description -->
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
|
||||
|
@ -309,7 +307,11 @@ a single token vector given zero or more wordpiece vectors.
|
|||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> # TODO:
|
||||
> [model]
|
||||
> @architectures = "spacy.Tok2VecTransformer.v1"
|
||||
> name = "albert-base-v2"
|
||||
> tokenizer_config = {"use_fast": false}
|
||||
> grad_factor = 1.0
|
||||
> ```
|
||||
|
||||
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
|
||||
|
@ -554,10 +556,6 @@ others, but may not be as accurate, especially if texts are short.
|
|||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
|
||||
|
||||
<!-- TODO:
|
||||
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
||||
-->
|
||||
|
||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||
|
||||
An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
|
||||
|
|
|
@ -438,7 +438,29 @@ will not be available.
|
|||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
||||
|
||||
<!-- TODO: document debug profile?-->
|
||||
### debug profile {#debug-profile}
|
||||
|
||||
Profile which functions take the most time in a spaCy pipeline. Input should be
|
||||
formatted as one JSON object per line with a key `"text"`. It can either be
|
||||
provided as a JSONL file, or be read from `sys.sytdin`. If no input file is
|
||||
specified, the IMDB dataset is loaded via
|
||||
[`ml_datasets`](https://github.com/explosion/ml_datasets).
|
||||
|
||||
<Infobox title="New in v3.0" variant="warning">
|
||||
|
||||
The `profile` command is now available as a subcommand of `spacy debug`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
```bash
|
||||
$ python -m spacy debug profile [model] [inputs] [--n-texts]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ----------------- | ----------------------------------------------------------------- | ------------------------------------------------------- |
|
||||
| `model` | positional | A loadable spaCy model. |
|
||||
| `inputs` | positional | Optional path to input file, or `-` for standard input. |
|
||||
| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. |
|
||||
|
||||
### debug model {#debug-model}
|
||||
|
||||
|
@ -546,20 +568,20 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
|
|||
|
||||
</Accordion>
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
| ----------------------- | ---------- | ------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
||||
| `component` | positional | | Name of the pipeline component of which the model should be analyzed. |
|
||||
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
|
||||
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
|
||||
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
|
||||
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
|
||||
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
|
||||
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
|
||||
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
|
||||
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
|
||||
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
|
||||
| `--help`, `-h` | flag | | Show help message and available arguments. |
|
||||
| Argument | Type | Description | Default |
|
||||
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ------- |
|
||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | |
|
||||
| `component` | positional | Name of the pipeline component of which the model should be analyzed. | |
|
||||
| `--layers`, `-l` | option | Comma-separated names of layer IDs to print. | |
|
||||
| `--dimensions`, `-DIM` | option | Show dimensions of each layer. | `False` |
|
||||
| `--parameters`, `-PAR` | option | Show parameters of each layer. | `False` |
|
||||
| `--gradients`, `-GRAD` | option | Show gradients of each layer. | `False` |
|
||||
| `--attributes`, `-ATTR` | option | Show attributes of each layer. | `False` |
|
||||
| `--print-step0`, `-P0` | option | Print model before training. | `False` |
|
||||
| `--print-step1`, `-P1` | option | Print model after initialization. | `False` |
|
||||
| `--print-step2`, `-P2` | option | Print model after training. | `False` |
|
||||
| `--print-step3`, `-P3` | option | Print final predictions. | `False` |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. | |
|
||||
|
||||
## Train {#train}
|
||||
|
||||
|
|
|
@ -293,8 +293,6 @@ factories.
|
|||
> return Model("custom", forward, dims={"nO": nO})
|
||||
> ```
|
||||
|
||||
<!-- TODO: finish table -->
|
||||
|
||||
| Registry name | Description |
|
||||
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||
|
@ -303,7 +301,7 @@ factories.
|
|||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `assets` | |
|
||||
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||
|
|
|
@ -37,20 +37,19 @@ import Accordion from 'components/accordion.js'
|
|||
|
||||
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
|
||||
|
||||
<!-- TODO: note on v3 tok2vec own model vs. upstream listeners -->
|
||||
The statistical components like the tagger or parser are typically independent
|
||||
and don't share any data between each other. For example, the named entity
|
||||
recognizer doesn't use any features set by the tagger and parser, and so on.
|
||||
This means that you can swap them, or remove single components from the pipeline
|
||||
without affecting the others. However, components may share a "token-to-vector"
|
||||
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
|
||||
|
||||
In spaCy v2.x, the statistical components like the tagger or parser are
|
||||
independent and don't share any data between themselves. For example, the named
|
||||
entity recognizer doesn't use any features set by the tagger and parser, and so
|
||||
on. This means that you can swap them, or remove single components from the
|
||||
pipeline without affecting the others.
|
||||
|
||||
However, custom components may depend on annotations set by other components.
|
||||
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
|
||||
it'll only work if it's added after the tagger. The parser will respect
|
||||
pre-defined sentence boundaries, so if a previous component in the pipeline sets
|
||||
them, its dependency predictions may be different. Similarly, it matters if you
|
||||
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
|
||||
Custom components may also depend on annotations set by other components. For
|
||||
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
|
||||
only work if it's added after the tagger. The parser will respect pre-defined
|
||||
sentence boundaries, so if a previous component in the pipeline sets them, its
|
||||
dependency predictions may be different. Similarly, it matters if you add the
|
||||
[`EntityRuler`](/api/entityruler) before or after the statistical entity
|
||||
recognizer: if it's added before, the entity recognizer will take the existing
|
||||
entities into account when making predictions. The
|
||||
[`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge
|
||||
|
|
|
@ -371,7 +371,7 @@ that reference this variable.
|
|||
|
||||
### Model architectures {#model-architectures}
|
||||
|
||||
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
||||
<!-- TODO: refer to architectures API: /api/architectures -->
|
||||
|
||||
### Metrics, training output and weighted scores {#metrics}
|
||||
|
||||
|
|
|
@ -32,8 +32,6 @@ transformer pipeline component is available to spaCy.
|
|||
$ pip install spacy-transformers
|
||||
```
|
||||
|
||||
<!-- TODO: the text below has been copied from the spacy-transformers repo and needs to be updated and adjusted -->
|
||||
|
||||
## Runtime usage {#runtime}
|
||||
|
||||
Transformer models can be used as **drop-in replacements** for other types of
|
||||
|
@ -99,9 +97,9 @@ evaluate, package and visualize your model.
|
|||
|
||||
</Project>
|
||||
|
||||
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
|
||||
components and the settings used to construct them, including their model
|
||||
implementation. Here's a config snippet for the
|
||||
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
|
||||
describes the pipeline components and the settings used to construct them,
|
||||
including their model implementation. Here's a config snippet for the
|
||||
[`Transformer`](/api/transformer) component, along with matching Python code. In
|
||||
this case, the `[components.transformer]` block describes the `transformer`
|
||||
component:
|
||||
|
|
|
@ -249,7 +249,15 @@ $ python -m spacy convert ./training.json ./output
|
|||
|
||||
#### Training config {#migrating-training-config}
|
||||
|
||||
<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
|
||||
The easiest way to get started with a training config is to use the
|
||||
[`init config`](/api/cli#init-config) command. You can start off with a blank
|
||||
config for a new model, copy the config from an existing model, or auto-fill a
|
||||
partial config like a starter config generated by our
|
||||
[quickstart widget](/usage/training#quickstart).
|
||||
|
||||
```bash
|
||||
python -m spacy init-config ./config.cfg --lang en --pipeline tagger,parser
|
||||
```
|
||||
|
||||
```diff
|
||||
### {wrap="true"}
|
||||
|
|
|
@ -4,12 +4,10 @@ import { StaticQuery, graphql } from 'gatsby'
|
|||
import { Quickstart, QS } from '../components/quickstart'
|
||||
|
||||
const DEFAULT_LANG = 'en'
|
||||
const MODELS_SMALL = { en: 'roberta-base-small' }
|
||||
const MODELS_LARGE = { en: 'roberta-base' }
|
||||
|
||||
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
|
||||
const COMMENT = `# This is an auto-generated partial config for training a model.
|
||||
# TODO: intructions for how to fill and use it`
|
||||
# To use it for training, auto-fill it with all default values.
|
||||
# python -m spacy init config config.cfg --base base_config.cfg`
|
||||
const DATA = [
|
||||
{
|
||||
id: 'lang',
|
||||
|
|
Loading…
Reference in New Issue
Block a user