Merge pull request #5938 from svlandeg/feature/update-docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-19 20:24:02 +02:00 committed by GitHub
commit 7a8cc64ea8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 84 additions and 60 deletions

View File

@ -1,5 +1,5 @@
# Recommended settings and available resources for each language, if available.
# Not all languages have recommended word vecotrs or transformers and for some,
# Not all languages have recommended word vectors or transformers and for some,
# the recommended transformer for efficiency and accuracy may be the same.
en:
word_vectors: en_vectors_web_lg

View File

@ -13,7 +13,7 @@ menu:
TODO: intro and how architectures work, link to
[`registry`](/api/top-level#registry),
[custom models](/usage/training#custom-models) usage etc.
[custom functions](/usage/training#custom-functions) usage etc.
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
@ -622,7 +622,7 @@ others, but may not be as accurate, especially if texts are short.
An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
(tagged as named entities) to unique identifiers, grounding the named entities
into the "real world". This requires 3 main component
into the "real world". This requires 3 main components:
- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
synonyms and prior probabilities.

View File

@ -276,7 +276,7 @@ python -m spacy init fill-config tmp/starter-config_invalid.cfg --base tmp/start
| Name | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ |
| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
| **PRINTS** | Config validation errors, if available. |
@ -448,7 +448,7 @@ will not be available.
| Name | Description |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--ignore-warnings`, `-IW` | Ignore warnings, only show stats and errors. ~~bool (flag)~~ |
| `--verbose`, `-V` | Print additional information and explanations. ~~bool (flag)~~ |
| `--no-format`, `-NF` | Don't pretty-print the results. Use this if you want to write to a file. ~~bool (flag)~~ |
@ -612,9 +612,9 @@ Train a model. Expects data in spaCy's
Will save out the best model from all epochs, as well as the final model. The
`--code` argument can be used to provide a Python file that's imported before
the training process starts. This lets you register
[custom functions](/usage/training#custom-models) and architectures and refer to
them in your config, all while still using spaCy's built-in `train` workflow. If
you need to manage complex multi-step training workflows, check out the new
[custom functions](/usage/training#custom-functions) and architectures and refer
to them in your config, all while still using spaCy's built-in `train` workflow.
If you need to manage complex multi-step training workflows, check out the new
[spaCy projects](/usage/projects).
<Infobox title="New in v3.0" variant="warning">
@ -636,7 +636,7 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
| `--output`, `-o` | Directory to store model in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
@ -674,7 +674,7 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--re
| `texts_loc` | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. ~~Path (positional)~~ |
| `output_dir` | Directory to write models to on each epoch. ~~Path (positional)~~ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--resume-path`, `-r` | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~ |
| `--epoch-resume`, `-er` | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. ~~Optional[int] \(option)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |

View File

@ -40,13 +40,14 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("entity_linker", config=config)
> ```
| Setting | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
| `kb` | The [`KnowledgeBase`](/api/kb). Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~KnowledgeBase~~ |
| Setting | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~Callable[[Vocab], KnowledgeBase]~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
@ -79,16 +80,17 @@ shortcut for this and instantiate the component using its string name and
`KnowledgeBase` as well as the Candidate generator can be customized by
providing custom registered functions.
| Name | Description |
| ---------------- | --------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | | |
| `kb` | The [`KnowledgeBase`](/api/kb). ~~KnowledgeBase~~ |
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | | |
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
## EntityLinker.\_\_call\_\_ {#call tag="method"}

View File

@ -295,23 +295,23 @@ factories.
> return Model("custom", forward, dims={"nO": nO})
> ```
| Registry name | Description |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | Registry for data assets, knowledge bases etc. |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
| Registry name | Description |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | Registry for data assets, knowledge bases etc. |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
### spacy-transformers registry {#registry-transformers}
@ -340,7 +340,17 @@ See the [`Transformer`](/api/transformer) API reference and
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
<!-- TODO: intro -->
A data batcher implements a batching strategy that essentially turns a stream of
items into a stream of batches, with each batch consisting of one item or a list
of items. During training, the models update their weights after processing one
batch at a time. Typical batching strategies include presenting the training
data as a stream of batches with similar sizes, or with increasing batch sizes.
See the Thinc documentation on
[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.
Instead of using one of the built-in batchers listed here, you can also
[implement your own](/usage/training#custom-code-readers-batchers), which may or
may not use a custom schedule.
#### batch_by_words.v1 {#batch_by_words tag="registered function"}

View File

@ -155,8 +155,8 @@ other. For instance, to generate a packaged model, you might start by converting
your data, then run [`spacy train`](/api/cli#train) to train your model on the
converted data and if that's successful, run [`spacy package`](/api/cli#package)
to turn the best model artifact into an installable Python package. The
following command run the workflow named `all` defined in the `project.yml`, and
execute the commands it specifies, in order:
following command runs the workflow named `all` defined in the `project.yml`, and
executes the commands it specifies, in order:
```cli
$ python -m spacy project run all
@ -199,7 +199,7 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
### Dependencies and outputs {#deps-outputs}
Each command defined in the `project.yml` can optionally define a list of
dependencies and outputs. These are the files the commands requires and creates.
dependencies and outputs. These are the files the command requires and creates.
For example, a command for training a model may depend on a
[`config.cfg`](/usage/training#config) and the training and evaluation data, and
it will export a directory `model-best`, containing the best model, which you

View File

@ -5,7 +5,7 @@ menu:
- ['Introduction', 'basics']
- ['Quickstart', 'quickstart']
- ['Config System', 'config']
- ['Custom Models', 'custom-models']
- ['Custom Functions', 'custom-functions']
- ['Transfer Learning', 'transfer-learning']
- ['Parallel Training', 'parallel-training']
- ['Internal API', 'api']
@ -127,7 +127,7 @@ Some of the main advantages and features of spaCy's training config are:
[optimizers](https://thinc.ai/docs/api-optimizers) or
[schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
passed into them. You can also register your own functions to define
[custom architectures](#custom-models), reference them in your config and
[custom architectures](#custom-functions), reference them in your config and
tweak their parameters.
- **Interpolation.** If you have hyperparameters or other settings used by
multiple components, define them once and reference them as
@ -299,7 +299,7 @@ case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined
in the [function registry](/api/top-level#registry). All other values defined in
the block are passed to the function as keyword arguments when it's initialized.
You can also use this mechanism to register
[custom implementations and architectures](#custom-models) and reference them
[custom implementations and architectures](#custom-functions) and reference them
from your configs.
> #### How the config is resolved
@ -481,9 +481,25 @@ still look good.
</Accordion>
## Custom model implementations and architectures {#custom-models}
## Custom Functions {#custom-functions}
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. possibly link to new (not yet created) page on creating models -->
Registered functions in the training config files can refer to built-in
implementations, but you can also plug in fully custom implementations. To do
so, you first write your own implementation of a custom architectures, data
reader or any other functionality, and then register this function with the
correct [registry](/api/top-level#registry). This allows you to plug in models
defined in PyTorch or Tensorflow, make custom modifications to the `nlp` object,
create custom optimizers or schedules, or write a function that streams in data
and preprocesses it on the fly while training.
Each custom function can have any numbers of arguments that should be passed
into them through the config similar as with the built-in functions. If your
function defines **default argument values**, spaCy is able to auto-fill your
config when you run [`init fill-config`](/api/cli#init-fill-config). If you want
to make sure that a given parameter is always explicitely set in the config,
avoid setting a default value for it.
<!-- TODO: possibly link to new (not yet created) page on creating models ? -->
### Training with custom code {#custom-code}
@ -642,11 +658,7 @@ In your config, you can now reference the schedule in the
starting with an `@`, it's interpreted as a reference to a function. All other
settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
given parameter is always explicitely set in the config, avoid setting a default
value for it.
the functions need to be represented in the config.
```ini
### config.cfg (excerpt)
@ -733,7 +745,7 @@ the annotations are exactly the same.
```python
### functions.py
from typing import Callable, Iterable, Iterator
from typing import Callable, Iterable, Iterator, List
import spacy
from spacy.gold import Example

View File

@ -44,7 +44,7 @@ menu:
</Infobox>
### Custom models using any framework {#feautres-custom-models}
### Custom models using any framework {#features-custom-models}
### Manage end-to-end workflows with projects {#features-projects}