diff --git a/spacy/cli/templates/quickstart_training_recommendations.yml b/spacy/cli/templates/quickstart_training_recommendations.yml index efb6da2be..206e69954 100644 --- a/spacy/cli/templates/quickstart_training_recommendations.yml +++ b/spacy/cli/templates/quickstart_training_recommendations.yml @@ -1,5 +1,5 @@ # Recommended settings and available resources for each language, if available. -# Not all languages have recommended word vecotrs or transformers and for some, +# Not all languages have recommended word vectors or transformers and for some, # the recommended transformer for efficiency and accuracy may be the same. en: word_vectors: en_vectors_web_lg diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 446e6c7c3..25a44245d 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -13,7 +13,7 @@ menu: TODO: intro and how architectures work, link to [`registry`](/api/top-level#registry), -[custom models](/usage/training#custom-models) usage etc. +[custom functions](/usage/training#custom-functions) usage etc. ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} @@ -622,7 +622,7 @@ others, but may not be as accurate, especially if texts are short. An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities -into the "real world". This requires 3 main component +into the "real world". This requires 3 main components: - A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential synonyms and prior probabilities. diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index a86c920ad..9cadb2f0f 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -276,7 +276,7 @@ python -m spacy init fill-config tmp/starter-config_invalid.cfg --base tmp/start | Name | Description | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | -| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ | +| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | | **PRINTS** | Config validation errors, if available. | @@ -448,7 +448,7 @@ will not be available. | Name | Description | | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | -| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--ignore-warnings`, `-IW` | Ignore warnings, only show stats and errors. ~~bool (flag)~~ | | `--verbose`, `-V` | Print additional information and explanations. ~~bool (flag)~~ | | `--no-format`, `-NF` | Don't pretty-print the results. Use this if you want to write to a file. ~~bool (flag)~~ | @@ -612,9 +612,9 @@ Train a model. Expects data in spaCy's Will save out the best model from all epochs, as well as the final model. The `--code` argument can be used to provide a Python file that's imported before the training process starts. This lets you register -[custom functions](/usage/training#custom-models) and architectures and refer to -them in your config, all while still using spaCy's built-in `train` workflow. If -you need to manage complex multi-step training workflows, check out the new +[custom functions](/usage/training#custom-functions) and architectures and refer +to them in your config, all while still using spaCy's built-in `train` workflow. +If you need to manage complex multi-step training workflows, check out the new [spaCy projects](/usage/projects). @@ -636,7 +636,7 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | | `--output`, `-o` | Directory to store model in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | -| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | @@ -674,7 +674,7 @@ $ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--re | `texts_loc` | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. ~~Path (positional)~~ | | `output_dir` | Directory to write models to on each epoch. ~~Path (positional)~~ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | -| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. ~~Optional[Path] \(option)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--resume-path`, `-r` | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~ | | `--epoch-resume`, `-er` | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. ~~Optional[int] \(option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index a1bc52199..679c3c0c2 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -40,13 +40,14 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("entity_linker", config=config) > ``` -| Setting | Description | -| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ | -| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ | -| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ | -| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ | -| `kb` | The [`KnowledgeBase`](/api/kb). Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~KnowledgeBase~~ | +| Setting | Description | +| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ | +| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ | +| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ | +| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~Callable[[Vocab], KnowledgeBase]~~ | +| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | ```python https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py @@ -79,16 +80,17 @@ shortcut for this and instantiate the component using its string name and `KnowledgeBase` as well as the Candidate generator can be customized by providing custom registered functions. -| Name | Description | -| ---------------- | --------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| _keyword-only_ | | | -| `kb` | The [`KnowledgeBase`](/api/kb). ~~KnowledgeBase~~ | -| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ | -| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ | -| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ | +| Name | Description | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | | +| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ | +| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | +| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ | +| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ | +| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ | ## EntityLinker.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 646b685f0..b33d7f022 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -295,23 +295,23 @@ factories. > return Model("custom", forward, dims={"nO": nO}) > ``` -| Registry name | Description | -| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `assets` | Registry for data assets, knowledge bases etc. | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | -| `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | -| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| Registry name | Description | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `assets` | Registry for data assets, knowledge bases etc. | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | ### spacy-transformers registry {#registry-transformers} @@ -340,7 +340,17 @@ See the [`Transformer`](/api/transformer) API reference and ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} - +A data batcher implements a batching strategy that essentially turns a stream of +items into a stream of batches, with each batch consisting of one item or a list +of items. During training, the models update their weights after processing one +batch at a time. Typical batching strategies include presenting the training +data as a stream of batches with similar sizes, or with increasing batch sizes. +See the Thinc documentation on +[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples. + +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. #### batch_by_words.v1 {#batch_by_words tag="registered function"} diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 41f0357ca..30e4394d1 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -155,8 +155,8 @@ other. For instance, to generate a packaged model, you might start by converting your data, then run [`spacy train`](/api/cli#train) to train your model on the converted data and if that's successful, run [`spacy package`](/api/cli#package) to turn the best model artifact into an installable Python package. The -following command run the workflow named `all` defined in the `project.yml`, and -execute the commands it specifies, in order: +following command runs the workflow named `all` defined in the `project.yml`, and +executes the commands it specifies, in order: ```cli $ python -m spacy project run all @@ -199,7 +199,7 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project. ### Dependencies and outputs {#deps-outputs} Each command defined in the `project.yml` can optionally define a list of -dependencies and outputs. These are the files the commands requires and creates. +dependencies and outputs. These are the files the command requires and creates. For example, a command for training a model may depend on a [`config.cfg`](/usage/training#config) and the training and evaluation data, and it will export a directory `model-best`, containing the best model, which you diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index f7a74bbcc..1579e61ea 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -5,7 +5,7 @@ menu: - ['Introduction', 'basics'] - ['Quickstart', 'quickstart'] - ['Config System', 'config'] - - ['Custom Models', 'custom-models'] + - ['Custom Functions', 'custom-functions'] - ['Transfer Learning', 'transfer-learning'] - ['Parallel Training', 'parallel-training'] - ['Internal API', 'api'] @@ -127,7 +127,7 @@ Some of the main advantages and features of spaCy's training config are: [optimizers](https://thinc.ai/docs/api-optimizers) or [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are passed into them. You can also register your own functions to define - [custom architectures](#custom-models), reference them in your config and + [custom architectures](#custom-functions), reference them in your config and tweak their parameters. - **Interpolation.** If you have hyperparameters or other settings used by multiple components, define them once and reference them as @@ -299,7 +299,7 @@ case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined in the [function registry](/api/top-level#registry). All other values defined in the block are passed to the function as keyword arguments when it's initialized. You can also use this mechanism to register -[custom implementations and architectures](#custom-models) and reference them +[custom implementations and architectures](#custom-functions) and reference them from your configs. > #### How the config is resolved @@ -481,9 +481,25 @@ still look good. -## Custom model implementations and architectures {#custom-models} +## Custom Functions {#custom-functions} - +Registered functions in the training config files can refer to built-in +implementations, but you can also plug in fully custom implementations. To do +so, you first write your own implementation of a custom architectures, data +reader or any other functionality, and then register this function with the +correct [registry](/api/top-level#registry). This allows you to plug in models +defined in PyTorch or Tensorflow, make custom modifications to the `nlp` object, +create custom optimizers or schedules, or write a function that streams in data +and preprocesses it on the fly while training. + +Each custom function can have any numbers of arguments that should be passed +into them through the config similar as with the built-in functions. If your +function defines **default argument values**, spaCy is able to auto-fill your +config when you run [`init fill-config`](/api/cli#init-fill-config). If you want +to make sure that a given parameter is always explicitely set in the config, +avoid setting a default value for it. + + ### Training with custom code {#custom-code} @@ -642,11 +658,7 @@ In your config, you can now reference the schedule in the starting with an `@`, it's interpreted as a reference to a function. All other settings in the block will be passed to the function as keyword arguments. Keep in mind that the config shouldn't have any hidden defaults and all arguments on -the functions need to be represented in the config. If your function defines -**default argument values**, spaCy is able to auto-fill your config when you run -[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a -given parameter is always explicitely set in the config, avoid setting a default -value for it. +the functions need to be represented in the config. ```ini ### config.cfg (excerpt) @@ -733,7 +745,7 @@ the annotations are exactly the same. ```python ### functions.py -from typing import Callable, Iterable, Iterator +from typing import Callable, Iterable, Iterator, List import spacy from spacy.gold import Example diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index ffed1c89f..837818a83 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -44,7 +44,7 @@ menu: -### Custom models using any framework {#feautres-custom-models} +### Custom models using any framework {#features-custom-models} ### Manage end-to-end workflows with projects {#features-projects}