diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 2398cb632..626c1d858 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -509,8 +509,6 @@ page should be safe to use and we'll try to ensure backwards compatibility. However, we recommend having additional tests in place if your application depends on any of spaCy's utilities. - - ### util.get_lang_class {#util.get_lang_class tag="function"} Import and load a `Language` class. Allows lazy-loading diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 73ad88bcc..bc8c990e8 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -623,7 +623,7 @@ added to the pipeline: > > @Language.factory("my_component") > def my_component(nlp, name): -> return MyComponent() +> return MyComponent() > ``` | Argument | Description | @@ -636,8 +636,6 @@ All other settings can be passed in by the user via the `config` argument on [`@Language.factory`](/api/language#factory) decorator also lets you define a `default_config` that's used as a fallback. - - ```python ### With config {highlight="4,9"} import spacy @@ -688,7 +686,7 @@ make your factory a separate function. That's also how spaCy does it internally. -### Example: Stateful component with settings +### Example: Stateful component with settings {#example-stateful-components} This example shows a **stateful** pipeline component for handling acronyms: based on a dictionary, it will detect acronyms and their expanded forms in both @@ -757,6 +755,85 @@ doc = nlp("LOL, be right back") print(doc._.acronyms) ``` +Many stateful components depend on **data resources** like dictionaries and +lookup tables that should ideally be **configurable**. For example, it makes +sense to make the `DICTIONARY` and argument of the registered function, so the +`AcronymComponent` can be re-used with different data. One logical solution +would be to make it an argument of the component factory, and allow it to be +initialized with different dictionaries. + +> #### Example +> +> Making the data an argument of the registered function would result in output +> like this in your `config.cfg`, which is typically not what you want (and only +> works for JSON-serializable data). +> +> ```ini +> [components.acronyms.dictionary] +> lol = "laugh out loud" +> brb = "be right back" +> ``` + +However, passing in the dictionary directly is problematic, because it means +that if a component saves out its config and settings, the +[`config.cfg`](/usage/training#config) will include a dump of the entire data, +since that's the config the component was created with. + +```diff +DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"} +- default_config = {"dictionary:" DICTIONARY} +``` + +If what you're passing in isn't JSON-serializable – e.g. a custom object like a +[model](#trainable-components) – saving out the component config becomes +impossible because there's no way for spaCy to know _how_ that object was +created, and what to do to create it again. This makes it much harder to save, +load and train custom models with custom components. A simple solution is to +**register a function** that returns your resources. The +[registry](/api/top-level#registry) lets you **map string names to functions** +that create objects, so given a name and optional arguments, spaCy will know how +to recreate the object. To register a function that returns a custom asset, you +can use the `@spacy.registry.assets` decorator with a single argument, the name: + +```python +### Registered function for assets {highlight="1"} +@spacy.registry.assets("acronyms.slang_dict.v1") +def create_acronyms_slang_dict(): + dictionary = {"lol": "laughing out loud", "brb": "be right back"} + dictionary.update({value: key for key, value in dictionary.items()}) + return dictionary +``` + +In your `default_config` (and later in your +[training config](/usage/training#config)), you can now refer to the function +registered under the name `"acronyms.slang_dict.v1"` using the `@assets` key. +This tells spaCy how to create the value, and when your component is created, +the result of the registered function is passed in as the key `"dictionary"`. + +> #### config.cfg +> +> ```ini +> [components.acronyms] +> factory = "acronyms" +> +> [components.acronyms.dictionary] +> @assets = "acronyms.slang_dict.v1" +> ``` + +```diff +- default_config = {"dictionary:" DICTIONARY} ++ default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}} +``` + +Using a registered function also means that you can easily include your custom +components in models that you [train](/usage/training). To make sure spaCy knows +where to find your custom `@assets` function, you can pass in a Python file via +the argument `--code`. If someone else is using your component, all they have to +do to customize the data is to register their own function and swap out the +name. Registered functions can also take **arguments** by the way that can be +defined in the config as well – you can read more about this in the docs on +[training with custom code](/usage/training#custom-code). + ### Python type hints and pydantic validation {#type-hints new="3"} spaCy's configs are powered by our machine learning library Thinc's @@ -994,7 +1071,7 @@ loss is calculated and to add evaluation scores to the training output. | [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. | | [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. | - + ## Extension attributes {#custom-components-attributes new="2"} diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 123ef195e..30e4394d1 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -97,7 +97,7 @@ to download and where to put them. The [`spacy project assets`](/api/cli#project-assets) will fetch the project assets for you: -``` +```cli $ cd some_example_project $ python -m spacy project assets ``` diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index c4eaf1d88..739403625 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -414,7 +414,7 @@ recipe once the dish has already been prepared. You have to make a new one. spaCy includes a variety of built-in [architectures](/api/architectures) for different tasks. For example: - + | Architecture | Description | | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -776,12 +776,11 @@ mattis pretium. ### Defining custom architectures {#custom-architectures} - - + ## Transfer learning {#transfer-learning} - + ### Using transformer models like BERT {#transformers} @@ -811,7 +810,7 @@ config and customize the implementations, see the usage guide on ### Pretraining with spaCy {#pretraining} - + ## Parallel Training with Ray {#parallel-training} @@ -836,9 +835,8 @@ spaCy gives you full control over the training loop. However, for most use cases, it's recommended to train your models via the [`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep track of your settings and hyperparameters, instead of writing your own training -scripts from scratch. -[Custom registered functions](/usage/training/#custom-code) should typically -give you everything you need to train fully custom models with +scripts from scratch. [Custom registered functions](#custom-code) should +typically give you everything you need to train fully custom models with [`spacy train`](/api/cli#train).