From cdec46493fd316338ef39a528a482184c4162a6f Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 5 Aug 2020 15:00:54 +0200 Subject: [PATCH] Update docs --- website/docs/api/language.md | 31 +++--- website/docs/api/top-level.md | 17 +++- website/docs/usage/processing-pipelines.md | 56 +++++++++++ website/docs/usage/training.md | 104 +++++++++++++++++++-- 4 files changed, 180 insertions(+), 28 deletions(-) diff --git a/website/docs/api/language.md b/website/docs/api/language.md index ba62d0b13..7464a029e 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -363,7 +363,7 @@ that take a `Doc` object, modify it and return it. Only one of `before`, As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't -take callables anymore and instead expects the name of a component factory +take callables anymore and instead expects the **name of a component factory** registered using [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory). It now takes care of creating the component, adds it to the pipeline and returns it. @@ -379,20 +379,25 @@ component, adds it to the pipeline and returns it. > > nlp.add_pipe("component", before="ner") > component = nlp.add_pipe("component", name="custom_name", last=True) +> +> # Add component from source model +> source_nlp = spacy.load("en_core_web_sm") +> nlp.add_pipe("ner", source=source_nlp) > ``` -| Name | Type | Description | -| -------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `factory_name` | str | Name of the registered component factory. | -| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. | -| _keyword-only_ | | | -| `before` | str / int | Component name or index to insert component directly before. | -| `after` | str / int | Component name or index to insert component directly after: | -| `first` | bool | Insert component first / not first in the pipeline. | -| `last` | bool | Insert component last / not last in the pipeline. | -| `config` 3 | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. | -| `validate` 3 | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. | -| **RETURNS** 3 | callable | The pipeline component. | +| Name | Type | Description | +| -------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `factory_name` | str | Name of the registered component factory. | +| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. | +| _keyword-only_ | | | +| `before` | str / int | Component name or index to insert component directly before. | +| `after` | str / int | Component name or index to insert component directly after: | +| `first` | bool | Insert component first / not first in the pipeline. | +| `last` | bool | Insert component last / not last in the pipeline. | +| `config` 3 | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. | +| `source` 3 | `Language` | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. | +| `validate` 3 | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. | +| **RETURNS** 3 | callable | The pipeline component. | ## Language.has_factory {#has_factory tag="classmethod" new="3"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 368b58a9b..2ebdb911e 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -4,6 +4,7 @@ menu: - ['spacy', 'spacy'] - ['displacy', 'displacy'] - ['registry', 'registry'] + - ['Loaders & Batchers', 'loaders-batchers'] - ['Data & Alignment', 'gold'] - ['Utility Functions', 'util'] --- @@ -34,6 +35,7 @@ loaded in via [`Language.from_disk`](/api/language#from_disk). | Name | Type | Description | | ------------------------------------------ | ----------------- | --------------------------------------------------------------------------------- | | `name` | str / `Path` | Model to load, i.e. package name or path. | +| _keyword-only_ | | | | `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `component_cfg` 3 | `Dict[str, dict]` | Optional config overrides for pipeline components, keyed by component names. | | **RETURNS** | `Language` | A `Language` object with the loaded model. | @@ -83,11 +85,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your > markdown = spacy.info(markdown=True, silent=True) > ``` -| Name | Type | Description | -| ---------- | ---- | ------------------------------------------------ | -| `model` | str | A model, i.e. a package name or path (optional). | -| `markdown` | bool | Print information as Markdown. | -| `silent` | bool | Don't print anything, just return. | +| Name | Type | Description | +| -------------- | ---- | ------------------------------------------------ | +| `model` | str | A model, i.e. a package name or path (optional). | +| _keyword-only_ | | | +| `markdown` | bool | Print information as Markdown. | +| `silent` | bool | Don't print anything, just return. | ### spacy.explain {#spacy.explain tag="function"} @@ -331,6 +334,10 @@ See the [`Transformer`](/api/transformer) API reference and | [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | | [`annotation_setters`](/api/transformers#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | +## Training data loaders and batchers {#loaders-batchers new="3"} + + + ## Training data and alignment {#gold source="spacy/gold"} ### gold.docs_to_json {#docs_to_json tag="function"} diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 6388529f6..7c47c0c73 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -311,6 +311,62 @@ nlp.rename_pipe("ner", "entityrecognizer") nlp.replace_pipe("tagger", my_custom_tagger) ``` +### Sourcing pipeline components from existing models {#sourced-components new="3"} + +Pipeline components that are independent can also be reused across models. +Instead of adding a new blank component to a pipeline, you can also copy an +existing component from a pretrained model by setting the `source` argument on +[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be +interpreted as the name of the component in the source pipeline – for instance, +`"ner"`. This is especially useful for +[training a model](/usage/training#config-components) because it lets you mix +and match components and create fully custom model packages with updated +pretrained components and new components trained on your data. + + + +When reusing components across models, keep in mind that the **vocabulary**, +**vectors** and model settings **must match**. If a pretrained model includes +[word vectors](/usage/vectors-embeddings) and the component uses them as +features, the model you copy it to needs to have the _same_ vectors available – +otherwise, it won't be able to make the same predictions. + + + +> #### In training config +> +> Instead of providing a `factory`, component blocks in the training +> [config](/usage/training#config) can also define a `source`. The string needs +> to be a loadable spaCy model package or path. The +> +> ```ini +> [components.ner] +> source = "en_core_web_sm" +> component = "ner" +> ``` +> +> By default, sourced components will be updated with your data during training. +> If you want to preserve the component as-is, you can "freeze" it: +> +> ```ini +> [training] +> frozen_components = ["ner"] +> ``` + +```python +### {executable="true"} +import spacy + +# The source model with different components +source_nlp = spacy.load("en_core_web_sm") +print(source_nlp.pipe_names) + +# Add only the entity recognizer to the new blank model +nlp = spacy.blank("en") +nlp.add_pipe("ner", source=source_nlp) +print(nlp.pipe_names) +``` + ### Analyzing pipeline components {#analysis new="3"} The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 955e484fb..7c9d50921 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -149,12 +149,14 @@ not just define static settings, but also construct objects like architectures, schedules, optimizers or any other custom components. The main top-level sections of a config file are: -| Section | Description | -| ------------- | --------------------------------------------------------------------------------------------------------------------- | -| `training` | Settings and controls for the training and evaluation process. | -| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | -| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. | -| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. | +| Section | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. | +| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. | +| `paths` | Paths to data and other assets. Can be re-used across the config as variables, e.g. `${paths:train}`, and [overwritten](#config-overrides) on the CLI. | +| `system` | Settings related to system and hardware. | +| `training` | Settings and controls for the training and evaluation process. | +| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | @@ -168,7 +170,7 @@ available for the different architectures are documented with the -#### Overwriting config settings on the command line {#config-overrides} +### Overwriting config settings on the command line {#config-overrides} The config system means that you can define all settings **in one place** and in a consistent format. There are no command-line arguments that need to be set, @@ -192,7 +194,87 @@ of the training, the final filled `config.cfg` is exported with your model, so you'll always have a record of the settings that were used, including your overrides. -#### Using registered functions {#config-functions} +### Defining pipeline components {#config-components} + +When you train a model, you typically train a +[pipeline](/usage/processing-pipelines) of **one or more components**. The +`[components]` block in the config defines the available pipeline components and +how they should be created – either by a built-in or custom +[factory](/usage/processing-pipelines#built-in), or +[sourced](/usage/processing-pipelines#sourced-components) from an existing +pretrained model. For example, `[components.parser]` defines the component named +`"parser"` in the pipeline. There are different ways you might want to treat +your components during training, and the most common scenarios are: + +1. Train a **new component** from scratch on your data. +2. Update an existing **pretrained component** with more examples. +3. Include an existing pretrained component without updating it. +4. Include a non-trainable component, like a rule-based + [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a + fully [custom component](/usage/processing-pipelines#custom-components). + +If a component block defines a `factory`, spaCy will look it up in the +[built-in](/usage/processing-pipelines#built-in) or +[custom](/usage/processing-pipelines#custom-components) components and create a +new component from scratch. All settings defined in the config block will be +passed to the component factory as arguments. This lets you configure the model +settings and hyperparameters. If a component block defines a `source`, the +component will be copied over from an existing pretrained model, with its +existing weights. This lets you include an already trained component in your +model pipeline, or update a pretrained components with more data specific to +your use case. + +```ini +### config.cfg (excerpt) +[components] + +# "parser" and "ner" are sourced from pretrained model +[components.parser] +source = "en_core_web_sm" + +[components.ner] +source = "en_core_web_sm" + +# "textcat" and "custom" are created blank from built-in / custom factory +[components.textcat] +factory = "textcat" + +[components.custom] +factory = "your_custom_factory" +your_custom_setting = true +``` + +The `pipeline` setting in the `[nlp]` block defines the pipeline components +added to the pipeline, in order. For example, `"parser"` here references +`[components.parser]`. By default, spaCy will **update all components that can +be updated**. Trainable components that are created from scratch are initialized +with random weights. For sourced components, spaCy will keep the existing +weights and [resume training](/api/language#resume_training). + +If you don't want a component to be updated, you can **freeze** it by adding it +to the `frozen_components` list in the `[training]` block. Frozen components are +**not updated** during training and are included in the final trained model +as-is. + +> #### Note on frozen components +> +> Even though frozen components are not **updated** during training, they will +> still **run** during training and evaluation. This is very important, because +> they may still impact your model's performance – for instance, a sentence +> boundary detector can impact what the parser or entity recognizer considers a +> valid parse. So the evaluation results should always reflect what your model +> will produce at runtime. + +```ini +[nlp] +lang = "en" +pipeline = ["parser", "ner", "textcat", "custom"] + +[training] +frozen_components = ["parser", "custom"] +``` + +### Using registered functions {#config-functions} The training configuration defined in the config file doesn't have to only consist of static values. Some settings can also be **functions**. For instance, @@ -373,7 +455,9 @@ In your config, you can now reference the schedule in the starting with an `@`, it's interpreted as a reference to a function. All other settings in the block will be passed to the function as keyword arguments. Keep in mind that the config shouldn't have any hidden defaults and all arguments on -the functions need to be represented in the config. +the functions need to be represented in the config. If your function defines +**default argument values**, spaCy is able to auto-fill your config when you run +[`init config`](/api/cli#init-config). @@ -405,7 +489,7 @@ using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered function provides type hints, the values that are passed in will be checked against the expected types. For example, `start: int` in the example above will ensure that the value received as the argument `start` is an integer. If the -value can't be cast to an integer, spaCy will raise an error. +value can't be coerced into an integer, spaCy will raise an error. `start: pydantic.StrictInt` will force the value to be an integer and raise an error if it's not – for instance, if your config defines a float.