mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-22 15:24:11 +03:00
6b07be2110
* Add `Language.distill` This method is the distillation counterpart of `Language.update`. It takes a teacher `Language` instance and distills the student pipes on the teacher pipes. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Clarify that how Example is used in distillation * Update transition parser distill docstring for examples argument * Pass optimizer to `TrainablePipe.distill` * Annotate pipe before update As discussed internally, we want to let a pipe annotate before doing an update with gold/silver data. Otherwise, the output may be (too) informed by the gold/silver data. * Rename `component_map` to `student_to_teacher` * Better synopsis in `Language.distill` docstring * `name` -> `student_name` * Fix labels type in docstring * Mark distill test as slow * Fix `student_to_teacher` type in docs --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
1160 lines
78 KiB
Plaintext
1160 lines
78 KiB
Plaintext
---
|
|
title: Language
|
|
teaser: A text-processing pipeline
|
|
tag: class
|
|
source: spacy/language.py
|
|
---
|
|
|
|
Usually you'll load this once per process as `nlp` and pass the instance around
|
|
your application. The `Language` class is created when you call
|
|
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
|
|
[language data](/usage/linguistic-features#language-data), optional binary
|
|
weights, e.g. provided by a [trained pipeline](/models), and the
|
|
[processing pipeline](/usage/processing-pipelines) containing components like
|
|
the tagger or parser that are called on a document in order. You can also add
|
|
your own processing pipeline components that take a `Doc` object, modify it and
|
|
return it.
|
|
|
|
## Language.\_\_init\_\_ {id="init",tag="method"}
|
|
|
|
Initialize a `Language` object. Note that the `meta` is only used for meta
|
|
information in [`Language.meta`](/api/language#meta) and not to configure the
|
|
`nlp` object or to override the config. To initialize from a config, use
|
|
[`Language.from_config`](/api/language#from_config) instead.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> # Construction from subclass
|
|
> from spacy.lang.en import English
|
|
> nlp = English()
|
|
>
|
|
> # Construction from scratch
|
|
> from spacy.vocab import Vocab
|
|
> from spacy.language import Language
|
|
> nlp = Language(Vocab())
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------ |
|
|
| `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ |
|
|
| _keyword-only_ | |
|
|
| `max_length` | Maximum number of characters allowed in a single text. Defaults to `10 ** 6`. ~~int~~ |
|
|
| `meta` | [Meta data](/api/data-formats#meta) overrides. ~~Dict[str, Any]~~ |
|
|
| `create_tokenizer` | Optional function that receives the `nlp` object and returns a tokenizer. ~~Callable[[Language], Callable[[str], Doc]]~~ |
|
|
| `batch_size` | Default batch size for [`pipe`](#pipe) and [`evaluate`](#evaluate). Defaults to `1000`. ~~int~~ |
|
|
|
|
## Language.from_config {id="from_config",tag="classmethod",version="3"}
|
|
|
|
Create a `Language` object from a loaded config. Will set up the tokenizer and
|
|
language data, add pipeline components based on the pipeline and add pipeline
|
|
components based on the definitions specified in the config. If no config is
|
|
provided, the default config of the given language is used. This is also how
|
|
spaCy loads a model under the hood based on its
|
|
[`config.cfg`](/api/data-formats#config).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from thinc.api import Config
|
|
> from spacy.language import Language
|
|
>
|
|
> config = Config().from_disk("./config.cfg")
|
|
> nlp = Language.from_config(config)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `config` | The loaded config. ~~Union[Dict[str, Any], Config]~~ |
|
|
| _keyword-only_ | |
|
|
| `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ |
|
|
| `disable` | Name(s) of pipeline component(s) to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). Is merged with the config entry `nlp.disabled`. ~~Union[str, Iterable[str]]~~ |
|
|
| `enable` <Tag variant="new">3.4</Tag> | Name(s) of pipeline component(s) to [enable](/usage/processing-pipelines#disabling). All other pipes will be disabled, but can be enabled again using [nlp.enable_pipe](/api/language#enable_pipe). ~~Union[str, Iterable[str]]~~ |
|
|
| `exclude` | Name(s) of pipeline component(s) to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~Union[str, Iterable[str]]~~ |
|
|
| `meta` | [Meta data](/api/data-formats#meta) overrides. ~~Dict[str, Any]~~ |
|
|
| `auto_fill` | Whether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to `True`. ~~bool~~ |
|
|
| `validate` | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
|
| **RETURNS** | The initialized object. ~~Language~~ |
|
|
|
|
## Language.component {id="component",tag="classmethod",version="3"}
|
|
|
|
Register a custom pipeline component under a given name. This allows
|
|
initializing the component by name using
|
|
[`Language.add_pipe`](/api/language#add_pipe) and referring to it in
|
|
[config files](/usage/training#config). This classmethod and decorator is
|
|
intended for **simple stateless functions** that take a `Doc` and return it. For
|
|
more complex stateful components that allow settings and need access to the
|
|
shared `nlp` object, use the [`Language.factory`](/api/language#factory)
|
|
decorator. For more details and examples, see the
|
|
[usage documentation](/usage/processing-pipelines#custom-components).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.language import Language
|
|
>
|
|
> # Usage as a decorator
|
|
> @Language.component("my_component")
|
|
> def my_component(doc):
|
|
> # Do something to the doc
|
|
> return doc
|
|
>
|
|
> # Usage as a function
|
|
> Language.component("my_component2", func=my_component)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `name` | The name of the component factory. ~~str~~ |
|
|
| _keyword-only_ | |
|
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
|
| `func` | Optional function if not used as a decorator. ~~Optional[Callable[[Doc], Doc]]~~ |
|
|
|
|
## Language.factory {id="factory",tag="classmethod"}
|
|
|
|
Register a custom pipeline component factory under a given name. This allows
|
|
initializing the component by name using
|
|
[`Language.add_pipe`](/api/language#add_pipe) and referring to it in
|
|
[config files](/usage/training#config). The registered factory function needs to
|
|
take at least two **named arguments** which spaCy fills in automatically: `nlp`
|
|
for the current `nlp` object and `name` for the component instance name. This
|
|
can be useful to distinguish multiple instances of the same component and allows
|
|
trainable components to add custom losses using the component instance name. The
|
|
`default_config` defines the default values of the remaining factory arguments.
|
|
It's merged into the [`nlp.config`](/api/language#config). For more details and
|
|
examples, see the
|
|
[usage documentation](/usage/processing-pipelines#custom-components).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.language import Language
|
|
>
|
|
> # Usage as a decorator
|
|
> @Language.factory(
|
|
> "my_component",
|
|
> default_config={"some_setting": True},
|
|
> )
|
|
> def create_my_component(nlp, name, some_setting):
|
|
> return MyComponent(some_setting)
|
|
>
|
|
> # Usage as function
|
|
> Language.factory(
|
|
> "my_component",
|
|
> default_config={"some_setting": True},
|
|
> func=create_my_component
|
|
> )
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | The name of the component factory. ~~str~~ |
|
|
| _keyword-only_ | |
|
|
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
|
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
|
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ |
|
|
| `func` | Optional function if not used as a decorator. ~~Optional[Callable[[...], Callable[[Doc], Doc]]]~~ |
|
|
|
|
## Language.\_\_call\_\_ {id="call",tag="method"}
|
|
|
|
Apply the pipeline to some text. The text can span multiple sentences, and can
|
|
contain arbitrary whitespace. Alignment into the original string is preserved.
|
|
|
|
Instead of text, a `Doc` can be passed as input, in which case tokenization is
|
|
skipped, but the rest of the pipeline is run.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp("An example sentence. Another sentence.")
|
|
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `text` | The text to be processed, or a Doc. ~~Union[str, Doc]~~ |
|
|
| _keyword-only_ | |
|
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
|
| **RETURNS** | A container for accessing the annotations. ~~Doc~~ |
|
|
|
|
## Language.pipe {id="pipe",tag="method"}
|
|
|
|
Process texts as a stream, and yield `Doc` objects in order. This is usually
|
|
more efficient than processing texts one-by-one.
|
|
|
|
Instead of text, a `Doc` object can be passed as input. In this case
|
|
tokenization is skipped but the rest of the pipeline is run.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> texts = ["One document.", "...", "Lots of documents"]
|
|
> for doc in nlp.pipe(texts, batch_size=50):
|
|
> assert doc.has_annotation("DEP")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `texts` | A sequence of strings (or `Doc` objects). ~~Iterable[Union[str, Doc]]~~ |
|
|
| _keyword-only_ | |
|
|
| `as_tuples` | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ |
|
|
| `batch_size` | The number of texts to buffer. ~~Optional[int]~~ |
|
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
|
| `n_process` | Number of processors to use. Defaults to `1`. ~~int~~ |
|
|
| **YIELDS** | Documents in the order of the original text. ~~Doc~~ |
|
|
|
|
## Language.set_error_handler {id="set_error_handler",tag="method",version="3"}
|
|
|
|
Define a callback that will be invoked when an error is thrown during processing
|
|
of one or more documents. Specifically, this function will call
|
|
[`set_error_handler`](/api/pipe#set_error_handler) on all the pipeline
|
|
components that define that function. The error handler will be invoked with the
|
|
original component's name, the component itself, the list of documents that was
|
|
being processed, and the original error.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> def warn_error(proc_name, proc, docs, e):
|
|
> print(f"An error occurred when applying component {proc_name}.")
|
|
>
|
|
> nlp.set_error_handler(warn_error)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------- | -------------------------------------------------------------------------------------------------------------- |
|
|
| `error_handler` | A function that performs custom error handling. ~~Callable[[str, Callable[[Doc], Doc], List[Doc], Exception]~~ |
|
|
|
|
## Language.initialize {id="initialize",tag="method",version="3"}
|
|
|
|
Initialize the pipeline for training and return an
|
|
[`Optimizer`](https://thinc.ai/docs/api-optimizers). Under the hood, it uses the
|
|
settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
|
|
config block to set up the vocabulary, load in vectors and tok2vec weights and
|
|
pass optional arguments to the `initialize` methods implemented by pipeline
|
|
components or the tokenizer. This method is typically called automatically when
|
|
you run [`spacy train`](/api/cli#train). See the usage guide on the
|
|
[config lifecycle](/usage/training#config-lifecycle) and
|
|
[initialization](/usage/training#initialization) for details.
|
|
|
|
`get_examples` should be a function that returns an iterable of
|
|
[`Example`](/api/example) objects. The data examples can either be the full
|
|
training data or a representative sample. They are used to **initialize the
|
|
models** of trainable pipeline components and are passed each component's
|
|
[`initialize`](/api/pipe#initialize) method, if available. Initialization
|
|
includes validating the network,
|
|
[inferring missing shapes](/usage/layers-architectures#thinc-shape-inference)
|
|
and setting up the label scheme based on the data.
|
|
|
|
If no `get_examples` function is provided when calling `nlp.initialize`, the
|
|
pipeline components will be initialized with generic data. In this case, it is
|
|
crucial that the output dimension of each component has already been defined
|
|
either in the [config](/usage/training#config), or by calling
|
|
[`pipe.add_label`](/api/pipe#add_label) for each possible output label (e.g. for
|
|
the tagger or textcat).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> get_examples = lambda: examples
|
|
> optimizer = nlp.initialize(get_examples)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `get_examples` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Optional[Callable[[], Iterable[Example]]]~~ |
|
|
| _keyword-only_ | |
|
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
|
|
|
## Language.resume_training {id="resume_training",tag="method,experimental",version="3"}
|
|
|
|
Continue training a trained pipeline. Create and return an optimizer, and
|
|
initialize "rehearsal" for any pipeline component that has a `rehearse` method.
|
|
Rehearsal is used to prevent models from "forgetting" their initialized
|
|
"knowledge". To perform rehearsal, collect samples of text you want the models
|
|
to retain performance on, and call [`nlp.rehearse`](/api/language#rehearse) with
|
|
a batch of [Example](/api/example) objects.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> optimizer = nlp.resume_training()
|
|
> nlp.rehearse(examples, sgd=optimizer)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------- |
|
|
| _keyword-only_ | |
|
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
|
|
|
## Language.update {id="update",tag="method"}
|
|
|
|
Update the models in the pipeline.
|
|
|
|
<Infobox variant="warning" title="Changed in v3.0">
|
|
|
|
The `Language.update` method now takes a batch of [`Example`](/api/example)
|
|
objects instead of the raw texts and annotations or `Doc` and `GoldParse`
|
|
objects. An [`Example`](/api/example) streamlines how data is passed around. It
|
|
stores two `Doc` objects: one for holding the gold-standard reference data, and
|
|
one for holding the predictions of the pipeline.
|
|
|
|
For most use cases, you shouldn't have to write your own training scripts
|
|
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a config file
|
|
and custom registered functions if needed. See the
|
|
[training documentation](/usage/training) for details.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> for raw_text, entity_offsets in train_data:
|
|
> doc = nlp.make_doc(raw_text)
|
|
> example = Example.from_dict(doc, {"entities": entity_offsets})
|
|
> nlp.update([example], sgd=optimizer)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
|
| _keyword-only_ | |
|
|
| `drop` | The dropout rate. ~~float~~ |
|
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
|
| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ |
|
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
|
|
|
## Language.distill {id="distill",tag="method,experimental",version="4"}
|
|
|
|
Distill the models in a student pipeline from a teacher pipeline.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
>
|
|
> teacher = spacy.load("en_core_web_lg")
|
|
> student = English()
|
|
> student.add_pipe("tagger")
|
|
> student.distill(teacher, examples, sgd=optimizer)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `teacher` | The teacher pipeline to distill from. ~~Language~~ |
|
|
| `examples` | A batch of [`Example`](/api/example) distillation examples. The reference (teacher) and predicted (student) docs must have the same number of tokens and orthography. ~~Iterable[Example]~~ |
|
|
| _keyword-only_ | |
|
|
| `drop` | The dropout rate. ~~float~~ |
|
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
|
| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ |
|
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
|
| `exclude` | Names of components that shouldn't be updated. Defaults to `[]`. ~~Iterable[str]~~ |
|
|
| `annotates` | Names of components that should set annotations on the prediced examples after updating. Defaults to `[]`. ~~Iterable[str]~~ |
|
|
| `student_to_teacher` | Map student component names to teacher component names, only necessary when the names differ. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
|
|
|
## Language.rehearse {id="rehearse",tag="method,experimental",version="3"}
|
|
|
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
|
current model to make predictions similar to an initial model, to try to address
|
|
the "catastrophic forgetting" problem. This feature is experimental.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> optimizer = nlp.resume_training()
|
|
> losses = nlp.rehearse(examples, sgd=optimizer)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------- |
|
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
|
| _keyword-only_ | |
|
|
| `drop` | The dropout rate. ~~float~~ |
|
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
|
| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ |
|
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
|
|
|
## Language.evaluate {id="evaluate",tag="method"}
|
|
|
|
Evaluate a pipeline's components.
|
|
|
|
<Infobox variant="warning" title="Changed in v3.0">
|
|
|
|
The `Language.evaluate` method now takes a batch of [`Example`](/api/example)
|
|
objects instead of tuples of `Doc` and `GoldParse` objects.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> scores = nlp.evaluate(examples)
|
|
> print(scores)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
|
| _keyword-only_ | |
|
|
| `batch_size` | The batch size to use. ~~Optional[int]~~ |
|
|
| `scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. ~~Optional[Scorer]~~ |
|
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
|
| `scorer_cfg` | Optional dictionary of keyword arguments for the `Scorer`. Defaults to `None`. ~~Optional[Dict[str, Any]]~~ |
|
|
| **RETURNS** | A dictionary of evaluation scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
|
|
|
## Language.use_params {id="use_params",tag="contextmanager, method"}
|
|
|
|
Replace weights of models in the pipeline with those provided in the params
|
|
dictionary. Can be used as a context manager, in which case, models go back to
|
|
their original weights after the block.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> with nlp.use_params(optimizer.averages):
|
|
> nlp.to_disk("/tmp/checkpoint")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------- | ------------------------------------------------------ |
|
|
| `params` | A dictionary of parameters keyed by model ID. ~~dict~~ |
|
|
|
|
## Language.add_pipe {id="add_pipe",tag="method",version="2"}
|
|
|
|
Add a component to the processing pipeline. Expects a name that maps to a
|
|
component factory registered using
|
|
[`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory). Components should be callables
|
|
that take a `Doc` object, modify it and return it. Only one of `before`,
|
|
`after`, `first` or `last` can be set. Default behavior is `last=True`.
|
|
|
|
<Infobox title="Changed in v3.0" variant="warning">
|
|
|
|
As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't
|
|
take callables anymore and instead expects the **name of a component factory**
|
|
registered using [`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory). It now takes care of creating the
|
|
component, adds it to the pipeline and returns it.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> @Language.component("component")
|
|
> def component_func(doc):
|
|
> # modify Doc and return it
|
|
> return doc
|
|
>
|
|
> nlp.add_pipe("component", before="ner")
|
|
> component = nlp.add_pipe("component", name="custom_name", last=True)
|
|
>
|
|
> # Add component from source pipeline
|
|
> source_nlp = spacy.load("en_core_web_sm")
|
|
> nlp.add_pipe("ner", source=source_nlp)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `factory_name` | Name of the registered component factory. ~~str~~ |
|
|
| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ |
|
|
| _keyword-only_ | |
|
|
| `before` | Component name or index to insert component directly before. ~~Optional[Union[str, int]]~~ |
|
|
| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ |
|
|
| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ |
|
|
| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ |
|
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Dict[str, Any]~~ |
|
|
| `source` <Tag variant="new">3</Tag> | Optional source pipeline to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. ~~Optional[Language]~~ |
|
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
|
|
|
## Language.create_pipe {id="create_pipe",tag="method",version="2"}
|
|
|
|
Create a pipeline component from a factory.
|
|
|
|
<Infobox title="Changed in v3.0" variant="warning">
|
|
|
|
As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method also takes
|
|
the string name of the factory, creates the component, adds it to the pipeline
|
|
and returns it. The `Language.create_pipe` method is now mostly used internally.
|
|
To create a component and add it to the pipeline, you should always use
|
|
`Language.add_pipe`.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> parser = nlp.create_pipe("parser")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `factory_name` | Name of the registered component factory. ~~str~~ |
|
|
| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ |
|
|
| _keyword-only_ | |
|
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Dict[str, Any]~~ |
|
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
|
|
|
## Language.has_factory {id="has_factory",tag="classmethod",version="3"}
|
|
|
|
Check whether a factory name is registered on the `Language` class or subclass.
|
|
Will check for
|
|
[language-specific factories](/usage/processing-pipelines#factories-language)
|
|
registered on the subclass, as well as general-purpose factories registered on
|
|
the `Language` base class, available to all subclasses.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.language import Language
|
|
> from spacy.lang.en import English
|
|
>
|
|
> @English.component("component")
|
|
> def component(doc):
|
|
> return doc
|
|
>
|
|
> assert English.has_factory("component")
|
|
> assert not Language.has_factory("component")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------- |
|
|
| `name` | Name of the pipeline factory to check. ~~str~~ |
|
|
| **RETURNS** | Whether a factory of that name is registered on the class. ~~bool~~ |
|
|
|
|
## Language.has_pipe {id="has_pipe",tag="method",version="2"}
|
|
|
|
Check whether a component is present in the pipeline. Equivalent to
|
|
`name in nlp.pipe_names`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> @Language.component("component")
|
|
> def component(doc):
|
|
> return doc
|
|
>
|
|
> nlp.add_pipe("component", name="my_component")
|
|
> assert "my_component" in nlp.pipe_names
|
|
> assert nlp.has_pipe("my_component")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------------------- |
|
|
| `name` | Name of the pipeline component to check. ~~str~~ |
|
|
| **RETURNS** | Whether a component of that name exists in the pipeline. ~~bool~~ |
|
|
|
|
## Language.get_pipe {id="get_pipe",tag="method",version="2"}
|
|
|
|
Get a pipeline component for a given component name.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> parser = nlp.get_pipe("parser")
|
|
> custom_component = nlp.get_pipe("custom_component")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------ |
|
|
| `name` | Name of the pipeline component to get. ~~str~~ |
|
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
|
|
|
## Language.replace_pipe {id="replace_pipe",tag="method",version="2"}
|
|
|
|
Replace a component in the pipeline and return the new component.
|
|
|
|
<Infobox title="Changed in v3.0" variant="warning">
|
|
|
|
As of v3.0, the `Language.replace_pipe` method doesn't take callables anymore
|
|
and instead expects the **name of a component factory** registered using
|
|
[`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory).
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> new_parser = nlp.replace_pipe("parser", "my_custom_parser")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `name` | Name of the component to replace. ~~str~~ |
|
|
| `component` | The factory name of the component to insert. ~~str~~ |
|
|
| _keyword-only_ | |
|
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for the new component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ |
|
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
|
| **RETURNS** | The new pipeline component. ~~Callable[[Doc], Doc]~~ |
|
|
|
|
## Language.rename_pipe {id="rename_pipe",tag="method",version="2"}
|
|
|
|
Rename a component in the pipeline. Useful to create custom names for
|
|
pre-defined and pre-loaded components. To change the default name of a component
|
|
added to the pipeline, you can also use the `name` argument on
|
|
[`add_pipe`](/api/language#add_pipe).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.rename_pipe("parser", "spacy_parser")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ---------- | ---------------------------------------- |
|
|
| `old_name` | Name of the component to rename. ~~str~~ |
|
|
| `new_name` | New name of the component. ~~str~~ |
|
|
|
|
## Language.remove_pipe {id="remove_pipe",tag="method",version="2"}
|
|
|
|
Remove a component from the pipeline. Returns the removed component name and
|
|
component function.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> name, component = nlp.remove_pipe("parser")
|
|
> assert name == "parser"
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------ |
|
|
| `name` | Name of the component to remove. ~~str~~ |
|
|
| **RETURNS** | A `(name, component)` tuple of the removed component. ~~Tuple[str, Callable[[Doc], Doc]]~~ |
|
|
|
|
## Language.disable_pipe {id="disable_pipe",tag="method",version="3"}
|
|
|
|
Temporarily disable a pipeline component so it's not run as part of the
|
|
pipeline. Disabled components are listed in
|
|
[`nlp.disabled`](/api/language#attributes) and included in
|
|
[`nlp.components`](/api/language#attributes), but not in
|
|
[`nlp.pipeline`](/api/language#pipeline), so they're not run when you process a
|
|
`Doc` with the `nlp` object. If the component is already disabled, this method
|
|
does nothing.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.add_pipe("ner")
|
|
> nlp.add_pipe("textcat")
|
|
> assert nlp.pipe_names == ["ner", "textcat"]
|
|
> nlp.disable_pipe("ner")
|
|
> assert nlp.pipe_names == ["textcat"]
|
|
> assert nlp.component_names == ["ner", "textcat"]
|
|
> assert nlp.disabled == ["ner"]
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------ | ----------------------------------------- |
|
|
| `name` | Name of the component to disable. ~~str~~ |
|
|
|
|
## Language.enable_pipe {id="enable_pipe",tag="method",version="3"}
|
|
|
|
Enable a previously disabled component (e.g. via
|
|
[`Language.disable_pipes`](/api/language#disable_pipes)) so it's run as part of
|
|
the pipeline, [`nlp.pipeline`](/api/language#pipeline). If the component is
|
|
already enabled, this method does nothing.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.disable_pipe("ner")
|
|
> assert "ner" in nlp.disabled
|
|
> assert not "ner" in nlp.pipe_names
|
|
> nlp.enable_pipe("ner")
|
|
> assert not "ner" in nlp.disabled
|
|
> assert "ner" in nlp.pipe_names
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------ | ---------------------------------------- |
|
|
| `name` | Name of the component to enable. ~~str~~ |
|
|
|
|
## Language.select_pipes {id="select_pipes",tag="contextmanager, method",version="3"}
|
|
|
|
Disable one or more pipeline components. If used as a context manager, the
|
|
pipeline will be restored to the initial state at the end of the block.
|
|
Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method
|
|
you can use to undo your changes. You can specify either `disable` (as a list or
|
|
string), or `enable`. In the latter case, all components not in the `enable`
|
|
list will be disabled. Under the hood, this method calls into
|
|
[`disable_pipe`](/api/language#disable_pipe) and
|
|
[`enable_pipe`](/api/language#enable_pipe).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> with nlp.select_pipes(disable=["tagger", "parser"]):
|
|
> nlp.initialize()
|
|
>
|
|
> with nlp.select_pipes(enable="ner"):
|
|
> nlp.initialize()
|
|
>
|
|
> disabled = nlp.select_pipes(disable=["tagger", "parser"])
|
|
> nlp.initialize()
|
|
> disabled.restore()
|
|
> ```
|
|
|
|
<Infobox title="Changed in v3.0" variant="warning" id="disable_pipes">
|
|
|
|
As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`:
|
|
|
|
```diff
|
|
- nlp.disable_pipes(["tagger", "parser"])
|
|
+ nlp.select_pipes(disable=["tagger", "parser"])
|
|
```
|
|
|
|
</Infobox>
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------ |
|
|
| _keyword-only_ | |
|
|
| `disable` | Name(s) of pipeline component(s) to disable. ~~Optional[Union[str, Iterable[str]]]~~ |
|
|
| `enable` | Name(s) of pipeline component(s) that will not be disabled. ~~Optional[Union[str, Iterable[str]]]~~ |
|
|
| **RETURNS** | The disabled pipes that can be restored by calling the object's `.restore()` method. ~~DisabledPipes~~ |
|
|
|
|
## Language.get_factory_meta {id="get_factory_meta",tag="classmethod",version="3"}
|
|
|
|
Get the factory meta information for a given pipeline component name. Expects
|
|
the name of the component **factory**. The factory meta is an instance of the
|
|
[`FactoryMeta`](/api/language#factorymeta) dataclass and contains the
|
|
information about the component and its default provided by the
|
|
[`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory) decorator.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> factory_meta = Language.get_factory_meta("ner")
|
|
> assert factory_meta.factory == "ner"
|
|
> print(factory_meta.default_config)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------- |
|
|
| `name` | The factory name. ~~str~~ |
|
|
| **RETURNS** | The factory meta. ~~FactoryMeta~~ |
|
|
|
|
## Language.get_pipe_meta {id="get_pipe_meta",tag="method",version="3"}
|
|
|
|
Get the factory meta information for a given pipeline component name. Expects
|
|
the name of the component **instance** in the pipeline. The factory meta is an
|
|
instance of the [`FactoryMeta`](/api/language#factorymeta) dataclass and
|
|
contains the information about the component and its default provided by the
|
|
[`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory) decorator.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.add_pipe("ner", name="entity_recognizer")
|
|
> factory_meta = nlp.get_pipe_meta("entity_recognizer")
|
|
> assert factory_meta.factory == "ner"
|
|
> print(factory_meta.default_config)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------ |
|
|
| `name` | The pipeline component name. ~~str~~ |
|
|
| **RETURNS** | The factory meta. ~~FactoryMeta~~ |
|
|
|
|
## Language.analyze_pipes {id="analyze_pipes",tag="method",version="3"}
|
|
|
|
Analyze the current pipeline components and show a summary of the attributes
|
|
they assign and require, and the scores they set. The data is based on the
|
|
information provided in the [`@Language.component`](/api/language#component) and
|
|
[`@Language.factory`](/api/language#factory) decorator. If requirements aren't
|
|
met, e.g. if a component specifies a required property that is not set by a
|
|
previous component, a warning is shown.
|
|
|
|
<Infobox variant="warning" title="Important note">
|
|
|
|
The pipeline analysis is static and does **not actually run the components**.
|
|
This means that it relies on the information provided by the components
|
|
themselves. If a custom component declares that it assigns an attribute but it
|
|
doesn't, the pipeline analysis won't catch that.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = spacy.blank("en")
|
|
> nlp.add_pipe("tagger")
|
|
> nlp.add_pipe("entity_linker")
|
|
> analysis = nlp.analyze_pipes()
|
|
> ```
|
|
|
|
<Accordion title="Example output" spaced>
|
|
|
|
```json {title="Structured"}
|
|
{
|
|
"summary": {
|
|
"tagger": {
|
|
"assigns": ["token.tag"],
|
|
"requires": [],
|
|
"scores": ["tag_acc", "pos_acc", "lemma_acc"],
|
|
"retokenizes": false
|
|
},
|
|
"entity_linker": {
|
|
"assigns": ["token.ent_kb_id"],
|
|
"requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
|
|
"scores": [],
|
|
"retokenizes": false
|
|
}
|
|
},
|
|
"problems": {
|
|
"tagger": [],
|
|
"entity_linker": [
|
|
"doc.ents",
|
|
"doc.sents",
|
|
"token.ent_iob",
|
|
"token.ent_type"
|
|
]
|
|
},
|
|
"attrs": {
|
|
"token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
|
|
"doc.ents": { "assigns": [], "requires": ["entity_linker"] },
|
|
"token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
|
|
"doc.sents": { "assigns": [], "requires": ["entity_linker"] },
|
|
"token.tag": { "assigns": ["tagger"], "requires": [] },
|
|
"token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
|
|
}
|
|
}
|
|
```
|
|
|
|
```
|
|
### Pretty
|
|
============================= Pipeline Overview =============================
|
|
|
|
# Component Assigns Requires Scores Retokenizes
|
|
- ------------- --------------- -------------- ----------- -----------
|
|
0 tagger token.tag tag_acc False
|
|
|
|
1 entity_linker token.ent_kb_id doc.ents nel_micro_f False
|
|
doc.sents nel_micro_r
|
|
token.ent_iob nel_micro_p
|
|
token.ent_type
|
|
|
|
|
|
================================ Problems (4) ================================
|
|
⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
|
|
token.ent_iob, token.ent_type
|
|
```
|
|
|
|
</Accordion>
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| _keyword-only_ | |
|
|
| `keys` | The values to display in the table. Corresponds to attributes of the [`FactoryMeta`](/api/language#factorymeta). Defaults to `["assigns", "requires", "scores", "retokenizes"]`. ~~List[str]~~ |
|
|
| `pretty` | Pretty-print the results as a table. Defaults to `False`. ~~bool~~ |
|
|
| **RETURNS** | Dictionary containing the pipe analysis, keyed by `"summary"` (component meta by pipe), `"problems"` (attribute names by pipe) and `"attrs"` (pipes that assign and require an attribute, keyed by attribute). ~~Optional[Dict[str, Any]]~~ |
|
|
|
|
## Language.replace_listeners {id="replace_listeners",tag="method",version="3"}
|
|
|
|
Find [listener layers](/usage/embeddings-transformers#embedding-layers)
|
|
(connecting to a shared token-to-vector embedding component) of a given pipeline
|
|
component model and replace them with a standalone copy of the token-to-vector
|
|
layer. The listener layer allows other components to connect to a shared
|
|
token-to-vector embedding component like [`Tok2Vec`](/api/tok2vec) or
|
|
[`Transformer`](/api/transformer). Replacing listeners can be useful when
|
|
training a pipeline with components sourced from an existing pipeline: if
|
|
multiple components (e.g. tagger, parser, NER) listen to the same
|
|
token-to-vector component, but some of them are frozen and not updated, their
|
|
performance may degrade significally as the token-to-vector component is updated
|
|
with new data. To prevent this, listeners can be replaced with a standalone
|
|
token-to-vector layer that is owned by the component and doesn't change if the
|
|
component isn't updated.
|
|
|
|
This method is typically not called directly and only executed under the hood
|
|
when loading a config with
|
|
[sourced components](/usage/training#config-components) that define
|
|
`replace_listeners`.
|
|
|
|
> ```python
|
|
> ### Example
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> nlp.replace_listeners("tok2vec", "tagger", ["model.tok2vec"])
|
|
> ```
|
|
>
|
|
> ```ini
|
|
> ### config.cfg (excerpt)
|
|
> [training]
|
|
> frozen_components = ["tagger"]
|
|
>
|
|
> [components]
|
|
>
|
|
> [components.tagger]
|
|
> source = "en_core_web_sm"
|
|
> replace_listeners = ["model.tok2vec"]
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `tok2vec_name` | Name of the token-to-vector component, typically `"tok2vec"` or `"transformer"`.~~str~~ |
|
|
| `pipe_name` | Name of pipeline component to replace listeners for. ~~str~~ |
|
|
| `listeners` | The paths to the listeners, relative to the component config, e.g. `["model.tok2vec"]`. Typically, implementations will only connect to one tok2vec component, `model.tok2vec`, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a _complete_ list of the paths to all listener layers used by the model that should be replaced.~~Iterable[str]~~ |
|
|
|
|
## Language.meta {id="meta",tag="property"}
|
|
|
|
Meta data for the `Language` class, including name, version, data sources,
|
|
license, author information and more. If a trained pipeline is loaded, this
|
|
contains meta data of the pipeline. The `Language.meta` is also what's
|
|
serialized as the `meta.json` when you save an `nlp` object to disk. See the
|
|
[meta data format](/api/data-formats#meta) for more details.
|
|
|
|
<Infobox variant="warning" title="Changed in v3.0">
|
|
|
|
As of v3.0, the meta only contains **meta information** about the pipeline and
|
|
isn't used to construct the language class and pipeline components. This
|
|
information is expressed in the [`config.cfg`](/api/data-formats#config).
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> print(nlp.meta)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------- |
|
|
| **RETURNS** | The meta data. ~~Dict[str, Any]~~ |
|
|
|
|
## Language.config {id="config",tag="property",version="3"}
|
|
|
|
Export a trainable [`config.cfg`](/api/data-formats#config) for the current
|
|
`nlp` object. Includes the current pipeline, all configs used to create the
|
|
currently active pipeline components, as well as the default training config
|
|
that can be used with [`spacy train`](/api/cli#train). `Language.config` returns
|
|
a [Thinc `Config` object](https://thinc.ai/docs/api-config#config), which is a
|
|
subclass of the built-in `dict`. It supports the additional methods `to_disk`
|
|
(serialize the config to a file) and `to_str` (output the config as a string).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.config.to_disk("./config.cfg")
|
|
> print(nlp.config.to_str())
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ---------------------- |
|
|
| **RETURNS** | The config. ~~Config~~ |
|
|
|
|
## Language.to_disk {id="to_disk",tag="method",version="2"}
|
|
|
|
Save the current state to a directory. Under the hood, this method delegates to
|
|
the `to_disk` methods of the individual pipeline components, if available. This
|
|
means that if a trained pipeline is loaded, all components and their weights
|
|
will be saved to disk.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp.to_disk("/path/to/pipeline")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
|
|
## Language.from_disk {id="from_disk",tag="method",version="2"}
|
|
|
|
Loads state from a directory, including all data that was saved with the
|
|
`Language` object. Modifies the object in place and returns it.
|
|
|
|
<Infobox variant="warning" title="Important note">
|
|
|
|
Keep in mind that this method **only loads the serialized state** and doesn't
|
|
set up the `nlp` object. This means that it requires the correct language class
|
|
to be initialized and all pipeline components to be added to the pipeline. If
|
|
you want to load a serialized pipeline from a directory, you should use
|
|
[`spacy.load`](/api/top-level#spacy.load), which will set everything up for you.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.language import Language
|
|
> nlp = Language().from_disk("/path/to/pipeline")
|
|
>
|
|
> # Using language-specific subclass
|
|
> from spacy.lang.en import English
|
|
> nlp = English().from_disk("/path/to/pipeline")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ----------------------------------------------------------------------------------------------------------- |
|
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The modified `Language` object. ~~Language~~ |
|
|
|
|
## Language.to_bytes {id="to_bytes",tag="method"}
|
|
|
|
Serialize the current state to a binary string.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp_bytes = nlp.to_bytes()
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~iterable~~ |
|
|
| **RETURNS** | The serialized form of the `Language` object. ~~bytes~~ |
|
|
|
|
## Language.from_bytes {id="from_bytes",tag="method"}
|
|
|
|
Load state from a binary string. Note that this method is commonly used via the
|
|
subclasses like `English` or `German` to make language-specific functionality
|
|
like the [lexical attribute getters](/usage/linguistic-features#language-data)
|
|
available to the loaded object.
|
|
|
|
Note that if you want to serialize and reload a whole pipeline, using this alone
|
|
won't work, you also need to handle the config. See
|
|
["Serializing the pipeline"](https://spacy.io/usage/saving-loading#pipeline) for
|
|
details.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lang.en import English
|
|
> nlp_bytes = nlp.to_bytes()
|
|
> nlp2 = English()
|
|
> nlp2.from_bytes(nlp_bytes)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ----------------------------------------------------------------------------------------------------------- |
|
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The `Language` object. ~~Language~~ |
|
|
|
|
## Attributes {id="attributes"}
|
|
|
|
| Name | Description |
|
|
| -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `vocab` | A container for the lexical types. ~~Vocab~~ |
|
|
| `tokenizer` | The tokenizer. ~~Tokenizer~~ |
|
|
| `make_doc` | Callable that takes a string and returns a `Doc`. ~~Callable[[str], Doc]~~ |
|
|
| `pipeline` | List of `(name, component)` tuples describing the current processing pipeline, in order. ~~List[Tuple[str, Callable[[Doc], Doc]]]~~ |
|
|
| `pipe_names` | List of pipeline component names, in order. ~~List[str]~~ |
|
|
| `pipe_labels` | List of labels set by the pipeline components, if available, keyed by component name. ~~Dict[str, List[str]]~~ |
|
|
| `pipe_factories` | Dictionary of pipeline component names, mapped to their factory names. ~~Dict[str, str]~~ |
|
|
| `factories` | All available factory functions, keyed by name. ~~Dict[str, Callable[[...], Callable[[Doc], Doc]]]~~ |
|
|
| `factory_names` <Tag variant="new">3</Tag> | List of all available factory names. ~~List[str]~~ |
|
|
| `components` <Tag variant="new">3</Tag> | List of all available `(name, component)` tuples, including components that are currently disabled. ~~List[Tuple[str, Callable[[Doc], Doc]]]~~ |
|
|
| `component_names` <Tag variant="new">3</Tag> | List of all available component names, including components that are currently disabled. ~~List[str]~~ |
|
|
| `disabled` <Tag variant="new">3</Tag> | Names of components that are currently disabled and don't run as part of the pipeline. ~~List[str]~~ |
|
|
| `path` | Path to the pipeline data directory, if a pipeline is loaded from a path or package. Otherwise `None`. ~~Optional[Path]~~ |
|
|
|
|
## Class attributes {id="class-attributes"}
|
|
|
|
| Name | Description |
|
|
| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `Defaults` | Settings, data and factory methods for creating the `nlp` object and processing pipeline. ~~Defaults~~ |
|
|
| `lang` | [IETF language tag](https://www.w3.org/International/articles/language-tags/), such as 'en' for English. ~~str~~ |
|
|
| `default_config` | Base [config](/usage/training#config) to use for [Language.config](/api/language#config). Defaults to [`default_config.cfg`](%%GITHUB_SPACY/spacy/default_config.cfg). ~~Config~~ |
|
|
|
|
## Defaults {id="defaults"}
|
|
|
|
The following attributes can be set on the `Language.Defaults` class to
|
|
customize the default language data:
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.language import language
|
|
> from spacy.lang.tokenizer_exceptions import URL_MATCH
|
|
> from thinc.api import Config
|
|
>
|
|
> DEFAULT_CONFIFG = """
|
|
> [nlp.tokenizer]
|
|
> @tokenizers = "MyCustomTokenizer.v1"
|
|
> """
|
|
>
|
|
> class Defaults(Language.Defaults):
|
|
> stop_words = set()
|
|
> tokenizer_exceptions = {}
|
|
> prefixes = tuple()
|
|
> suffixes = tuple()
|
|
> infixes = tuple()
|
|
> token_match = None
|
|
> url_match = URL_MATCH
|
|
> lex_attr_getters = {}
|
|
> syntax_iterators = {}
|
|
> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
|
|
> config = Config().from_str(DEFAULT_CONFIG)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~ |
|
|
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~ |
|
|
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
|
|
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
|
|
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
|
|
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~ |
|
|
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`](%%GITHUB_SPACY/spacy/lang/en/syntax_iterators.py). ~~Dict[str, Callable[[Union[Doc, Span]], Iterator[Span]]]~~ |
|
|
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`](%%GITHUB_SPACY/spacy/lang/zh/__init__.py) ~~Dict[str, Any]~~ |
|
|
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`](%%GITHUB_SPACY/spacy/lang/zh/__init__.py) ~~Config~~ |
|
|
|
|
## Serialization fields {id="serialization-fields"}
|
|
|
|
During serialization, spaCy will export several data fields used to restore
|
|
different aspects of the object. If needed, you can exclude them from
|
|
serialization by passing in the string names via the `exclude` argument.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
|
|
> nlp.from_disk("/pipeline", exclude=["ner"])
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------ |
|
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
|
| `tokenizer` | Tokenization rules and exceptions. |
|
|
| `meta` | The meta data, available as [`Language.meta`](/api/language#meta). |
|
|
| ... | String names of pipeline components, e.g. `"ner"`. |
|
|
|
|
## FactoryMeta {id="factorymeta",version="3",tag="dataclass"}
|
|
|
|
The `FactoryMeta` contains the information about the component and its default
|
|
provided by the [`@Language.component`](/api/language#component) or
|
|
[`@Language.factory`](/api/language#factory) decorator. It's created whenever a
|
|
component is defined and stored on the `Language` class for each component
|
|
instance and factory instance.
|
|
|
|
| Name | Description |
|
|
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `factory` | The name of the registered component factory. ~~str~~ |
|
|
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
|
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
|
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ |
|
|
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|