spaCy/language.md at fe4cfd06322720905aac8c1b0e26f88c4d375c64

mirror of https://github.com/explosion/spaCy.git synced 2025-07-02 10:53:05 +03:00

Ines Montani fe4cfd0632 Start updating website for v3 [ci skip]

2020-07-01 21:26:39 +02:00

28 KiB

Raw Blame History

title	teaser	tag	source
Language	A text-processing pipeline	class	spacy/language.py

Usually you'll load this once per process as nlp and pass the instance around your application. The Language class is created when you call spacy.load() and contains the shared vocabulary and language data, optional model data loaded from a model package or a path, and a processing pipeline containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a Doc object, modify it and return it.

Language.init

Initialize a Language object.

Example

from spacy.vocab import Vocab
from spacy.language import Language
nlp = Language(Vocab())

from spacy.lang.en import English
nlp = English()

Name	Type	Description
`vocab`	`Vocab`	A `Vocab` object. If `True`, a vocab is created via `Language.Defaults.create_vocab`.
`make_doc`	callable	A function that takes text and returns a `Doc` object. Usually a `Tokenizer`.
`meta`	dict	Custom meta data for the `Language` class. Is written to by models to add model meta data.
RETURNS	`Language`	The newly constructed object.

Language.call

Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved.

Example

doc = nlp("An example sentence. Another sentence.")
assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")

Name	Type	Description
`text`	str	The text to be processed.
`disable`	list	Names of pipeline components to disable.
RETURNS	`Doc`	A container for accessing the annotations.

Pipeline components to prevent from being loaded can now be added as a list to disable, instead of specifying one keyword argument per component.

- doc = nlp("I don't want parsed", parse=False)
+ doc = nlp("I don't want parsed", disable=["parser"])

Language.pipe

Process texts as a stream, and yield Doc objects in order. This is usually more efficient than processing texts one-by-one.

Example

texts = ["One document.", "...", "Lots of documents"]
for doc in nlp.pipe(texts, batch_size=50):
    assert doc.is_parsed

Name	Type	Description
`texts`	iterable	A sequence of strings.
`as_tuples`	bool	If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`.
`batch_size`	int	The number of texts to buffer.
`disable`	list	Names of pipeline components to disable.
`component_cfg` 2.1	dict	Config parameters for specific pipeline components, keyed by component name.
`n_process` 2.2.2	int	Number of processors to use, only supported in Python 3. Defaults to `1`.
YIELDS	`Doc`	Documents in the order of the original text.

Language.update

Update the models in the pipeline.

Example

for raw_text, entity_offsets in train_data:
    doc = nlp.make_doc(raw_text)
    gold = GoldParse(doc, entities=entity_offsets)
    nlp.update([doc], [gold], drop=0.5, sgd=optimizer)

Name	Type	Description
`docs`	iterable	A batch of `Doc` objects or strings. If strings, a `Doc` object will be created from the text.
`golds`	iterable	A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create `GoldParse` objects. For the available keys and their usage, see `GoldParse.__init__`.
`drop`	float	The dropout rate.
`sgd`	callable	An optimizer.
`losses`	dict	Dictionary to update with the loss, keyed by pipeline component.
`component_cfg` 2.1	dict	Config parameters for specific pipeline components, keyed by component name.

Language.evaluate

Evaluate a model's pipeline components.

Example

scorer = nlp.evaluate(docs_golds, verbose=True)
print(scorer.scores)

Name	Type	Description
`docs_golds`	iterable	Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see simple training style).
`verbose`	bool	Print debugging information.
`batch_size`	int	The batch size to use.
`scorer`	`Scorer`	Optional `Scorer` to use. If not passed in, a new one will be created.
`component_cfg` 2.1	dict	Config parameters for specific pipeline components, keyed by component name.
RETURNS	Scorer	The scorer containing the evaluation scores.

Language.begin_training

Allocate models, pre-process training data and acquire an optimizer.

Example

optimizer = nlp.begin_training(gold_tuples)

Name	Type	Description
`gold_tuples`	iterable	Gold-standard training data.
`component_cfg` 2.1	dict	Config parameters for specific pipeline components, keyed by component name.
`**cfg`	-	Config parameters (sent to all components).
RETURNS	callable	An optimizer.

Language.use_params

Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a context manager, in which case, models go back to their original weights after the block.

Example

with nlp.use_params(optimizer.averages):
    nlp.to_disk("/tmp/checkpoint")

Name	Type	Description
`params`	dict	A dictionary of parameters keyed by model ID.
`**cfg`	-	Config parameters.

Language.preprocess_gold

Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map.

Name	Type	Description
`docs_golds`	iterable	Tuples of `Doc` and `GoldParse` objects.
YIELDS	tuple	Tuples of `Doc` and `GoldParse` objects.

Language.create_pipe

Create a pipeline component from a factory.

Example

parser = nlp.create_pipe("parser")
nlp.add_pipe(parser)

Name	Type	Description
`name`	str	Factory name to look up in `Language.factories`.
`config`	dict	Configuration parameters to initialize component.
RETURNS	callable	The pipeline component.

Language.add_pipe

Add a component to the processing pipeline. Valid components are callables that take a Doc object, modify it and return it. Only one of before, after, first or last can be set. Default behavior is last=True.

Example

def component(doc):
    # modify Doc and return it return doc

nlp.add_pipe(component, before="ner")
nlp.add_pipe(component, name="custom_name", last=True)

Name	Type	Description
`component`	callable	The pipeline component.
`name`	str	Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline.
`before`	str	Component name to insert component directly before.
`after`	str	Component name to insert component directly after:
`first`	bool	Insert component first / not first in the pipeline.
`last`	bool	Insert component last / not last in the pipeline.

Language.has_pipe

Check whether a component is present in the pipeline. Equivalent to name in nlp.pipe_names.

Example

nlp.add_pipe(lambda doc: doc, name="component")
assert "component" in nlp.pipe_names
assert nlp.has_pipe("component")

Name	Type	Description
`name`	str	Name of the pipeline component to check.
RETURNS	bool	Whether a component of that name exists in the pipeline.

Language.get_pipe

Get a pipeline component for a given component name.

Example

parser = nlp.get_pipe("parser")
custom_component = nlp.get_pipe("custom_component")

Name	Type	Description
`name`	str	Name of the pipeline component to get.
RETURNS	callable	The pipeline component.

Language.replace_pipe

Replace a component in the pipeline.

Example

nlp.replace_pipe("parser", my_custom_parser)

Name	Type	Description
`name`	str	Name of the component to replace.
`component`	callable	The pipeline component to insert.

Language.rename_pipe

Rename a component in the pipeline. Useful to create custom names for pre-defined and pre-loaded components. To change the default name of a component added to the pipeline, you can also use the name argument on add_pipe.

Example

nlp.rename_pipe("parser", "spacy_parser")

Name	Type	Description
`old_name`	str	Name of the component to rename.
`new_name`	str	New name of the component.

Language.remove_pipe

Remove a component from the pipeline. Returns the removed component name and component function.

Example

name, component = nlp.remove_pipe("parser")
assert name == "parser"

Name	Type	Description
`name`	str	Name of the component to remove.
RETURNS	tuple	A `(name, component)` tuple of the removed component.

Language.select_pipes

Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a DisabledPipes object is returned, that has a .restore() method you can use to undo your changes.

You can specify either disable (as a list or string), or enable. In the latter case, all components not in the enable list, will be disabled.

Example

# New API as of v3.0
with nlp.select_pipes(disable=["tagger", "parser"]):
   nlp.begin_training()

with nlp.select_pipes(enable="ner"):
    nlp.begin_training()

disabled = nlp.select_pipes(disable=["tagger", "parser"])
nlp.begin_training()
disabled.restore()

Name	Type	Description
`disable`	list	Names of pipeline components to disable.
`disable`	str	Name of pipeline component to disable.
`enable`	list	Names of pipeline components that will not be disabled.
`enable`	str	Name of pipeline component that will not be disabled.
RETURNS	`DisabledPipes`	The disabled pipes that can be restored by calling the object's `.restore()` method.

As of spaCy v3.0, the disable_pipes method has been renamed to select_pipes:

- nlp.disable_pipes(["tagger", "parser"])
+ nlp.select_pipes(disable=["tagger", "parser"])

Language.to_disk

Save the current state to a directory. If a model is loaded, this will include the model.

Example

nlp.to_disk("/path/to/models")

Name	Type	Description
`path`	str / `Path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.
`exclude`	list	Names of pipeline components or serialization fields to exclude.

Language.from_disk

Loads state from a directory. Modifies the object in place and returns it. If the saved Language object contains a model, the model will be loaded. Note that this method is commonly used via the subclasses like English or German to make language-specific functionality like the lexical attribute getters available to the loaded object.

Example

from spacy.language import Language
nlp = Language().from_disk("/path/to/model")

# using language-specific subclass
from spacy.lang.en import English
nlp = English().from_disk("/path/to/en_model")

Name	Type	Description
`path`	str / `Path`	A path to a directory. Paths may be either strings or `Path`-like objects.
`exclude`	list	Names of pipeline components or serialization fields to exclude.
RETURNS	`Language`	The modified `Language` object.

Language.to_bytes

Serialize the current state to a binary string.

Example

nlp_bytes = nlp.to_bytes()

Name	Type	Description
`exclude`	list	Names of pipeline components or serialization fields to exclude.
RETURNS	bytes	The serialized form of the `Language` object.

Language.from_bytes

Load state from a binary string. Note that this method is commonly used via the subclasses like English or German to make language-specific functionality like the lexical attribute getters available to the loaded object.

Example

from spacy.lang.en import English
nlp_bytes = nlp.to_bytes()
nlp2 = English()
nlp2.from_bytes(nlp_bytes)

Name	Type	Description
`bytes_data`	bytes	The data to load from.
`exclude`	list	Names of pipeline components or serialization fields to exclude.
RETURNS	`Language`	The `Language` object.

Pipeline components to prevent from being loaded can now be added as a list to disable (v2.0) or exclude (v2.1), instead of specifying one keyword argument per component.

- nlp = English().from_bytes(bytes, tagger=False, entity=False)
+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])

Attributes

Name	Type	Description
`vocab`	`Vocab`	A container for the lexical types.
`tokenizer`	`Tokenizer`	The tokenizer.
`make_doc`	`callable`	Callable that takes a string and returns a `Doc`.
`pipeline`	list	List of `(name, component)` tuples describing the current processing pipeline, in order.
`pipe_names` 2	list	List of pipeline component names, in order.
`pipe_labels` 2.2	dict	List of labels set by the pipeline components, if available, keyed by component name.
`meta`	dict	Custom meta data for the Language class. If a model is loaded, contains meta data of the model.
`path` 2	`Path`	Path to the model data directory, if a model is loaded. Otherwise `None`.

Class attributes

Name	Type	Description
`Defaults`	class	Settings, data and factory methods for creating the `nlp` object and processing pipeline.
`lang`	str	Two-letter language ID, i.e. ISO code.
`factories` 2	dict	Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name.

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
nlp.from_disk("./model-data", exclude=["ner"])

Name	Description
`vocab`	The shared `Vocab`.
`tokenizer`	Tokenization rules and exceptions.
`meta`	The meta data, available as `Language.meta`.
...	String names of pipeline components, e.g. `"ner"`.

28 KiB Raw Blame History

Language.__init__

Example

Language.__call__

Example

Language.pipe

Example

Language.update

Example

Language.evaluate

Example

Language.begin_training

Example

Language.use_params

Example

Language.preprocess_gold

Language.create_pipe

Example

Language.add_pipe

Example

Language.has_pipe

Example

Language.get_pipe

Example

Language.replace_pipe

Example

Language.rename_pipe

Example

Language.remove_pipe

Example

Language.select_pipes

Example

Language.to_disk

Example

Language.from_disk

Example

Language.to_bytes

Example

Language.from_bytes

Example

Attributes

Class attributes

Serialization fields

Example

28 KiB

Raw Blame History

Language.init

Language.call