68 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | next | menu | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Language Processing Pipelines | /usage/embeddings-transformers | 
 | 
import Pipelines101 from 'usage/101/_pipelines.md'
Processing text
When you call nlp on a text, spaCy will tokenize it and then call each
component on the Doc, in order. It then returns the processed Doc that you
can work with.
doc = nlp("This is a text")
When processing large volumes of text, the statistical models are usually more
efficient if you let them work on batches of texts. spaCy's
nlp.pipe method takes an iterable of texts and yields
processed Doc objects. The batching is done internally.
texts = ["This is a text", "These are lots of texts", "..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
- Process the texts as a stream using nlp.pipeand buffer them in batches, instead of one-by-one. This is usually much more efficient.
- Only apply the pipeline components you need. Getting predictions from the
model that you don't actually need adds up and becomes very inefficient at
scale. To prevent this, use the disablekeyword argument to disable components you don't need – either when loading a model, or during processing withnlp.pipe. See the section on disabling pipeline components for more details and examples.
In this example, we're using nlp.pipe to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in doc.ents (set by the ner component), we'll
disable all other statistical components (the tagger and parser) during
processing. nlp.pipe yields Doc objects, so we can iterate over them and
access the named entity predictions:
✏️ Things to try
- Also disable the
"ner"component. You'll see that thedoc.entsare now empty, because the entity recognizer didn't run.
### {executable="true"}
import spacy
texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
When using nlp.pipe, keep in mind that it returns a
generator that
yields Doc objects – not a list. So if you want to use it like a list, you'll
have to call list() on it first:
- docs = nlp.pipe(texts)[0]         # will raise an error
+ docs = list(nlp.pipe(texts))[0]   # works as expected
How pipelines work
spaCy makes it very easy to create your own pipelines consisting of reusable
components – this includes spaCy's default tagger, parser and entity recognizer,
but also your own custom processing functions. A pipeline component can be added
to an already existing nlp object, specified when initializing a Language
class, or defined within a model package.
config.cfg (excerpt)
[nlp] lang = "en" pipeline = ["tagger", "parser"] [components] [components.tagger] factory = "tagger" # Settings for the tagger component [components.parser] factory = "parser" # Settings for the parser component
When you load a model, spaCy first consults the model's
meta.json and
config.cfg. The config tells spaCy what language
class to use, which components are in the pipeline, and how those components
should be created. spaCy will then do the following:
- Load the language class and data for the given ID via
get_lang_classand initialize it. TheLanguageclass contains the shared vocabulary, tokenization rules and the language-specific settings.
- Iterate over the pipeline names and look up each component name in the
[components]block. Thefactorytells spaCy which component factory to use for adding the component with withadd_pipe. The settings are passed into the factory.
- Make the model data available to the Languageclass by callingfrom_diskwith the path to the model data directory.
So when you call this...
nlp = spacy.load("en_core_web_sm")
... the model's config.cfg tells spaCy to use the language "en" and the
pipeline ["tagger", "parser", "ner"]. spaCy will then initialize
spacy.lang.en.English, and create each pipeline component and add it to the
processing pipeline. It'll then load in the model's data from its data directory
and return the modified Language class for you to use as the nlp object.
spaCy v3.0 introduces a config.cfg, which includes more detailed settings for
the model pipeline, its components and the
training process. You can export the config of your
current nlp object by calling nlp.config.to_disk.
Fundamentally, a spaCy model consists of three components: the
weights, i.e. binary data loaded in from a directory, a pipeline of
functions called in order, and language data like the tokenization rules and
language-specific settings. For example, a Spanish NER model requires different
weights, language data and pipeline components than an English parsing and
tagging model. This is also why the pipeline state is always held by the
Language class. spacy.load puts this all
together and returns an instance of Language with a pipeline set and access to
the binary data:
### spacy.load under the hood
lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
nlp = cls()                            # 2. Initialize it
for name in pipeline:
    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
nlp.from_disk(model_data_path)         # 4. Load in the binary data
When you call nlp on a text, spaCy will tokenize it and then call each
component on the Doc, in order. Since the model data is loaded, the
components can access it to assign annotations to the Doc object, and
subsequently to the Token and Span which are only views of the Doc, and
don't own any data themselves. All components return the modified document,
which is then processed by the component next in the pipeline.
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence")  # Create a Doc from raw text
for name, proc in nlp.pipeline:           # Iterate over components in order
    doc = proc(doc)                       # Apply each component
The current processing pipeline is available as nlp.pipeline, which returns a
list of (name, component) tuples, or nlp.pipe_names, which only returns a
list of human-readable component names.
print(nlp.pipeline)
# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
print(nlp.pipe_names)
# ['tagger', 'parser', 'ner']
Built-in pipeline components
spaCy ships with several built-in pipeline components that are registered with
string names. This means that you can initialize them by calling
nlp.add_pipe with their names and spaCy will know
how to create them. See the API documentation for a full list of
available pipeline components and component functions.
Usage
nlp = spacy.blank("en") nlp.add_pipe("sentencizer") # add_pipe returns the added component ruler = nlp.add_pipe("entity_ruler")
| String name | Component | Description | 
|---|---|---|
| tagger | Tagger | Assign part-of-speech-tags. | 
| parser | DependencyParser | Assign dependency labels. | 
| ner | EntityRecognizer | Assign named entities. | 
| entity_linker | EntityLinker | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | 
| entity_ruler | EntityRuler | Assign named entities based on pattern rules and dictionaries. | 
| textcat | TextCategorizer | Assign text categories. | 
| lemmatizer | Lemmatizer | Assign base forms to words. | 
| morphologizer | Morphologizer | Assign morphological features and coarse-grained POS tags. | 
| senter | SentenceRecognizer | Assign sentence boundaries. | 
| sentencizer | Sentencizer | Add rule-based sentence segmentation without the dependency parse. | 
| tok2vec | Tok2Vec | Assign token-to-vector embeddings. | 
| transformer | Transformer | Assign the tokens and outputs of a transformer model. | 
Disabling and modifying pipeline components
If you don't need a particular component of the pipeline – for example, the
tagger or the parser, you can disable loading it. This can sometimes make a
big difference and improve loading speed. Disabled component names can be
provided to spacy.load,
Language.from_disk or the nlp object itself as a
list:
### Disable loading
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
In some cases, you do want to load all pipeline components and their weights,
because you need them at different points in your application. However, if you
only need a Doc object with named entities, there's no need to run all
pipeline components on it – that can potentially make processing much slower.
Instead, you can use the disable keyword argument on
nlp.pipe to temporarily disable the components during
processing:
### Disable for processing
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
If you need to execute more code with components disabled – e.g. to reset
the weights or update only some components during training – you can use the
nlp.select_pipes context manager. At the end of
the with block, the disabled pipeline components will be restored
automatically. Alternatively, select_pipes returns an object that lets you
call its restore() method to restore the disabled components when needed. This
can be useful if you want to prevent unnecessary code indentation of large
blocks.
### Disable for block
# 1. Use as a context manager
with nlp.select_pipes(disable=["tagger", "parser"]):
    doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")
# 2. Restore manually
disabled = nlp.select_pipes(disable="ner")
doc = nlp("I won't have named entities")
disabled.restore()
If you want to disable all pipes except for one or a few, you can use the
enable keyword. Just like the disable keyword, it takes a list of pipe
names, or a string defining just one pipe.
# Enable only the parser
with nlp.select_pipes(enable="parser"):
    doc = nlp("I will only be parsed")
Finally, you can also use the remove_pipe method
to remove pipeline components from an existing pipeline, the
rename_pipe method to rename them, or the
replace_pipe method to replace them with a
custom component entirely (more details on this in the section on
custom components.
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", my_custom_tagger)
Sourcing pipeline components from existing models
Pipeline components that are independent can also be reused across models.
Instead of adding a new blank component to a pipeline, you can also copy an
existing component from a pretrained model by setting the source argument on
nlp.add_pipe. The first argument will then be
interpreted as the name of the component in the source pipeline – for instance,
"ner". This is especially useful for
training a model because it lets you mix
and match components and create fully custom model packages with updated
pretrained components and new components trained on your data.
When reusing components across models, keep in mind that the vocabulary, vectors and model settings must match. If a pretrained model includes word vectors and the component uses them as features, the model you copy it to needs to have the same vectors available – otherwise, it won't be able to make the same predictions.
In training config
Instead of providing a
factory, component blocks in the training config can also define asource. The string needs to be a loadable spaCy model package or path. The[components.ner] source = "en_core_web_sm" component = "ner"By default, sourced components will be updated with your data during training. If you want to preserve the component as-is, you can "freeze" it:
[training] frozen_components = ["ner"]
### {executable="true"}
import spacy
# The source model with different components
source_nlp = spacy.load("en_core_web_sm")
print(source_nlp.pipe_names)
# Add only the entity recognizer to the new blank model
nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
print(nlp.pipe_names)
Analyzing pipeline components
The nlp.analyze_pipes method analyzes the
components in the current pipeline and outputs information about them, like the
attributes they set on the Doc and Token, whether
they retokenize the Doc and which scores they produce during training. It will
also show warnings if components require values that aren't set by previous
component – for instance, if the entity linker is used but no component that
runs before it sets named entities. Setting pretty=True will pretty-print a
table instead of only returning the structured data.
✏️ Things to try
- Add the components
"ner"and"sentencizer"before the"entity_linker". The analysis should now show no problems, because requirements are met.
### {executable="true"}
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("tagger")
# This is a problem because it needs entities and sentence boundaries
nlp.add_pipe("entity_linker")
analysis = nlp.analyze_pipes(pretty=True)
### Structured
{
  "summary": {
    "tagger": {
      "assigns": ["token.tag"],
      "requires": [],
      "scores": ["tag_acc", "pos_acc", "lemma_acc"],
      "retokenizes": false
    },
    "entity_linker": {
      "assigns": ["token.ent_kb_id"],
      "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
      "scores": [],
      "retokenizes": false
    }
  },
  "problems": {
    "tagger": [],
    "entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"]
  },
  "attrs": {
    "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
    "doc.ents": { "assigns": [], "requires": ["entity_linker"] },
    "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
    "doc.sents": { "assigns": [], "requires": ["entity_linker"] },
    "token.tag": { "assigns": ["tagger"], "requires": [] },
    "token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
  }
}
### Pretty
============================= Pipeline Overview =============================
#   Component       Assigns           Requires         Scores      Retokenizes
-   -------------   ---------------   --------------   ---------   -----------
0   tagger          token.tag                          tag_acc     False
                                                       pos_acc
                                                       lemma_acc
1   entity_linker   token.ent_kb_id   doc.ents                     False
                                      doc.sents
                                      token.ent_iob
                                      token.ent_type
================================ Problems (4) ================================
⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
token.ent_iob, token.ent_type
The pipeline analysis is static and does not actually run the components. This means that it relies on the information provided by the components themselves. If a custom component declares that it assigns an attribute but it doesn't, the pipeline analysis won't catch that.
Creating custom pipeline components
A pipeline component is a function that receives a Doc object, modifies it and
returns it – – for example, by using the current weights to make a prediction
and set some annotation on the document. By adding a component to the pipeline,
you'll get access to the Doc at any point during processing – instead of
only being able to modify it afterwards.
Example
from spacy.language import Language @Language.component("my_component") def my_component(doc): # Do something to the doc here return doc
| Argument | Type | Description | 
|---|---|---|
| doc | Doc | The Docobject processed by the previous component. | 
| RETURNS | Doc | The Docobject processed by this pipeline component. | 
The @Language.component decorator lets you turn a
simple function into a pipeline component. It takes at least one argument, the
name of the component factory. You can use this name to add an instance of
your component to the pipeline. It can also be listed in your model config, so
you can save, load and train models using your component.
Custom components can be added to the pipeline using the
add_pipe method. Optionally, you can either specify
a component to add it before or after, tell spaCy to add it first or
last in the pipeline, or define a custom name. If no name is set and no
name attribute is present on your component, the function name is used.
Example
nlp.add_pipe("my_component") nlp.add_pipe("my_component", first=True) nlp.add_pipe("my_component", before="parser")
| Argument | Description | 
|---|---|
| last | If set to True, component is added last in the pipeline (default). | 
| first | If set to True, component is added first in the pipeline. | 
| before | String name or index to add the new component before. | 
| after | String name or index to add the new component after. | 
As of v3.0, components need to be registered using the
@Language.component or
@Language.factory decorator so spaCy knows that a
function is a component. nlp.add_pipe now takes the
string name of the component factory instead of the component function. This
doesn't only save you lines of code, it also allows spaCy to validate and track
your custom components, and make sure they can be saved and loaded.
- ruler = nlp.create_pipe("entity_ruler")
- nlp.add_pipe(ruler)
+ ruler = nlp.add_pipe("entity_ruler")
Examples: Simple stateless pipeline components
The following component receives the Doc in the pipeline and prints some
information about it: the number of tokens, the part-of-speech tags of the
tokens and a conditional message based on the document length. The
@Language.component decorator lets you register the
component under the name "info_component".
✏️ Things to try
- Add the component first in the pipeline by setting
first=True. You'll see that the part-of-speech tags are empty, because the component now runs before the tagger and the tags aren't available yet.- Change the component
nameor remove thenameargument. You should see this change reflected innlp.pipe_names.nlp.pipeline. You'll see a list of tuples describing the component name and the function that's called on theDocobject in the pipeline.- Change the first argument to
@Language.component, the name, to something else. spaCy should now complain that it doesn't know a component of the name"info_component".
### {executable="true"}
import spacy
from spacy.language import Language
@Language.component("info_component")
def my_component(doc):
    print(f"After tokenization, this doc has {len(doc)} tokens.")
    print("The part-of-speech tags are:", [token.pos_ for token in doc])
    if len(doc) < 10:
        print("This is a pretty short document.")
    return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("info_component", name="print_info", last=True)
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp("This is a sentence.")
Here's another example of a pipeline component that implements custom logic to improve the sentence boundaries set by the dependency parser. The custom logic should therefore be applied after tokenization, but before the dependency parsing – this way, the parser can also take advantage of the sentence boundaries.
✏️ Things to try
[token.dep_ for token in doc]with and without the custom pipeline component. You'll see that the predicted dependency parse changes to match the sentence boundaries.- Remove the
elseblock. All other tokens will now haveis_sent_startset toNone(missing value), the parser will assign sentence boundaries in between.
### {executable="true"}
import spacy
from spacy.language import Language
@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i + 1].is_title:
            doc[i + 1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i + 1].is_sent_start = False
    return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)
Component factories and stateful components
Component factories are callables that take settings and return a pipeline
component function. This is useful if your component is stateful and if you
need to customize their creation, or if you need access to the current nlp
object or the shared vocab. Component factories can be registered using the
@Language.factory decorator and they need at least
two named arguments that are filled in automatically when the component is
added to the pipeline:
Example
from spacy.language import Language @Language.factory("my_component") def my_component(nlp, name): return MyComponent()
| Argument | Description | 
|---|---|
| nlp | The current nlpobject. Can be used to access the shared vocab. | 
| name | The instance name of the component in the pipeline. This lets you identify different instances of the same component. | 
All other settings can be passed in by the user via the config argument on
nlp.add_pipe. The
@Language.factory decorator also lets you define a
default_config that's used as a fallback.
### With config {highlight="4,9"}
import spacy
from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": True})
def my_component(nlp, name, some_setting: bool):
    return MyComponent(some_setting=some_setting)
nlp = spacy.blank("en")
nlp.add_pipe("my_component", config={"some_setting": False})
The @Language.component decorator is essentially a
shortcut for stateless pipeline component that don't need any settings. This
means you don't have to always write a function that returns your function if
there's no state to be passed through – spaCy can just take care of this for
you. The following two code examples are equivalent:
# Statless component with @Language.factory
@Language.factory("my_component")
def create_my_component():
    def my_component(doc):
        # Do something to the doc
        return doc
    return my_component
# Stateless component with @Language.component
@Language.component("my_component")
def my_component(doc):
    # Do something to the doc
    return doc
Yes, the @Language.factory decorator can be added to
a function or a class. If it's added to a class, it expects the __init__
method to take the arguments nlp and name, and will populate all other
arguments from the config. That said, it's often cleaner and more intuitive to
make your factory a separate function. That's also how spaCy does it internally.
Example: Stateful component with settings
This example shows a stateful pipeline component for handling acronyms:
based on a dictionary, it will detect acronyms and their expanded forms in both
directions and add them to a list as the custom doc._.acronyms
extension attribute. Under the hood, it uses
the PhraseMatcher to find instances of the phrases.
The factory function takes three arguments: the shared nlp object and
component instance name, which are passed in automatically by spaCy, and a
case_sensitive config setting that makes the matching and acronym detection
case-sensitive.
✏️ Things to try
- Change the
configpassed tonlp.add_pipeand set"case_sensitive"toTrue. You should see that the expanded acronym for "LOL" isn't detected anymore.- Add some more terms to the
DICTIONARYand update the processed text so they're detected.- Add a
nameargument tonlp.add_pipeto change the component name. Printnlp.pipe_namesto see the change reflected in the pipeline.- Print the config of the current
nlpobject withprint(nlp.config.to_str())and inspect the[components]block. You should see an entry for the acronyms component, referencing the factoryacronymsand the config settings.
### {executable="true"}
from spacy.language import Language
from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher
import spacy
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
DICTIONARY.update({value: key for key, value in DICTIONARY.items()})
@Language.factory("acronyms", default_config={"case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, case_sensitive: bool):
    return AcronymComponent(nlp, case_sensitive)
class AcronymComponent:
    def __init__(self, nlp: Language, case_sensitive: bool):
        # Create the matcher and match on Token.lower if case-insensitive
        matcher_attr = "TEXT" if case_sensitive else "LOWER"
        self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr)
        self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY])
        self.case_sensitive = case_sensitive
        # Register custom extension on the Doc
        if not Doc.has_extension("acronyms"):
            Doc.set_extension("acronyms", default=[])
    def __call__(self, doc: Doc) -> Doc:
        # Add the matched spans when doc is processed
        for _, start, end in self.matcher(doc):
            span = doc[start:end]
            acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower())
            doc._.acronyms.append((span, acronym))
        return doc
# Add the component to the pipeline and configure it
nlp = spacy.blank("en")
nlp.add_pipe("acronyms", config={"case_sensitive": False})
# Process a doc and see the results
doc = nlp("LOL, be right back")
print(doc._.acronyms)
Many stateful components depend on data resources like dictionaries and
lookup tables that should ideally be configurable. For example, it makes
sense to make the DICTIONARY and argument of the registered function, so the
AcronymComponent can be re-used with different data. One logical solution
would be to make it an argument of the component factory, and allow it to be
initialized with different dictionaries.
Example
Making the data an argument of the registered function would result in output like this in your
config.cfg, which is typically not what you want (and only works for JSON-serializable data).[components.acronyms.dictionary] lol = "laugh out loud" brb = "be right back"
However, passing in the dictionary directly is problematic, because it means
that if a component saves out its config and settings, the
config.cfg will include a dump of the entire data,
since that's the config the component was created with.
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
- default_config = {"dictionary:" DICTIONARY}
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
model – saving out the component config becomes
impossible because there's no way for spaCy to know how that object was
created, and what to do to create it again. This makes it much harder to save,
load and train custom models with custom components. A simple solution is to
register a function that returns your resources. The
registry lets you map string names to functions
that create objects, so given a name and optional arguments, spaCy will know how
to recreate the object. To register a function that returns a custom asset, you
can use the @spacy.registry.assets decorator with a single argument, the name:
### Registered function for assets {highlight="1"}
@spacy.registry.assets("acronyms.slang_dict.v1")
def create_acronyms_slang_dict():
    dictionary = {"lol": "laughing out loud", "brb": "be right back"}
    dictionary.update({value: key for key, value in dictionary.items()})
    return dictionary
In your default_config (and later in your
training config), you can now refer to the function
registered under the name "acronyms.slang_dict.v1" using the @assets key.
This tells spaCy how to create the value, and when your component is created,
the result of the registered function is passed in as the key "dictionary".
config.cfg
[components.acronyms] factory = "acronyms" [components.acronyms.dictionary] @assets = "acronyms.slang_dict.v1"
- default_config = {"dictionary:" DICTIONARY}
+ default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}}
Using a registered function also means that you can easily include your custom
components in models that you train. To make sure spaCy knows
where to find your custom @assets function, you can pass in a Python file via
the argument --code. If someone else is using your component, all they have to
do to customize the data is to register their own function and swap out the
name. Registered functions can also take arguments by the way that can be
defined in the config as well – you can read more about this in the docs on
training with custom code.
Python type hints and pydantic validation
spaCy's configs are powered by our machine learning library Thinc's
configuration system, which supports
type hints and even
advanced type annotations
using pydantic. If your component
factory provides type hints, the values that are passed in will be checked
against the expected types. If the value can't be cast to an integer, spaCy
will raise an error. pydantic also provides strict types like StrictFloat,
which will force the value to be an integer and raise an error if it's not – for
instance, if your config defines a float.
If you're not using
strict types,
values that can be cast to the given type will still be accepted. For
example, 1 can be cast to a float or a bool type, but not to a
List[str]. However, if the type is
StrictFloat,
only a float will be accepted.
The following example shows a custom pipeline component for debugging. It can be
added anywhere in the pipeline and logs information about the nlp object and
the Doc that passes through. The log_level config setting lets the user
customize what log statements are shown – for instance, "INFO" will show info
logs and more critical logging statements, whereas "DEBUG" will show
everything. The value is annotated as a StrictStr, so it will only accept a
string value.
✏️ Things to try
- Change the
configpassed tonlp.add_pipeto use the log level"INFO". You should see that only the statement logged withlogger.infois shown.- Change the
configpassed tonlp.add_pipeso that it contains unexpected values – for example, a boolean instead of a string:"log_level": False. You should see a validation error.- Check out the docs on
pydantic's constrained types and write a type hint forlog_levelthat only accepts the exact string values"DEBUG","INFO"or"CRITICAL".
### {executable="true"}
import spacy
from spacy.language import Language
from spacy.tokens import Doc
from pydantic import StrictStr
import logging
@Language.factory("debug", default_config={"log_level": "DEBUG"})
class DebugComponent:
    def __init__(self, nlp: Language, name: str, log_level: StrictStr):
        self.logger = logging.getLogger(f"spacy.{name}")
        self.logger.setLevel(log_level)
        self.logger.info(f"Pipeline: {nlp.pipe_names}")
    def __call__(self, doc: Doc) -> Doc:
        self.logger.debug(f"Doc: {len(doc)} tokens, is_tagged: {doc.is_tagged}")
        return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("debug", config={"log_level": "DEBUG"})
doc = nlp("This is a text...")
Language-specific factories
There are many use case where you might want your pipeline components to be
language-specific. Sometimes this requires entirely different implementation per
language, sometimes the only difference is in the settings or data. spaCy allows
you to register factories of the same name on both the Language base
class, as well as its subclasses like English or German. Factories are
resolved starting with the specific subclass. If the subclass doesn't define a
component of that name, spaCy will check the Language base class.
Here's an example of a pipeline component that overwrites the normalized form of
a token, the Token.norm_ with an entry from a language-specific lookup table.
It's registered twice under the name "token_normalizer" – once using
@English.factory and once using @German.factory:
### {executable="true"}
from spacy.lang.en import English
from spacy.lang.de import German
class TokenNormalizer:
    def __init__(self, norm_table):
        self.norm_table = norm_table
    def __call__(self, doc):
        for token in doc:
            # Overwrite the token.norm_ if there's an entry in the data
            token.norm_ = self.norm_table.get(token.text, token.norm_)
        return doc
@English.factory("token_normalizer")
def create_en_normalizer(nlp, name):
    return TokenNormalizer({"realise": "realize", "colour": "color"})
@German.factory("token_normalizer")
def create_de_normalizer(nlp, name):
    return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
nlp_en = English()
nlp_en.add_pipe("token_normalizer")  # uses the English factory
print([token.norm_ for token in nlp_en("realise colour daß wußte")])
nlp_de = German()
nlp_de.add_pipe("token_normalizer")  # uses the German factory
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
Under the hood, language-specific factories are added to the
factories registry prefixed with the language code,
e.g. "en.token_normalizer". When resolving the factory in
nlp.add_pipe, spaCy first checks for a
language-specific version of the factory using nlp.lang and if none is
available, falls back to looking up the regular factory name.
Trainable components
spaCy's Pipe class helps you implement your own trainable
components that have their own model instance, make predictions over Doc
objects and can be updated using spacy train. This lets you
plug fully custom machine learning components into your pipeline. You'll need
the following:
- Model: A Thinc Modelinstance. This can be a model using layers implemented in Thinc, or a wrapped model implemented in PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a list ofDocobjects as input and can have any type of output.
- Pipe subclass: A subclass of Pipethat implements at least two methods:Pipe.predictandPipe.set_annotations.
- Component factory: A component factory registered with
@Language.factorythat takes thenlpobject and componentnameand optional settings provided by the config and returns an instance of your trainable component.
Example
from spacy.pipeline import Pipe from spacy.language import Language class TrainableComponent(Pipe): def predict(self, docs): ... def set_annotations(self, docs, scores): ... @Language.factory("my_trainable_component") def make_component(nlp, name, model): return TrainableComponent(nlp.vocab, model, name=name)
| Name | Description | 
|---|---|
| predict | Apply the component's model to a batch of Docobjects (without modifying them) and return the scores. | 
| set_annotations | Modify a batch of Docobjects, using pre-computed scores generated bypredict. | 
By default, Pipe.__init__ takes the shared vocab, the
Model and the name of the component
instance in the pipeline, which you can use as a key in the losses. All other
keyword arguments will become available as Pipe.cfg and will
also be serialized with the component.
spaCy's config system resolves the config describing
the pipeline components and models bottom-up. This means that it will
first create a Model from a registered architecture,
validate its arguments and then pass the object forward to the component. This
means that the config can express very complex, nested trees of objects – but
the objects don't have to pass the model settings all the way down to the
components. It also makes the components more modular and lets you swap
different architectures in your config, and re-use model definitions.
### config.cfg (excerpt)
[components]
[components.textcat]
factory = "textcat"
labels = []
# This function is created and then passed to the "textcat" component as
# the argument "model"
[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v1"
exclusive_classes = false
pretrained_vectors = null
width = 64
conv_depth = 2
embed_size = 2000
window_size = 1
ngram_size = 1
dropout = null
[components.other_textcat]
factory = "textcat"
# This references the [components.textcat.model] block above
model = ${components.textcat.model}
labels = []
Your trainable pipeline component factories should therefore always take a
model argument instead of instantiating the
Model inside the component. To register
custom architectures, you can use the
@spacy.registry.architectures decorator. Also see
the training guide for details.
For some use cases, it makes sense to also overwrite additional methods to customize how the model is updated from examples, how it's initialized, how the loss is calculated and to add evaluation scores to the training output.
| Name | Description | 
|---|---|
| update | Learn from a batch of Exampleobjects containing the predictions and gold-standard annotations, and update the component's model. | 
| begin_training | Initialize the model. Typically calls into Model.initializeandPipe.create_optimizerif no optimizer is provided. | 
| get_loss | Return a tuple of the loss and the gradient for a batch of Exampleobjects. | 
| score | Score a batch of Exampleobjects and return a dictionary of scores. The@Language.factorydecorator can define thedefault_socre_weightsof the component to decide which keys of the scores to display during training and how they count towards the final score. | 
Extension attributes
spaCy allows you to set any custom attributes and methods on the Doc, Span
and Token, which become available as Doc._, Span._ and Token._ – for
example, Token._.my_attr. This lets you store additional information relevant
to your application, add new features and functionality to spaCy, and implement
your own models trained with other machine learning libraries. It also lets you
take advantage of spaCy's data structures and the Doc object as the "single
source of truth".
Writing to a ._ attribute instead of to the Doc directly keeps a clearer
separation and makes it easier to ensure backwards compatibility. For example,
if you've implemented your own .coref property and spaCy claims it one day,
it'll break your code. Similarly, just by looking at the code, you'll
immediately know what's built-in and what's custom – for example,
doc.sentiment is spaCy, while doc._.sent_score isn't.
Extension definitions – the defaults, methods, getters and setters you pass in
to set_extension – are stored in class attributes on the Underscore class.
If you write to an extension attribute, e.g. doc._.hello = True, the data is
stored within the Doc.user_data dictionary. To keep the
underscore data separate from your other dictionary entries, the string "._."
is placed before the name, in a tuple.
There are three main types of extensions, which can be defined using the
Doc.set_extension,
Span.set_extension and
Token.set_extension methods.
- 
Attribute extensions. Set a default value for an attribute, which can be overwritten manually at any time. Attribute extensions work like "normal" variables and are the quickest way to store arbitrary information on a Doc,SpanorToken.Doc.set_extension("hello", default=True) assert doc._.hello doc._.hello = False
- 
Property extensions. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you retrieve the attribute, you can also access values of previously added attribute extensions. For example, a Docgetter can average overTokenattributes. ForSpanextensions, you'll almost always want to use a property – otherwise, you'd have to write to every possibleSpanin theDocto set up the values correctly.Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value) assert doc._.hello doc._.hello = "Hi!"
- 
Method extensions. Assign a function that becomes available as an object method. Method extensions are always immutable. For more details and implementation ideas, see these examples. Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!") assert doc._.hello("Bob") == "Hi Bob!"
Before you can access a custom extension, you need to register it using the
set_extension method on the object you want to add it to, e.g. the Doc. Keep
in mind that extensions are always added globally and not just on a
particular instance. If an attribute of the same name already exists, or if
you're trying to access an attribute that hasn't been registered, spaCy will
raise an AttributeError.
### Example
from spacy.tokens import Doc, Span, Token
fruits = ["apple", "pear", "banana", "orange", "strawberry"]
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
Token.set_extension("is_fruit", getter=is_fruit_getter)
Doc.set_extension("has_fruit", getter=has_fruit_getter)
Span.set_extension("has_fruit", getter=has_fruit_getter)
Usage example
doc = nlp("I have an apple and a melon") assert doc[3]._.is_fruit # get Token attributes assert not doc[0]._.is_fruit assert doc._.has_fruit # get Doc attributes assert doc[1:4]._.has_fruit # get Span attributes
Once you've registered your custom attribute, you can also use the built-in
set, get and has methods to modify and retrieve the attributes. This is
especially useful it you want to pass in a string instead of calling
doc._.my_attr.
Example: Pipeline component for GPE entities and country meta data via a REST API
This example shows the implementation of a pipeline component that fetches
country meta data via the REST Countries API, sets
entity annotations for countries, merges entities into one token and sets custom
attributes on the Doc, Span and Token – for example, the capital,
latitude/longitude coordinates and even the country flag.
### {executable="true"}
import requests
from spacy.lang.en import English
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
@Language.factory("rest_countries")
class RESTCountriesComponent:
    def __init__(self, nlp, name, label="GPE"):
        r = requests.get("https://restcountries.eu/rest/v2/all")
        r.raise_for_status()  # make sure requests raises an error if it fails
        countries = r.json()
        # Convert API response to dict keyed by country name for easy lookup
        self.countries = {c["name"]: c for c in countries}
        self.label = label
        # Set up the PhraseMatcher with Doc patterns for each country name
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add("COUNTRIES", [nlp.make_doc(c) for c in self.countries.keys()])
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension("is_country", default=False)
        Token.set_extension("country_capital", default=False)
        Token.set_extension("country_latlng", default=False)
        Token.set_extension("country_flag", default=False)
        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_country == True.
        Doc.set_extension("has_country", getter=self.has_country)
        Span.set_extension("has_country", getter=self.has_country)
    def __call__(self, doc):
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in self.matcher(doc):
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            # Can be extended with other data returned by the API, like
            # currencies, country code, flag, calling code etc.
            for token in entity:
                token._.set("is_country", True)
                token._.set("country_capital", self.countries[entity.text]["capital"])
                token._.set("country_latlng", self.countries[entity.text]["latlng"])
                token._.set("country_flag", self.countries[entity.text]["flag"])
        # Iterate over all spans and merge them into one token
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
        # Overwrite doc.ents and add entity – be careful not to replace!
        doc.ents = list(doc.ents) + spans
        return doc  # don't forget to return the Doc!
    def has_country(self, tokens):
        """Getter for Doc and Span attributes. Since the getter is only called
        when we access the attribute, we can refer to the Token's 'is_country'
        attribute here, which is already set in the processing step."""
        return any([t._.get("is_country") for t in tokens])
nlp = English()
nlp.add_pipe("rest_countries", config={"label": "GPE"})
doc = nlp("Some text about Colombia and the Czech Republic")
print("Pipeline", nlp.pipe_names)  # pipeline contains component name
print("Doc has countries", doc._.has_country)  # Doc contains countries
for token in doc:
    if token._.is_country:
        print(token.text, token._.country_capital, token._.country_latlng, token._.country_flag)
print("Entities", [(e.text, e.label_) for e in doc.ents])
In this case, all data can be fetched on initialization in one request. However,
if you're working with text that contains incomplete country names, spelling
mistakes or foreign-language versions, you could also implement a
like_country-style getter function that makes a request to the search API
endpoint and returns the best-matching result.
User hooks
While it's generally recommended to use the Doc._, Span._ and Token._
proxies to add your own custom attributes, spaCy offers a few exceptions to
allow customizing the built-in methods like
Doc.similarity or Doc.vector with
your own hooks, which can rely on statistical models you train yourself. For
instance, you can provide your own on-the-fly sentence segmentation algorithm or
document similarity method.
Hooks let you customize some of the behaviors of the Doc, Span or Token
objects by adding a component to the pipeline. For instance, to customize the
Doc.similarity method, you can add a component that
sets a custom function to doc.user_hooks["similarity"]. The built-in
Doc.similarity method will check the user_hooks dict, and delegate to your
function if you've set one. Similar results can be achieved by setting functions
to Doc.user_span_hooks and Doc.user_token_hooks.
Implementation note
The hooks live on the
Docobject because theSpanandTokenobjects are created lazily, and don't own any data. They just proxy to their parentDoc. This turns out to be convenient here — we only have to worry about installing hooks in one place.
| Name | Customizes | 
|---|---|
| user_hooks | Doc.vector,Doc.has_vector,Doc.vector_norm,Doc.sents | 
| user_token_hooks | Token.similarity,Token.vector,Token.has_vector,Token.vector_norm,Token.conjuncts | 
| user_span_hooks | Span.similarity,Span.vector,Span.has_vector,Span.vector_norm,Span.root | 
### Add custom similarity hooks
class SimilarityModel:
    def __init__(self, model):
        self._model = model
    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.similarity
        doc.user_span_hooks["similarity"] = self.similarity
        doc.user_token_hooks["similarity"] = self.similarity
    def similarity(self, obj1, obj2):
        y = self._model([obj1.vector, obj2.vector])
        return float(y[0])
Developing plugins and wrappers
We're very excited about all the new possibilities for community extensions and plugins in spaCy, and we can't wait to see what you build with it! To get you started, here are a few tips, tricks and best practices. See here for examples of other spaCy extensions.
Usage ideas
- Adding new features and hooking in models. For example, a sentiment
analysis model, or your preferred solution for lemmatization or sentiment
analysis. spaCy's built-in tagger, parser and entity recognizer respect
annotations that were already set on the Docin a previous step of the pipeline.
- Integrating other libraries and APIs. For example, your pipeline component
can write additional information and data directly to the DocorTokenas custom attributes, while making sure no information is lost in the process. This can be output generated by other libraries and models, or an external service with a REST API.
- Debugging and logging. For example, a component which stores and/or exports relevant information about the current state of the processed document, and insert it at any point of your pipeline.
Best practices
Extensions can claim their own ._ namespace and exist as standalone packages.
If you're developing a tool or library and want to make it easy for others to
use it with spaCy and add it to their pipeline, all you have to do is expose a
function that takes a Doc, modifies it and returns it.
- 
Make sure to choose a descriptive and specific name for your pipeline component class, and set it as its nameattribute. Avoid names that are too common or likely to clash with built-in or a user's other custom components. While it's fine to call your package"spacy_my_extension", avoid component names including"spacy", since this can easily lead to confusion.+ name = "myapp_lemmatizer" - name = "lemmatizer"
- 
When writing to Doc,TokenorSpanobjects, use getter functions wherever possible, and avoid setting values explicitly. Tokens and spans don't own any data themselves, and they're implemented as C extension classes – so you can't usually add new attributes to them like you could with most pure Python objects.+ is_fruit = lambda token: token.text in ("apple", "orange") + Token.set_extension("is_fruit", getter=is_fruit) - token._.set_extension("is_fruit", default=False) - if token.text in ('"apple", "orange"): - token._.set("is_fruit", True)
- 
Always add your custom attributes to the global Doc,TokenorSpanobjects, not a particular instance of them. Add the attributes as early as possible, e.g. in your extension's__init__method or in the global scope of your module. This means that in the case of namespace collisions, the user will see an error immediately, not just when they run their pipeline.+ from spacy.tokens import Doc + def __init__(attr="my_attr"): + Doc.set_extension(attr, getter=self.get_doc_attr) - def __call__(doc): - doc.set_extension("my_attr", getter=self.get_doc_attr)
- 
If your extension is setting properties on the Doc,TokenorSpan, include an option to let the user to change those attribute names. This makes it easier to avoid namespace collisions and accommodate users with different naming preferences. We recommend adding anattrsargument to the__init__method of your class so you can write the names to class attributes and reuse them across your component.+ Doc.set_extension(self.doc_attr, default="some value") - Doc.set_extension("my_doc_attr", default="some value")
- 
Ideally, extensions should be standalone packages with spaCy and optionally, other packages specified as a dependency. They can freely assign to their own ._namespace, but should stick to that. If your extension's only job is to provide a better.similarityimplementation, and your docs state this explicitly, there's no problem with writing to theuser_hooksand overwriting spaCy's built-in method. However, a third-party extension should never silently overwrite built-ins, or attributes set by other extensions.
- 
If you're looking to publish a model that depends on a custom pipeline component, you can either require it in the model package's dependencies, or – if the component is specific and lightweight – choose to ship it with your model package. Just make sure the @Language.componentor@Language.factorydecorator that registers the custom component runs in your model's__init__.pyor is exposed via an entry point.
- 
Once you're ready to share your extension with others, make sure to add docs and installation instructions (you can always link to this page for more info). Make it easy for others to install and use your extension, for example by uploading it to PyPi. If you're sharing your code on GitHub, don't forget to tag it with spacyandspacy-extensionto help people find it. If you post it on Twitter, feel free to tag @spacy_io so we can check it out.
Wrapping other models and libraries
Let's say you have a custom entity recognizer that takes a list of strings and
returns their BILUO tags. Given an
input like ["A", "text", "about", "Facebook"], it will predict and return
["O", "O", "O", "U-ORG"]. To integrate it into your spaCy pipeline and make it
add those entities to the doc.ents, you can wrap it in a custom pipeline
component function and pass it the token texts from the Doc object received by
the component.
The gold.spans_from_biluo_tags is very
helpful here, because it takes a Doc object and token-based BILUO tags and
returns a sequence of Span objects in the Doc with added labels. So all your
wrapper has to do is compute the entity spans and overwrite the doc.ents.
How the doc.ents work
When you add spans to the
doc.ents, spaCy will automatically resolve them back to the underlying tokens and set theToken.ent_typeandToken.ent_iobattributes. By definition, each token can only be part of one entity, so overlapping entity spans are not allowed.
### {highlight="1,8-9"}
import your_custom_entity_recognizer
from spacy.gold import offsets_from_biluo_tags
from spacy.language import Language
@Language.component("custom_ner_wrapper")
def custom_ner_wrapper(doc):
    words = [token.text for token in doc]
    custom_entities = your_custom_entity_recognizer(words)
    doc.ents = spans_from_biluo_tags(doc, custom_entities)
    return doc
The custom_ner_wrapper can then be added to the pipeline of a blank model
using nlp.add_pipe. You can also replace the
existing entity recognizer of a pretrained model with
nlp.replace_pipe.
Here's another example of a custom model, your_custom_model, that takes a list
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
part-of-speech tags, dependency labels and head token indices. Here, we can use
the Doc.from_array to create a new Doc object using
those values. To create a numpy array we need integers, so we can look up the
string labels in the StringStore. The
doc.vocab.strings.add method comes in handy here,
because it returns the integer ID of the string and makes sure it's added to
the vocab. This is especially important if the custom model uses a different
label scheme than spaCy's default models.
Example: spacy-stanza
For an example of an end-to-end wrapper for statistical tokenization, tagging and parsing, check out
spacy-stanza. It uses a very similar approach to the example in this section – the only difference is that it fully replaces thenlpobject instead of providing a pipeline component, since it also needs to handle tokenization.
### {highlight="1,11,17-19"}
import your_custom_model
from spacy.language import Language
from spacy.symbols import POS, TAG, DEP, HEAD
from spacy.tokens import Doc
import numpy
@Language.component("custom_model_wrapper")
def custom_model_wrapper(doc):
    words = [token.text for token in doc]
    spaces = [token.whitespace for token in doc]
    pos, tags, deps, heads = your_custom_model(words)
    # Convert the strings to integers and add them to the string store
    pos = [doc.vocab.strings.add(label) for label in pos]
    tags = [doc.vocab.strings.add(label) for label in tags]
    deps = [doc.vocab.strings.add(label) for label in deps]
    # Create a new Doc from a numpy array
    attrs = [POS, TAG, DEP, HEAD]
    arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
    new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
    return new_doc
If you create a Doc object with dependencies and heads, spaCy is able to
resolve the sentence boundaries automatically. However, note that the HEAD
value used to construct a Doc is the token index relative to the current
token – e.g. -1 for the previous token. The CoNLL format typically annotates
heads as 1-indexed absolute indices with 0 indicating the root. If that's
the case in your annotations, you need to convert them first:
heads = [2, 0, 4, 2, 2]
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
For more details on how to write and package custom components, make them available to spaCy via entry points and implement your own serialization methods, check out the usage guide on saving and loading.