# Conflicts: # website/docs/usage/training.md
51 KiB
title | menu | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Top-level Functions |
|
spaCy
spacy.load
Load a model using the name of an installed
model package, a string path or a
Path
-like object. spaCy will try resolving the load argument in this order. If
a model is loaded from a model name, spaCy will assume it's a Python package and
import it and call the model's own load()
method. If a model is loaded from a
path, spaCy will assume it's a data directory, load its
config.cfg
and use the language and pipeline
information to construct the Language
class. The data will be loaded in via
Language.from_disk
.
Example
nlp = spacy.load("en_core_web_sm") # package nlp = spacy.load("/path/to/en") # string path nlp = spacy.load(Path("/path/to/en")) # pathlib Path nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
Name | Description |
---|---|
name |
Model to load, i.e. package name or path. |
keyword-only | |
disable |
Names of pipeline components to disable. |
config 3 |
Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value" . |
RETURNS | A Language object with the loaded model. |
Essentially, spacy.load()
is a convenience wrapper that reads the model's
config.cfg
, uses the language and pipeline
information to construct a Language
object, loads in the model data and
returns it.
### Abstract example
cls = util.get_lang_class(lang) # get language for ID, e.g. "en"
nlp = cls() # initialize the language
for name in pipeline:
nlp.add_pipe(name) # add component to pipeline
nlp.from_disk(model_data_path) # load in model data
spacy.blank
Create a blank model of a given language class. This function is the twin of
spacy.load()
.
Example
nlp_en = spacy.blank("en") # equivalent to English() nlp_de = spacy.blank("de") # equivalent to German()
Name | Description |
---|---|
name |
ISO code of the language class to load. |
RETURNS | An empty Language object of the appropriate subclass. |
spacy.info
The same as the info
command. Pretty-print information about
your installation, models and local setup from within spaCy. To get the model
meta data as a dictionary instead, you can use the meta
attribute on your
nlp
object with a loaded model, e.g. nlp.meta
.
Example
spacy.info() spacy.info("en_core_web_sm") markdown = spacy.info(markdown=True, silent=True)
Name | Description |
---|---|
model |
A model, i.e. a package name or path (optional). |
keyword-only | |
markdown |
Print information as Markdown. |
silent |
Don't print anything, just return. |
spacy.explain
Get a description for a given POS tag, dependency label or entity type. For a
list of available terms, see
glossary.py
.
Example
spacy.explain("NORP") # Nationalities or religious or political groups doc = nlp("Hello world") for word in doc: print(word.text, word.tag_, spacy.explain(word.tag_)) # Hello UH interjection # world NN noun, singular or mass
Name | Description |
---|---|
term |
Term to explain. |
RETURNS | The explanation, or None if not found in the glossary. |
spacy.prefer_gpu
Allocate data and perform operations on GPU, if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.
Example
import spacy activated = spacy.prefer_gpu() nlp = spacy.load("en_core_web_sm")
Name | Description |
---|---|
RETURNS | Whether the GPU was activated. |
spacy.require_gpu
Allocate data and perform operations on GPU. Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.
Example
import spacy spacy.require_gpu() nlp = spacy.load("en_core_web_sm")
Name | Description |
---|---|
RETURNS | True |
displaCy
As of v2.0, spaCy comes with a built-in visualization suite. For more info and examples, see the usage guide on visualizing spaCy.
displacy.serve
Serve a dependency parse tree or named entity visualization to view it in your browser. Will run a simple web server.
Example
import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") doc1 = nlp("This is a sentence.") doc2 = nlp("This is another sentence.") displacy.serve([doc1, doc2], style="dep")
Name | Description |
---|---|
docs |
Document(s) or span(s) to visualize. |
style |
Visualization style, "dep" or "ent" . Defaults to "dep" . |
page |
Render markup as full HTML page. Defaults to True . |
minify |
Minify HTML markup. Defaults to False . |
options |
Visualizer-specific options, e.g. colors. |
manual |
Don't parse Doc and instead, expect a dict or list of dicts. See here for formats and examples. Defaults to False . |
port |
Port to serve visualization. Defaults to 5000 . |
host |
Host to serve visualization. Defaults to "0.0.0.0" . |
displacy.render
Render a dependency parse tree or named entity visualization.
Example
import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") doc = nlp("This is a sentence.") html = displacy.render(doc, style="dep")
Name | Description |
---|---|
docs |
Document(s) or span(s) to visualize. |
style |
Visualization style, "dep" or "ent" . Defaults to "dep" . |
page |
Render markup as full HTML page. Defaults to True . |
minify |
Minify HTML markup. Defaults to False . |
options |
Visualizer-specific options, e.g. colors. |
manual |
Don't parse Doc and instead, expect a dict or list of dicts. See here for formats and examples. Defaults to False . |
jupyter |
Explicitly enable or disable "Jupyter mode" to return markup ready to be rendered in a notebook. Detected automatically if None (default). |
RETURNS | The rendered HTML markup. |
Visualizer options
The options
argument lets you specify additional settings for each visualizer.
If a setting is not present in the options, the default value will be used.
Dependency Visualizer options
Example
options = {"compact": True, "color": "blue"} displacy.serve(doc, style="dep", options=options)
Name | Description |
---|---|
fine_grained |
Use fine-grained part-of-speech tags (Token.tag_ ) instead of coarse-grained tags (Token.pos_ ). Defaults to False . |
add_lemma 2.2.4 |
Print the lemma's in a separate row below the token texts. Defaults to False . |
collapse_punct |
Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to True . |
collapse_phrases |
Merge noun phrases into one token. Defaults to False . |
compact |
"Compact mode" with square arrows that takes up less space. Defaults to False . |
color |
Text color (HEX, RGB or color names). Defaults to "#000000" . |
bg |
Background color (HEX, RGB or color names). Defaults to "#ffffff" . |
font |
Font name or font family for all text. Defaults to "Arial" . |
offset_x |
Spacing on left side of the SVG in px. Defaults to 50 . |
arrow_stroke |
Width of arrow path in px. Defaults to 2 . |
arrow_width |
Width of arrow head in px. Defaults to 10 in regular mode and 8 in compact mode. |
arrow_spacing |
Spacing between arrows in px to avoid overlaps. Defaults to 20 in regular mode and 12 in compact mode. |
word_spacing |
Vertical spacing between words and arcs in px. Defaults to 45 . |
distance |
Distance between words in px. Defaults to 175 in regular mode and 150 in compact mode. |
Named Entity Visualizer options
Example
options = {"ents": ["PERSON", "ORG", "PRODUCT"], "colors": {"ORG": "yellow"}} displacy.serve(doc, style="ent", options=options)
Name | Description |
---|---|
ents |
Entity types to highlight or None for all types (default). |
colors |
Color overrides. Entity types in uppercase should be mapped to color names or values. |
template 2.2 |
Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use {bg} , {text} and {label} . See templates.py for examples. |
By default, displaCy comes with colors for all entity types used by
spaCy models. If you're using custom entity types, you can use the
colors
setting to add your own colors for them. Your application or model
package can also expose a
spacy_displacy_colors
entry point
to add custom labels and their colors automatically.
registry
spaCy's function registry extends
Thinc's registry
and allows you
to map strings to functions. You can register functions to create architectures,
optimizers, schedules and more, and then refer to them and set their arguments
in your config file. Python type hints are used to
validate the inputs. See the
Thinc docs for details on the
registry
methods and our helper library
catalogue
for some background on the
concept of function registries. spaCy also uses the function registry for
language subclasses, model architecture, lookups and pipeline component
factories.
Example
import spacy from thinc.api import Model @spacy.registry.architectures("CustomNER.v1") def custom_ner(n0: int) -> Model: return Model("custom", forward, dims={"nO": nO})
Registry name | Description |
---|---|
architectures |
Registry for functions that create model architectures. Can be used to register custom model architectures and reference them in the config.cfg . |
factories |
Registry for functions that create pipeline components. Added automatically when you use the @spacy.component decorator and also reads from entry points. |
tokenizers |
Registry for tokenizer factories. Registered functions should return a callback that receives the nlp object and returns a Tokenizer or a custom callable. |
languages |
Registry for language-specific Language subclasses. Automatically reads from entry points. |
lookups |
Registry for large lookup tables available via vocab.lookups . |
displacy_colors |
Registry for custom color scheme for the displacy NER visualizer. Automatically reads from entry points. |
assets |
Registry for data assets, knowledge bases etc. |
callbacks |
Registry for custom callbacks to modify the nlp object before training. |
readers |
Registry for training and evaluation data readers like Corpus . |
batchers |
Registry for training and evaluation data batchers. |
optimizers |
Registry for functions that create optimizers. |
schedules |
Registry for functions that create schedules. |
layers |
Registry for functions that create layers. |
losses |
Registry for functions that create losses. |
initializers |
Registry for functions that create initializers. |
spacy-transformers registry
The following registries are added by the
spacy-transformers
package.
See the Transformer
API reference and
usage docs for details.
Example
import spacy_transformers @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") def configure_custom_annotation_setter(): def annotation_setter(docs, trf_data) -> None: # Set annotations on the docs return annotation_sette
Registry name | Description |
---|---|
span_getters |
Registry for functions that take a batch of Doc objects and return a list of Span objects to process by the transformer, e.g. sentences. |
annotation_setters |
Registry for functions that create annotation setters. Annotation setters are functions that take a batch of Doc objects and a FullTransformerBatch and can set additional annotations on the Doc . |
Batchers
A batcher implements a batching strategy that essentially turns a stream of
items into a stream of batches, with each batch consisting of one item or a list
of items. During training, the models update their weights after processing one
batch at a time. Typical batching strategies include presenting the training
data as a stream of batches with similar sizes, or with increasing batch sizes.
See the Thinc documentation on
schedules
for a few standard examples.
Instead of using one of the built-in batchers listed here, you can also implement your own, which may or may not use a custom schedule.
batch_by_words.v1
Create minibatches of roughly a given number of words. If any examples are
longer than the specified batch length, they will appear in a batch by
themselves, or be discarded if discard_oversize
is set to True
. The argument
docs
can be a list of strings, Doc
objects or
Example
objects.
Example config
[training.batcher] @batchers = "batch_by_words.v1" size = 100 tolerance = 0.2 discard_oversize = false get_length = null
Name | Description |
---|---|
seqs |
The sequences to minibatch. |
size |
The target number of words per batch. Can also be a block referencing a schedule, e.g. compounding . |
tolerance |
What percentage of the size to allow batches to exceed. |
discard_oversize |
Whether to discard sequences that by themselves exceed the tolerated size. |
get_length |
Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. |
batch_by_sequence.v1
Example config
[training.batcher] @batchers = "batch_by_sequence.v1" size = 32 get_length = null
Create a batcher that creates batches of the specified size.
Name | Description |
---|---|
size |
The target number of items per batch. Can also be a block referencing a schedule, e.g. compounding . |
get_length |
Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. |
batch_by_padded.v1
Example config
[training.batcher] @batchers = "batch_by_padded.v1" size = 100 buffer = 256 discard_oversize = false get_length = null
Minibatch a sequence by the size of padded batches that would result, with sequences binned by length within a window. The padded size is defined as the maximum length of sequences within the batch multiplied by the number of sequences in the batch.
Name | Description |
---|---|
size |
The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. compounding . |
buffer |
The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
discard_oversize |
Whether to discard sequences that are by themselves longer than the largest padded batch size. |
get_length |
Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. |
Training data and alignment
gold.biluo_tags_from_offsets
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit,
Out). Returns a list of strings, describing the tags. Each tag string will be of
the form of either ""
, "O"
or "{action}-{label}"
, where action is one of
"B"
, "I"
, "L"
, "U"
. The string "-"
is used where the entity offsets
don't align with the tokenization in the Doc
object. The training algorithm
will view these as missing values. O
denotes a non-entity token. B
denotes
the beginning of a multi-token entity, I
the inside of an entity of three or
more tokens, and L
the end of an entity of two or more tokens. U
denotes a
single-token entity.
Example
from spacy.gold import biluo_tags_from_offsets doc = nlp("I like London.") entities = [(7, 13, "LOC")] tags = biluo_tags_from_offsets(doc, entities) assert tags == ["O", "O", "U-LOC", "O"]
Name | Description |
---|---|
doc |
The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
entities |
A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string. |
RETURNS | A list of strings, describing the BILUO tags. |
gold.offsets_from_biluo_tags
Encode per-token tags following the BILUO scheme into entity offsets.
Example
from spacy.gold import offsets_from_biluo_tags doc = nlp("I like London.") tags = ["O", "O", "U-LOC", "O"] entities = offsets_from_biluo_tags(doc, tags) assert entities == [(7, 13, "LOC")]
Name | Description |
---|---|
doc |
The document that the BILUO tags refer to. |
entities |
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "" , "O" or "{action}-{label}" , where action is one of "B" , "I" , "L" , "U" . |
RETURNS | A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string. |
gold.spans_from_biluo_tags
Encode per-token tags following the
BILUO scheme into
Span
objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents
.
Example
from spacy.gold import spans_from_biluo_tags doc = nlp("I like London.") tags = ["O", "O", "U-LOC", "O"] doc.ents = spans_from_biluo_tags(doc, tags)
Name | Description |
---|---|
doc |
The document that the BILUO tags refer to. |
entities |
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "" , "O" or "{action}-{label}" , where action is one of "B" , "I" , "L" , "U" . |
RETURNS | A sequence of Span objects with added entity labels. |
Utility functions
spaCy comes with a small collection of utility functions located in
spacy/util.py
.
Because utility functions are mostly intended for internal use within spaCy,
their behavior may change with future releases. The functions documented on this
page should be safe to use and we'll try to ensure backwards compatibility.
However, we recommend having additional tests in place if your application
depends on any of spaCy's utilities.
util.get_lang_class
Import and load a Language
class. Allows lazy-loading
language data and importing languages using the
two-letter language code. To add a language code for a custom language class,
you can register it using the @registry.languages
decorator.
Example
for lang_id in ["en", "de"]: lang_class = util.get_lang_class(lang_id) lang = lang_class()
Name | Description |
---|---|
lang |
Two-letter language code, e.g. "en" . |
RETURNS | The respective subclass. |
util.lang_class_is_loaded
Check whether a Language
subclass is already loaded. Language
subclasses are
loaded lazily, to avoid expensive setup code associated with the language data.
Example
lang_cls = util.get_lang_class("en") assert util.lang_class_is_loaded("en") is True assert util.lang_class_is_loaded("de") is False
Name | Description |
---|---|
name |
Two-letter language code, e.g. "en" . |
RETURNS | Whether the class has been loaded. |
util.load_model
Load a model from a package or data path. If called with a package name, spaCy
will assume the model is a Python package and import and call its load()
method. If called with a path, spaCy will assume it's a data directory, read the
language and pipeline settings from the config.cfg
and create a Language
object. The model data will then be loaded in via
Language.from_disk
.
Example
nlp = util.load_model("en_core_web_sm") nlp = util.load_model("en_core_web_sm", disable=["ner"]) nlp = util.load_model("/path/to/data")
Name | Description |
---|---|
name |
Package name or model path. |
vocab 3 |
Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. |
disable |
Names of pipeline components to disable. |
config 3 |
Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline" . |
RETURNS | Language class with the loaded model. |
util.load_model_from_init_py
A helper function to use in the load()
method of a model package's
__init__.py
.
Example
from spacy.util import load_model_from_init_py def load(**overrides): return load_model_from_init_py(__file__, **overrides)
Name | Description |
---|---|
init_file |
Path to model's __init__.py , i.e. __file__ . |
vocab 3 |
Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. |
disable |
Names of pipeline components to disable. |
config 3 |
Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline" . |
RETURNS | Language class with the loaded model. |
util.load_config
Load a model's config.cfg
from a file path. The
config typically includes details about the model pipeline and how its
components are created, as well as all training settings and hyperparameters.
Example
config = util.load_config("/path/to/model/config.cfg") print(config.to_str())
Name | Description |
---|---|
path |
Path to the model's config.cfg . |
overrides |
Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. "nlp.pipeline" . |
interpolate |
Whether to interpolate the config and replace variables like ${paths:train} with their values. Defaults to False . |
RETURNS | The model's config. |
util.load_meta
Get a model's meta.json
from a file path and
validate its contents.
Example
meta = util.load_meta("/path/to/model/meta.json")
Name | Description |
---|---|
path |
Path to the model's meta.json . |
RETURNS | The model's meta data. |
util.is_package
Check if string maps to a package installed via pip. Mainly used to validate model packages.
Example
util.is_package("en_core_web_sm") # True util.is_package("xyz") # False
Name | Description |
---|---|
name |
Name of package. |
RETURNS | True if installed package, False if not. |
util.get_package_path
Get path to an installed package. Mainly used to resolve the location of model packages. Currently imports the package to find its path.
Example
util.get_package_path("en_core_web_sm") # /usr/lib/python3.6/site-packages/en_core_web_sm
Name | Description |
---|---|
package_name |
Name of installed package. |
RETURNS | Path to model package directory. |
util.is_in_jupyter
Check if user is running spaCy from a Jupyter notebook by
detecting the IPython kernel. Mainly used for the
displacy
visualizer.
Example
html = "<h1>Hello world!</h1>" if util.is_in_jupyter(): from IPython.core.display import display, HTML display(HTML(html))
Name | Description |
---|---|
RETURNS | True if in Jupyter, False if not. |
util.compile_prefix_regex
Compile a sequence of prefix rules into a regex object.
Example
prefixes = ("§", "%", "=", r"\+") prefix_regex = util.compile_prefix_regex(prefixes) nlp.tokenizer.prefix_search = prefix_regex.search
Name | Description |
---|---|
entries |
The prefix rules, e.g. lang.punctuation.TOKENIZER_PREFIXES . |
RETURNS | The regex object. to be used for Tokenizer.prefix_search . |
util.compile_suffix_regex
Compile a sequence of suffix rules into a regex object.
Example
suffixes = ("'s", "'S", r"(?<=[0-9])\+") suffix_regex = util.compile_suffix_regex(suffixes) nlp.tokenizer.suffix_search = suffix_regex.search
Name | Description |
---|---|
entries |
The suffix rules, e.g. lang.punctuation.TOKENIZER_SUFFIXES . |
RETURNS | The regex object. to be used for Tokenizer.suffix_search . |
util.compile_infix_regex
Compile a sequence of infix rules into a regex object.
Example
infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])") infix_regex = util.compile_infix_regex(infixes) nlp.tokenizer.infix_finditer = infix_regex.finditer
Name | Description |
---|---|
entries |
The infix rules, e.g. lang.punctuation.TOKENIZER_INFIXES . |
RETURNS | The regex object. to be used for Tokenizer.infix_finditer . |
util.minibatch
Iterate over batches of items. size
may be an iterator, so that batch-size can
vary on each step.
Example
batches = minibatch(train_data) for batch in batches: nlp.update(batch)
Name | Description |
---|---|
items |
The items to batch up. |
size |
int / iterable |
YIELDS | The batches. |
util.filter_spans
Filter a sequence of Span
objects and remove duplicates or
overlaps. Useful for creating named entities (where one token can only be part
of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the
(first) longest span is preferred over shorter spans.
Example
doc = nlp("This is a sentence.") spans = [doc[0:2], doc[0:2], doc[0:4]] filtered = filter_spans(spans)
Name | Description |
---|---|
spans |
The spans to filter. |
RETURNS | The filtered spans. |
util.get_words_and_spaces
Given a list of words and a text, reconstruct the original tokens and return a
list of words and spaces that can be used to create a Doc
.
This can help recover destructive tokenization that didn't preserve any
whitespace information.
Example
orig_words = ["Hey", ",", "what", "'s", "up", "?"] orig_text = "Hey, what's up?" words, spaces = get_words_and_spaces(orig_words, orig_text) # ['Hey', ',', 'what', "'s", 'up', '?'] # [False, True, False, True, False, False]
Name | Description |
---|---|
words |
The list of words. |
text |
The original text. |
RETURNS | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. |