spaCy/top-level.md at cdec46493fd316338ef39a528a482184c4162a6f

explosion/spaCy

Fork 0

mirror of https://github.com/explosion/spaCy.git synced 2025-07-09 06:13:08 +03:00

Ines Montani cdec46493f Update docs

2020-08-05 15:00:54 +02:00

45 KiB

Raw Blame History

title

Top-level Functions

spacy

displacy

registry

Loaders & Batchers

loaders-batchers

Data & Alignment

gold

Utility Functions

util

spaCy

spacy.load

Load a model using the name of an installed model package, a string path or a Path-like object. spaCy will try resolving the load argument in this order. If a model is loaded from a model name, spaCy will assume it's a Python package and import it and call the model's own load() method. If a model is loaded from a path, spaCy will assume it's a data directory, read the language and pipeline settings off the meta.json and initialize the Language class. The data will be loaded in via Language.from_disk.

Example

nlp = spacy.load("en_core_web_sm") # package
nlp = spacy.load("/path/to/en") # string path
nlp = spacy.load(Path("/path/to/en")) # pathlib Path

nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])

Name	Type	Description
`name`	str / `Path`	Model to load, i.e. package name or path.
keyword-only
`disable`	`List[str]`	Names of pipeline components to disable.
`component_cfg` 3	`Dict[str, dict]`	Optional config overrides for pipeline components, keyed by component names.
RETURNS	`Language`	A `Language` object with the loaded model.

Essentially, spacy.load() is a convenience wrapper that reads the language ID and pipeline components from a model's meta.json, initializes the Language class, loads in the model data and returns it.

### Abstract example
cls = util.get_lang_class(lang)         #  get language for ID, e.g. "en"
nlp = cls()                             #  initialize the language
for name in pipeline:
    nlp.add_pipe(name)                  #  add component to pipeline
nlp.from_disk(model_data_path)          #  load in model data

spacy.blank

Create a blank model of a given language class. This function is the twin of spacy.load().

Example

nlp_en = spacy.blank("en")   # equivalent to English()
nlp_de = spacy.blank("de")   # equivalent to German()

Name	Type	Description
`name`	str	ISO code of the language class to load.
RETURNS	`Language`	An empty `Language` object of the appropriate subclass.

spacy.info

The same as the info command. Pretty-print information about your installation, models and local setup from within spaCy. To get the model meta data as a dictionary instead, you can use the meta attribute on your nlp object with a loaded model, e.g. nlp.meta.

Example

spacy.info()
spacy.info("en_core_web_sm")
markdown = spacy.info(markdown=True, silent=True)

Name	Type	Description
`model`	str	A model, i.e. a package name or path (optional).
keyword-only
`markdown`	bool	Print information as Markdown.
`silent`	bool	Don't print anything, just return.

spacy.explain

Get a description for a given POS tag, dependency label or entity type. For a list of available terms, see glossary.py.

Example

spacy.explain("NORP")
# Nationalities or religious or political groups

doc = nlp("Hello world")
for word in doc:
   print(word.text, word.tag_, spacy.explain(word.tag_))
# Hello UH interjection
# world NN noun, singular or mass

Name	Type	Description
`term`	str	Term to explain.
RETURNS	str	The explanation, or `None` if not found in the glossary.

spacy.prefer_gpu

Allocate data and perform operations on GPU, if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.

Example

import spacy
activated = spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

Name	Type	Description
RETURNS	bool	Whether the GPU was activated.

spacy.require_gpu

Allocate data and perform operations on GPU. Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.

Example

import spacy
spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

Name	Type	Description
RETURNS	bool	`True`

displaCy

As of v2.0, spaCy comes with a built-in visualization suite. For more info and examples, see the usage guide on visualizing spaCy.

displacy.serve

Serve a dependency parse tree or named entity visualization to view it in your browser. Will run a simple web server.

Example

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
displacy.serve([doc1, doc2], style="dep")

Name	Type	Description	Default
`docs`	list, `Doc`, `Span`	Document(s) to visualize.
`style`	str	Visualization style, `'dep'` or `'ent'`.	`'dep'`
`page`	bool	Render markup as full HTML page.	`True`
`minify`	bool	Minify HTML markup.	`False`
`options`	dict	Visualizer-specific options, e.g. colors.	`{}`
`manual`	bool	Don't parse `Doc` and instead, expect a dict or list of dicts. See here for formats and examples.	`False`
`port`	int	Port to serve visualization.	`5000`
`host`	str	Host to serve visualization.	`'0.0.0.0'`

displacy.render

Render a dependency parse tree or named entity visualization.

Example

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
html = displacy.render(doc, style="dep")

Name	Type	Description	Default
`docs`	list, `Doc`, `Span`	Document(s) to visualize.
`style`	str	Visualization style, `'dep'` or `'ent'`.	`'dep'`
`page`	bool	Render markup as full HTML page.	`False`
`minify`	bool	Minify HTML markup.	`False`
`jupyter`	bool	Explicitly enable or disable "Jupyter mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`.	`None`
`options`	dict	Visualizer-specific options, e.g. colors.	`{}`
`manual`	bool	Don't parse `Doc` and instead, expect a dict or list of dicts. See here for formats and examples.	`False`
RETURNS	str	Rendered HTML markup.

Visualizer options

The options argument lets you specify additional settings for each visualizer. If a setting is not present in the options, the default value will be used.

Dependency Visualizer options

Example

options = {"compact": True, "color": "blue"}
displacy.serve(doc, style="dep", options=options)

Name	Type	Description	Default
`fine_grained`	bool	Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).	`False`
`add_lemma` 2.2.4	bool	Print the lemma's in a separate row below the token texts.	`False`
`collapse_punct`	bool	Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation.	`True`
`collapse_phrases`	bool	Merge noun phrases into one token.	`False`
`compact`	bool	"Compact mode" with square arrows that takes up less space.	`False`
`color`	str	Text color (HEX, RGB or color names).	`'#000000'`
`bg`	str	Background color (HEX, RGB or color names).	`'#ffffff'`
`font`	str	Font name or font family for all text.	`'Arial'`
`offset_x`	int	Spacing on left side of the SVG in px.	`50`
`arrow_stroke`	int	Width of arrow path in px.	`2`
`arrow_width`	int	Width of arrow head in px.	`10` / `8` (compact)
`arrow_spacing`	int	Spacing between arrows in px to avoid overlaps.	`20` / `12` (compact)
`word_spacing`	int	Vertical spacing between words and arcs in px.	`45`
`distance`	int	Distance between words in px.	`175` / `150` (compact)

Named Entity Visualizer options

Example

options = {"ents": ["PERSON", "ORG", "PRODUCT"],
           "colors": {"ORG": "yellow"}}
displacy.serve(doc, style="ent", options=options)

Name	Type	Description	Default
`ents`	list	Entity types to highlight (`None` for all types).	`None`
`colors`	dict	Color overrides. Entity types in uppercase should be mapped to color names or values.	`{}`
`template` 2.2	str	Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`.	see `templates.py`

By default, displaCy comes with colors for all entity types used by spaCy models. If you're using custom entity types, you can use the colors setting to add your own colors for them. Your application or model package can also expose a spacy_displacy_colors entry point to add custom labels and their colors automatically.

registry

spaCy's function registry extends Thinc's registry and allows you to map strings to functions. You can register functions to create architectures, optimizers, schedules and more, and then refer to them and set their arguments in your config file. Python type hints are used to validate the inputs. See the Thinc docs for details on the registry methods and our helper library catalogue for some background on the concept of function registries. spaCy also uses the function registry for language subclasses, model architecture, lookups and pipeline component factories.

Example

import spacy
from thinc.api import Model

@spacy.registry.architectures("CustomNER.v1")
def custom_ner(n0: int) -> Model:
    return Model("custom", forward, dims={"nO": nO})

Registry name	Description
`architectures`	Registry for functions that create model architectures. Can be used to register custom model architectures and reference them in the `config.cfg`.
`factories`	Registry for functions that create pipeline components. Added automatically when you use the `@spacy.component` decorator and also reads from entry points
`languages`	Registry for language-specific `Language` subclasses. Automatically reads from entry points.
`lookups`	Registry for large lookup tables available via `vocab.lookups`.
`displacy_colors`	Registry for custom color scheme for the `displacy` NER visualizer. Automatically reads from entry points.
`assets`
`optimizers`	Registry for functions that create optimizers.
`schedules`	Registry for functions that create schedules.
`layers`	Registry for functions that create layers.
`losses`	Registry for functions that create losses.
`initializers`	Registry for functions that create initializers.

spacy-transformers registry

The following registries are added by the spacy-transformers package. See the Transformer API reference and usage docs for details.

Example

import spacy_transformers

@spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
def configure_custom_annotation_setter():
    def annotation_setter(docs, trf_data) -> None:
       # Set annotations on the docs

    return annotation_sette

Registry name	Description
`span_getters`	Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.
`annotation_setters`	Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a `FullTransformerBatch` and can set additional annotations on the `Doc`.

Training data loaders and batchers

Training data and alignment

gold.docs_to_json

Convert a list of Doc objects into the JSON-serializable format used by the spacy train command. Each input doc will be treated as a 'paragraph' in the output doc.

Example

from spacy.gold import docs_to_json

doc = nlp("I like London")
json_data = docs_to_json([doc])

Name	Type	Description
`docs`	iterable / `Doc`	The `Doc` object(s) to convert.
`id`	int	ID to assign to the JSON. Defaults to `0`.
RETURNS	dict	The data in spaCy's JSON format.

gold.align

Calculate alignment tables between two tokenizations, using the Levenshtein algorithm. The alignment is case-insensitive.

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you'll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

Example

from spacy.gold import align

bert_tokens = ["obama", "'", "s", "podcast"]
spacy_tokens = ["obama", "'s", "podcast"]
alignment = align(bert_tokens, spacy_tokens)
cost, a2b, b2a, a2b_multi, b2a_multi = alignment

Name	Type	Description
`tokens_a`	list	String values of candidate tokens to align.
`tokens_b`	list	String values of reference tokens to align.
RETURNS	tuple	A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment.

The returned tuple contains the following alignment information:

Example
a2b = array([0, -1, -1, 2])
b2a = array([0, 2, 3])
a2b_multi = {1: 1, 2: 1}
b2a_multi = {}
If a2b[3] == 2, that means that tokens_a[3] aligns to tokens_b[2]. If there's no one-to-one alignment for a token, it has the value -1.

Name	Type	Description
`cost`	int	The number of misaligned tokens.
`a2b`	`numpy.ndarray[ndim=1, dtype='int32']`	One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.
`b2a`	`numpy.ndarray[ndim=1, dtype='int32']`	One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.
`a2b_multi`	dict	A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`.
`b2a_multi`	dict	A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`.

gold.biluo_tags_from_offsets

Encode labelled spans into per-token tags, using the BILUO scheme (Begin, In, Last, Unit, Out). Returns a list of strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

Example

from spacy.gold import biluo_tags_from_offsets

doc = nlp("I like London.")
entities = [(7, 13, "LOC")]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]

Name	Type	Description
`doc`	`Doc`	The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
`entities`	iterable	A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string.
RETURNS	list	str strings, describing the BILUO tags.

gold.offsets_from_biluo_tags

Encode per-token tags following the BILUO scheme into entity offsets.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp("I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = offsets_from_biluo_tags(doc, tags)
assert entities == [(7, 13, "LOC")]

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string.

gold.spans_from_biluo_tags

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

Example

from spacy.gold import spans_from_biluo_tags

doc = nlp("I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = spans_from_biluo_tags(doc, tags)

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `Span` objects with added entity labels.

Utility functions

spaCy comes with a small collection of utility functions located in spacy/util.py. Because utility functions are mostly intended for internal use within spaCy, their behavior may change with future releases. The functions documented on this page should be safe to use and we'll try to ensure backwards compatibility. However, we recommend having additional tests in place if your application depends on any of spaCy's utilities.

util.get_lang_class

Import and load a Language class. Allows lazy-loading language data and importing languages using the two-letter language code. To add a language code for a custom language class, you can use the set_lang_class helper.

Example

for lang_id in ["en", "de"]:
    lang_class = util.get_lang_class(lang_id)
    lang = lang_class()

Name	Type	Description
`lang`	str	Two-letter language code, e.g. `'en'`.
RETURNS	`Language`	Language class.

util.set_lang_class

Set a custom Language class name that can be loaded via get_lang_class. If your model uses a custom language, this is required so that spaCy can load the correct class from the two-letter language code.

Example

from spacy.lang.xy import CustomLanguage

util.set_lang_class('xy', CustomLanguage)
lang_class = util.get_lang_class('xy')
nlp = lang_class()

Name	Type	Description
`name`	str	Two-letter language code, e.g. `'en'`.
`cls`	`Language`	The language class, e.g. `English`.

util.lang_class_is_loaded

Check whether a Language class is already loaded. Language classes are loaded lazily, to avoid expensive setup code associated with the language data.

Example

lang_cls = util.get_lang_class("en")
assert util.lang_class_is_loaded("en") is True
assert util.lang_class_is_loaded("de") is False

Name	Type	Description
`name`	str	Two-letter language code, e.g. `'en'`.
RETURNS	bool	Whether the class has been loaded.

util.load_model

Load a model from a package or data path. If called with a package name, spaCy will assume the model is a Python package and import and call its load() method. If called with a path, spaCy will assume it's a data directory, read the language and pipeline settings from the meta.json and initialize a Language class. The model data will then be loaded in via Language.from_disk().

Example

nlp = util.load_model("en_core_web_sm")
nlp = util.load_model("en_core_web_sm", disable=["ner"])
nlp = util.load_model("/path/to/data")

Name	Type	Description
`name`	str	Package name or model path.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.load_model_from_path

Load a model from a data directory path. Creates the Language class and pipeline based on the directory's meta.json and then calls from_disk() with the path. This function also makes it easy to test a new model that you haven't packaged yet.

Example

nlp = load_model_from_path("/path/to/data")

Name	Type	Description
`model_path`	str	Path to model data directory.
`meta`	dict	Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.load_model_from_init_py

A helper function to use in the load() method of a model package's __init__.py.

Example

from spacy.util import load_model_from_init_py

def load(**overrides):
    return load_model_from_init_py(__file__, **overrides)

Name	Type	Description
`init_file`	str	Path to model's `__init__.py`, i.e. `__file__`.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.get_model_meta

Get a model's meta.json from a directory path and validate its contents.

Example

meta = util.get_model_meta("/path/to/model")

Name	Type	Description
`path`	str / `Path`	Path to model directory.
RETURNS	dict	The model's meta data.

util.is_package

Check if string maps to a package installed via pip. Mainly used to validate model packages.

Example

util.is_package("en_core_web_sm") # True
util.is_package("xyz") # False

Name	Type	Description
`name`	str	Name of package.
RETURNS	`bool`	`True` if installed package, `False` if not.

util.get_package_path

Get path to an installed package. Mainly used to resolve the location of model packages. Currently imports the package to find its path.

Example

util.get_package_path("en_core_web_sm")
# /usr/lib/python3.6/site-packages/en_core_web_sm

Name	Type	Description
`package_name`	str	Name of installed package.
RETURNS	`Path`	Path to model package directory.

util.is_in_jupyter

Check if user is running spaCy from a Jupyter notebook by detecting the IPython kernel. Mainly used for the displacy visualizer.

Example

html = "<h1>Hello world!</h1>"
if util.is_in_jupyter():
    from IPython.core.display import display, HTML
    display(HTML(html))

Name	Type	Description
RETURNS	bool	`True` if in Jupyter, `False` if not.

util.compile_prefix_regex

Compile a sequence of prefix rules into a regex object.

Example

prefixes = ("§", "%", "=", r"\+")
prefix_regex = util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

Name	Type	Description
`entries`	tuple	The prefix rules, e.g. `lang.punctuation.TOKENIZER_PREFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.prefix_search`.

util.compile_suffix_regex

Compile a sequence of suffix rules into a regex object.

Example

suffixes = ("'s", "'S", r"(?<=[0-9])\+")
suffix_regex = util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

Name	Type	Description
`entries`	tuple	The suffix rules, e.g. `lang.punctuation.TOKENIZER_SUFFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.suffix_search`.

util.compile_infix_regex

Compile a sequence of infix rules into a regex object.

Example

infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
infix_regex = util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

Name	Type	Description
`entries`	tuple	The infix rules, e.g. `lang.punctuation.TOKENIZER_INFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.infix_finditer`.

util.minibatch

Iterate over batches of items. size may be an iterator, so that batch-size can vary on each step.

Example

batches = minibatch(train_data)
for batch in batches:
    nlp.update(batch)

Name	Type	Description
`items`	iterable	The items to batch up.
`size`	int / iterable	The batch size(s).
YIELDS	list	The batches.

util.filter_spans

Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Example

doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)

Name	Type	Description
`spans`	iterable	The spans to filter.
RETURNS	list	The filtered spans.

util.get_words_and_spaces

Name	Type	Description
`words`	list
`text`	str
RETURNS	tuple

45 KiB Raw Blame History

spaCy

spacy.load

Example

spacy.blank

Example

spacy.info

Example

spacy.explain

Example

spacy.prefer_gpu

Example

spacy.require_gpu

Example

displaCy

displacy.serve

Example

displacy.render

Example

Visualizer options

Dependency Visualizer options

Example

Named Entity Visualizer options

Example

registry

Example

spacy-transformers registry

Example

Training data loaders and batchers

Training data and alignment

gold.docs_to_json

Example

gold.align

Example

Example

gold.biluo_tags_from_offsets

Example

gold.offsets_from_biluo_tags

Example

gold.spans_from_biluo_tags

Example

Utility functions

util.get_lang_class

Example

util.set_lang_class

Example

util.lang_class_is_loaded

Example

util.load_model

Example

util.load_model_from_path

Example

util.load_model_from_init_py

Example

util.get_model_meta

Example

util.is_package

Example

util.get_package_path

Example

util.is_in_jupyter

Example

util.compile_prefix_regex

Example

util.compile_suffix_regex

Example

util.compile_infix_regex

Example

util.minibatch

Example

util.filter_spans

Example

util.get_words_and_spaces

45 KiB

Raw Blame History