spaCy/top-level.md at e9babd99730331010bc8cd1d6e176755e3372df6

explosion/spaCy

Fork 0

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Ines Montani 1ea1bc98e7 Document regex utilities [ci skip]

2019-02-24 18:34:10 +01:00

33 KiB

Raw Blame History

title

Top-level Functions

spacy

displacy

Utility Functions

util

Compatibility

compat

spaCy

spacy.load

Load a model via its shortcut link, the name of an installed model package, a unicode path or a Path-like object. spaCy will try resolving the load argument in this order. If a model is loaded from a shortcut link or package name, spaCy will assume it's a Python package and import it and call the model's own load() method. If a model is loaded from a path, spaCy will assume it's a data directory, read the language and pipeline settings off the meta.json and initialize the Language class. The data will be loaded in via Language.from_disk.

Example

nlp = spacy.load("en") # shortcut link
nlp = spacy.load("en_core_web_sm") # package
nlp = spacy.load("/path/to/en") # unicode path
nlp = spacy.load(Path("/path/to/en")) # pathlib Path

nlp = spacy.load("en", disable=["parser", "tagger"])

Name	Type	Description
`name`	unicode / `Path`	Model to load, i.e. shortcut link, package name or path.
`disable`	list	Names of pipeline components to disable.
RETURNS	`Language`	A `Language` object with the loaded model.

Essentially, spacy.load() is a convenience wrapper that reads the language ID and pipeline components from a model's meta.json, initializes the Language class, loads in the model data and returns it.

### Abstract example
cls = util.get_lang_class(lang)         #  get language for ID, e.g. 'en'
nlp = cls()                             #  initialise the language
for name in pipeline: component = nlp.create_pipe(name)   #  create each pipeline component nlp.add_pipe(component)             #  add component to pipeline
nlp.from_disk(model_data_path)          #  load in model data

As of spaCy 2.0, the path keyword argument is deprecated. spaCy will also raise an error if no model could be loaded and never just return an empty Language object. If you need a blank language, you can use the new function spacy.blank() or import the class explicitly, e.g. from spacy.lang.en import English.

- nlp = spacy.load("en", path="/model")
+ nlp = spacy.load("/model")

spacy.blank

Create a blank model of a given language class. This function is the twin of spacy.load().

Example

nlp_en = spacy.blank("en")
nlp_de = spacy.blank("de")

Name	Type	Description
`name`	unicode	ISO code of the language class to load.
`disable`	list	Names of pipeline components to disable.
RETURNS	`Language`	An empty `Language` object of the appropriate subclass.

spacy.info

The same as the info command. Pretty-print information about your installation, models and local setup from within spaCy. To get the model meta data as a dictionary instead, you can use the meta attribute on your nlp object with a loaded model, e.g. nlp.meta.

Example

spacy.info()
spacy.info("en")
spacy.info("de", markdown=True)

Name	Type	Description
`model`	unicode	A model, i.e. shortcut link, package name or path (optional).
`markdown`	bool	Print information as Markdown.

spacy.explain

Get a description for a given POS tag, dependency label or entity type. For a list of available terms, see glossary.py.

Example

spacy.explain(u"NORP")
# Nationalities or religious or political groups

doc = nlp(u"Hello world")
for word in doc:
   print(word.text, word.tag_, spacy.explain(word.tag_))
# Hello UH interjection
# world NN noun, singular or mass

Name	Type	Description
`term`	unicode	Term to explain.
RETURNS	unicode	The explanation, or `None` if not found in the glossary.

spacy.prefer_gpu

Allocate data and perform operations on GPU, if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.

Example

import spacy
activated = spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

Name	Type	Description
RETURNS	bool	Whether the GPU was activated.

spacy.require_gpu

Allocate data and perform operations on GPU. Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any models.

Example

import spacy
spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

Name	Type	Description
RETURNS	bool	`True`

displaCy

As of v2.0, spaCy comes with a built-in visualization suite. For more info and examples, see the usage guide on visualizing spaCy.

displacy.serve

Serve a dependency parse tree or named entity visualization to view it in your browser. Will run a simple web server.

Example

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp(u"This is a sentence.")
doc2 = nlp(u"This is another sentence.")
displacy.serve([doc1, doc2], style="dep")

Name	Type	Description	Default
`docs`	list, `Doc`, `Span`	Document(s) to visualize.
`style`	unicode	Visualization style, `'dep'` or `'ent'`.	`'dep'`
`page`	bool	Render markup as full HTML page.	`True`
`minify`	bool	Minify HTML markup.	`False`
`options`	dict	Visualizer-specific options, e.g. colors.	`{}`
`manual`	bool	Don't parse `Doc` and instead, expect a dict or list of dicts. See here for formats and examples.	`False`
`port`	int	Port to serve visualization.	`5000`
`host`	unicode	Host to serve visualization.	`'0.0.0.0'`

displacy.render

Render a dependency parse tree or named entity visualization.

Example

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a sentence.")
html = displacy.render(doc, style="dep")

Name	Type	Description	Default
`docs`	list, `Doc`, `Span`	Document(s) to visualize.
`style`	unicode	Visualization style, `'dep'` or `'ent'`.	`'dep'`
`page`	bool	Render markup as full HTML page.	`False`
`minify`	bool	Minify HTML markup.	`False`
`jupyter`	bool	Explicitly enable "Jupyter mode" to return markup ready to be rendered in a notebook.	detected automatically
`options`	dict	Visualizer-specific options, e.g. colors.	`{}`
`manual`	bool	Don't parse `Doc` and instead, expect a dict or list of dicts. See here for formats and examples.	`False`
RETURNS	unicode	Rendered HTML markup.

Visualizer options

The options argument lets you specify additional settings for each visualizer. If a setting is not present in the options, the default value will be used.

Dependency Visualizer options

Example

options = {"compact": True, "color": "blue"}
displacy.serve(doc, style="dep", options=options)

Name	Type	Description	Default
`fine_grained`	bool	Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).	`False`
`collapse_punct`	bool	Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation.	`True`
`collapse_phrases`	bool	Merge noun phrases into one token.	`False`
`compact`	bool	"Compact mode" with square arrows that takes up less space.	`False`
`color`	unicode	Text color (HEX, RGB or color names).	`'#000000'`
`bg`	unicode	Background color (HEX, RGB or color names).	`'#ffffff'`
`font`	unicode	Font name or font family for all text.	`'Arial'`
`offset_x`	int	Spacing on left side of the SVG in px.	`50`
`arrow_stroke`	int	Width of arrow path in px.	`2`
`arrow_width`	int	Width of arrow head in px.	`10` / `8` (compact)
`arrow_spacing`	int	Spacing between arrows in px to avoid overlaps.	`20` / `12` (compact)
`word_spacing`	int	Vertical spacing between words and arcs in px.	`45`
`distance`	int	Distance between words in px.	`175` / `150` (compact)

Named Entity Visualizer options

Example

options = {"ents": ["PERSON", "ORG", "PRODUCT"],
           "colors": {"ORG": "yellow"}}
displacy.serve(doc, style="ent", options=options)

Name	Type	Description	Default
`ents`	list	Entity types to highlight (`None` for all types).	`None`
`colors`	dict	Color overrides. Entity types in uppercase should be mapped to color names or values.	`{}`

By default, displaCy comes with colors for all entity types supported by spaCy. If you're using custom entity types, you can use the colors setting to add your own colors for them.

Utility functions

spaCy comes with a small collection of utility functions located in spacy/util.py. Because utility functions are mostly intended for internal use within spaCy, their behavior may change with future releases. The functions documented on this page should be safe to use and we'll try to ensure backwards compatibility. However, we recommend having additional tests in place if your application depends on any of spaCy's utilities.

util.get_data_path

Get path to the data directory where spaCy looks for models. Defaults to spacy/data.

Name	Type	Description
`require_exists`	bool	Only return path if it exists, otherwise return `None`.
RETURNS	`Path` / `None`	Data path or `None`.

util.set_data_path

Set custom path to the data directory where spaCy looks for models.

Example

util.set_data_path("/custom/path")
util.get_data_path()
# PosixPath('/custom/path')

Name	Type	Description
`path`	unicode / `Path`	Path to new data directory.

util.get_lang_class

Import and load a Language class. Allows lazy-loading language data and importing languages using the two-letter language code. To add a language code for a custom language class, you can use the set_lang_class helper.

Example

for lang_id in ["en", "de"]:
    lang_class = util.get_lang_class(lang_id)
    lang = lang_class()
    tokenizer = lang.Defaults.create_tokenizer()

Name	Type	Description
`lang`	unicode	Two-letter language code, e.g. `'en'`.
RETURNS	`Language`	Language class.

util.set_lang_class

Set a custom Language class name that can be loaded via get_lang_class. If your model uses a custom language, this is required so that spaCy can load the correct class from the two-letter language code.

Example

from spacy.lang.xy import CustomLanguage

util.set_lang_class('xy', CustomLanguage)
lang_class = util.get_lang_class('xy')
nlp = lang_class()

Name	Type	Description
`name`	unicode	Two-letter language code, e.g. `'en'`.
`cls`	`Language`	The language class, e.g. `English`.

util.load_model

Load a model from a shortcut link, package or data path. If called with a shortcut link or package name, spaCy will assume the model is a Python package and import and call its load() method. If called with a path, spaCy will assume it's a data directory, read the language and pipeline settings from the meta.json and initialize a Language class. The model data will then be loaded in via Language.from_disk().

Example

nlp = util.load_model("en")
nlp = util.load_model("en_core_web_sm", disable=["ner"])
nlp = util.load_model("/path/to/data")

Name	Type	Description
`name`	unicode	Package name, shortcut link or model path.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.load_model_from_path

Load a model from a data directory path. Creates the Language class and pipeline based on the directory's meta.json and then calls from_disk() with the path. This function also makes it easy to test a new model that you haven't packaged yet.

Example

nlp = load_model_from_path("/path/to/data")

Name	Type	Description
`model_path`	unicode	Path to model data directory.
`meta`	dict	Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.load_model_from_init_py

A helper function to use in the load() method of a model package's __init__.py.

Example

from spacy.util import load_model_from_init_py

def load(**overrides):
    return load_model_from_init_py(__file__, **overrides)

Name	Type	Description
`init_file`	unicode	Path to model's `__init__.py`, i.e. `__file__`.
`**overrides`	-	Specific overrides, like pipeline components to disable.
RETURNS	`Language`	`Language` class with the loaded model.

util.get_model_meta

Get a model's meta.json from a directory path and validate its contents.

Example

meta = util.get_model_meta("/path/to/model")

Name	Type	Description
`path`	unicode / `Path`	Path to model directory.
RETURNS	dict	The model's meta data.

util.is_package

Check if string maps to a package installed via pip. Mainly used to validate model packages.

Example

util.is_package("en_core_web_sm") # True
util.is_package("xyz") # False

Name	Type	Description
`name`	unicode	Name of package.
RETURNS	`bool`	`True` if installed package, `False` if not.

util.get_package_path

Get path to an installed package. Mainly used to resolve the location of model packages. Currently imports the package to find its path.

Example

util.get_package_path("en_core_web_sm")
# /usr/lib/python3.6/site-packages/en_core_web_sm

Name	Type	Description
`package_name`	unicode	Name of installed package.
RETURNS	`Path`	Path to model package directory.

util.is_in_jupyter

Check if user is running spaCy from a Jupyter notebook by detecting the IPython kernel. Mainly used for the displacy visualizer.

Example

html = "<h1>Hello world!</h1>"
if util.is_in_jupyter():
    from IPython.core.display import display, HTML
    display(HTML(html))

Name	Type	Description
RETURNS	bool	`True` if in Jupyter, `False` if not.

util.update_exc

Update, validate and overwrite tokenizer exceptions. Used to combine global exceptions with custom, language-specific exceptions. Will raise an error if key doesn't match ORTH values.

Example

BASE =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
NEW = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
exceptions = util.update_exc(BASE, NEW)
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}

Name	Type	Description
`base_exceptions`	dict	Base tokenizer exceptions.
`*addition_dicts`	dicts	Exception dictionaries to add to the base exceptions, in order.
RETURNS	dict	Combined tokenizer exceptions.

util.compile_prefix_regex

Compile a sequence of prefix rules into a regex object.

Example

prefixes = ("§", "%", "=", r"\+")
prefix_regex = util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

Name	Type	Description
`entries`	tuple	The prefix rules, e.g. `lang.punctuation.TOKENIZER_PREFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.prefix_search`.

util.compile_suffix_regex

Compile a sequence of suffix rules into a regex object.

Example

suffixes = ("'s", "'S", r"(?<=[0-9])\+")
suffix_regex = util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

Name	Type	Description
`entries`	tuple	The suffix rules, e.g. `lang.punctuation.TOKENIZER_SUFFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.suffix_search`.

util.compile_infix_regex

Compile a sequence of infix rules into a regex object.

Example

infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
infix_regex = util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

Name	Type	Description
`entries`	tuple	The infix rules, e.g. `lang.punctuation.TOKENIZER_INFIXES`.
RETURNS	regex	The regex object. to be used for `Tokenizer.infix_finditer`.

util.minibatch

Iterate over batches of items. size may be an iterator, so that batch-size can vary on each step.

Example

batches = minibatch(train_data)
for batch in batches:
    texts, annotations = zip(*batch)
    nlp.update(texts, annotations)

Name	Type	Description
`items`	iterable	The items to batch up.
`size`	int / iterable	The batch size(s). Use `util.compounding` or `util.decaying` or for an infinite series of compounding or decaying values.
YIELDS	list	The batches.

util.compounding

Yield an infinite series of compounding values. Each time the generator is called, a value is produced by multiplying the previous value by the compound rate.

Example

sizes = compounding(1., 10., 1.5)
assert next(sizes) == 1.
assert next(sizes) == 1. * 1.5
assert next(sizes) == 1.5 * 1.5

Name	Type	Description
`start`	int / float	The first value.
`stop`	int / float	The maximum value.
`compound`	int / float	The compounding factor.
YIELDS	int	Compounding values.

util.decaying

Yield an infinite series of linearly decaying values.

Example

sizes = decaying(1., 10., 0.001)
assert next(sizes) == 1.
assert next(sizes) == 1. - 0.001
assert next(sizes) == 0.999 - 0.001

Name	Type	Description
`start`	int / float	The first value.
`end`	int / float	The maximum value.
`decay`	int / float	The decaying factor.
YIELDS	int	The decaying values.

util.itershuffle

Shuffle an iterator. This works by holding bufsize items back and yielding them sometime later. Obviously, this is not unbiased – but should be good enough for batching. Larger buffsize means less bias.

Example

values = range(1000)
shuffled = itershuffle(values)

Name	Type	Description
`iterable`	iterable	Iterator to shuffle.
`buffsize`	int	Items to hold back.
YIELDS	iterable	The shuffled iterator.

Compatibility functions

All Python code is written in an intersection of Python 2 and Python 3. This is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or platform compatibility only lives in spacy.compat. To distinguish them from the builtin functions, replacement functions are suffixed with an underscore, e.e unicode_.

Example

from spacy.compat import unicode_

compatible_unicode = unicode_("hello world")

Name	Python 2	Python 3
`compat.bytes_`	`str`	`bytes`
`compat.unicode_`	`unicode`	`str`
`compat.basestring_`	`basestring`	`str`
`compat.input_`	`raw_input`	`input`
`compat.path2str`	`str(path)` with `.decode('utf8')`	`str(path)`

compat.is_config

Check if a specific configuration of Python version and operating system matches the user's setup. Mostly used to display targeted error messages.

Example

from spacy.compat import is_config

if is_config(python2=True, windows=True):
    print("You are using Python 2 on Windows.")

Name	Type	Description
`python2`	bool	spaCy is executed with Python 2.x.
`python3`	bool	spaCy is executed with Python 3.x.
`windows`	bool	spaCy is executed on Windows.
`linux`	bool	spaCy is executed on Linux.
`osx`	bool	spaCy is executed on OS X or macOS.
RETURNS	bool	Whether the specified configuration matches the user's platform.

33 KiB Raw Blame History Unescape Escape

spaCy

spacy.load

Example

spacy.blank

Example

spacy.info

Example

spacy.explain

Example

spacy.prefer_gpu

Example

spacy.require_gpu

Example

displaCy

displacy.serve

Example

displacy.render

Example

Visualizer options

Dependency Visualizer options

Example

Named Entity Visualizer options

Example

Utility functions

util.get_data_path

util.set_data_path

Example

util.get_lang_class

Example

util.set_lang_class

Example

util.load_model

Example

util.load_model_from_path

Example

util.load_model_from_init_py

Example

util.get_model_meta

Example

util.is_package

Example

util.get_package_path

Example

util.is_in_jupyter

Example

util.update_exc

Example

util.compile_prefix_regex

Example

util.compile_suffix_regex

Example

util.compile_infix_regex

Example

util.minibatch

Example

util.compounding

Example

util.decaying

Example

util.itershuffle

Example

Compatibility functions

Example

compat.is_config

Example

33 KiB

Raw Blame History