mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
f728c00cbb
# Conflicts: # website/docs/api/data-formats.md
801 lines
51 KiB
Markdown
801 lines
51 KiB
Markdown
---
|
|
title: Top-level Functions
|
|
menu:
|
|
- ['spacy', 'spacy']
|
|
- ['displacy', 'displacy']
|
|
- ['registry', 'registry']
|
|
- ['Batchers', 'batchers']
|
|
- ['Data & Alignment', 'gold']
|
|
- ['Utility Functions', 'util']
|
|
---
|
|
|
|
## spaCy {#spacy hidden="true"}
|
|
|
|
### spacy.load {#spacy.load tag="function" model="any"}
|
|
|
|
Load a model using the name of an installed
|
|
[model package](/usage/training#models-generating), a string path or a
|
|
`Path`-like object. spaCy will try resolving the load argument in this order. If
|
|
a model is loaded from a model name, spaCy will assume it's a Python package and
|
|
import it and call the model's own `load()` method. If a model is loaded from a
|
|
path, spaCy will assume it's a data directory, load its
|
|
[`config.cfg`](/api/data-formats#config) and use the language and pipeline
|
|
information to construct the `Language` class. The data will be loaded in via
|
|
[`Language.from_disk`](/api/language#from_disk).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = spacy.load("en_core_web_sm") # package
|
|
> nlp = spacy.load("/path/to/en") # string path
|
|
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
|
>
|
|
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `name` | Model to load, i.e. package name or path. ~~Union[str, Path]~~ |
|
|
| _keyword-only_ | |
|
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
|
| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
|
|
| **RETURNS** | A `Language` object with the loaded model. ~~Language~~ |
|
|
|
|
Essentially, `spacy.load()` is a convenience wrapper that reads the model's
|
|
[`config.cfg`](/api/data-formats#config), uses the language and pipeline
|
|
information to construct a `Language` object, loads in the model data and
|
|
returns it.
|
|
|
|
```python
|
|
### Abstract example
|
|
cls = util.get_lang_class(lang) # get language for ID, e.g. "en"
|
|
nlp = cls() # initialize the language
|
|
for name in pipeline:
|
|
nlp.add_pipe(name) # add component to pipeline
|
|
nlp.from_disk(model_data_path) # load in model data
|
|
```
|
|
|
|
### spacy.blank {#spacy.blank tag="function" new="2"}
|
|
|
|
Create a blank model of a given language class. This function is the twin of
|
|
`spacy.load()`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp_en = spacy.blank("en") # equivalent to English()
|
|
> nlp_de = spacy.blank("de") # equivalent to German()
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------------------------------------------------------------------------- |
|
|
| `name` | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~ |
|
|
| **RETURNS** | An empty `Language` object of the appropriate subclass. ~~Language~~ |
|
|
|
|
### spacy.info {#spacy.info tag="function"}
|
|
|
|
The same as the [`info` command](/api/cli#info). Pretty-print information about
|
|
your installation, models and local setup from within spaCy. To get the model
|
|
meta data as a dictionary instead, you can use the `meta` attribute on your
|
|
`nlp` object with a loaded model, e.g. `nlp.meta`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> spacy.info()
|
|
> spacy.info("en_core_web_sm")
|
|
> markdown = spacy.info(markdown=True, silent=True)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------ |
|
|
| `model` | A model, i.e. a package name or path (optional). ~~Optional[str]~~ |
|
|
| _keyword-only_ | |
|
|
| `markdown` | Print information as Markdown. ~~bool~~ |
|
|
| `silent` | Don't print anything, just return. ~~bool~~ |
|
|
|
|
### spacy.explain {#spacy.explain tag="function"}
|
|
|
|
Get a description for a given POS tag, dependency label or entity type. For a
|
|
list of available terms, see
|
|
[`glossary.py`](https://github.com/explosion/spaCy/tree/master/spacy/glossary.py).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> spacy.explain("NORP")
|
|
> # Nationalities or religious or political groups
|
|
>
|
|
> doc = nlp("Hello world")
|
|
> for word in doc:
|
|
> print(word.text, word.tag_, spacy.explain(word.tag_))
|
|
> # Hello UH interjection
|
|
> # world NN noun, singular or mass
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------------------------------------------- |
|
|
| `term` | Term to explain. ~~str~~ |
|
|
| **RETURNS** | The explanation, or `None` if not found in the glossary. ~~Optional[str]~~ |
|
|
|
|
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
|
|
|
Allocate data and perform operations on [GPU](/usage/#gpu), if available. If
|
|
data has already been allocated on CPU, it will not be moved. Ideally, this
|
|
function should be called right after importing spaCy and _before_ loading any
|
|
models.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> activated = spacy.prefer_gpu()
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------------- |
|
|
| **RETURNS** | Whether the GPU was activated. ~~bool~~ |
|
|
|
|
### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
|
|
|
|
Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error
|
|
if no GPU is available. If data has already been allocated on CPU, it will not
|
|
be moved. Ideally, this function should be called right after importing spaCy
|
|
and _before_ loading any models.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> spacy.require_gpu()
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------- |
|
|
| **RETURNS** | `True` ~~bool~~ |
|
|
|
|
## displaCy {#displacy source="spacy/displacy"}
|
|
|
|
As of v2.0, spaCy comes with a built-in visualization suite. For more info and
|
|
examples, see the usage guide on [visualizing spaCy](/usage/visualizers).
|
|
|
|
### displacy.serve {#displacy.serve tag="method" new="2"}
|
|
|
|
Serve a dependency parse tree or named entity visualization to view it in your
|
|
browser. Will run a simple web server.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> from spacy import displacy
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> doc1 = nlp("This is a sentence.")
|
|
> doc2 = nlp("This is another sentence.")
|
|
> displacy.serve([doc1, doc2], style="dep")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ |
|
|
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
|
|
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
|
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
|
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
|
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
|
| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ |
|
|
| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ |
|
|
|
|
### displacy.render {#displacy.render tag="method" new="2"}
|
|
|
|
Render a dependency parse tree or named entity visualization.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> from spacy import displacy
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> doc = nlp("This is a sentence.")
|
|
> html = displacy.render(doc, style="dep")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ |
|
|
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
|
|
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
|
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
|
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
|
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
|
| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
|
|
| **RETURNS** | The rendered HTML markup. ~~str~~ |
|
|
|
|
### Visualizer options {#displacy_options}
|
|
|
|
The `options` argument lets you specify additional settings for each visualizer.
|
|
If a setting is not present in the options, the default value will be used.
|
|
|
|
#### Dependency Visualizer options {#options-dep}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> options = {"compact": True, "color": "blue"}
|
|
> displacy.serve(doc, style="dep", options=options)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ |
|
|
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
|
|
| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
|
|
| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ |
|
|
| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ |
|
|
| `color` | Text color (HEX, RGB or color names). Defaults to `"#000000"`. ~~str~~ |
|
|
| `bg` | Background color (HEX, RGB or color names). Defaults to `"#ffffff"`. ~~str~~ |
|
|
| `font` | Font name or font family for all text. Defaults to `"Arial"`. ~~str~~ |
|
|
| `offset_x` | Spacing on left side of the SVG in px. Defaults to `50`. ~~int~~ |
|
|
| `arrow_stroke` | Width of arrow path in px. Defaults to `2`. ~~int~~ |
|
|
| `arrow_width` | Width of arrow head in px. Defaults to `10` in regular mode and `8` in compact mode. ~~int~~ |
|
|
| `arrow_spacing` | Spacing between arrows in px to avoid overlaps. Defaults to `20` in regular mode and `12` in compact mode. ~~int~~ |
|
|
| `word_spacing` | Vertical spacing between words and arcs in px. Defaults to `45`. ~~int~~ |
|
|
| `distance` | Distance between words in px. Defaults to `175` in regular mode and `150` in compact mode. ~~int~~ |
|
|
|
|
#### Named Entity Visualizer options {#displacy_options-ent}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> options = {"ents": ["PERSON", "ORG", "PRODUCT"],
|
|
> "colors": {"ORG": "yellow"}}
|
|
> displacy.serve(doc, style="ent", options=options)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ |
|
|
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. ~~Dict[str, str]~~ |
|
|
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
|
|
|
|
By default, displaCy comes with colors for all entity types used by
|
|
[spaCy models](/models). If you're using custom entity types, you can use the
|
|
`colors` setting to add your own colors for them. Your application or model
|
|
package can also expose a
|
|
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
|
|
to add custom labels and their colors automatically.
|
|
|
|
## registry {#registry source="spacy/util.py" new="3"}
|
|
|
|
spaCy's function registry extends
|
|
[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you
|
|
to map strings to functions. You can register functions to create architectures,
|
|
optimizers, schedules and more, and then refer to them and set their arguments
|
|
in your [config file](/usage/training#config). Python type hints are used to
|
|
validate the inputs. See the
|
|
[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the
|
|
`registry` methods and our helper library
|
|
[`catalogue`](https://github.com/explosion/catalogue) for some background on the
|
|
concept of function registries. spaCy also uses the function registry for
|
|
language subclasses, model architecture, lookups and pipeline component
|
|
factories.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from typing import Iterator
|
|
> import spacy
|
|
>
|
|
> @spacy.registry.schedules("waltzing.v1")
|
|
> def waltzing() -> Iterator[float]:
|
|
> i = 0
|
|
> while True:
|
|
> yield i % 3 + 1
|
|
> i += 1
|
|
> ```
|
|
|
|
| Registry name | Description |
|
|
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
|
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
|
|
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
|
|
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
|
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
| `assets` | Registry for data assets, knowledge bases etc. |
|
|
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
|
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
|
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
|
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
|
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
|
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
|
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
|
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
|
|
|
### spacy-transformers registry {#registry-transformers}
|
|
|
|
The following registries are added by the
|
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
|
|
See the [`Transformer`](/api/transformer) API reference and
|
|
[usage docs](/usage/embeddings-transformers) for details.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy_transformers
|
|
>
|
|
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
|
|
> def configure_custom_annotation_setter():
|
|
> def annotation_setter(docs, trf_data) -> None:
|
|
> # Set annotations on the docs
|
|
>
|
|
> return annotation_sette
|
|
> ```
|
|
|
|
| Registry name | Description |
|
|
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
|
|
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
|
|
|
|
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
|
|
|
|
A data batcher implements a batching strategy that essentially turns a stream of
|
|
items into a stream of batches, with each batch consisting of one item or a list
|
|
of items. During training, the models update their weights after processing one
|
|
batch at a time. Typical batching strategies include presenting the training
|
|
data as a stream of batches with similar sizes, or with increasing batch sizes.
|
|
See the Thinc documentation on
|
|
[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.
|
|
|
|
Instead of using one of the built-in batchers listed here, you can also
|
|
[implement your own](/usage/training#custom-code-readers-batchers), which may or
|
|
may not use a custom schedule.
|
|
|
|
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
|
|
|
Create minibatches of roughly a given number of words. If any examples are
|
|
longer than the specified batch length, they will appear in a batch by
|
|
themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
|
`docs` can be a list of strings, [`Doc`](/api/doc) objects or
|
|
[`Example`](/api/example) objects.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [training.batcher]
|
|
> @batchers = "batch_by_words.v1"
|
|
> size = 100
|
|
> tolerance = 0.2
|
|
> discard_oversize = false
|
|
> get_length = null
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `seqs` | The sequences to minibatch. ~~Iterable[Any]~~ |
|
|
| `size` | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
|
| `tolerance` | What percentage of the size to allow batches to exceed. ~~float~~ |
|
|
| `discard_oversize` | Whether to discard sequences that by themselves exceed the tolerated size. ~~bool~~ |
|
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
|
|
|
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [training.batcher]
|
|
> @batchers = "batch_by_sequence.v1"
|
|
> size = 32
|
|
> get_length = null
|
|
> ```
|
|
|
|
Create a batcher that creates batches of the specified size.
|
|
|
|
| Name | Description |
|
|
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `size` | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
|
|
|
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [training.batcher]
|
|
> @batchers = "batch_by_padded.v1"
|
|
> size = 100
|
|
> buffer = 256
|
|
> discard_oversize = false
|
|
> get_length = null
|
|
> ```
|
|
|
|
Minibatch a sequence by the size of padded batches that would result, with
|
|
sequences binned by length within a window. The padded size is defined as the
|
|
maximum length of sequences within the batch multiplied by the number of
|
|
sequences in the batch.
|
|
|
|
| Name | Description |
|
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `size` | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
|
| `buffer` | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. ~~int~~ |
|
|
| `discard_oversize` | Whether to discard sequences that are by themselves longer than the largest padded batch size. ~~bool~~ |
|
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
|
|
|
## Training data and alignment {#gold source="spacy/gold"}
|
|
|
|
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
|
|
|
|
Encode labelled spans into per-token tags, using the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
|
|
Out). Returns a list of strings, describing the tags. Each tag string will be of
|
|
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
|
|
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
|
|
don't align with the tokenization in the `Doc` object. The training algorithm
|
|
will view these as missing values. `O` denotes a non-entity token. `B` denotes
|
|
the beginning of a multi-token entity, `I` the inside of an entity of three or
|
|
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
|
|
single-token entity.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import biluo_tags_from_offsets
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> entities = [(7, 13, "LOC")]
|
|
> tags = biluo_tags_from_offsets(doc, entities)
|
|
> assert tags == ["O", "O", "U-LOC", "O"]
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. ~~Doc~~ |
|
|
| `entities` | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, Union[str, int]]]~~ |
|
|
| **RETURNS** | A list of strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. ~~List[str]~~ |
|
|
|
|
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
|
|
|
Encode per-token tags following the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) into entity offsets.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import offsets_from_biluo_tags
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> tags = ["O", "O", "U-LOC", "O"]
|
|
> entities = offsets_from_biluo_tags(doc, tags)
|
|
> assert entities == [(7, 13, "LOC")]
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
|
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
|
| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~ |
|
|
|
|
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
|
|
|
Encode per-token tags following the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) into
|
|
[`Span`](/api/span) objects. This can be used to create entity spans from
|
|
token-based tags, e.g. to overwrite the `doc.ents`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import spans_from_biluo_tags
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> tags = ["O", "O", "U-LOC", "O"]
|
|
> doc.ents = spans_from_biluo_tags(doc, tags)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
|
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
|
| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~ |
|
|
|
|
## Utility functions {#util source="spacy/util.py"}
|
|
|
|
spaCy comes with a small collection of utility functions located in
|
|
[`spacy/util.py`](https://github.com/explosion/spaCy/tree/master/spacy/util.py).
|
|
Because utility functions are mostly intended for **internal use within spaCy**,
|
|
their behavior may change with future releases. The functions documented on this
|
|
page should be safe to use and we'll try to ensure backwards compatibility.
|
|
However, we recommend having additional tests in place if your application
|
|
depends on any of spaCy's utilities.
|
|
|
|
### util.get_lang_class {#util.get_lang_class tag="function"}
|
|
|
|
Import and load a `Language` class. Allows lazy-loading
|
|
[language data](/usage/adding-languages) and importing languages using the
|
|
two-letter language code. To add a language code for a custom language class,
|
|
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
|
decorator.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> for lang_id in ["en", "de"]:
|
|
> lang_class = util.get_lang_class(lang_id)
|
|
> lang = lang_class()
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ---------------------------------------------- |
|
|
| `lang` | Two-letter language code, e.g. `"en"`. ~~str~~ |
|
|
| **RETURNS** | The respective subclass. ~~Language~~ |
|
|
|
|
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
|
|
|
Check whether a `Language` subclass is already loaded. `Language` subclasses are
|
|
loaded lazily, to avoid expensive setup code associated with the language data.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> lang_cls = util.get_lang_class("en")
|
|
> assert util.lang_class_is_loaded("en") is True
|
|
> assert util.lang_class_is_loaded("de") is False
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ---------------------------------------------- |
|
|
| `name` | Two-letter language code, e.g. `"en"`. ~~str~~ |
|
|
| **RETURNS** | Whether the class has been loaded. ~~bool~~ |
|
|
|
|
### util.load_model {#util.load_model tag="function" new="2"}
|
|
|
|
Load a model from a package or data path. If called with a package name, spaCy
|
|
will assume the model is a Python package and import and call its `load()`
|
|
method. If called with a path, spaCy will assume it's a data directory, read the
|
|
language and pipeline settings from the [`config.cfg`](/api/data-formats#config)
|
|
and create a `Language` object. The model data will then be loaded in via
|
|
[`Language.from_disk`](/api/language#from_disk).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = util.load_model("en_core_web_sm")
|
|
> nlp = util.load_model("en_core_web_sm", disable=["ner"])
|
|
> nlp = util.load_model("/path/to/data")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | Package name or model path. ~~str~~ |
|
|
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. |
|
|
| `disable` | Names of pipeline components to disable. ~~Iterable[str]~~ |
|
|
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
|
|
| **RETURNS** | `Language` class with the loaded model. ~~Language~~ |
|
|
|
|
### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
|
|
|
|
A helper function to use in the `load()` method of a model package's
|
|
[`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.util import load_model_from_init_py
|
|
>
|
|
> def load(**overrides):
|
|
> return load_model_from_init_py(__file__, **overrides)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `init_file` | Path to model's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ |
|
|
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. |
|
|
| `disable` | Names of pipeline components to disable. ~~Iterable[str]~~ |
|
|
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
|
|
| **RETURNS** | `Language` class with the loaded model. ~~Language~~ |
|
|
|
|
### util.load_config {#util.load_config tag="function" new="3"}
|
|
|
|
Load a model's [`config.cfg`](/api/data-formats#config) from a file path. The
|
|
config typically includes details about the model pipeline and how its
|
|
components are created, as well as all training settings and hyperparameters.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> config = util.load_config("/path/to/model/config.cfg")
|
|
> print(config.to_str())
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `path` | Path to the model's `config.cfg`. ~~Union[str, Path]~~ |
|
|
| `overrides` | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ |
|
|
| `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~ |
|
|
| **RETURNS** | The model's config. ~~Config~~ |
|
|
|
|
### util.load_meta {#util.load_meta tag="function" new="3"}
|
|
|
|
Get a model's [`meta.json`](/api/data-formats#meta) from a file path and
|
|
validate its contents.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> meta = util.load_meta("/path/to/model/meta.json")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------- |
|
|
| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ |
|
|
| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ |
|
|
|
|
### util.is_package {#util.is_package tag="function"}
|
|
|
|
Check if string maps to a package installed via pip. Mainly used to validate
|
|
[model packages](/usage/models).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> util.is_package("en_core_web_sm") # True
|
|
> util.is_package("xyz") # False
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------- |
|
|
| `name` | Name of package. ~~str~~ |
|
|
| **RETURNS** | `True` if installed package, `False` if not. ~~bool~~ |
|
|
|
|
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
|
|
|
Get path to an installed package. Mainly used to resolve the location of
|
|
[model packages](/usage/models). Currently imports the package to find its path.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> util.get_package_path("en_core_web_sm")
|
|
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ----------------------------------------- |
|
|
| `package_name` | Name of installed package. ~~str~~ |
|
|
| **RETURNS** | Path to model package directory. ~~Path~~ |
|
|
|
|
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
|
|
|
Check if user is running spaCy from a [Jupyter](https://jupyter.org) notebook by
|
|
detecting the IPython kernel. Mainly used for the
|
|
[`displacy`](/api/top-level#displacy) visualizer.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> html = "<h1>Hello world!</h1>"
|
|
> if util.is_in_jupyter():
|
|
> from IPython.core.display import display, HTML
|
|
> display(HTML(html))
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ---------------------------------------------- |
|
|
| **RETURNS** | `True` if in Jupyter, `False` if not. ~~bool~~ |
|
|
|
|
### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
|
|
|
|
Compile a sequence of prefix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> prefixes = ("§", "%", "=", r"\+")
|
|
> prefix_regex = util.compile_prefix_regex(prefixes)
|
|
> nlp.tokenizer.prefix_search = prefix_regex.search
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
|
|
|
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
|
|
|
Compile a sequence of suffix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> suffixes = ("'s", "'S", r"(?<=[0-9])\+")
|
|
> suffix_regex = util.compile_suffix_regex(suffixes)
|
|
> nlp.tokenizer.suffix_search = suffix_regex.search
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
|
|
|
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
|
|
|
Compile a sequence of infix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
|
|
> infix_regex = util.compile_infix_regex(infixes)
|
|
> nlp.tokenizer.infix_finditer = infix_regex.finditer
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
|
|
|
|
### util.minibatch {#util.minibatch tag="function" new="2"}
|
|
|
|
Iterate over batches of items. `size` may be an iterator, so that batch-size can
|
|
vary on each step.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> batches = minibatch(train_data)
|
|
> for batch in batches:
|
|
> nlp.update(batch)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ---------- | ---------------------------------------- |
|
|
| `items` | The items to batch up. ~~Iterable[Any]~~ |
|
|
| `size` | int / iterable | The batch size(s). ~~Union[int, Sequence[int]]~~ |
|
|
| **YIELDS** | The batches. |
|
|
|
|
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
|
|
|
|
Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
|
|
overlaps. Useful for creating named entities (where one token can only be part
|
|
of one entity) or when merging spans with
|
|
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
|
|
(first) longest span is preferred over shorter spans.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp("This is a sentence.")
|
|
> spans = [doc[0:2], doc[0:2], doc[0:4]]
|
|
> filtered = filter_spans(spans)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | --------------------------------------- |
|
|
| `spans` | The spans to filter. ~~Iterable[Span]~~ |
|
|
| **RETURNS** | The filtered spans. ~~List[Span]~~ |
|
|
|
|
### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
|
|
|
Given a list of words and a text, reconstruct the original tokens and return a
|
|
list of words and spaces that can be used to create a [`Doc`](/api/doc#init).
|
|
This can help recover destructive tokenization that didn't preserve any
|
|
whitespace information.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> orig_words = ["Hey", ",", "what", "'s", "up", "?"]
|
|
> orig_text = "Hey, what's up?"
|
|
> words, spaces = get_words_and_spaces(orig_words, orig_text)
|
|
> # ['Hey', ',', 'what', "'s", 'up', '?']
|
|
> # [False, True, False, True, False, False]
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `words` | The list of words. ~~Iterable[str]~~ |
|
|
| `text` | The original text. ~~str~~ |
|
|
| **RETURNS** | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. ~~Tuple[List[str], List[bool]]~~ |
|