mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-06 23:36:33 +03:00
762 lines
43 KiB
Markdown
762 lines
43 KiB
Markdown
---
|
|
title: Top-level Functions
|
|
menu:
|
|
- ['spacy', 'spacy']
|
|
- ['displacy', 'displacy']
|
|
- ['registry', 'registry']
|
|
- ['Data & Alignment', 'gold']
|
|
- ['Utility Functions', 'util']
|
|
---
|
|
|
|
## spaCy {#spacy hidden="true"}
|
|
|
|
### spacy.load {#spacy.load tag="function" model="any"}
|
|
|
|
Load a model using the name of an installed
|
|
[model package](/usage/training#models-generating), a string path or a
|
|
`Path`-like object. spaCy will try resolving the load argument in this order. If
|
|
a model is loaded from a model name, spaCy will assume it's a Python package and
|
|
import it and call the model's own `load()` method. If a model is loaded from a
|
|
path, spaCy will assume it's a data directory, read the language and pipeline
|
|
settings off the meta.json and initialize the `Language` class. The data will be
|
|
loaded in via [`Language.from_disk`](/api/language#from_disk).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = spacy.load("en_core_web_sm") # package
|
|
> nlp = spacy.load("/path/to/en") # string path
|
|
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
|
>
|
|
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------------------------------------ | ----------------- | --------------------------------------------------------------------------------- |
|
|
| `name` | str / `Path` | Model to load, i.e. package name or path. |
|
|
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
|
| `component_cfg` <Tag variant="new">3</Tag> | `Dict[str, dict]` | Optional config overrides for pipeline components, keyed by component names. |
|
|
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
|
|
|
Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
|
|
and pipeline components from a model's `meta.json`, initializes the `Language`
|
|
class, loads in the model data and returns it.
|
|
|
|
```python
|
|
### Abstract example
|
|
cls = util.get_lang_class(lang) # get language for ID, e.g. "en"
|
|
nlp = cls() # initialize the language
|
|
for name in pipeline:
|
|
nlp.add_pipe(name) # add component to pipeline
|
|
nlp.from_disk(model_data_path) # load in model data
|
|
```
|
|
|
|
### spacy.blank {#spacy.blank tag="function" new="2"}
|
|
|
|
Create a blank model of a given language class. This function is the twin of
|
|
`spacy.load()`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp_en = spacy.blank("en") # equivalent to English()
|
|
> nlp_de = spacy.blank("de") # equivalent to German()
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
|
|
| `name` | str | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
|
|
| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. |
|
|
|
|
#### spacy.info {#spacy.info tag="function"}
|
|
|
|
The same as the [`info` command](/api/cli#info). Pretty-print information about
|
|
your installation, models and local setup from within spaCy. To get the model
|
|
meta data as a dictionary instead, you can use the `meta` attribute on your
|
|
`nlp` object with a loaded model, e.g. `nlp.meta`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> spacy.info()
|
|
> spacy.info("en_core_web_sm")
|
|
> markdown = spacy.info(markdown=True, silent=True)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ---- | ------------------------------------------------ |
|
|
| `model` | str | A model, i.e. a package name or path (optional). |
|
|
| `markdown` | bool | Print information as Markdown. |
|
|
| `silent` | bool | Don't print anything, just return. |
|
|
|
|
### spacy.explain {#spacy.explain tag="function"}
|
|
|
|
Get a description for a given POS tag, dependency label or entity type. For a
|
|
list of available terms, see
|
|
[`glossary.py`](https://github.com/explosion/spaCy/tree/master/spacy/glossary.py).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> spacy.explain("NORP")
|
|
> # Nationalities or religious or political groups
|
|
>
|
|
> doc = nlp("Hello world")
|
|
> for word in doc:
|
|
> print(word.text, word.tag_, spacy.explain(word.tag_))
|
|
> # Hello UH interjection
|
|
> # world NN noun, singular or mass
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | -------------------------------------------------------- |
|
|
| `term` | str | Term to explain. |
|
|
| **RETURNS** | str | The explanation, or `None` if not found in the glossary. |
|
|
|
|
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
|
|
|
Allocate data and perform operations on [GPU](/usage/#gpu), if available. If
|
|
data has already been allocated on CPU, it will not be moved. Ideally, this
|
|
function should be called right after importing spaCy and _before_ loading any
|
|
models.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> activated = spacy.prefer_gpu()
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ------------------------------ |
|
|
| **RETURNS** | bool | Whether the GPU was activated. |
|
|
|
|
### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
|
|
|
|
Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error
|
|
if no GPU is available. If data has already been allocated on CPU, it will not
|
|
be moved. Ideally, this function should be called right after importing spaCy
|
|
and _before_ loading any models.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> spacy.require_gpu()
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ----------- |
|
|
| **RETURNS** | bool | `True` |
|
|
|
|
## displaCy {#displacy source="spacy/displacy"}
|
|
|
|
As of v2.0, spaCy comes with a built-in visualization suite. For more info and
|
|
examples, see the usage guide on [visualizing spaCy](/usage/visualizers).
|
|
|
|
### displacy.serve {#displacy.serve tag="method" new="2"}
|
|
|
|
Serve a dependency parse tree or named entity visualization to view it in your
|
|
browser. Will run a simple web server.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> from spacy import displacy
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> doc1 = nlp("This is a sentence.")
|
|
> doc2 = nlp("This is another sentence.")
|
|
> displacy.serve([doc1, doc2], style="dep")
|
|
> ```
|
|
|
|
| Name | Type | Description | Default |
|
|
| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
|
|
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
|
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
|
| `page` | bool | Render markup as full HTML page. | `True` |
|
|
| `minify` | bool | Minify HTML markup. | `False` |
|
|
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
|
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
|
| `port` | int | Port to serve visualization. | `5000` |
|
|
| `host` | str | Host to serve visualization. | `'0.0.0.0'` |
|
|
|
|
### displacy.render {#displacy.render tag="method" new="2"}
|
|
|
|
Render a dependency parse tree or named entity visualization.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> from spacy import displacy
|
|
> nlp = spacy.load("en_core_web_sm")
|
|
> doc = nlp("This is a sentence.")
|
|
> html = displacy.render(doc, style="dep")
|
|
> ```
|
|
|
|
| Name | Type | Description | Default |
|
|
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
|
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
|
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
|
| `page` | bool | Render markup as full HTML page. | `False` |
|
|
| `minify` | bool | Minify HTML markup. | `False` |
|
|
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
|
|
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
|
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
|
| **RETURNS** | str | Rendered HTML markup. |
|
|
|
|
### Visualizer options {#displacy_options}
|
|
|
|
The `options` argument lets you specify additional settings for each visualizer.
|
|
If a setting is not present in the options, the default value will be used.
|
|
|
|
#### Dependency Visualizer options {#options-dep}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> options = {"compact": True, "color": "blue"}
|
|
> displacy.serve(doc, style="dep", options=options)
|
|
> ```
|
|
|
|
| Name | Type | Description | Default |
|
|
| ------------------------------------------ | ---- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
|
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
|
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
|
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
|
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
|
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
|
| `color` | str | Text color (HEX, RGB or color names). | `'#000000'` |
|
|
| `bg` | str | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
|
| `font` | str | Font name or font family for all text. | `'Arial'` |
|
|
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
|
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
|
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
|
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
|
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
|
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
|
|
|
#### Named Entity Visualizer options {#displacy_options-ent}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> options = {"ents": ["PERSON", "ORG", "PRODUCT"],
|
|
> "colors": {"ORG": "yellow"}}
|
|
> displacy.serve(doc, style="ent", options=options)
|
|
> ```
|
|
|
|
| Name | Type | Description | Default |
|
|
| --------------------------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
|
| `ents` | list | Entity types to highlight (`None` for all types). | `None` |
|
|
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
|
|
| `template` <Tag variant="new">2.2</Tag> | str | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
|
|
|
|
By default, displaCy comes with colors for all entity types used by
|
|
[spaCy models](/models). If you're using custom entity types, you can use the
|
|
`colors` setting to add your own colors for them. Your application or model
|
|
package can also expose a
|
|
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
|
|
to add custom labels and their colors automatically.
|
|
|
|
## registry {#registry source="spacy/util.py" new="3"}
|
|
|
|
spaCy's function registry extends
|
|
[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you
|
|
to map strings to functions. You can register functions to create architectures,
|
|
optimizers, schedules and more, and then refer to them and set their arguments
|
|
in your [config file](/usage/training#config). Python type hints are used to
|
|
validate the inputs. See the
|
|
[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the
|
|
`registry` methods and our helper library
|
|
[`catalogue`](https://github.com/explosion/catalogue) for some background on the
|
|
concept of function registries. spaCy also uses the function registry for
|
|
language subclasses, model architecture, lookups and pipeline component
|
|
factories.
|
|
|
|
<!-- TODO: improve example? -->
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> import spacy
|
|
> from thinc.api import Model
|
|
>
|
|
> @spacy.registry.architectures("CustomNER.v1")
|
|
> def custom_ner(n0: int) -> Model:
|
|
> return Model("custom", forward, dims={"nO": nO})
|
|
> ```
|
|
|
|
| Registry name | Description |
|
|
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
|
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) |
|
|
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
|
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
| `assets` | <!-- TODO: what is this used for again?--> |
|
|
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
|
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
|
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
|
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
|
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
|
|
|
## Training data and alignment {#gold source="spacy/gold"}
|
|
|
|
### gold.docs_to_json {#docs_to_json tag="function"}
|
|
|
|
Convert a list of Doc objects into the
|
|
[JSON-serializable format](/api/data-formats#json-input) used by the
|
|
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
|
|
'paragraph' in the output doc.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import docs_to_json
|
|
>
|
|
> doc = nlp("I like London")
|
|
> json_data = docs_to_json([doc])
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------------- | ------------------------------------------ |
|
|
| `docs` | iterable / `Doc` | The `Doc` object(s) to convert. |
|
|
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
|
|
| **RETURNS** | dict | The data in spaCy's JSON format. |
|
|
|
|
### gold.align {#align tag="function"}
|
|
|
|
Calculate alignment tables between two tokenizations, using the Levenshtein
|
|
algorithm. The alignment is case-insensitive.
|
|
|
|
<Infobox title="Important note" variant="warning">
|
|
|
|
The current implementation of the alignment algorithm assumes that both
|
|
tokenizations add up to the same string. For example, you'll be able to align
|
|
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
|
`["I", "'m"]` and `["I", "am"]`.
|
|
|
|
</Infobox>
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import align
|
|
>
|
|
> bert_tokens = ["obama", "'", "s", "podcast"]
|
|
> spacy_tokens = ["obama", "'s", "podcast"]
|
|
> alignment = align(bert_tokens, spacy_tokens)
|
|
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | -------------------------------------------------------------------------- |
|
|
| `tokens_a` | list | String values of candidate tokens to align. |
|
|
| `tokens_b` | list | String values of reference tokens to align. |
|
|
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
|
|
|
|
The returned tuple contains the following alignment information:
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> a2b = array([0, -1, -1, 2])
|
|
> b2a = array([0, 2, 3])
|
|
> a2b_multi = {1: 1, 2: 1}
|
|
> b2a_multi = {}
|
|
> ```
|
|
>
|
|
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
|
|
> there's no one-to-one alignment for a token, it has the value `-1`.
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `cost` | int | The number of misaligned tokens. |
|
|
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
|
|
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
|
|
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
|
|
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
|
|
|
|
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
|
|
|
|
Encode labelled spans into per-token tags, using the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
|
|
Out). Returns a list of strings, describing the tags. Each tag string will be of
|
|
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
|
|
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
|
|
don't align with the tokenization in the `Doc` object. The training algorithm
|
|
will view these as missing values. `O` denotes a non-entity token. `B` denotes
|
|
the beginning of a multi-token entity, `I` the inside of an entity of three or
|
|
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
|
|
single-token entity.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import biluo_tags_from_offsets
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> entities = [(7, 13, "LOC")]
|
|
> tags = biluo_tags_from_offsets(doc, entities)
|
|
> assert tags == ["O", "O", "U-LOC", "O"]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
|
|
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
|
|
| **RETURNS** | list | str strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. |
|
|
|
|
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
|
|
|
Encode per-token tags following the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) into entity offsets.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import offsets_from_biluo_tags
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> tags = ["O", "O", "U-LOC", "O"]
|
|
> entities = offsets_from_biluo_tags(doc, tags)
|
|
> assert entities == [(7, 13, "LOC")]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
|
| `entities` | iterable | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
|
| **RETURNS** | list | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. |
|
|
|
|
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
|
|
|
Encode per-token tags following the
|
|
[BILUO scheme](/usage/linguistic-features#accessing-ner) into
|
|
[`Span`](/api/span) objects. This can be used to create entity spans from
|
|
token-based tags, e.g. to overwrite the `doc.ents`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.gold import spans_from_biluo_tags
|
|
>
|
|
> doc = nlp("I like London.")
|
|
> tags = ["O", "O", "U-LOC", "O"]
|
|
> doc.ents = spans_from_biluo_tags(doc, tags)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
|
| `entities` | iterable | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
|
| **RETURNS** | list | A sequence of `Span` objects with added entity labels. |
|
|
|
|
## Utility functions {#util source="spacy/util.py"}
|
|
|
|
spaCy comes with a small collection of utility functions located in
|
|
[`spacy/util.py`](https://github.com/explosion/spaCy/tree/master/spacy/util.py).
|
|
Because utility functions are mostly intended for **internal use within spaCy**,
|
|
their behavior may change with future releases. The functions documented on this
|
|
page should be safe to use and we'll try to ensure backwards compatibility.
|
|
However, we recommend having additional tests in place if your application
|
|
depends on any of spaCy's utilities.
|
|
|
|
<!-- TODO: document new config-related util functions? -->
|
|
|
|
### util.get_lang_class {#util.get_lang_class tag="function"}
|
|
|
|
Import and load a `Language` class. Allows lazy-loading
|
|
[language data](/usage/adding-languages) and importing languages using the
|
|
two-letter language code. To add a language code for a custom language class,
|
|
you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> for lang_id in ["en", "de"]:
|
|
> lang_class = util.get_lang_class(lang_id)
|
|
> lang = lang_class()
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------- | -------------------------------------- |
|
|
| `lang` | str | Two-letter language code, e.g. `'en'`. |
|
|
| **RETURNS** | `Language` | Language class. |
|
|
|
|
### util.set_lang_class {#util.set_lang_class tag="function"}
|
|
|
|
Set a custom `Language` class name that can be loaded via
|
|
[`get_lang_class`](/api/top-level#util.get_lang_class). If your model uses a
|
|
custom language, this is required so that spaCy can load the correct class from
|
|
the two-letter language code.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lang.xy import CustomLanguage
|
|
>
|
|
> util.set_lang_class('xy', CustomLanguage)
|
|
> lang_class = util.get_lang_class('xy')
|
|
> nlp = lang_class()
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------ | ---------- | -------------------------------------- |
|
|
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
|
| `cls` | `Language` | The language class, e.g. `English`. |
|
|
|
|
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
|
|
|
Check whether a `Language` class is already loaded. `Language` classes are
|
|
loaded lazily, to avoid expensive setup code associated with the language data.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> lang_cls = util.get_lang_class("en")
|
|
> assert util.lang_class_is_loaded("en") is True
|
|
> assert util.lang_class_is_loaded("de") is False
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | -------------------------------------- |
|
|
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
|
| **RETURNS** | bool | Whether the class has been loaded. |
|
|
|
|
### util.load_model {#util.load_model tag="function" new="2"}
|
|
|
|
Load a model from a package or data path. If called with a package name, spaCy
|
|
will assume the model is a Python package and import and call its `load()`
|
|
method. If called with a path, spaCy will assume it's a data directory, read the
|
|
language and pipeline settings from the meta.json and initialize a `Language`
|
|
class. The model data will then be loaded in via
|
|
[`Language.from_disk()`](/api/language#from_disk).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = util.load_model("en_core_web_sm")
|
|
> nlp = util.load_model("en_core_web_sm", disable=["ner"])
|
|
> nlp = util.load_model("/path/to/data")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------- | ---------- | -------------------------------------------------------- |
|
|
| `name` | str | Package name or model path. |
|
|
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
|
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
|
|
|
### util.load_model_from_path {#util.load_model_from_path tag="function" new="2"}
|
|
|
|
Load a model from a data directory path. Creates the [`Language`](/api/language)
|
|
class and pipeline based on the directory's meta.json and then calls
|
|
[`from_disk()`](/api/language#from_disk) with the path. This function also makes
|
|
it easy to test a new model that you haven't packaged yet.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> nlp = load_model_from_path("/path/to/data")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- |
|
|
| `model_path` | str | Path to model data directory. |
|
|
| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. |
|
|
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
|
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
|
|
|
### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
|
|
|
|
A helper function to use in the `load()` method of a model package's
|
|
[`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.util import load_model_from_init_py
|
|
>
|
|
> def load(**overrides):
|
|
> return load_model_from_init_py(__file__, **overrides)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------------- | ---------- | -------------------------------------------------------- |
|
|
| `init_file` | str | Path to model's `__init__.py`, i.e. `__file__`. |
|
|
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
|
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
|
|
|
### util.get_model_meta {#util.get_model_meta tag="function" new="2"}
|
|
|
|
Get a model's meta.json from a directory path and validate its contents.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> meta = util.get_model_meta("/path/to/model")
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------------ | ------------------------ |
|
|
| `path` | str / `Path` | Path to model directory. |
|
|
| **RETURNS** | dict | The model's meta data. |
|
|
|
|
### util.is_package {#util.is_package tag="function"}
|
|
|
|
Check if string maps to a package installed via pip. Mainly used to validate
|
|
[model packages](/usage/models).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> util.is_package("en_core_web_sm") # True
|
|
> util.is_package("xyz") # False
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------ | -------------------------------------------- |
|
|
| `name` | str | Name of package. |
|
|
| **RETURNS** | `bool` | `True` if installed package, `False` if not. |
|
|
|
|
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
|
|
|
Get path to an installed package. Mainly used to resolve the location of
|
|
[model packages](/usage/models). Currently imports the package to find its path.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> util.get_package_path("en_core_web_sm")
|
|
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| -------------- | ------ | -------------------------------- |
|
|
| `package_name` | str | Name of installed package. |
|
|
| **RETURNS** | `Path` | Path to model package directory. |
|
|
|
|
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
|
|
|
Check if user is running spaCy from a [Jupyter](https://jupyter.org) notebook by
|
|
detecting the IPython kernel. Mainly used for the
|
|
[`displacy`](/api/top-level#displacy) visualizer.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> html = "<h1>Hello world!</h1>"
|
|
> if util.is_in_jupyter():
|
|
> from IPython.core.display import display, HTML
|
|
> display(HTML(html))
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ------------------------------------- |
|
|
| **RETURNS** | bool | `True` if in Jupyter, `False` if not. |
|
|
|
|
### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
|
|
|
|
Compile a sequence of prefix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> prefixes = ("§", "%", "=", r"\+")
|
|
> prefix_regex = util.compile_prefix_regex(prefixes)
|
|
> nlp.tokenizer.prefix_search = prefix_regex.search
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | tuple | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
|
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). |
|
|
|
|
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
|
|
|
Compile a sequence of suffix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> suffixes = ("'s", "'S", r"(?<=[0-9])\+")
|
|
> suffix_regex = util.compile_suffix_regex(suffixes)
|
|
> nlp.tokenizer.suffix_search = suffix_regex.search
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | tuple | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
|
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). |
|
|
|
|
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
|
|
|
Compile a sequence of infix rules into a regex object.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
|
|
> infix_regex = util.compile_infix_regex(infixes)
|
|
> nlp.tokenizer.infix_finditer = infix_regex.finditer
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `entries` | tuple | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
|
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). |
|
|
|
|
### util.minibatch {#util.minibatch tag="function" new="2"}
|
|
|
|
Iterate over batches of items. `size` may be an iterator, so that batch-size can
|
|
vary on each step.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> batches = minibatch(train_data)
|
|
> for batch in batches:
|
|
> nlp.update(batch)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | -------------- | ---------------------- |
|
|
| `items` | iterable | The items to batch up. |
|
|
| `size` | int / iterable | The batch size(s). |
|
|
| **YIELDS** | list | The batches. |
|
|
|
|
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
|
|
|
|
Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
|
|
overlaps. Useful for creating named entities (where one token can only be part
|
|
of one entity) or when merging spans with
|
|
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
|
|
(first) longest span is preferred over shorter spans.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp("This is a sentence.")
|
|
> spans = [doc[0:2], doc[0:2], doc[0:4]]
|
|
> filtered = filter_spans(spans)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | -------- | -------------------- |
|
|
| `spans` | iterable | The spans to filter. |
|
|
| **RETURNS** | list | The filtered spans. |
|
|
|
|
### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
|
|
|
<!-- TODO: document -->
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ----------- |
|
|
| `words` | list | |
|
|
| `text` | str | |
|
|
| **RETURNS** | tuple | |
|