mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
681 lines
34 KiB
Markdown
681 lines
34 KiB
Markdown
---
|
||
title: Top-level Functions
|
||
menu:
|
||
- ['spacy', 'spacy']
|
||
- ['displacy', 'displacy']
|
||
- ['Utility Functions', 'util']
|
||
- ['Compatibility', 'compat']
|
||
---
|
||
|
||
## spaCy {#spacy hidden="true"}
|
||
|
||
### spacy.load {#spacy.load tag="function" model="any"}
|
||
|
||
Load a model via its [shortcut link](/usage/models#usage), the name of an
|
||
installed [model package](/usage/training#models-generating), a unicode path or
|
||
a `Path`-like object. spaCy will try resolving the load argument in this order.
|
||
If a model is loaded from a shortcut link or package name, spaCy will assume
|
||
it's a Python package and import it and call the model's own `load()` method. If
|
||
a model is loaded from a path, spaCy will assume it's a data directory, read the
|
||
language and pipeline settings off the meta.json and initialize the `Language`
|
||
class. The data will be loaded in via
|
||
[`Language.from_disk`](/api/language#from_disk).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp = spacy.load("en") # shortcut link
|
||
> nlp = spacy.load("en_core_web_sm") # package
|
||
> nlp = spacy.load("/path/to/en") # unicode path
|
||
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
||
>
|
||
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
|
||
| `name` | unicode / `Path` | Model to load, i.e. shortcut link, package name or path. |
|
||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
||
|
||
Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
|
||
and pipeline components from a model's `meta.json`, initializes the `Language`
|
||
class, loads in the model data and returns it.
|
||
|
||
```python
|
||
### Abstract example
|
||
cls = util.get_lang_class(lang) # get language for ID, e.g. 'en'
|
||
nlp = cls() # initialise the language
|
||
for name in pipeline: component = nlp.create_pipe(name) # create each pipeline component nlp.add_pipe(component) # add component to pipeline
|
||
nlp.from_disk(model_data_path) # load in model data
|
||
```
|
||
|
||
<Infobox title="Changed in v2.0" variant="warning">
|
||
|
||
As of spaCy 2.0, the `path` keyword argument is deprecated. spaCy will also
|
||
raise an error if no model could be loaded and never just return an empty
|
||
`Language` object. If you need a blank language, you can use the new function
|
||
[`spacy.blank()`](/api/top-level#spacy.blank) or import the class explicitly,
|
||
e.g. `from spacy.lang.en import English`.
|
||
|
||
```diff
|
||
- nlp = spacy.load("en", path="/model")
|
||
+ nlp = spacy.load("/model")
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### spacy.blank {#spacy.blank tag="function" new="2"}
|
||
|
||
Create a blank model of a given language class. This function is the twin of
|
||
`spacy.load()`.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp_en = spacy.blank("en")
|
||
> nlp_de = spacy.blank("de")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
|
||
| `name` | unicode | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
|
||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||
| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. |
|
||
|
||
#### spacy.info {#spacy.info tag="function"}
|
||
|
||
The same as the [`info` command](/api/cli#info). Pretty-print information about
|
||
your installation, models and local setup from within spaCy. To get the model
|
||
meta data as a dictionary instead, you can use the `meta` attribute on your
|
||
`nlp` object with a loaded model, e.g. `nlp.meta`.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> spacy.info()
|
||
> spacy.info("en")
|
||
> spacy.info("de", markdown=True)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ------- | ------------------------------------------------------------- |
|
||
| `model` | unicode | A model, i.e. shortcut link, package name or path (optional). |
|
||
| `markdown` | bool | Print information as Markdown. |
|
||
|
||
### spacy.explain {#spacy.explain tag="function"}
|
||
|
||
Get a description for a given POS tag, dependency label or entity type. For a
|
||
list of available terms, see
|
||
[`glossary.py`](https://github.com/explosion/spaCy/tree/master/spacy/glossary.py).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> spacy.explain("NORP")
|
||
> # Nationalities or religious or political groups
|
||
>
|
||
> doc = nlp("Hello world")
|
||
> for word in doc:
|
||
> print(word.text, word.tag_, spacy.explain(word.tag_))
|
||
> # Hello UH interjection
|
||
> # world NN noun, singular or mass
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | -------------------------------------------------------- |
|
||
| `term` | unicode | Term to explain. |
|
||
| **RETURNS** | unicode | The explanation, or `None` if not found in the glossary. |
|
||
|
||
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
||
|
||
Allocate data and perform operations on [GPU](/usage/#gpu), if available. If
|
||
data has already been allocated on CPU, it will not be moved. Ideally, this
|
||
function should be called right after importing spaCy and _before_ loading any
|
||
models.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> import spacy
|
||
> activated = spacy.prefer_gpu()
|
||
> nlp = spacy.load("en_core_web_sm")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | ------------------------------ |
|
||
| **RETURNS** | bool | Whether the GPU was activated. |
|
||
|
||
### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
|
||
|
||
Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error
|
||
if no GPU is available. If data has already been allocated on CPU, it will not
|
||
be moved. Ideally, this function should be called right after importing spaCy
|
||
and _before_ loading any models.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> import spacy
|
||
> spacy.require_gpu()
|
||
> nlp = spacy.load("en_core_web_sm")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | ----------- |
|
||
| **RETURNS** | bool | `True` |
|
||
|
||
## displaCy {#displacy source="spacy/displacy"}
|
||
|
||
As of v2.0, spaCy comes with a built-in visualization suite. For more info and
|
||
examples, see the usage guide on [visualizing spaCy](/usage/visualizers).
|
||
|
||
### displacy.serve {#displacy.serve tag="method" new="2"}
|
||
|
||
Serve a dependency parse tree or named entity visualization to view it in your
|
||
browser. Will run a simple web server.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> import spacy
|
||
> from spacy import displacy
|
||
> nlp = spacy.load("en_core_web_sm")
|
||
> doc1 = nlp("This is a sentence.")
|
||
> doc2 = nlp("This is another sentence.")
|
||
> displacy.serve([doc1, doc2], style="dep")
|
||
> ```
|
||
|
||
| Name | Type | Description | Default |
|
||
| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
|
||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
||
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||
| `page` | bool | Render markup as full HTML page. | `True` |
|
||
| `minify` | bool | Minify HTML markup. | `False` |
|
||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
||
| `port` | int | Port to serve visualization. | `5000` |
|
||
| `host` | unicode | Host to serve visualization. | `'0.0.0.0'` |
|
||
|
||
### displacy.render {#displacy.render tag="method" new="2"}
|
||
|
||
Render a dependency parse tree or named entity visualization.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> import spacy
|
||
> from spacy import displacy
|
||
> nlp = spacy.load("en_core_web_sm")
|
||
> doc = nlp("This is a sentence.")
|
||
> html = displacy.render(doc, style="dep")
|
||
> ```
|
||
|
||
| Name | Type | Description | Default |
|
||
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
||
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||
| `page` | bool | Render markup as full HTML page. | `False` |
|
||
| `minify` | bool | Minify HTML markup. | `False` |
|
||
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
|
||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
||
| **RETURNS** | unicode | Rendered HTML markup. |
|
||
|
||
### Visualizer options {#displacy_options}
|
||
|
||
The `options` argument lets you specify additional settings for each visualizer.
|
||
If a setting is not present in the options, the default value will be used.
|
||
|
||
#### Dependency Visualizer options {#options-dep}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> options = {"compact": True, "color": "blue"}
|
||
> displacy.serve(doc, style="dep", options=options)
|
||
> ```
|
||
|
||
| Name | Type | Description | Default |
|
||
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||
|
||
#### Named Entity Visualizer options {#displacy_options-ent}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> options = {"ents": ["PERSON", "ORG", "PRODUCT"],
|
||
> "colors": {"ORG": "yellow"}}
|
||
> displacy.serve(doc, style="ent", options=options)
|
||
> ```
|
||
|
||
| Name | Type | Description | Default |
|
||
| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
||
| `ents` | list | Entity types to highlight (`None` for all types). | `None` |
|
||
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
|
||
| `template` <Tag variant="new">2.2</Tag> | unicode | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
|
||
|
||
By default, displaCy comes with colors for all
|
||
[entity types supported by spaCy](/api/annotation#named-entities). If you're
|
||
using custom entity types, you can use the `colors` setting to add your own
|
||
colors for them. Your application or model package can also expose a
|
||
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
|
||
to add custom labels and their colors automatically.
|
||
|
||
## Utility functions {#util source="spacy/util.py"}
|
||
|
||
spaCy comes with a small collection of utility functions located in
|
||
[`spacy/util.py`](https://github.com/explosion/spaCy/tree/master/spacy/util.py).
|
||
Because utility functions are mostly intended for **internal use within spaCy**,
|
||
their behavior may change with future releases. The functions documented on this
|
||
page should be safe to use and we'll try to ensure backwards compatibility.
|
||
However, we recommend having additional tests in place if your application
|
||
depends on any of spaCy's utilities.
|
||
|
||
### util.get_data_path {#util.get_data_path tag="function"}
|
||
|
||
Get path to the data directory where spaCy looks for models. Defaults to
|
||
`spacy/data`.
|
||
|
||
| Name | Type | Description |
|
||
| ---------------- | --------------- | ------------------------------------------------------- |
|
||
| `require_exists` | bool | Only return path if it exists, otherwise return `None`. |
|
||
| **RETURNS** | `Path` / `None` | Data path or `None`. |
|
||
|
||
### util.set_data_path {#util.set_data_path tag="function"}
|
||
|
||
Set custom path to the data directory where spaCy looks for models.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> util.set_data_path("/custom/path")
|
||
> util.get_data_path()
|
||
> # PosixPath('/custom/path')
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------ | ---------------- | --------------------------- |
|
||
| `path` | unicode / `Path` | Path to new data directory. |
|
||
|
||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||
|
||
Import and load a `Language` class. Allows lazy-loading
|
||
[language data](/usage/adding-languages) and importing languages using the
|
||
two-letter language code. To add a language code for a custom language class,
|
||
you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> for lang_id in ["en", "de"]:
|
||
> lang_class = util.get_lang_class(lang_id)
|
||
> lang = lang_class()
|
||
> tokenizer = lang.Defaults.create_tokenizer()
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------- | -------------------------------------- |
|
||
| `lang` | unicode | Two-letter language code, e.g. `'en'`. |
|
||
| **RETURNS** | `Language` | Language class. |
|
||
|
||
### util.set_lang_class {#util.set_lang_class tag="function"}
|
||
|
||
Set a custom `Language` class name that can be loaded via
|
||
[`get_lang_class`](/api/top-level#util.get_lang_class). If your model uses a
|
||
custom language, this is required so that spaCy can load the correct class from
|
||
the two-letter language code.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.lang.xy import CustomLanguage
|
||
>
|
||
> util.set_lang_class('xy', CustomLanguage)
|
||
> lang_class = util.get_lang_class('xy')
|
||
> nlp = lang_class()
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------ | ---------- | -------------------------------------- |
|
||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||
| `cls` | `Language` | The language class, e.g. `English`. |
|
||
|
||
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
||
|
||
Check whether a `Language` class is already loaded. `Language` classes are
|
||
loaded lazily, to avoid expensive setup code associated with the language data.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> lang_cls = util.get_lang_class("en")
|
||
> assert util.lang_class_is_loaded("en") is True
|
||
> assert util.lang_class_is_loaded("de") is False
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | -------------------------------------- |
|
||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||
| **RETURNS** | bool | Whether the class has been loaded. |
|
||
|
||
### util.load_model {#util.load_model tag="function" new="2"}
|
||
|
||
Load a model from a shortcut link, package or data path. If called with a
|
||
shortcut link or package name, spaCy will assume the model is a Python package
|
||
and import and call its `load()` method. If called with a path, spaCy will
|
||
assume it's a data directory, read the language and pipeline settings from the
|
||
meta.json and initialize a `Language` class. The model data will then be loaded
|
||
in via [`Language.from_disk()`](/api/language#from_disk).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp = util.load_model("en")
|
||
> nlp = util.load_model("en_core_web_sm", disable=["ner"])
|
||
> nlp = util.load_model("/path/to/data")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------------- | ---------- | -------------------------------------------------------- |
|
||
| `name` | unicode | Package name, shortcut link or model path. |
|
||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||
|
||
### util.load_model_from_path {#util.load_model_from_path tag="function" new="2"}
|
||
|
||
Load a model from a data directory path. Creates the [`Language`](/api/language)
|
||
class and pipeline based on the directory's meta.json and then calls
|
||
[`from_disk()`](/api/language#from_disk) with the path. This function also makes
|
||
it easy to test a new model that you haven't packaged yet.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> nlp = load_model_from_path("/path/to/data")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- |
|
||
| `model_path` | unicode | Path to model data directory. |
|
||
| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. |
|
||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||
|
||
### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
|
||
|
||
A helper function to use in the `load()` method of a model package's
|
||
[`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.util import load_model_from_init_py
|
||
>
|
||
> def load(**overrides):
|
||
> return load_model_from_init_py(__file__, **overrides)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------------- | ---------- | -------------------------------------------------------- |
|
||
| `init_file` | unicode | Path to model's `__init__.py`, i.e. `__file__`. |
|
||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||
|
||
### util.get_model_meta {#util.get_model_meta tag="function" new="2"}
|
||
|
||
Get a model's meta.json from a directory path and validate its contents.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> meta = util.get_model_meta("/path/to/model")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------- | ------------------------ |
|
||
| `path` | unicode / `Path` | Path to model directory. |
|
||
| **RETURNS** | dict | The model's meta data. |
|
||
|
||
### util.is_package {#util.is_package tag="function"}
|
||
|
||
Check if string maps to a package installed via pip. Mainly used to validate
|
||
[model packages](/usage/models).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> util.is_package("en_core_web_sm") # True
|
||
> util.is_package("xyz") # False
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | -------------------------------------------- |
|
||
| `name` | unicode | Name of package. |
|
||
| **RETURNS** | `bool` | `True` if installed package, `False` if not. |
|
||
|
||
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
||
|
||
Get path to an installed package. Mainly used to resolve the location of
|
||
[model packages](/usage/models). Currently imports the package to find its path.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> util.get_package_path("en_core_web_sm")
|
||
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| -------------- | ------- | -------------------------------- |
|
||
| `package_name` | unicode | Name of installed package. |
|
||
| **RETURNS** | `Path` | Path to model package directory. |
|
||
|
||
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
||
|
||
Check if user is running spaCy from a [Jupyter](https://jupyter.org) notebook by
|
||
detecting the IPython kernel. Mainly used for the
|
||
[`displacy`](/api/top-level#displacy) visualizer.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> html = "<h1>Hello world!</h1>"
|
||
> if util.is_in_jupyter():
|
||
> from IPython.core.display import display, HTML
|
||
> display(HTML(html))
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | ------------------------------------- |
|
||
| **RETURNS** | bool | `True` if in Jupyter, `False` if not. |
|
||
|
||
### util.update_exc {#util.update_exc tag="function"}
|
||
|
||
Update, validate and overwrite
|
||
[tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). Used to
|
||
combine global exceptions with custom, language-specific exceptions. Will raise
|
||
an error if key doesn't match `ORTH` values.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> BASE = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
|
||
> NEW = {"a.": [{ORTH: "a.", NORM: "all"}]}
|
||
> exceptions = util.update_exc(BASE, NEW)
|
||
> # {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------------- | ----- | --------------------------------------------------------------- |
|
||
| `base_exceptions` | dict | Base tokenizer exceptions. |
|
||
| `*addition_dicts` | dicts | Exception dictionaries to add to the base exceptions, in order. |
|
||
| **RETURNS** | dict | Combined tokenizer exceptions. |
|
||
|
||
### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
|
||
|
||
Compile a sequence of prefix rules into a regex object.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> prefixes = ("§", "%", "=", r"\+")
|
||
> prefix_regex = util.compile_prefix_regex(prefixes)
|
||
> nlp.tokenizer.prefix_search = prefix_regex.search
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `entries` | tuple | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). |
|
||
|
||
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
||
|
||
Compile a sequence of suffix rules into a regex object.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> suffixes = ("'s", "'S", r"(?<=[0-9])\+")
|
||
> suffix_regex = util.compile_suffix_regex(suffixes)
|
||
> nlp.tokenizer.suffix_search = suffix_regex.search
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `entries` | tuple | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). |
|
||
|
||
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
||
|
||
Compile a sequence of infix rules into a regex object.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
|
||
> infix_regex = util.compile_infix_regex(infixes)
|
||
> nlp.tokenizer.infix_finditer = infix_regex.finditer
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `entries` | tuple | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). |
|
||
|
||
### util.minibatch {#util.minibatch tag="function" new="2"}
|
||
|
||
Iterate over batches of items. `size` may be an iterator, so that batch-size can
|
||
vary on each step.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> batches = minibatch(train_data)
|
||
> for batch in batches:
|
||
> texts, annotations = zip(*batch)
|
||
> nlp.update(texts, annotations)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `items` | iterable | The items to batch up. |
|
||
| `size` | int / iterable | The batch size(s). Use [`util.compounding`](/api/top-level#util.compounding) or [`util.decaying`](/api/top-level#util.decaying) or for an infinite series of compounding or decaying values. |
|
||
| **YIELDS** | list | The batches. |
|
||
|
||
### util.compounding {#util.compounding tag="function" new="2"}
|
||
|
||
Yield an infinite series of compounding values. Each time the generator is
|
||
called, a value is produced by multiplying the previous value by the compound
|
||
rate.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> sizes = compounding(1., 10., 1.5)
|
||
> assert next(sizes) == 1.
|
||
> assert next(sizes) == 1. * 1.5
|
||
> assert next(sizes) == 1.5 * 1.5
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ----------- | ----------------------- |
|
||
| `start` | int / float | The first value. |
|
||
| `stop` | int / float | The maximum value. |
|
||
| `compound` | int / float | The compounding factor. |
|
||
| **YIELDS** | int | Compounding values. |
|
||
|
||
### util.decaying {#util.decaying tag="function" new="2"}
|
||
|
||
Yield an infinite series of linearly decaying values.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> sizes = decaying(10., 1., 0.001)
|
||
> assert next(sizes) == 10.
|
||
> assert next(sizes) == 10. - 0.001
|
||
> assert next(sizes) == 9.999 - 0.001
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ----------- | -------------------- |
|
||
| `start` | int / float | The first value. |
|
||
| `end` | int / float | The maximum value. |
|
||
| `decay` | int / float | The decaying factor. |
|
||
| **YIELDS** | int | The decaying values. |
|
||
|
||
### util.itershuffle {#util.itershuffle tag="function" new="2"}
|
||
|
||
Shuffle an iterator. This works by holding `bufsize` items back and yielding
|
||
them sometime later. Obviously, this is not unbiased – but should be good enough
|
||
for batching. Larger `bufsize` means less bias.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> values = range(1000)
|
||
> shuffled = itershuffle(values)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | -------- | ----------------------------------- |
|
||
| `iterable` | iterable | Iterator to shuffle. |
|
||
| `bufsize` | int | Items to hold back (default: 1000). |
|
||
| **YIELDS** | iterable | The shuffled iterator. |
|
||
|
||
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
|
||
|
||
Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
|
||
overlaps. Useful for creating named entities (where one token can only be part
|
||
of one entity) or when merging spans with
|
||
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
|
||
(first) longest span is preferred over shorter spans.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp("This is a sentence.")
|
||
> spans = [doc[0:2], doc[0:2], doc[0:4]]
|
||
> filtered = filter_spans(spans)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | -------- | -------------------- |
|
||
| `spans` | iterable | The spans to filter. |
|
||
| **RETURNS** | list | The filtered spans. |
|