spaCy/website/docs/api/top-level.md

---
title: Top-level Functions
menu:
  - ['spacy', 'spacy']
  - ['displacy', 'displacy']
  - ['registry', 'registry']
  - ['Loggers', 'loggers']
  - ['Readers', 'readers']
  - ['Batchers', 'batchers']
  - ['Augmenters', 'augmenters']
  - ['Training & Alignment', 'gold']
  - ['Utility Functions', 'util']
---

## spaCy {#spacy hidden="true"}

### spacy.load {#spacy.load tag="function"}

Load a pipeline using the name of an installed
[package](/usage/saving-loading#models), a string path or a `Path`-like object.
spaCy will try resolving the load argument in this order. If a pipeline is
loaded from a string name, spaCy will assume it's a Python package and import it
and call the package's own `load()` method. If a pipeline is loaded from a path,
spaCy will assume it's a data directory, load its
[`config.cfg`](/api/data-formats#config) and use the language and pipeline
information to construct the `Language` class. The data will be loaded in via
[`Language.from_disk`](/api/language#from_disk).

<Infobox variant="warning" title="Changed in v3.0">

As of v3.0, the `disable` keyword argument specifies components to load but
disable, instead of components to not load at all. Those components can now be
specified separately using the new `exclude` keyword argument.

</Infobox>

> #### Example
>
> ```python
> nlp = spacy.load("en_core_web_sm") # package
> nlp = spacy.load("/path/to/pipeline") # string path
> nlp = spacy.load(Path("/path/to/pipeline")) # pathlib Path
>
> nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"])
> ```

| Name                                 | Description                                                                                                                                                                                                                                    |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name`                               | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~                                                                                                                                                                              |
| _keyword-only_                       |                                                                                                                                                                                                                                                |
| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
| `config` <Tag variant="new">3</Tag>  | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~                                                                             |
| **RETURNS**                          | A `Language` object with the loaded pipeline. ~~Language~~                                                                                                                                                                                     |

Essentially, `spacy.load()` is a convenience wrapper that reads the pipeline's
[`config.cfg`](/api/data-formats#config), uses the language and pipeline
information to construct a `Language` object, loads in the model data and
weights, and returns it.

```python
### Abstract example
cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
nlp = cls()                            # 2. Initialize it
for name in pipeline:
    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
nlp.from_disk(data_path)               # 4. Load in the binary data
```

### spacy.blank {#spacy.blank tag="function" new="2"}

Create a blank pipeline of a given language class. This function is the twin of
`spacy.load()`.

> #### Example
>
> ```python
> nlp_en = spacy.blank("en")   # equivalent to English()
> nlp_de = spacy.blank("de")   # equivalent to German()
> ```

| Name                                | Description                                                                                                                                                        |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `name`                              | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~                                                           |
| _keyword-only_                      |                                                                                                                                                                    |
| `vocab` <Tag variant="new">3</Tag>  | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                             |
| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
| `meta` <Tag variant="new">3</Tag>   | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~                                                                                   |
| **RETURNS**                         | An empty `Language` object of the appropriate subclass. ~~Language~~                                                                                               |

### spacy.info {#spacy.info tag="function"}

The same as the [`info` command](/api/cli#info). Pretty-print information about
your installation, installed pipelines and local setup from within spaCy.

> #### Example
>
> ```python
> spacy.info()
> spacy.info("en_core_web_sm")
> markdown = spacy.info(markdown=True, silent=True)
> ```

| Name           | Description                                                                  |
| -------------- | ---------------------------------------------------------------------------- |
| `model`        | Optional pipeline, i.e. a package name or path (optional). ~~Optional[str]~~ |
| _keyword-only_ |                                                                              |
| `markdown`     | Print information as Markdown. ~~bool~~                                      |
| `silent`       | Don't print anything, just return. ~~bool~~                                  |

### spacy.explain {#spacy.explain tag="function"}

Get a description for a given POS tag, dependency label or entity type. For a
list of available terms, see [`glossary.py`](%%GITHUB_SPACY/spacy/glossary.py).

> #### Example
>
> ```python
> spacy.explain("NORP")
> # Nationalities or religious or political groups
>
> doc = nlp("Hello world")
> for word in doc:
>    print(word.text, word.tag_, spacy.explain(word.tag_))
> # Hello UH interjection
> # world NN noun, singular or mass
> ```

| Name        | Description                                                                |
| ----------- | -------------------------------------------------------------------------- |
| `term`      | Term to explain. ~~str~~                                                   |
| **RETURNS** | The explanation, or `None` if not found in the glossary. ~~Optional[str]~~ |

### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}

Allocate data and perform operations on [GPU](/usage/#gpu), if available. If
data has already been allocated on CPU, it will not be moved. Ideally, this
function should be called right after importing spaCy and _before_ loading any
pipelines.

> #### Example
>
> ```python
> import spacy
> activated = spacy.prefer_gpu()
> nlp = spacy.load("en_core_web_sm")
> ```

| Name        | Description                                      |
| ----------- | ------------------------------------------------ |
| `gpu_id`    | Device index to select. Defaults to `0`. ~~int~~ |
| **RETURNS** | Whether the GPU was activated. ~~bool~~          |

### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}

Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error
if no GPU is available. If data has already been allocated on CPU, it will not
be moved. Ideally, this function should be called right after importing spaCy
and _before_ loading any pipelines.

> #### Example
>
> ```python
> import spacy
> spacy.require_gpu()
> nlp = spacy.load("en_core_web_sm")
> ```

| Name        | Description                                      |
| ----------- | ------------------------------------------------ |
| `gpu_id`    | Device index to select. Defaults to `0`. ~~int~~ |
| **RETURNS** | `True` ~~bool~~                                  |

## displaCy {#displacy source="spacy/displacy"}

As of v2.0, spaCy comes with a built-in visualization suite. For more info and
examples, see the usage guide on [visualizing spaCy](/usage/visualizers).

### displacy.serve {#displacy.serve tag="method" new="2"}

Serve a dependency parse tree or named entity visualization to view it in your
browser. Will run a simple web server.

> #### Example
>
> ```python
> import spacy
> from spacy import displacy
> nlp = spacy.load("en_core_web_sm")
> doc1 = nlp("This is a sentence.")
> doc2 = nlp("This is another sentence.")
> displacy.serve([doc1, doc2], style="dep")
> ```

| Name      | Description                                                                                                                                                       |
| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs`    | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~                                                                             |
| `style`   | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~                                                                                             |
| `page`    | Render markup as full HTML page. Defaults to `True`. ~~bool~~                                                                                                     |
| `minify`  | Minify HTML markup. Defaults to `False`. ~~bool~~                                                                                                                 |
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~                                                                                 |
| `manual`  | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `port`    | Port to serve visualization. Defaults to `5000`. ~~int~~                                                                                                          |
| `host`    | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~                                                                                                     |

### displacy.render {#displacy.render tag="method" new="2"}

Render a dependency parse tree or named entity visualization.

> #### Example
>
> ```python
> import spacy
> from spacy import displacy
> nlp = spacy.load("en_core_web_sm")
> doc = nlp("This is a sentence.")
> html = displacy.render(doc, style="dep")
> ```

| Name        | Description                                                                                                                                                                            |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs`      | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~                                                                                                  |
| `style`     | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~                                                                                                                  |
| `page`      | Render markup as full HTML page. Defaults to `True`. ~~bool~~                                                                                                                          |
| `minify`    | Minify HTML markup. Defaults to `False`. ~~bool~~                                                                                                                                      |
| `options`   | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~                                                                                                      |
| `manual`    | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~                      |
| `jupyter`   | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
| **RETURNS** | The rendered HTML markup. ~~str~~                                                                                                                                                      |

### Visualizer options {#displacy_options}

The `options` argument lets you specify additional settings for each visualizer.
If a setting is not present in the options, the default value will be used.

#### Dependency Visualizer options {#options-dep}

> #### Example
>
> ```python
> options = {"compact": True, "color": "blue"}
> displacy.serve(doc, style="dep", options=options)
> ```

| Name                                       | Description                                                                                                                                  |
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
| `fine_grained`                             | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~             |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~                                                      |
| `collapse_punct`                           | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
| `collapse_phrases`                         | Merge noun phrases into one token. Defaults to `False`. ~~bool~~                                                                             |
| `compact`                                  | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~                                                    |
| `color`                                    | Text color (HEX, RGB or color names). Defaults to `"#000000"`. ~~str~~                                                                       |
| `bg`                                       | Background color (HEX, RGB or color names). Defaults to `"#ffffff"`. ~~str~~                                                                 |
| `font`                                     | Font name or font family for all text. Defaults to `"Arial"`. ~~str~~                                                                        |
| `offset_x`                                 | Spacing on left side of the SVG in px. Defaults to `50`. ~~int~~                                                                             |
| `arrow_stroke`                             | Width of arrow path in px. Defaults to `2`. ~~int~~                                                                                          |
| `arrow_width`                              | Width of arrow head in px. Defaults to `10` in regular mode and `8` in compact mode. ~~int~~                                                 |
| `arrow_spacing`                            | Spacing between arrows in px to avoid overlaps. Defaults to `20` in regular mode and `12` in compact mode. ~~int~~                           |
| `word_spacing`                             | Vertical spacing between words and arcs in px. Defaults to `45`. ~~int~~                                                                     |
| `distance`                                 | Distance between words in px. Defaults to `175` in regular mode and `150` in compact mode. ~~int~~                                           |

#### Named Entity Visualizer options {#displacy_options-ent}

> #### Example
>
> ```python
> options = {"ents": ["PERSON", "ORG", "PRODUCT"],
>            "colors": {"ORG": "yellow"}}
> displacy.serve(doc, style="ent", options=options)
> ```

| Name                                    | Description                                                                                                                                                                                                                                 |
| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ents`                                  | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~                                                                                                                                                        |
| `colors`                                | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~                                                                                                                                                 |
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |

By default, displaCy comes with colors for all entity types used by
[spaCy's trained pipelines](/models). If you're using custom entity types, you
can use the `colors` setting to add your own colors for them. Your application
or pipeline package can also expose a
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
to add custom labels and their colors automatically.

## registry {#registry source="spacy/util.py" new="3"}

spaCy's function registry extends
[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you
to map strings to functions. You can register functions to create architectures,
optimizers, schedules and more, and then refer to them and set their arguments
in your [config file](/usage/training#config). Python type hints are used to
validate the inputs. See the
[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the
`registry` methods and our helper library
[`catalogue`](https://github.com/explosion/catalogue) for some background on the
concept of function registries. spaCy also uses the function registry for
language subclasses, model architecture, lookups and pipeline component
factories.

> #### Example
>
> ```python
> from typing import Iterator
> import spacy
>
> @spacy.registry.schedules("waltzing.v1")
> def waltzing() -> Iterator[float]:
>     i = 0
>     while True:
>         yield i % 3 + 1
>         i += 1
> ```

| Registry name     | Description                                                                                                                                                                                                                                        |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
| `augmenters`      | Registry for functions that create [data augmentation](#augmenters) callbacks for corpora and other training data iterators.                                                                                                                       |
| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
| `loggers`         | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
| `misc`            | Registry for miscellaneous functions that return data assets, knowledge bases or anything else you may need.                                                                                                                                       |
| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
| `readers`         | Registry for file and data readers, including training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                   |
| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |

### spacy-transformers registry {#registry-transformers}

The following registries are added by the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
See the [`Transformer`](/api/transformer) API reference and
[usage docs](/usage/embeddings-transformers) for details.

> #### Example
>
> ```python
> import spacy_transformers
>
> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
> def configure_custom_annotation_setter():
>     def annotation_setter(docs, trf_data) -> None:
>        # Set annotations on the docs
>
>     return annotation_setter
> ```

| Registry name                                               | Description                                                                                                                                                                                                                                       |
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |

## Loggers {#loggers source="spacy/training/loggers.py" new="3"}

A logger records the training results. When a logger is created, two functions
are returned: one for logging the information for each training step, and a
second function that is called to finalize the logging when the training is
finished. To log each training step, a
[dictionary](/usage/training#custom-logging) is passed on from the
[`spacy train`](/api/cli#train), including information such as the training loss
and the accuracy scores on the development set.

There are two built-in logging functions: a logger printing results to the
console in tabular format (which is the default), and one that also sends the
results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of
using one of the built-in loggers listed here, you can also
[implement your own](/usage/training#custom-logging).

#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"}

> #### Example config
>
> ```ini
> [training.logger]
> @loggers = "spacy.ConsoleLogger.v1"
> ```

Writes the results of a training step to the console in a tabular format.

<Accordion title="Example console output" spaced>

```cli
$ python -m spacy train config.cfg
```

```
ℹ Using CPU
ℹ Loading config and nlp from: config.cfg
ℹ Pipeline: ['tok2vec', 'tagger']
ℹ Start training
ℹ Training. Initial learn rate: 0.0

E     #        LOSS TOK2VEC   LOSS TAGGER   TAG_ACC   SCORE
---   ------   ------------   -----------   -------   ------
  1        0           0.00         86.20      0.22     0.00
  1      200           3.08      18968.78     34.00     0.34
  1      400          31.81      22539.06     33.64     0.34
  1      600          92.13      22794.91     43.80     0.44
  1      800         183.62      21541.39     56.05     0.56
  1     1000         352.49      25461.82     65.15     0.65
  1     1200         422.87      23708.82     71.84     0.72
  1     1400         601.92      24994.79     76.57     0.77
  1     1600         662.57      22268.02     80.20     0.80
  1     1800        1101.50      28413.77     82.56     0.83
  1     2000        1253.43      28736.36     85.00     0.85
  1     2200        1411.02      28237.53     87.42     0.87
  1     2400        1605.35      28439.95     88.70     0.89
```

Note that the cumulative loss keeps increasing within one epoch, but should
start decreasing across epochs.

 </Accordion>

#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}

> #### Installation
>
> ```bash
> $ pip install wandb
> $ wandb login
> ```

Built-in logger that sends the results of each training step to the dashboard of
the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights
& Biases should be installed, and you should be logged in. The logger will send
the full config file to W&B, as well as various system information such as
memory utilization, network traffic, disk IO, GPU statistics, etc. This will
also include information such as your hostname and operating system, as well as
the location of your Python executable.

<Infobox variant="warning">

Note that by default, the full (interpolated)
[training config](/usage/training#config) is sent over to the W&B dashboard. If
you prefer to **exclude certain information** such as path names, you can list
those fields in "dot notation" in the `remove_config_values` parameter. These
fields will then be removed from the config before uploading, but will otherwise
remain in the config file stored on your local system.

</Infobox>

> #### Example config
>
> ```ini
> [training.logger]
> @loggers = "spacy.WandbLogger.v1"
> project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> ```

| Name                   | Description                                                                                                                           |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name`         | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~                              |

<Project id="integrations/wandb">

Get started with tracking your spaCy training runs in Weights & Biases using our
project template. It trains on the IMDB Movie Review Dataset and includes a
simple config with the built-in `WandbLogger`, as well as a custom example of
creating variants of the config for a simple hyperparameter grid search and
logging the results.

</Project>

## Readers {#readers}

### File readers {#file-readers source="github.com/explosion/srsly" new="3"}

The following file readers are provided by our serialization library
[`srsly`](https://github.com/explosion/srsly). All registered functions take one
argument `path`, pointing to the file path to load.

> #### Example config
>
> ```ini
> [corpora.train.augmenter.orth_variants]
> @readers = "srsly.read_json.v1"
> path = "corpus/en_orth_variants.json"
> ```

| Name                    | Description                                           |
| ----------------------- | ----------------------------------------------------- |
| `srsly.read_json.v1`    | Read data from a JSON file.                           |
| `srsly.read_jsonl.v1`   | Read data from a JSONL (newline-delimited JSON) file. |
| `srsly.read_yaml.v1`    | Read data from a YAML file.                           |
| `srsly.read_msgpack.v1` | Read data from a binary MessagePack file.             |

<Infobox title="Important note" variant="warning">

Since the file readers expect a local path, you should only use them in config
blocks that are **not executed at runtime** – for example, in `[training]` and
`[corpora]` (to load data or resources like data augmentation tables) or in
`[initialize]` (to pass data to pipeline components).

</Infobox>

#### spacy.read_labels.v1 {#read_labels tag="registered function"}

Read a JSON-formatted labels file generated with
[`init labels`](/api/cli#init-labels). Typically used in the
[`[initialize]`](/api/data-formats#config-initialize) block of the training
config to speed up the model initialization process and provide pre-generated
label sets.

> #### Example config
>
> ```ini
> [initialize.components]
>
> [initialize.components.ner]
>
> [initialize.components.ner.labels]
> @readers = "spacy.read_labels.v1"
> path = "corpus/labels/ner.json"
> ```

| Name        | Description                                                                                                                                                                                                               |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path`      | The path to the labels file generated with [`init labels`](/api/cli#init-labels). ~~Path~~                                                                                                                                |
| `require`   | Whether to require the file to exist. If set to `False` and the labels file doesn't exist, the loader will return `None` and the `initialize` method will extract the labels from the data. Defaults to `False`. ~~bool~~ |
| **CREATES** | The                                                                                                                                                                                                                       |

### Corpus readers {#corpus-readers source="spacy/training/corpus.py" new="3"}

Corpus readers are registered functions that load data and return a function
that takes the current `nlp` object and yields [`Example`](/api/example) objects
that can be used for [training](/usage/training) and
[pretraining](/usage/embeddings-transformers#pretraining). You can replace it
with your own registered function in the
[`@readers` registry](/api/top-level#registry) to customize the data loading and
streaming.

#### spacy.Corpus.v1 {#corpus tag="registered function"}

The `Corpus` reader manages annotated corpora and can be used for training and
development datasets in the [DocBin](/api/docbin) (`.spacy`) format. Also see
the [`Corpus`](/api/corpus) class.

> #### Example config
>
> ```ini
> [paths]
> train = "corpus/train.spacy"
>
> [corpora.train]
> @readers = "spacy.Corpus.v1"
> path = ${paths.train}
> gold_preproc = false
> max_length = 0
> limit = 0
> ```

| Name            | Description                                                                                                                                                                                                                                                                              |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path`          | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~                                                                                                                                        |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~                                                                                                                                 |
| `max_length`    | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~                                                                                                                                      |
| `limit`         | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                                                                                                                                                                          |
| `augmenter`     | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
| **CREATES**     | The corpus reader. ~~Corpus~~                                                                                                                                                                                                                                                            |

#### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"}

Create [`Example`](/api/example) objects from a JSONL (newline-delimited JSON)
file of texts keyed by `"text"`. Can be used to read the raw text corpus for
language model [pretraining](/usage/embeddings-transformers#pretraining) from a
JSONL file. Also see the [`JsonlCorpus`](/api/corpus#jsonlcorpus) class.

> #### Example config
>
> ```ini
> [paths]
> pretrain = "corpus/raw_text.jsonl"
>
> [corpora.pretrain]
> @readers = "spacy.JsonlCorpus.v1"
> path = ${paths.pretrain}
> min_length = 0
> max_length = 0
> limit = 0
> ```

| Name         | Description                                                                                                                      |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------- |
| `path`       | The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ |
| `min_length` | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~       |
| `max_length` | Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~        |
| `limit`      | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                  |
| **CREATES**  | The corpus reader. ~~JsonlCorpus~~                                                                                               |

## Batchers {#batchers source="spacy/training/batchers.py" new="3"}

A data batcher implements a batching strategy that essentially turns a stream of
items into a stream of batches, with each batch consisting of one item or a list
of items. During training, the models update their weights after processing one
batch at a time. Typical batching strategies include presenting the training
data as a stream of batches with similar sizes, or with increasing batch sizes.
See the Thinc documentation on
[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.

Instead of using one of the built-in batchers listed here, you can also
[implement your own](/usage/training#custom-code-readers-batchers), which may or
may not use a custom schedule.

### spacy.batch_by_words.v1 {#batch_by_words tag="registered function"}

Create minibatches of roughly a given number of words. If any examples are
longer than the specified batch length, they will appear in a batch by
themselves, or be discarded if `discard_oversize` is set to `True`. The argument
`docs` can be a list of strings, [`Doc`](/api/doc) objects or
[`Example`](/api/example) objects.

> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "spacy.batch_by_words.v1"
> size = 100
> tolerance = 0.2
> discard_oversize = false
> get_length = null
> ```

| Name               | Description                                                                                                                                                                             |
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `seqs`             | The sequences to minibatch. ~~Iterable[Any]~~                                                                                                                                           |
| `size`             | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
| `tolerance`        | What percentage of the size to allow batches to exceed. ~~float~~                                                                                                                       |
| `discard_oversize` | Whether to discard sequences that by themselves exceed the tolerated size. ~~bool~~                                                                                                     |
| `get_length`       | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                 |
| **CREATES**        | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                     |

### spacy.batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}

> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "spacy.batch_by_sequence.v1"
> size = 32
> get_length = null
> ```

Create a batcher that creates batches of the specified size.

| Name         | Description                                                                                                                                                                             |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size`       | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                 |
| **CREATES**  | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                     |

### spacy.batch_by_padded.v1 {#batch_by_padded tag="registered function"}

> #### Example config
>
> ```ini
> [training.batcher]
> @batchers = "spacy.batch_by_padded.v1"
> size = 100
> buffer = 256
> discard_oversize = false
> get_length = null
> ```

Minibatch a sequence by the size of padded batches that would result, with
sequences binned by length within a window. The padded size is defined as the
maximum length of sequences within the batch multiplied by the number of
sequences in the batch.

| Name               | Description                                                                                                                                                                                                                                 |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `size`             | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~                                          |
| `buffer`           | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. ~~int~~ |
| `discard_oversize` | Whether to discard sequences that are by themselves longer than the largest padded batch size. ~~bool~~                                                                                                                                     |
| `get_length`       | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                                                                     |
| **CREATES**        | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                                                                         |

## Augmenters {#augmenters source="spacy/training/augment.py" new="3"}

Data augmentation is the process of applying small modifications to the training
data. It can be especially useful for punctuation and case replacement – for
example, if your corpus only uses smart quotes and you want to include
variations using regular quotes, or to make the model less sensitive to
capitalization by including a mix of capitalized and lowercase examples. See the
[usage guide](/usage/training#data-augmentation) for details and examples.

### spacy.orth_variants.v1 {#orth_variants tag="registered function"}

> #### Example config
>
> ```ini
> [corpora.train.augmenter]
> @augmenters = "spacy.orth_variants.v1"
> level = 0.1
> lower = 0.5
>
> [corpora.train.augmenter.orth_variants]
> @readers = "srsly.read_json.v1"
> path = "corpus/en_orth_variants.json"
> ```

Create a data augmentation callback that uses orth-variant replacement. The
callback can be added to a corpus or other data iterator during training. It's
is especially useful for punctuation and case replacement, to help generalize
beyond corpora that don't have smart quotes, or only have smart quotes etc.

| Name            | Description                                                                                                                                                                                                                                                                                               |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `level`         | The percentage of texts that will be augmented. ~~float~~                                                                                                                                                                                                                                                 |
| `lower`         | The percentage of texts that will be lowercased. ~~float~~                                                                                                                                                                                                                                                |
| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ |
| **CREATES**     | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~                                                                                                                              |

### spacy.lower_case.v1 {#lower_case tag="registered function"}

> #### Example config
>
> ```ini
> [corpora.train.augmenter]
> @augmenters = "spacy.lower_case.v1"
> level = 0.3
> ```

Create a data augmentation callback that lowercases documents. The callback can
be added to a corpus or other data iterator during training. It's especially
useful for making the model less sensitive to capitalization.

| Name        | Description                                                                                                                                                                  |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `level`     | The percentage of texts that will be augmented. ~~float~~                                                                                                                    |
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |

## Training data and alignment {#gold source="spacy/training"}

### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}

Encode labelled spans into per-token tags, using the
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
Out). Returns a list of strings, describing the tags. Each tag string will be in
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
don't align with the tokenization in the `Doc` object. The training algorithm
will view these as missing values. `O` denotes a non-entity token. `B` denotes
the beginning of a multi-token entity, `I` the inside of an entity of three or
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
single-token entity.

<Infobox title="Changed in v3.0" variant="warning" id="biluo_tags_from_offsets">

This method was previously available as `spacy.gold.biluo_tags_from_offsets`.

</Infobox>

> #### Example
>
> ```python
> from spacy.training import offsets_to_biluo_tags
>
> doc = nlp("I like London.")
> entities = [(7, 13, "LOC")]
> tags = offsets_to_biluo_tags(doc, entities)
> assert tags == ["O", "O", "U-LOC", "O"]
> ```

| Name        | Description                                                                                                                                                                                |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc`       | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. ~~Doc~~                                                             |
| `entities`  | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, Union[str, int]]]~~ |
| `missing`   | The label used for missing values, e.g. if tokenization doesn't align with the entity offsets. Defaults to `"O"`. ~~str~~                                                                  |
| **RETURNS** | A list of strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. ~~List[str]~~                                                                                    |

### training.biluo_tags_to_offsets {#biluo_tags_to_offsets tag="function"}

Encode per-token tags following the
[BILUO scheme](/usage/linguistic-features#accessing-ner) into entity offsets.

<Infobox title="Changed in v3.0" variant="warning" id="offsets_from_biluo_tags">

This method was previously available as `spacy.gold.offsets_from_biluo_tags`.

</Infobox>

> #### Example
>
> ```python
> from spacy.training import biluo_tags_to_offsets
>
> doc = nlp("I like London.")
> tags = ["O", "O", "U-LOC", "O"]
> entities = biluo_tags_to_offsets(doc, tags)
> assert entities == [(7, 13, "LOC")]
> ```

| Name        | Description                                                                                                                                                                                                                                                  |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc`       | The document that the BILUO tags refer to. ~~Doc~~                                                                                                                                                                                                           |
| `entities`  | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~                                                                                 |

### training.biluo_tags_to_spans {#biluo_tags_to_spans tag="function" new="2.1"}

Encode per-token tags following the
[BILUO scheme](/usage/linguistic-features#accessing-ner) into
[`Span`](/api/span) objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the `doc.ents`.

<Infobox title="Changed in v3.0" variant="warning" id="spans_from_biluo_tags">

This method was previously available as `spacy.gold.spans_from_biluo_tags`.

</Infobox>

> #### Example
>
> ```python
> from spacy.training import biluo_tags_to_spans
>
> doc = nlp("I like London.")
> tags = ["O", "O", "U-LOC", "O"]
> doc.ents = biluo_tags_to_spans(doc, tags)
> ```

| Name        | Description                                                                                                                                                                                                                                                  |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc`       | The document that the BILUO tags refer to. ~~Doc~~                                                                                                                                                                                                           |
| `entities`  | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~                                                                                                                                                                                        |

## Utility functions {#util source="spacy/util.py"}

spaCy comes with a small collection of utility functions located in
[`spacy/util.py`](%%GITHUB_SPACY/spacy/util.py). Because utility functions are
mostly intended for **internal use within spaCy**, their behavior may change
with future releases. The functions documented on this page should be safe to
use and we'll try to ensure backwards compatibility. However, we recommend
having additional tests in place if your application depends on any of spaCy's
utilities.

### util.get_lang_class {#util.get_lang_class tag="function"}

Import and load a `Language` class. Allows lazy-loading
[language data](/usage/linguistic-features#language-data) and importing
languages using the two-letter language code. To add a language code for a
custom language class, you can register it using the
[`@registry.languages`](/api/top-level#registry) decorator.

> #### Example
>
> ```python
> for lang_id in ["en", "de"]:
>     lang_class = util.get_lang_class(lang_id)
>     lang = lang_class()
> ```

| Name        | Description                                    |
| ----------- | ---------------------------------------------- |
| `lang`      | Two-letter language code, e.g. `"en"`. ~~str~~ |
| **RETURNS** | The respective subclass. ~~Language~~          |

### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}

Check whether a `Language` subclass is already loaded. `Language` subclasses are
loaded lazily to avoid expensive setup code associated with the language data.

> #### Example
>
> ```python
> lang_cls = util.get_lang_class("en")
> assert util.lang_class_is_loaded("en") is True
> assert util.lang_class_is_loaded("de") is False
> ```

| Name        | Description                                    |
| ----------- | ---------------------------------------------- |
| `name`      | Two-letter language code, e.g. `"en"`. ~~str~~ |
| **RETURNS** | Whether the class has been loaded. ~~bool~~    |

### util.load_model {#util.load_model tag="function" new="2"}

Load a pipeline from a package or data path. If called with a string name, spaCy
will assume the pipeline is a Python package and import and call its `load()`
method. If called with a path, spaCy will assume it's a data directory, read the
language and pipeline settings from the [`config.cfg`](/api/data-formats#config)
and create a `Language` object. The model data will then be loaded in via
[`Language.from_disk`](/api/language#from_disk).

> #### Example
>
> ```python
> nlp = util.load_model("en_core_web_sm")
> nlp = util.load_model("en_core_web_sm", exclude=["ner"])
> nlp = util.load_model("/path/to/data")
> ```

| Name                                 | Description                                                                                                                                                                                                                                      |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `name`                               | Package name or path. ~~str~~                                                                                                                                                                                                                    |
| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                           |
| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                             |
| `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                   |
| **RETURNS**                          | `Language` class with the loaded pipeline. ~~Language~~                                                                                                                                                                                          |

### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}

A helper function to use in the `load()` method of a pipeline package's
[`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py).

> #### Example
>
> ```python
> from spacy.util import load_model_from_init_py
>
> def load(**overrides):
>     return load_model_from_init_py(__file__, **overrides)
> ```

| Name                                 | Description                                                                                                                                                                                                                                    |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `init_file`                          | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~                                                                                                                                                                         |
| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                         |
| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
| `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                 |
| **RETURNS**                          | `Language` class with the loaded pipeline. ~~Language~~                                                                                                                                                                                        |

### util.load_config {#util.load_config tag="function" new="3"}

Load a pipeline's [`config.cfg`](/api/data-formats#config) from a file path. The
config typically includes details about the components and how they're created,
as well as all training settings and hyperparameters.

> #### Example
>
> ```python
> config = util.load_config("/path/to/config.cfg")
> print(config.to_str())
> ```

| Name          | Description                                                                                                                                                                 |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path`        | Path to the pipeline's `config.cfg`. ~~Union[str, Path]~~                                                                                                                   |
| `overrides`   | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ |
| `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~                                              |
| **RETURNS**   | The pipeline's config. ~~Config~~                                                                                                                                           |

### util.load_meta {#util.load_meta tag="function" new="3"}

Get a pipeline's [`meta.json`](/api/data-formats#meta) from a file path and
validate its contents. The meta typically includes details about author,
licensing, data sources and version.

> #### Example
>
> ```python
> meta = util.load_meta("/path/to/meta.json")
> ```

| Name        | Description                                              |
| ----------- | -------------------------------------------------------- |
| `path`      | Path to the pipeline's `meta.json`. ~~Union[str, Path]~~ |
| **RETURNS** | The pipeline's meta data. ~~Dict[str, Any]~~             |

### util.get_installed_models {#util.get_installed_models tag="function" new="3"}

List all pipeline packages installed in the current environment. This will
include any spaCy pipeline that was packaged with
[`spacy package`](/api/cli#package). Under the hood, pipeline packages expose a
Python entry point that spaCy can check, without having to load the `nlp`
object.

> #### Example
>
> ```python
> names = util.get_installed_models()
> ```

| Name        | Description                                                                           |
| ----------- | ------------------------------------------------------------------------------------- |
| **RETURNS** | The string names of the pipelines installed in the current environment. ~~List[str]~~ |

### util.is_package {#util.is_package tag="function"}

Check if string maps to a package installed via pip. Mainly used to validate
[pipeline packages](/usage/models).

> #### Example
>
> ```python
> util.is_package("en_core_web_sm") # True
> util.is_package("xyz") # False
> ```

| Name        | Description                                           |
| ----------- | ----------------------------------------------------- |
| `name`      | Name of package. ~~str~~                              |
| **RETURNS** | `True` if installed package, `False` if not. ~~bool~~ |

### util.get_package_path {#util.get_package_path tag="function" new="2"}

Get path to an installed package. Mainly used to resolve the location of
[pipeline packages](/usage/models). Currently imports the package to find its
path.

> #### Example
>
> ```python
> util.get_package_path("en_core_web_sm")
> # /usr/lib/python3.6/site-packages/en_core_web_sm
> ```

| Name           | Description                                  |
| -------------- | -------------------------------------------- |
| `package_name` | Name of installed package. ~~str~~           |
| **RETURNS**    | Path to pipeline package directory. ~~Path~~ |

### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}

Check if user is running spaCy from a [Jupyter](https://jupyter.org) notebook by
detecting the IPython kernel. Mainly used for the
[`displacy`](/api/top-level#displacy) visualizer.

> #### Example
>
> ```python
> html = "<h1>Hello world!</h1>"
> if util.is_in_jupyter():
>     from IPython.core.display import display, HTML
>     display(HTML(html))
> ```

| Name        | Description                                    |
| ----------- | ---------------------------------------------- |
| **RETURNS** | `True` if in Jupyter, `False` if not. ~~bool~~ |

### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}

Compile a sequence of prefix rules into a regex object.

> #### Example
>
> ```python
> prefixes = ("§", "%", "=", r"\+")
> prefix_regex = util.compile_prefix_regex(prefixes)
> nlp.tokenizer.prefix_search = prefix_regex.search
> ```

| Name        | Description                                                                                                                                 |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries`   | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~                                         |

### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}

Compile a sequence of suffix rules into a regex object.

> #### Example
>
> ```python
> suffixes = ("'s", "'S", r"(?<=[0-9])\+")
> suffix_regex = util.compile_suffix_regex(suffixes)
> nlp.tokenizer.suffix_search = suffix_regex.search
> ```

| Name        | Description                                                                                                                                 |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries`   | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~                                         |

### util.compile_infix_regex {#util.compile_infix_regex tag="function"}

Compile a sequence of infix rules into a regex object.

> #### Example
>
> ```python
> infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
> infix_regex = util.compile_infix_regex(infixes)
> nlp.tokenizer.infix_finditer = infix_regex.finditer
> ```

| Name        | Description                                                                                                                               |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `entries`   | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~                                      |

### util.minibatch {#util.minibatch tag="function" new="2"}

Iterate over batches of items. `size` may be an iterator, so that batch-size can
vary on each step.

> #### Example
>
> ```python
> batches = minibatch(train_data)
> for batch in batches:
>     nlp.update(batch)
> ```

| Name       | Description                              |
| ---------- | ---------------------------------------- |
| `items`    | The items to batch up. ~~Iterable[Any]~~ |
| `size`     | int / iterable                           | The batch size(s). ~~Union[int, Sequence[int]]~~ |
| **YIELDS** | The batches.                             |

### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}

Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
overlaps. Useful for creating named entities (where one token can only be part
of one entity) or when merging spans with
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
(first) longest span is preferred over shorter spans.

> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> spans = [doc[0:2], doc[0:2], doc[0:4]]
> filtered = filter_spans(spans)
> ```

| Name        | Description                             |
| ----------- | --------------------------------------- |
| `spans`     | The spans to filter. ~~Iterable[Span]~~ |
| **RETURNS** | The filtered spans. ~~List[Span]~~      |

### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}

Given a list of words and a text, reconstruct the original tokens and return a
list of words and spaces that can be used to create a [`Doc`](/api/doc#init).
This can help recover destructive tokenization that didn't preserve any
whitespace information.

> #### Example
>
> ```python
> orig_words = ["Hey", ",", "what", "'s", "up", "?"]
> orig_text = "Hey, what's up?"
> words, spaces = get_words_and_spaces(orig_words, orig_text)
> # ['Hey', ',', 'what', "'s", 'up', '?']
> # [False, True, False, True, False, False]
> ```

| Name        | Description                                                                                                                                        |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `words`     | The list of words. ~~Iterable[str]~~                                                                                                               |
| `text`      | The original text. ~~str~~                                                                                                                         |
| **RETURNS** | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. ~~Tuple[List[str], List[bool]]~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								title: Top-level Functions
 								menu:
 								  - ['spacy', 'spacy']
 								  - ['displacy', 'displacy']
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								  - ['registry', 'registry']
-												some more information about the loggers

											
										
										
											2020-08-31 14:43:17 +03:00
+								  - ['Loggers', 'loggers']
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
+								  - ['Readers', 'readers']
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								  - ['Batchers', 'batchers']
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								  - ['Augmenters', 'augmenters']
 								  - ['Training & Alignment', 'gold']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								  - ['Utility Functions', 'util']
 								---
 								## spaCy {#spacy hidden="true"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								### spacy.load {#spacy.load tag="function"}
 								Load a pipeline using the name of an installed
 								[package](/usage/saving-loading#models), a string path or a `Path`-like object.
 								spaCy will try resolving the load argument in this order. If a pipeline is
 								loaded from a string name, spaCy will assume it's a Python package and import it
 								and call the package's own `load()` method. If a pipeline is loaded from a path,
 								spaCy will assume it's a data directory, load its
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								[`config.cfg`](/api/data-formats#config) and use the language and pipeline
 								information to construct the `Language` class. The data will be loaded in via
 								[`Language.from_disk`](/api/language#from_disk).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								<Infobox variant="warning" title="Changed in v3.0">
 								As of v3.0, the `disable` keyword argument specifies components to load but
 								disable, instead of components to not load at all. Those components can now be
 								specified separately using the new `exclude` keyword argument.
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> #### Example
 								>
 								> ```python
 								> nlp = spacy.load("en_core_web_sm") # package
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> nlp = spacy.load("/path/to/pipeline") # string path
 								> nlp = spacy.load(Path("/path/to/pipeline")) # pathlib Path
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								>
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								> nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"])
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								| Name                                 | Description                                                                                                                                                                                                                                    |
 								| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| `name`                               | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~                                                                                                                                                                              |
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								| _keyword-only_                       |                                                                                                                                                                                                                                                |
 								| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
 								| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
 								| `config` <Tag variant="new">3</Tag>  | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~                                                                             |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| **RETURNS**                          | A `Language` object with the loaded pipeline. ~~Language~~                                                                                                                                                                                     |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Essentially, `spacy.load()` is a convenience wrapper that reads the pipeline's
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								[`config.cfg`](/api/data-formats#config), uses the language and pipeline
 								information to construct a `Language` object, loads in the model data and
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								weights, and returns it.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### Abstract example
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
 								nlp = cls()                            # 2. Initialize it
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								for name in pipeline:
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
 								nlp.from_disk(data_path)               # 4. Load in the binary data
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								### spacy.blank {#spacy.blank tag="function" new="2"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Create a blank pipeline of a given language class. This function is the twin of
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								`spacy.load()`.
 								> #### Example
 								>
 								> ```python
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								> nlp_en = spacy.blank("en")   # equivalent to English()
 								> nlp_de = spacy.blank("de")   # equivalent to German()
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												Fix consistency and update docs

											
										
										
											2020-09-15 12:37:37 +03:00
+								| Name                                | Description                                                                                                                                                        |
 								| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `name`                              | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~                                                           |
-												Update docs

											
										
										
											2020-09-15 12:41:49 +03:00
+								| _keyword-only_                      |                                                                                                                                                                    |
 								| `vocab` <Tag variant="new">3</Tag>  | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                             |
-												Fix consistency and update docs

											
										
										
											2020-09-15 12:37:37 +03:00
+								| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
-												Fix docs [ci skip]

											
										
										
											2020-09-18 14:41:12 +03:00
+								| `meta` <Tag variant="new">3</Tag>   | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~                                                                                   |
-												Fix consistency and update docs

											
										
										
											2020-09-15 12:37:37 +03:00
+								| **RETURNS**                         | An empty `Language` object of the appropriate subclass. ~~Language~~                                                                                               |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								### spacy.info {#spacy.info tag="function"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								The same as the [`info` command](/api/cli#info). Pretty-print information about
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								your installation, installed pipelines and local setup from within spaCy.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> spacy.info()
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> spacy.info("en_core_web_sm")
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 19:11:45 +03:00
+								> markdown = spacy.info(markdown=True, silent=True)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| Name           | Description                                                                  |
 								| -------------- | ---------------------------------------------------------------------------- |
 								| `model`        | Optional pipeline, i.e. a package name or path (optional). ~~Optional[str]~~ |
 								| _keyword-only_ |                                                                              |
 								| `markdown`     | Print information as Markdown. ~~bool~~                                      |
 								| `silent`       | Don't print anything, just return. ~~bool~~                                  |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### spacy.explain {#spacy.explain tag="function"}
 								Get a description for a given POS tag, dependency label or entity type. For a
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								list of available terms, see [`glossary.py`](%%GITHUB_SPACY/spacy/glossary.py).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> spacy.explain("NORP")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> # Nationalities or religious or political groups
 								>
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("Hello world")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> for word in doc:
 								>    print(word.text, word.tag_, spacy.explain(word.tag_))
 								> # Hello UH interjection
 								> # world NN noun, singular or mass
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                                                |
 								| ----------- | -------------------------------------------------------------------------- |
 								| `term`      | Term to explain. ~~str~~                                                   |
 								| **RETURNS** | The explanation, or `None` if not found in the glossary. ~~Optional[str]~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
 								Allocate data and perform operations on [GPU](/usage/#gpu), if available. If
 								data has already been allocated on CPU, it will not be moved. Ideally, this
 								function should be called right after importing spaCy and _before_ loading any
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								pipelines.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> import spacy
 								> activated = spacy.prefer_gpu()
 								> nlp = spacy.load("en_core_web_sm")
 								> ```
-												Introducing the gpu_allocator (#6091)

* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
											
										
										
											2020-09-19 02:17:02 +03:00
+								| Name        | Description                                      |
 								| ----------- | ------------------------------------------------ |
 								| `gpu_id`    | Device index to select. Defaults to `0`. ~~int~~ |
 								| **RETURNS** | Whether the GPU was activated. ~~bool~~          |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
 								Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error
 								if no GPU is available. If data has already been allocated on CPU, it will not
 								be moved. Ideally, this function should be called right after importing spaCy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								and _before_ loading any pipelines.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> import spacy
 								> spacy.require_gpu()
 								> nlp = spacy.load("en_core_web_sm")
 								> ```
-												Introducing the gpu_allocator (#6091)

* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
											
										
										
											2020-09-19 02:17:02 +03:00
+								| Name        | Description                                      |
 								| ----------- | ------------------------------------------------ |
 								| `gpu_id`    | Device index to select. Defaults to `0`. ~~int~~ |
 								| **RETURNS** | `True` ~~bool~~                                  |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								## displaCy {#displacy source="spacy/displacy"}
 								As of v2.0, spaCy comes with a built-in visualization suite. For more info and
 								examples, see the usage guide on [visualizing spaCy](/usage/visualizers).
 								### displacy.serve {#displacy.serve tag="method" new="2"}
 								Serve a dependency parse tree or named entity visualization to view it in your
 								browser. Will run a simple web server.
 								> #### Example
 								>
 								> ```python
 								> import spacy
 								> from spacy import displacy
 								> nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc1 = nlp("This is a sentence.")
 								> doc2 = nlp("This is another sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> displacy.serve([doc1, doc2], style="dep")
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| Name      | Description                                                                                                                                                       |
 								| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `docs`    | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~                                                                             |
 								| `style`   | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~                                                                                             |
 								| `page`    | Render markup as full HTML page. Defaults to `True`. ~~bool~~                                                                                                     |
 								| `minify`  | Minify HTML markup. Defaults to `False`. ~~bool~~                                                                                                                 |
 								| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~                                                                                 |
-												Proofreading

Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.

											
										
										
											2020-09-28 17:50:15 +03:00
+								| `manual`  | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| `port`    | Port to serve visualization. Defaults to `5000`. ~~int~~                                                                                                          |
 								| `host`    | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~                                                                                                     |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### displacy.render {#displacy.render tag="method" new="2"}
 								Render a dependency parse tree or named entity visualization.
 								> #### Example
 								>
 								> ```python
 								> import spacy
 								> from spacy import displacy
 								> nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("This is a sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> html = displacy.render(doc, style="dep")
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                                                                                                                                                            |
 								| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `docs`      | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~                                                                                                  |
 								| `style`     | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~                                                                                                                  |
 								| `page`      | Render markup as full HTML page. Defaults to `True`. ~~bool~~                                                                                                                          |
 								| `minify`    | Minify HTML markup. Defaults to `False`. ~~bool~~                                                                                                                                      |
 								| `options`   | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~                                                                                                      |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| `manual`    | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~                      |
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| `jupyter`   | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
 								| **RETURNS** | The rendered HTML markup. ~~str~~                                                                                                                                                      |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### Visualizer options {#displacy_options}
 								The `options` argument lets you specify additional settings for each visualizer.
 								If a setting is not present in the options, the default value will be used.
 								#### Dependency Visualizer options {#options-dep}
 								> #### Example
 								>
 								> ```python
 								> options = {"compact": True, "color": "blue"}
 								> displacy.serve(doc, style="dep", options=options)
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name                                       | Description                                                                                                                                  |
 								| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `fine_grained`                             | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~             |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~                                                      |
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| `collapse_punct`                           | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
 								| `collapse_phrases`                         | Merge noun phrases into one token. Defaults to `False`. ~~bool~~                                                                             |
 								| `compact`                                  | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~                                                    |
 								| `color`                                    | Text color (HEX, RGB or color names). Defaults to `"#000000"`. ~~str~~                                                                       |
 								| `bg`                                       | Background color (HEX, RGB or color names). Defaults to `"#ffffff"`. ~~str~~                                                                 |
 								| `font`                                     | Font name or font family for all text. Defaults to `"Arial"`. ~~str~~                                                                        |
 								| `offset_x`                                 | Spacing on left side of the SVG in px. Defaults to `50`. ~~int~~                                                                             |
 								| `arrow_stroke`                             | Width of arrow path in px. Defaults to `2`. ~~int~~                                                                                          |
 								| `arrow_width`                              | Width of arrow head in px. Defaults to `10` in regular mode and `8` in compact mode. ~~int~~                                                 |
 								| `arrow_spacing`                            | Spacing between arrows in px to avoid overlaps. Defaults to `20` in regular mode and `12` in compact mode. ~~int~~                           |
 								| `word_spacing`                             | Vertical spacing between words and arcs in px. Defaults to `45`. ~~int~~                                                                     |
 								| `distance`                                 | Distance between words in px. Defaults to `175` in regular mode and `150` in compact mode. ~~int~~                                           |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								#### Named Entity Visualizer options {#displacy_options-ent}
 								> #### Example
 								>
 								> ```python
 								> options = {"ents": ["PERSON", "ORG", "PRODUCT"],
 								>            "colors": {"ORG": "yellow"}}
 								> displacy.serve(doc, style="ent", options=options)
 								> ```
-												Fix docs [ci skip]

											
										
										
											2020-10-02 02:11:19 +03:00
+								| Name                                    | Description                                                                                                                                                                                                                                 |
 								| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `ents`                                  | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~                                                                                                                                                        |
 								| `colors`                                | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~                                                                                                                                                 |
 								| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								By default, displaCy comes with colors for all entity types used by
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								[spaCy's trained pipelines](/models). If you're using custom entity types, you
 								can use the `colors` setting to add your own colors for them. Your application
 								or pipeline package can also expose a
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
 								to add custom labels and their colors automatically.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								## registry {#registry source="spacy/util.py" new="3"}
 								spaCy's function registry extends
 								[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you
 								to map strings to functions. You can register functions to create architectures,
 								optimizers, schedules and more, and then refer to them and set their arguments
 								in your [config file](/usage/training#config). Python type hints are used to
 								validate the inputs. See the
 								[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the
 								`registry` methods and our helper library
 								[`catalogue`](https://github.com/explosion/catalogue) for some background on the
 								concept of function registries. spaCy also uses the function registry for
 								language subclasses, model architecture, lookups and pipeline component
 								factories.
 								> #### Example
 								>
 								> ```python
-												waltzing schedule

											
										
										
											2020-08-19 20:53:00 +03:00
+								> from typing import Iterator
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								> import spacy
 								>
-												waltzing schedule

											
										
										
											2020-08-19 20:53:00 +03:00
+								> @spacy.registry.schedules("waltzing.v1")
 								> def waltzing() -> Iterator[float]:
 								>     i = 0
 								>     while True:
 								>         yield i % 3 + 1
 								>         i += 1
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								> ```
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								| Registry name     | Description                                                                                                                                                                                                                                        |
 								| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								| `augmenters`      | Registry for functions that create [data augmentation](#augmenters) callbacks for corpora and other training data iterators.                                                                                                                       |
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								| `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
 								| `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
 								| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |
 								| `factories`       | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
 								| `initializers`    | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers).                                                                                                                                                         |
 								| `languages`       | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                                                                 |
 								| `layers`          | Registry for functions that create [layers](https://thinc.ai/docs/api-layers).                                                                                                                                                                     |
 								| `loggers`         | Registry for functions that log [training results](/usage/training).                                                                                                                                                                               |
 								| `lookups`         | Registry for large lookup tables available via `vocab.lookups`.                                                                                                                                                                                    |
 								| `losses`          | Registry for functions that create [losses](https://thinc.ai/docs/api-loss).                                                                                                                                                                       |
-												Fix alphabetic ordering [ci skip]

											
										
										
											2020-09-04 00:01:50 +03:00
+								| `misc`            | Registry for miscellaneous functions that return data assets, knowledge bases or anything else you may need.                                                                                                                                       |
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								| `optimizers`      | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers).                                                                                                                                                             |
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								| `readers`         | Registry for file and data readers, including training and evaluation data readers like [`Corpus`](/api/corpus).                                                                                                                                   |
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								| `schedules`       | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules).                                                                                                                                                               |
 								| `tokenizers`      | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable.                                                                   |
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-29 19:44:10 +03:00
+								### spacy-transformers registry {#registry-transformers}
 								The following registries are added by the
 								[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
 								See the [`Transformer`](/api/transformer) API reference and
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								[usage docs](/usage/embeddings-transformers) for details.
-												Update docs [ci skip]

											
										
										
											2020-07-29 19:44:10 +03:00
 								> #### Example
 								>
 								> ```python
 								> import spacy_transformers
 								>
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
 								> def configure_custom_annotation_setter():
 								>     def annotation_setter(docs, trf_data) -> None:
 								>        # Set annotations on the docs
-												Update docs [ci skip]

											
										
										
											2020-07-29 19:44:10 +03:00
+								>
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								>     return annotation_setter
-												Update docs [ci skip]

											
										
										
											2020-07-29 19:44:10 +03:00
+								> ```
-												revert annotations refactor

											
										
										
											2020-08-31 15:40:55 +03:00
+								| Registry name                                               | Description                                                                                                                                                                                                                                       |
 								| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| [`span_getters`](/api/transformer#span_getters)             | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences.                                                                                                      |
 								| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								## Loggers {#loggers source="spacy/training/loggers.py" new="3"}
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												some more information about the loggers

											
										
										
											2020-08-31 14:43:17 +03:00
+								A logger records the training results. When a logger is created, two functions
 								are returned: one for logging the information for each training step, and a
 								second function that is called to finalize the logging when the training is
 								finished. To log each training step, a
 								[dictionary](/usage/training#custom-logging) is passed on from the
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
+								[`spacy train`](/api/cli#train), including information such as the training loss
 								and the accuracy scores on the development set.
-												some more information about the loggers

											
										
										
											2020-08-31 14:43:17 +03:00
 								There are two built-in logging functions: a logger printing results to the
 								console in tabular format (which is the default), and one that also sends the
-												Update docs [ci skip]

											
										
										
											2020-08-31 17:39:53 +03:00
+								results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of
 								using one of the built-in loggers listed here, you can also
-												fixes

											
										
										
											2020-08-31 15:51:49 +03:00
+								[implement your own](/usage/training#custom-logging).
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"}
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
+								> #### Example config
 								>
 								> ```ini
 								> [training.logger]
 								> @loggers = "spacy.ConsoleLogger.v1"
 								> ```
 								Writes the results of a training step to the console in a tabular format.
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
+								<Accordion title="Example console output" spaced>
-												console logging example

											
										
										
											2020-08-31 17:56:13 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
+								```cli
-												console logging example

											
										
										
											2020-08-31 17:56:13 +03:00
+								$ python -m spacy train config.cfg
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
+								```
 								```
-												console logging example

											
										
										
											2020-08-31 17:56:13 +03:00
+								ℹ Using CPU
 								ℹ Loading config and nlp from: config.cfg
 								ℹ Pipeline: ['tok2vec', 'tagger']
 								ℹ Start training
 								ℹ Training. Initial learn rate: 0.0
-												Update docs [ci skip]

											
										
										
											2020-08-31 18:06:33 +03:00
-												console logging example

											
										
										
											2020-08-31 17:56:13 +03:00
+								E     #        LOSS TOK2VEC   LOSS TAGGER   TAG_ACC   SCORE
 								---   ------   ------------   -----------   -------   ------
 0           0.00         86.20      0.22     0.00
 200           3.08      18968.78     34.00     0.34
 400          31.81      22539.06     33.64     0.34
 600          92.13      22794.91     43.80     0.44
 800         183.62      21541.39     56.05     0.56
 1000         352.49      25461.82     65.15     0.65
 1200         422.87      23708.82     71.84     0.72
 1400         601.92      24994.79     76.57     0.77
 1600         662.57      22268.02     80.20     0.80
 1800        1101.50      28413.77     82.56     0.83
 2000        1253.43      28736.36     85.00     0.85
 2200        1411.02      28237.53     87.42     0.87
 2400        1605.35      28439.95     88.70     0.89
 								```
 								Note that the cumulative loss keeps increasing within one epoch, but should
 								start decreasing across epochs.
 								 </Accordion>
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
 								> #### Installation
 								>
 								> ```bash
 								> $ pip install wandb
 								> $ wandb login
 								> ```
 								Built-in logger that sends the results of each training step to the dashboard of
-												some more information about the loggers

											
										
										
											2020-08-31 14:43:17 +03:00
+								the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights
 								& Biases should be installed, and you should be logged in. The logger will send
 								the full config file to W&B, as well as various system information such as
 								memory utilization, network traffic, disk IO, GPU statistics, etc. This will
 								also include information such as your hostname and operating system, as well as
 								the location of your Python executable.
-												Update docs [ci skip]

											
										
										
											2020-08-31 17:39:53 +03:00
+								<Infobox variant="warning">
 								Note that by default, the full (interpolated)
 								[training config](/usage/training#config) is sent over to the W&B dashboard. If
 								you prefer to **exclude certain information** such as path names, you can list
 								those fields in "dot notation" in the `remove_config_values` parameter. These
 								fields will then be removed from the config before uploading, but will otherwise
 								remain in the config file stored on your local system.
 								</Infobox>
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
 								> #### Example config
 								>
 								> ```ini
 								> [training.logger]
 								> @loggers = "spacy.WandbLogger.v1"
 								> project_name = "monitor_spacy_training"
-												generalize corpora, dot notation for dev and train corpus

											
										
										
											2020-09-17 12:38:59 +03:00
+								> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
-												add loggers registry & logger docs sections

											
										
										
											2020-08-28 22:44:04 +03:00
+								> ```
-												Update docs [ci skip]

											
										
										
											2020-07-29 19:44:10 +03:00
-												some more information about the loggers

											
										
										
											2020-08-31 14:43:17 +03:00
+								| Name                   | Description                                                                                                                           |
 								| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
 								| `project_name`         | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
 								| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~                              |
-												Update docs [ci skip]

											
										
										
											2020-09-20 13:30:53 +03:00
+								<Project id="integrations/wandb">
 								Get started with tracking your spaCy training runs in Weights & Biases using our
 								project template. It trains on the IMDB Movie Review Dataset and includes a
 								simple config with the built-in `WandbLogger`, as well as a custom example of
 								creating variants of the config for a simple hyperparameter grid search and
 								logging the results.
 								</Project>
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								## Readers {#readers}
 								### File readers {#file-readers source="github.com/explosion/srsly" new="3"}
 								The following file readers are provided by our serialization library
 								[`srsly`](https://github.com/explosion/srsly). All registered functions take one
 								argument `path`, pointing to the file path to load.
 								> #### Example config
 								>
 								> ```ini
 								> [corpora.train.augmenter.orth_variants]
 								> @readers = "srsly.read_json.v1"
 								> path = "corpus/en_orth_variants.json"
 								> ```
 								| Name                    | Description                                           |
 								| ----------------------- | ----------------------------------------------------- |
 								| `srsly.read_json.v1`    | Read data from a JSON file.                           |
 								| `srsly.read_jsonl.v1`   | Read data from a JSONL (newline-delimited JSON) file. |
 								| `srsly.read_yaml.v1`    | Read data from a YAML file.                           |
 								| `srsly.read_msgpack.v1` | Read data from a binary MessagePack file.             |
 								<Infobox title="Important note" variant="warning">
 								Since the file readers expect a local path, you should only use them in config
 								blocks that are **not executed at runtime** – for example, in `[training]` and
 								`[corpora]` (to load data or resources like data augmentation tables) or in
 								`[initialize]` (to pass data to pipeline components).
 								</Infobox>
 								#### spacy.read_labels.v1 {#read_labels tag="registered function"}
 								Read a JSON-formatted labels file generated with
 								[`init labels`](/api/cli#init-labels). Typically used in the
 								[`[initialize]`](/api/data-formats#config-initialize) block of the training
 								config to speed up the model initialization process and provide pre-generated
 								label sets.
 								> #### Example config
 								>
 								> ```ini
 								> [initialize.components]
 								>
 								> [initialize.components.ner]
 								>
 								> [initialize.components.ner.labels]
 								> @readers = "spacy.read_labels.v1"
 								> path = "corpus/labels/ner.json"
 								> ```
 								| Name        | Description                                                                                                                                                                                                               |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `path`      | The path to the labels file generated with [`init labels`](/api/cli#init-labels). ~~Path~~                                                                                                                                |
 								| `require`   | Whether to require the file to exist. If set to `False` and the labels file doesn't exist, the loader will return `None` and the `initialize` method will extract the labels from the data. Defaults to `False`. ~~bool~~ |
 								| **CREATES** | The                                                                                                                                                                                                                       |
 								### Corpus readers {#corpus-readers source="spacy/training/corpus.py" new="3"}
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
 								Corpus readers are registered functions that load data and return a function
 								that takes the current `nlp` object and yields [`Example`](/api/example) objects
 								that can be used for [training](/usage/training) and
 								[pretraining](/usage/embeddings-transformers#pretraining). You can replace it
 								with your own registered function in the
 								[`@readers` registry](/api/top-level#registry) to customize the data loading and
 								streaming.
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								#### spacy.Corpus.v1 {#corpus tag="registered function"}
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
 								The `Corpus` reader manages annotated corpora and can be used for training and
 								development datasets in the [DocBin](/api/docbin) (`.spacy`) format. Also see
 								the [`Corpus`](/api/corpus) class.
 								> #### Example config
 								>
 								> ```ini
 								> [paths]
 								> train = "corpus/train.spacy"
 								>
-												generalize corpora, dot notation for dev and train corpus

											
										
										
											2020-09-17 12:38:59 +03:00
+								> [corpora.train]
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
+								> @readers = "spacy.Corpus.v1"
 								> path = ${paths.train}
 								> gold_preproc = false
 								> max_length = 0
 								> limit = 0
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| Name            | Description                                                                                                                                                                                                                                                                              |
 								| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `path`          | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~                                                                                                                                        |
 								|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~                                                                                                                                 |
 								| `max_length`    | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~                                                                                                                                      |
 								| `limit`         | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                                                                                                                                                                          |
 								| `augmenter`     | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								| **CREATES**     | The corpus reader. ~~Corpus~~                                                                                                                                                                                                                                                            |
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								#### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"}
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
 								Create [`Example`](/api/example) objects from a JSONL (newline-delimited JSON)
 								file of texts keyed by `"text"`. Can be used to read the raw text corpus for
 								language model [pretraining](/usage/embeddings-transformers#pretraining) from a
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								JSONL file. Also see the [`JsonlCorpus`](/api/corpus#jsonlcorpus) class.
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
 								> #### Example config
 								>
 								> ```ini
 								> [paths]
 								> pretrain = "corpus/raw_text.jsonl"
 								>
-												generalize corpora, dot notation for dev and train corpus

											
										
										
											2020-09-17 12:38:59 +03:00
+								> [corpora.pretrain]
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								> @readers = "spacy.JsonlCorpus.v1"
-												Update docs and consistency [ci skip]

											
										
										
											2020-09-15 01:32:49 +03:00
+								> path = ${paths.pretrain}
 								> min_length = 0
 								> max_length = 0
 								> limit = 0
 								> ```
 								| Name         | Description                                                                                                                      |
 								| ------------ | -------------------------------------------------------------------------------------------------------------------------------- |
 								| `path`       | The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ |
 								| `min_length` | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~       |
 								| `max_length` | Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~        |
 								| `limit`      | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                  |
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								| **CREATES**  | The corpus reader. ~~JsonlCorpus~~                                                                                               |
-												Update docs [ci skip]

											
										
										
											2020-10-01 18:38:17 +03:00
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								## Batchers {#batchers source="spacy/training/batchers.py" new="3"}
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
-												small fix

											
										
										
											2020-08-19 20:17:36 +03:00
+								A data batcher implements a batching strategy that essentially turns a stream of
-												badgers intro

											
										
										
											2020-08-19 18:53:22 +03:00
+								items into a stream of batches, with each batch consisting of one item or a list
 								of items. During training, the models update their weights after processing one
 								batch at a time. Typical batching strategies include presenting the training
 								data as a stream of batches with similar sizes, or with increasing batch sizes.
 								See the Thinc documentation on
 								[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.
 								Instead of using one of the built-in batchers listed here, you can also
 								[implement your own](/usage/training#custom-code-readers-batchers), which may or
 								may not use a custom schedule.
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								### spacy.batch_by_words.v1 {#batch_by_words tag="registered function"}
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
 								Create minibatches of roughly a given number of words. If any examples are
 								longer than the specified batch length, they will appear in a batch by
 								themselves, or be discarded if `discard_oversize` is set to `True`. The argument
 								`docs` can be a list of strings, [`Doc`](/api/doc) objects or
 								[`Example`](/api/example) objects.
 								> #### Example config
 								>
 								> ```ini
 								> [training.batcher]
-												Add prefix to batchers

											
										
										
											2020-09-03 18:30:41 +03:00
+								> @batchers = "spacy.batch_by_words.v1"
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								> size = 100
 								> tolerance = 0.2
 								> discard_oversize = false
 								> get_length = null
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name               | Description                                                                                                                                                                             |
 								| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `seqs`             | The sequences to minibatch. ~~Iterable[Any]~~                                                                                                                                           |
 								| `size`             | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
 								| `tolerance`        | What percentage of the size to allow batches to exceed. ~~float~~                                                                                                                       |
 								| `discard_oversize` | Whether to discard sequences that by themselves exceed the tolerated size. ~~bool~~                                                                                                     |
 								| `get_length`       | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                 |
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								| **CREATES**        | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                     |
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								### spacy.batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
 								> #### Example config
 								>
 								> ```ini
 								> [training.batcher]
-												Add prefix to batchers

											
										
										
											2020-09-03 18:30:41 +03:00
+								> @batchers = "spacy.batch_by_sequence.v1"
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								> size = 32
 								> get_length = null
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-07 21:14:31 +03:00
+								Create a batcher that creates batches of the specified size.
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name         | Description                                                                                                                                                                             |
 								| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `size`       | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
 								| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                 |
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								| **CREATES**  | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                     |
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								### spacy.batch_by_padded.v1 {#batch_by_padded tag="registered function"}
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
 								> #### Example config
 								>
 								> ```ini
 								> [training.batcher]
-												Add prefix to batchers

											
										
										
											2020-09-03 18:30:41 +03:00
+								> @batchers = "spacy.batch_by_padded.v1"
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								> size = 100
-												Update docs [ci skip]

											
										
										
											2020-08-07 21:14:31 +03:00
+								> buffer = 256
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
+								> discard_oversize = false
 								> get_length = null
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-07 21:14:31 +03:00
+								Minibatch a sequence by the size of padded batches that would result, with
 								sequences binned by length within a window. The padded size is defined as the
 								maximum length of sequences within the batch multiplied by the number of
 								sequences in the batch.
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name               | Description                                                                                                                                                                                                                                 |
 								| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `size`             | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~                                          |
 								| `buffer`           | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. ~~int~~ |
 								| `discard_oversize` | Whether to discard sequences that are by themselves longer than the largest padded batch size. ~~bool~~                                                                                                                                     |
 								| `get_length`       | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~                                                                                     |
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								| **CREATES**        | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~                                                                                                                         |
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
 								## Augmenters {#augmenters source="spacy/training/augment.py" new="3"}
-												Update docs [ci skip]

											
										
										
											2020-10-02 12:38:03 +03:00
+								Data augmentation is the process of applying small modifications to the training
 								data. It can be especially useful for punctuation and case replacement – for
 								example, if your corpus only uses smart quotes and you want to include
 								variations using regular quotes, or to make the model less sensitive to
-												Update data augmenters (#6196)

* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-10-04 18:46:29 +03:00
+								capitalization by including a mix of capitalized and lowercase examples. See the
 								[usage guide](/usage/training#data-augmentation) for details and examples.
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								### spacy.orth_variants.v1 {#orth_variants tag="registered function"}
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
 								> #### Example config
 								>
 								> ```ini
 								> [corpora.train.augmenter]
 								> @augmenters = "spacy.orth_variants.v1"
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								> level = 0.1
 								> lower = 0.5
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								>
 								> [corpora.train.augmenter.orth_variants]
 								> @readers = "srsly.read_json.v1"
 								> path = "corpus/en_orth_variants.json"
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								> ```
 								Create a data augmentation callback that uses orth-variant replacement. The
-												Update data augmenters (#6196)

* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-10-04 18:46:29 +03:00
+								callback can be added to a corpus or other data iterator during training. It's
-												Update augmenter lookups and docs

											
										
										
											2020-10-01 00:03:47 +03:00
+								is especially useful for punctuation and case replacement, to help generalize
 								beyond corpora that don't have smart quotes, or only have smart quotes etc.
-												Integrate file readers

											
										
										
											2020-10-02 02:36:06 +03:00
+								| Name            | Description                                                                                                                                                                                                                                                                                               |
 								| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `level`         | The percentage of texts that will be augmented. ~~float~~                                                                                                                                                                                                                                                 |
 								| `lower`         | The percentage of texts that will be lowercased. ~~float~~                                                                                                                                                                                                                                                |
 								| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ |
 								| **CREATES**     | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~                                                                                                                              |
-												Update docs [ci skip]

											
										
										
											2020-08-05 21:29:53 +03:00
-												Update data augmenters (#6196)

* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-10-04 18:46:29 +03:00
+								### spacy.lower_case.v1 {#lower_case tag="registered function"}
 								> #### Example config
 								>
 								> ```ini
 								> [corpora.train.augmenter]
 								> @augmenters = "spacy.lower_case.v1"
 								> level = 0.3
 								> ```
 								Create a data augmentation callback that lowercases documents. The callback can
 								be added to a corpus or other data iterator during training. It's especially
 								useful for making the model less sensitive to capitalization.
 								| Name        | Description                                                                                                                                                                  |
 								| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `level`     | The percentage of texts that will be augmented. ~~float~~                                                                                                                    |
 								| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								## Training data and alignment {#gold source="spacy/training"}
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
 								Encode labelled spans into per-token tags, using the
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
-												Proofreading

Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.

											
										
										
											2020-09-28 17:50:15 +03:00
+								Out). Returns a list of strings, describing the tags. Each tag string will be in
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
 								`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
 								don't align with the tokenization in the `Doc` object. The training algorithm
 								will view these as missing values. `O` denotes a non-entity token. `B` denotes
 								the beginning of a multi-token entity, `I` the inside of an entity of three or
 								more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
 								single-token entity.
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update docs and formatting [ci skip]

											
										
										
											2020-09-23 11:14:27 +03:00
+								<Infobox title="Changed in v3.0" variant="warning" id="biluo_tags_from_offsets">
 								This method was previously available as `spacy.gold.biluo_tags_from_offsets`.
 								</Infobox>
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> #### Example
 								>
 								> ```python
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> from spacy.training import offsets_to_biluo_tags
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								>
 								> doc = nlp("I like London.")
 								> entities = [(7, 13, "LOC")]
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> tags = offsets_to_biluo_tags(doc, entities)
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> assert tags == ["O", "O", "U-LOC", "O"]
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                                                                                                                                                                |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `doc`       | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. ~~Doc~~                                                             |
 								| `entities`  | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, Union[str, int]]]~~ |
-												Update docs and formatting [ci skip]

											
										
										
											2020-09-23 11:14:27 +03:00
+								| `missing`   | The label used for missing values, e.g. if tokenization doesn't align with the entity offsets. Defaults to `"O"`. ~~str~~                                                                  |
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| **RETURNS** | A list of strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. ~~List[str]~~                                                                                    |
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								### training.biluo_tags_to_offsets {#biluo_tags_to_offsets tag="function"}
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								Encode per-token tags following the
 								[BILUO scheme](/usage/linguistic-features#accessing-ner) into entity offsets.
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update docs and formatting [ci skip]

											
										
										
											2020-09-23 11:14:27 +03:00
+								<Infobox title="Changed in v3.0" variant="warning" id="offsets_from_biluo_tags">
 								This method was previously available as `spacy.gold.offsets_from_biluo_tags`.
 								</Infobox>
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> #### Example
 								>
 								> ```python
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> from spacy.training import biluo_tags_to_offsets
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								>
 								> doc = nlp("I like London.")
 								> tags = ["O", "O", "U-LOC", "O"]
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> entities = biluo_tags_to_offsets(doc, tags)
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> assert entities == [(7, 13, "LOC")]
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                                                                                                                                                                                                                                  |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `doc`       | The document that the BILUO tags refer to. ~~Doc~~                                                                                                                                                                                                           |
 								| `entities`  | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
 								| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~                                                                                 |
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								### training.biluo_tags_to_spans {#biluo_tags_to_spans tag="function" new="2.1"}
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								Encode per-token tags following the
 								[BILUO scheme](/usage/linguistic-features#accessing-ner) into
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								[`Span`](/api/span) objects. This can be used to create entity spans from
 								token-based tags, e.g. to overwrite the `doc.ents`.
-												Update docs and formatting [ci skip]

											
										
										
											2020-09-23 11:14:27 +03:00
+								<Infobox title="Changed in v3.0" variant="warning" id="spans_from_biluo_tags">
 								This method was previously available as `spacy.gold.spans_from_biluo_tags`.
 								</Infobox>
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> #### Example
 								>
 								> ```python
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> from spacy.training import biluo_tags_to_spans
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								>
 								> doc = nlp("I like London.")
 								> tags = ["O", "O", "U-LOC", "O"]
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								> doc.ents = biluo_tags_to_spans(doc, tags)
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                                                                                                                                                                                                                                  |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `doc`       | The document that the BILUO tags refer to. ~~Doc~~                                                                                                                                                                                                           |
 								| `entities`  | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
 								| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~                                                                                                                                                                                        |
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Utility functions {#util source="spacy/util.py"}
 								spaCy comes with a small collection of utility functions located in
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								[`spacy/util.py`](%%GITHUB_SPACY/spacy/util.py). Because utility functions are
 								mostly intended for **internal use within spaCy**, their behavior may change
 								with future releases. The functions documented on this page should be safe to
 								use and we'll try to ensure backwards compatibility. However, we recommend
 								having additional tests in place if your application depends on any of spaCy's
 								utilities.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### util.get_lang_class {#util.get_lang_class tag="function"}
 								Import and load a `Language` class. Allows lazy-loading
-												Update data augmenters (#6196)

* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-10-04 18:46:29 +03:00
+								[language data](/usage/linguistic-features#language-data) and importing
 								languages using the two-letter language code. To add a language code for a
 								custom language class, you can register it using the
 								[`@registry.languages`](/api/top-level#registry) decorator.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> for lang_id in ["en", "de"]:
 								>     lang_class = util.get_lang_class(lang_id)
 								>     lang = lang_class()
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                    |
 								| ----------- | ---------------------------------------------- |
 								| `lang`      | Two-letter language code, e.g. `"en"`. ~~str~~ |
 								| **RETURNS** | The respective subclass. ~~Language~~          |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Merge branch 'spacy.io' [ci skip]

											
										
										
											2019-05-12 00:03:56 +03:00
+								### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
-												Document new API [ci skip]

											
										
										
											2019-03-11 17:23:53 +03:00
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								Check whether a `Language` subclass is already loaded. `Language` subclasses are
-												Proofreading

Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.

											
										
										
											2020-09-28 17:50:15 +03:00
+								loaded lazily to avoid expensive setup code associated with the language data.
-												Document new API [ci skip]

											
										
										
											2019-03-11 17:23:53 +03:00
 								> #### Example
 								>
 								> ```python
 								> lang_cls = util.get_lang_class("en")
 								> assert util.lang_class_is_loaded("en") is True
 								> assert util.lang_class_is_loaded("de") is False
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                    |
 								| ----------- | ---------------------------------------------- |
 								| `name`      | Two-letter language code, e.g. `"en"`. ~~str~~ |
 								| **RETURNS** | Whether the class has been loaded. ~~bool~~    |
-												Document new API [ci skip]

											
										
										
											2019-03-11 17:23:53 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### util.load_model {#util.load_model tag="function" new="2"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Load a pipeline from a package or data path. If called with a string name, spaCy
 								will assume the pipeline is a Python package and import and call its `load()`
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								method. If called with a path, spaCy will assume it's a data directory, read the
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-18 15:39:40 +03:00
+								language and pipeline settings from the [`config.cfg`](/api/data-formats#config)
 								and create a `Language` object. The model data will then be loaded in via
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								[`Language.from_disk`](/api/language#from_disk).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
+								> nlp = util.load_model("en_core_web_sm")
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								> nlp = util.load_model("en_core_web_sm", exclude=["ner"])
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> nlp = util.load_model("/path/to/data")
 								> ```
-												Fix consistency and update docs

											
										
										
											2020-09-15 12:37:37 +03:00
+								| Name                                 | Description                                                                                                                                                                                                                                      |
 								| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `name`                               | Package name or path. ~~str~~                                                                                                                                                                                                                    |
 								| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                           |
 								| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ |
 								| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                             |
 								| `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                   |
 								| **RETURNS**                          | `Language` class with the loaded pipeline. ~~Language~~                                                                                                                                                                                          |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								A helper function to use in the `load()` method of a pipeline package's
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								[`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py).
 								> #### Example
 								>
 								> ```python
 								> from spacy.util import load_model_from_init_py
 								>
 								> def load(**overrides):
 								>     return load_model_from_init_py(__file__, **overrides)
 								> ```
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								| Name                                 | Description                                                                                                                                                                                                                                    |
 								| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| `init_file`                          | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~                                                                                                                                                                         |
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                         |
 								| `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
 								| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
 								| `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                 |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| **RETURNS**                          | `Language` class with the loaded pipeline. ~~Language~~                                                                                                                                                                                        |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								### util.load_config {#util.load_config tag="function" new="3"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Load a pipeline's [`config.cfg`](/api/data-formats#config) from a file path. The
 								config typically includes details about the components and how they're created,
 								as well as all training settings and hyperparameters.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> config = util.load_config("/path/to/config.cfg")
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								> print(config.to_str())
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								| Name          | Description                                                                                                                                                                 |
 								| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| `path`        | Path to the pipeline's `config.cfg`. ~~Union[str, Path]~~                                                                                                                   |
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								| `overrides`   | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ |
-												Update Thinc and config variables

											
										
										
											2020-08-19 20:46:12 +03:00
+								| `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~                                              |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| **RETURNS**   | The pipeline's config. ~~Config~~                                                                                                                                           |
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
 								### util.load_meta {#util.load_meta tag="function" new="3"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Get a pipeline's [`meta.json`](/api/data-formats#meta) from a file path and
 								validate its contents. The meta typically includes details about author,
 								licensing, data sources and version.
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
 								> #### Example
 								>
 								> ```python
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> meta = util.load_meta("/path/to/meta.json")
-												Update docs and util consistency

											
										
										
											2020-08-18 02:22:59 +03:00
+								> ```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| Name        | Description                                              |
 								| ----------- | -------------------------------------------------------- |
 								| `path`      | Path to the pipeline's `meta.json`. ~~Union[str, Path]~~ |
 								| **RETURNS** | The pipeline's meta data. ~~Dict[str, Any]~~             |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								### util.get_installed_models {#util.get_installed_models tag="function" new="3"}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								List all pipeline packages installed in the current environment. This will
 								include any spaCy pipeline that was packaged with
 								[`spacy package`](/api/cli#package). Under the hood, pipeline packages expose a
 								Python entry point that spaCy can check, without having to load the `nlp`
 								object.
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
 								> #### Example
 								>
 								> ```python
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> names = util.get_installed_models()
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
+								> ```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| Name        | Description                                                                           |
 								| ----------- | ------------------------------------------------------------------------------------- |
 								| **RETURNS** | The string names of the pipelines installed in the current environment. ~~List[str]~~ |
-												Update docs [ci skip]

											
										
										
											2020-08-21 14:22:59 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### util.is_package {#util.is_package tag="function"}
 								Check if string maps to a package installed via pip. Mainly used to validate
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								[pipeline packages](/usage/models).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> util.is_package("en_core_web_sm") # True
 								> util.is_package("xyz") # False
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                           |
 								| ----------- | ----------------------------------------------------- |
 								| `name`      | Name of package. ~~str~~                              |
 								| **RETURNS** | `True` if installed package, `False` if not. ~~bool~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### util.get_package_path {#util.get_package_path tag="function" new="2"}
 								Get path to an installed package. Mainly used to resolve the location of
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								[pipeline packages](/usage/models). Currently imports the package to find its
 								path.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
 								> util.get_package_path("en_core_web_sm")
 								> # /usr/lib/python3.6/site-packages/en_core_web_sm
 								> ```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								| Name           | Description                                  |
 								| -------------- | -------------------------------------------- |
 								| `package_name` | Name of installed package. ~~str~~           |
 								| **RETURNS**    | Path to pipeline package directory. ~~Path~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
 								Check if user is running spaCy from a [Jupyter](https://jupyter.org) notebook by
 								detecting the IPython kernel. Mainly used for the
 								[`displacy`](/api/top-level#displacy) visualizer.
 								> #### Example
 								>
 								> ```python
 								> html = "<h1>Hello world!</h1>"
 								> if util.is_in_jupyter():
 								>     from IPython.core.display import display, HTML
 								>     display(HTML(html))
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                                    |
 								| ----------- | ---------------------------------------------- |
 								| **RETURNS** | `True` if in Jupyter, `False` if not. ~~bool~~ |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Document regex utilities [ci skip]

											
										
										
											2019-02-24 20:34:10 +03:00
+								### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
 								Compile a sequence of prefix rules into a regex object.
 								> #### Example
 								>
 								> ```python
 								> prefixes = ("§", "%", "=", r"\+")
 								> prefix_regex = util.compile_prefix_regex(prefixes)
 								> nlp.tokenizer.prefix_search = prefix_regex.search
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								| Name        | Description                                                                                                                                 |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `entries`   | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~                                         |
-												Document regex utilities [ci skip]

											
										
										
											2019-02-24 20:34:10 +03:00
 								### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
 								Compile a sequence of suffix rules into a regex object.
 								> #### Example
 								>
 								> ```python
 								> suffixes = ("'s", "'S", r"(?<=[0-9])\+")
 								> suffix_regex = util.compile_suffix_regex(suffixes)
 								> nlp.tokenizer.suffix_search = suffix_regex.search
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								| Name        | Description                                                                                                                                 |
 								| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `entries`   | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~                                         |
-												Document regex utilities [ci skip]

											
										
										
											2019-02-24 20:34:10 +03:00
 								### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
 								Compile a sequence of infix rules into a regex object.
 								> #### Example
 								>
 								> ```python
 								> infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
 								> infix_regex = util.compile_infix_regex(infixes)
 								> nlp.tokenizer.infix_finditer = infix_regex.finditer
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								| Name        | Description                                                                                                                               |
 								| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
 								| `entries`   | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~                                      |
-												Document regex utilities [ci skip]

											
										
										
											2019-02-24 20:34:10 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### util.minibatch {#util.minibatch tag="function" new="2"}
 								Iterate over batches of items. `size` may be an iterator, so that batch-size can
 								vary on each step.
 								> #### Example
 								>
 								> ```python
 								> batches = minibatch(train_data)
 								> for batch in batches:
-												fix component constructors, update, begin_training, reference to GoldParse

											
										
										
											2020-07-07 20:17:19 +03:00
+								>     nlp.update(batch)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name       | Description                              |
 								| ---------- | ---------------------------------------- |
 								| `items`    | The items to batch up. ~~Iterable[Any]~~ |
 								| `size`     | int / iterable                           | The batch size(s). ~~Union[int, Sequence[int]]~~ |
 								| **YIELDS** | The batches.                             |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Add util.filter_spans helper (#3686)


											
										
										
											2019-05-08 03:33:40 +03:00
+								### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
 								Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
 								overlaps. Useful for creating named entities (where one token can only be part
 								of one entity) or when merging spans with
 								[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
 								(first) longest span is preferred over shorter spans.
 								> #### Example
 								>
 								> ```python
 								> doc = nlp("This is a sentence.")
 								> spans = [doc[0:2], doc[0:2], doc[0:4]]
 								> filtered = filter_spans(spans)
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Name        | Description                             |
 								| ----------- | --------------------------------------- |
 								| `spans`     | The spans to filter. ~~Iterable[Span]~~ |
 								| **RETURNS** | The filtered spans. ~~List[Span]~~      |
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update docs and add keyword-only tag

											
										
										
											2020-07-06 19:14:57 +03:00
+								### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								Given a list of words and a text, reconstruct the original tokens and return a
 								list of words and spaces that can be used to create a [`Doc`](/api/doc#init).
 								This can help recover destructive tokenization that didn't preserve any
 								whitespace information.
-												Update docs

											
										
										
											2020-07-04 15:23:10 +03:00
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								> #### Example
 								>
 								> ```python
 								> orig_words = ["Hey", ",", "what", "'s", "up", "?"]
 								> orig_text = "Hey, what's up?"
 								> words, spaces = get_words_and_spaces(orig_words, orig_text)
 								> # ['Hey', ',', 'what', "'s", 'up', '?']
 								> # [False, True, False, True, False, False]
 								> ```
 								| Name        | Description                                                                                                                                        |
 								| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `words`     | The list of words. ~~Iterable[str]~~                                                                                                               |
 								| `text`      | The original text. ~~str~~                                                                                                                         |
 								| **RETURNS** | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. ~~Tuple[List[str], List[bool]]~~ |