Merge branch 'master' into spacy.io

2025-09-18 18:12:45 +03:00 · 2021-03-19 12:09:03 +11:00 · 2021-03-19 12:09:03 +11:00 · 6db5414668
commit 6db5414668
parent 42fcff6f8a 34e13c1161
21 changed files with 254 additions and 58 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -3,6 +3,7 @@ recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
 include LICENSE
 include README.md
 include pyproject.toml
 include spacy/py.typed
 recursive-exclude spacy/lang *.json
 recursive-include spacy/lang *.json.gz
 recursive-include spacy/cli *.json *.yml
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1,130 @@
 <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
 # spaCy examples
 For spaCy v3 we've converted many of the [v2 example
 scripts](https://github.com/explosion/spaCy/tree/v2.3.x/examples/) into
 end-to-end [spacy projects](https://spacy.io/usage/projects) workflows. The
 workflows include all the steps to go from data to packaged spaCy models.
 ## 🪐 Pipeline component demos
 The simplest demos for training a single pipeline component are in the
 [`pipelines`](https://github.com/explosion/projects/blob/v3/pipelines) category
 including:
 - [`pipelines/ner_demo`](https://github.com/explosion/projects/blob/v3/pipelines/ner_demo):
  Train a named entity recognizer
 - [`pipelines/textcat_demo`](https://github.com/explosion/projects/blob/v3/pipelines/textcat_demo):
  Train a text classifier
 - [`pipelines/parser_intent_demo`](https://github.com/explosion/projects/blob/v3/pipelines/parser_intent_demo):
  Train a dependency parser for custom semantics
 ## 🪐 Tutorials
 The [`tutorials`](https://github.com/explosion/projects/blob/v3/tutorials)
 category includes examples that work through specific NLP use cases end-to-end:
 - [`tutorials/textcat_goemotions`](https://github.com/explosion/projects/blob/v3/tutorials/textcat_goemotions):
  Train a text classifier to categorize emotions in Reddit posts
 - [`tutorials/nel_emerson`](https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson):
  Use an entity linker to disambiguate mentions of the same name
 Check out the [projects documentation](https://spacy.io/usage/projects) and
 browse through the [available
 projects](https://github.com/explosion/projects/)!
 ## 🚀 Get started with a demo project
 The
 [`pipelines/ner_demo`](https://github.com/explosion/projects/blob/v3/pipelines/ner_demo)
 project converts the spaCy v2
 [`train_ner.py`](https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/train_ner.py)
 demo script into a spaCy v3 project.
 1. Clone the project:
   ```bash
   python -m spacy project clone pipelines/ner_demo
   ```
 2. Install requirements and download any data assets:
   ```bash
   cd ner_demo
   python -m pip install -r requirements.txt
   python -m spacy project assets
   ```
 3. Run the default workflow to convert, train and evaluate:
   ```bash
   python -m spacy project run all
   ```
   Sample output:
   ```none
   ℹ Running workflow 'all'
   ================================== convert ==================================
   Running command: /home/user/venv/bin/python scripts/convert.py en assets/train.json corpus/train.spacy
   Running command: /home/user/venv/bin/python scripts/convert.py en assets/dev.json corpus/dev.spacy
   =============================== create-config ===============================
   Running command: /home/user/venv/bin/python -m spacy init config --lang en --pipeline ner configs/config.cfg --force
   ℹ Generated config template specific for your use case
   - Language: en
   - Pipeline: ner
   - Optimize for: efficiency
   - Hardware: CPU
   - Transformer: None
   ✔ Auto-filled config with all values
   ✔ Saved config
   configs/config.cfg
   You can now add your data and train your pipeline:
   python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
   =================================== train ===================================
   Running command: /home/user/venv/bin/python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --training.eval_frequency 10 --training.max_steps 100 --gpu-id -1
   ℹ Using CPU
   =========================== Initializing pipeline ===========================
   [2021-03-11 19:34:59,101] [INFO] Set up nlp object from config
   [2021-03-11 19:34:59,109] [INFO] Pipeline: ['tok2vec', 'ner']
   [2021-03-11 19:34:59,113] [INFO] Created vocabulary
   [2021-03-11 19:34:59,113] [INFO] Finished initializing nlp object
   [2021-03-11 19:34:59,265] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
   ✔ Initialized pipeline
   ============================= Training pipeline =============================
   ℹ Pipeline: ['tok2vec', 'ner']
   ℹ Initial learn rate: 0.001
   E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
   ---  ------  ------------  --------  ------  ------  ------  ------
     0       0          0.00      7.90    0.00    0.00    0.00    0.00
    10      10          0.11     71.07    0.00    0.00    0.00    0.00
    20      20          0.65     22.44   50.00   50.00   50.00    0.50
    30      30          0.22      6.38   80.00   66.67  100.00    0.80
    40      40          0.00      0.00   80.00   66.67  100.00    0.80
    50      50          0.00      0.00   80.00   66.67  100.00    0.80
    60      60          0.00      0.00  100.00  100.00  100.00    1.00
    70      70          0.00      0.00  100.00  100.00  100.00    1.00
    80      80          0.00      0.00  100.00  100.00  100.00    1.00
    90      90          0.00      0.00  100.00  100.00  100.00    1.00
   100     100          0.00      0.00  100.00  100.00  100.00    1.00
   ✔ Saved pipeline to output directory
   training/model-last
   ```
 4. Package the model:
   ```bash
   python -m spacy project run package
   ```
 5. Visualize the model's output with [Streamlit](https://streamlit.io):
   ```bash
   python -m spacy project run visualize-model
   ```
--- a/examples/training/README.md
+++ b/examples/training/README.md
@ -0,0 +1,5 @@
 <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
 # spaCy examples
 See [examples/README.md](../README.md)
--- a/extra/example_data/ner_example_data/README.md
+++ b/extra/example_data/ner_example_data/README.md
@ -1,7 +1,25 @@
 ## Examples of NER/IOB data that can be converted with `spacy convert`
-spacy JSON training files were generated with:
+To convert an IOB file to `.spacy` ([`DocBin`](https://spacy.io/api/docbin))
 for spaCy v3:
 ```bash
 python -m spacy convert -c iob -s -n 10 -b en_core_web_sm file.iob .
 ```
 See all the `spacy convert` options: https://spacy.io/api/cli#convert
 ---
 The spaCy v2 JSON training files were generated using **spaCy v2** with:
 ```bash
 python -m spacy convert -c iob -s -n 10 -b en file.iob
 ```
 To convert an existing JSON training file to `.spacy` for spaCy v3, convert
 with **spaCy v3**:
 ```bash
 python -m spacy convert file.json .
 ```
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,7 +5,7 @@ requires = [
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc>=8.0.0,<8.1.0",
+    "thinc>=8.0.2,<8.1.0",
    "blis>=0.4.0,<0.8.0",
    "pathy",
    "numpy>=1.15.0",
--- a/requirements.txt
+++ b/requirements.txt
@ -2,7 +2,7 @@
 spacy-legacy>=3.0.0,<3.1.0
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc>=8.0.0,<8.1.0
+thinc>=8.0.2,<8.1.0
 blis>=0.4.0,<0.8.0
 ml_datasets>=0.2.0,<0.3.0
 murmurhash>=0.28.0,<1.1.0
--- a/setup.cfg
+++ b/setup.cfg
@ -34,14 +34,14 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc>=8.0.0,<8.1.0
+    thinc>=8.0.2,<8.1.0
 install_requires =
    # Our libraries
    spacy-legacy>=3.0.0,<3.1.0
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc>=8.0.0,<8.1.0
+    thinc>=8.0.2,<8.1.0
    blis>=0.4.0,<0.8.0
    wasabi>=0.8.1,<1.1.0
    srsly>=2.4.0,<3.0.0
--- a/spacy/init.py
+++ b/spacy/init.py
@ -28,6 +28,8 @@ if sys.maxunicode == 65535:
 def load(
    name: Union[str, Path],
    *,
    vocab: Union[Vocab, bool] = True,
    disable: Iterable[str] = util.SimpleFrozenList(),
    exclude: Iterable[str] = util.SimpleFrozenList(),
    config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
@ -35,6 +37,7 @@ def load(
    """Load a spaCy model from an installed package or a local path.
    name (str): Package name or model path.
    vocab (Vocab): A Vocab object. If True, a vocab is created.
    disable (Iterable[str]): Names of pipeline components to disable. Disabled
        pipes will be loaded but they won't be run unless you explicitly
        enable them by calling nlp.enable_pipe.
@ -44,7 +47,9 @@ def load(
        keyed by section values in dot notation.
    RETURNS (Language): The loaded nlp object.
    """
-    return util.load_model(name, disable=disable, exclude=exclude, config=config)
+    return util.load_model(
        name, vocab=vocab, disable=disable, exclude=exclude, config=config
    )
 def blank(
@ -52,7 +57,7 @@ def blank(
    *,
    vocab: Union[Vocab, bool] = True,
    config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
-    meta: Dict[str, Any] = util.SimpleFrozenDict()
+    meta: Dict[str, Any] = util.SimpleFrozenDict(),
 ) -> Language:
    """Create a blank nlp object for a given language code.
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.4"
+__version__ = "3.0.5"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/cli/debug_config.py
+++ b/spacy/cli/debug_config.py
@ -20,7 +20,7 @@ def debug_config_cli(
    # fmt: off
    ctx: typer.Context,  # This is only used to read additional arguments
    config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
-    code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
+    code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
    show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
    show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
    # fmt: on
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -39,7 +39,7 @@ def debug_data_cli(
    # fmt: off
    ctx: typer.Context,  # This is only used to read additional arguments
    config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
-    code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
+    code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
    ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
    verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
    no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
--- a/spacy/cli/init_config.py
+++ b/spacy/cli/init_config.py
@ -10,7 +10,8 @@ from jinja2 import Template
 from .. import util
 from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
 from ..schemas import RecommendationSchema
-from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND, string_to_list
+from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
 from ._util import string_to_list, import_code
 ROOT = Path(__file__).parent / "templates"
@ -70,7 +71,8 @@ def init_fill_config_cli(
    base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False),
    output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
    pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
-    diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes")
+    diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
    code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
    # fmt: on
 ):
    """
@ -82,6 +84,7 @@ def init_fill_config_cli(
    DOCS: https://spacy.io/api/cli#init-fill-config
    """
    import_code(code_path)
    fill_config(output_file, base_path, pretraining=pretraining, diff=diff)
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -120,7 +120,7 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
    doc (Doc): Document do parse.
    RETURNS (dict): Generated dependency parse keyed by words and arcs.
    """
-    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"]))
+    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data", "user_hooks"]))
    if not doc.has_annotation("DEP"):
        warnings.warn(Warnings.W005)
    if options.get("collapse_phrases", False):
--- a/spacy/lang/ru/lemmatizer.py
+++ b/spacy/lang/ru/lemmatizer.py
@ -90,12 +90,12 @@ class RussianLemmatizer(Lemmatizer):
            return [string.lower()]
        return list(set([analysis.normal_form for analysis in filtered_analyses]))
-    def lookup_lemmatize(self, token: Token) -> List[str]:
+    def pymorphy2_lookup_lemmatize(self, token: Token) -> List[str]:
        string = token.text
        analyses = self._morph.parse(string)
        if len(analyses) == 1:
-            return analyses[0].normal_form
+            return [analyses[0].normal_form]
-        return string
+        return [string]
 def oc2ud(oc_tag: str) -> Tuple[str, Dict[str, str]]:
--- a/spacy/py.typed
+++ b/spacy/py.typed
--- a/spacy/tests/package/test_requirements.py
+++ b/spacy/tests/package/test_requirements.py
@ -6,15 +6,14 @@ def test_build_dependencies():
    # Check that library requirements are pinned exactly the same across different setup files.
    # TODO: correct checks for numpy rather than ignoring
    libs_ignore_requirements = [
        "numpy",
        "pytest",
        "pytest-timeout",
        "mock",
        "flake8",
        "hypothesis",
    ]
    # ignore language-specific packages that shouldn't be installed by all
    libs_ignore_setup = [
        "numpy",
        "fugashi",
        "natto-py",
        "pythainlp",
--- a/spacy/training/pretrain.py
+++ b/spacy/training/pretrain.py
@ -142,7 +142,7 @@ def create_pretraining_model(nlp, pretrain_config):
    # If the config referred to a Tok2VecListener, grab the original model instead
    if type(tok2vec).__name__ == "Tok2VecListener":
        original_tok2vec = (
-            tok2vec.upstream_name if tok2vec.upstream_name is not "*" else "tok2vec"
+            tok2vec.upstream_name if tok2vec.upstream_name != "*" else "tok2vec"
        )
        tok2vec = nlp.get_pipe(original_tok2vec).model
    try:
--- a/spacy/util.py
+++ b/spacy/util.py
@ -88,7 +88,7 @@ class registry(thinc.registry):
    displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
    misc = catalogue.create("spacy", "misc", entry_points=True)
    # Callback functions used to manipulate nlp object etc.
-    callbacks = catalogue.create("spacy", "callbacks")
+    callbacks = catalogue.create("spacy", "callbacks", entry_points=True)
    batchers = catalogue.create("spacy", "batchers", entry_points=True)
    readers = catalogue.create("spacy", "readers", entry_points=True)
    augmenters = catalogue.create("spacy", "augmenters", entry_points=True)
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -170,14 +170,15 @@ validation error with more details.
 $ python -m spacy init fill-config [base_path] [output_file] [--diff]
 ```
-| Name                   | Description                                                                                                                         |
+| Name                   | Description                                                                                                                                                                          |
-| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `base_path`            | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~           |
+| `base_path`            | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~                                                            |
-| `output_file`          | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ |
+| `output_file`          | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~                                                  |
-| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~                   |
+| `--code`, `-c`         | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
-| `--diff`, `-D`         | Print a visual diff highlighting the changes. ~~bool (flag)~~                                                                       |
+| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~                                                                    |
-| `--help`, `-h`         | Show help message and available arguments. ~~bool (flag)~~                                                                          |
+| `--diff`, `-D`         | Print a visual diff highlighting the changes. ~~bool (flag)~~                                                                                                                        |
-| **CREATES**            | Complete and auto-filled config file for training.                                                                                  |
+| `--help`, `-h`         | Show help message and available arguments. ~~bool (flag)~~                                                                                                                           |
 | **CREATES**            | Complete and auto-filled config file for training.                                                                                                                                   |
 ### init vectors {#init-vectors new="3" tag="command"}
@ -261,24 +262,24 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type]
 | `output_dir`                                     | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~        |
 | `--converter`, `-c` <Tag variant="new">2</Tag>   | Name of converter to use (see below). ~~str (option)~~                                                                                    |
 | `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ |
-| `--n-sents`, `-n`                                | Number of sentences per document. ~~int (option)~~                                                                                        |
+| `--n-sents`, `-n`                                | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~                                         |
-| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | Segment sentences (for `--converter ner`). ~~bool (flag)~~                                                                                |
+| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | Segment sentences. Supported for: `conll`, `ner` ~~bool (flag)~~                                                                          |
 | `--base`, `-b`                                   | Trained spaCy pipeline for sentence segmentation to use as base (for `--seg-sents`). ~~Optional[str](option)~~                            |
-| `--morphology`, `-m`                             | Enable appending morphology to tags. ~~bool (flag)~~                                                                                      |
+| `--morphology`, `-m`                             | Enable appending morphology to tags. Supported for: `conllu` ~~bool (flag)~~                                                              |
-| `--ner-map`, `-nm`                               | NER tag mapping (as JSON-encoded dict of entity types). ~~Optional[Path](option)~~                                                        |
+| `--ner-map`, `-nm`                               | NER tag mapping (as JSON-encoded dict of entity types). Supported for: `conllu` ~~Optional[Path](option)~~                                |
 | `--lang`, `-l` <Tag variant="new">2.1</Tag>      | Language code (if tokenizer required). ~~Optional[str] \(option)~~                                                                        |
 | `--help`, `-h`                                   | Show help message and available arguments. ~~bool (flag)~~                                                                                |
 | **CREATES**                                      | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train).                                       |
 ### Converters {#converters}
-| ID      | Description                                                                                                                                                                                                                                                                                                                                                     |
+| ID              | Description                                                                                                                                                                                                                                                                                                                                                           |
-| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `auto`  | Automatically pick converter based on file extension and file content (default).                                                                                                                                                                                                                                                                                |
+| `auto`          | Automatically pick converter based on file extension and file content (default).                                                                                                                                                                                                                                                                                      |
-| `json`  | JSON-formatted training data used in spaCy v2.x.                                                                                                                                                                                                                                                                                                                |
+| `json`          | JSON-formatted training data used in spaCy v2.x.                                                                                                                                                                                                                                                                                                                      |
-| `conll` | Universal Dependencies `.conllu` or `.conll` format.                                                                                                                                                                                                                                                                                                            |
+| `conllu`        | Universal Dependencies `.conllu` format.                                                                                                                                                                                                                                                                                                                              |
-| `ner`   | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). |
+| `ner` / `conll` | NER with IOB/IOB2/BILUO tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the NER tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). |
-| `iob`   | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data).                                                                                                                            |
+| `iob`           | NER with IOB/IOB2/BILUO tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data).                                                                                                                            |
 ## debug {#debug new="3"}
@ -805,7 +806,7 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
 | Name              | Description                                                                                                                                                                                                        |
 | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `config_path`     | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ |
-| `--output`, `-o`  | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~                                                                                                      |
+| `--output`, `-o`  | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~                                                                                                          |
 | `--code`, `-c`    | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~                               |
 | `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~                                                                                                                                                       |
 | `--gpu-id`, `-g`  | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~                                                                                                                                                         |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -48,6 +48,7 @@ specified separately using the new `exclude` keyword argument.
 | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `name`                               | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~                                                                                                                                                                              |
 | _keyword-only_                       |                                                                                                                                                                                                                                                |
 | `vocab`                              | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~                                                                                                          |
 | `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
 | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
 | `config` <Tag variant="new">3</Tag>  | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~                                                                             |
@ -83,9 +84,9 @@ Create a blank pipeline of a given language class. This function is the twin of
 | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `name`                              | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~                                                           |
 | _keyword-only_                      |                                                                                                                                                                    |
-| `vocab` <Tag variant="new">3</Tag>  | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                             |
+| `vocab`                             | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~                              |
 | `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
-| `meta` <Tag variant="new">3</Tag>   | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~                                                                                   |
+| `meta`                              | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~                                                                                   |
 | **RETURNS**                         | An empty `Language` object of the appropriate subclass. ~~Language~~                                                                                               |
 ### spacy.info {#spacy.info tag="function"}
@ -140,9 +141,9 @@ pipelines.
 <Infobox variant="warning" title="Jupyter notebook usage">
-In a Jupyter notebook, run `prefer_gpu()` in the same cell as `spacy.load()`
+In a Jupyter notebook, run `prefer_gpu()` in the same cell as `spacy.load()` to
-to ensure that the model is loaded on the correct device. See [more
+ensure that the model is loaded on the correct device. See
-details](/usage/v3#jupyter-notebook-gpu).
+[more details](/usage/v3#jupyter-notebook-gpu).
 </Infobox>
@ -168,9 +169,9 @@ and _before_ loading any pipelines.
 <Infobox variant="warning" title="Jupyter notebook usage">
-In a Jupyter notebook, run `require_gpu()` in the same cell as `spacy.load()`
+In a Jupyter notebook, run `require_gpu()` in the same cell as `spacy.load()` to
-to ensure that the model is loaded on the correct device. See [more
+ensure that the model is loaded on the correct device. See
-details](/usage/v3#jupyter-notebook-gpu).
+[more details](/usage/v3#jupyter-notebook-gpu).
 </Infobox>
@ -195,9 +196,9 @@ after importing spaCy and _before_ loading any pipelines.
 <Infobox variant="warning" title="Jupyter notebook usage">
-In a Jupyter notebook, run `require_cpu()` in the same cell as `spacy.load()`
+In a Jupyter notebook, run `require_cpu()` in the same cell as `spacy.load()` to
-to ensure that the model is loaded on the correct device. See [more
+ensure that the model is loaded on the correct device. See
-details](/usage/v3#jupyter-notebook-gpu).
+[more details](/usage/v3#jupyter-notebook-gpu).
 </Infobox>
@ -945,7 +946,8 @@ and create a `Language` object. The model data will then be loaded in via
 | Name                                 | Description                                                                                                                                                                                                                                      |
 | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `name`                               | Package name or path. ~~str~~                                                                                                                                                                                                                    |
-| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                           |
+| _keyword-only_                       |                                                                                                                                                                                                                                                  |
 | `vocab`                              | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~                                                                                                            |
 | `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ |
 | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                             |
 | `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                   |
@ -968,7 +970,8 @@ A helper function to use in the `load()` method of a pipeline package's
 | Name                                 | Description                                                                                                                                                                                                                                    |
 | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `init_file`                          | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~                                                                                                                                                                         |
-| `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~.                                                                                                         |
+| _keyword-only_                       |                                                                                                                                                                                                                                                |
 | `vocab` <Tag variant="new">3</Tag>   | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~                                                                                                          |
 | `disable`                            | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
 | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~                                                                                                           |
 | `config` <Tag variant="new">3</Tag>  | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~                                                                                                 |
@ -1147,11 +1150,11 @@ vary on each step.
 >     nlp.update(batch)
 > ```
-| Name       | Description                              |
+| Name       | Description                                      |
-| ---------- | ---------------------------------------- |
+| ---------- | ------------------------------------------------ |
-| `items`    | The items to batch up. ~~Iterable[Any]~~ |
+| `items`    | The items to batch up. ~~Iterable[Any]~~         |
-| `size`     | int / iterable                           | The batch size(s). ~~Union[int, Sequence[int]]~~ |
+| `size`     | The batch size(s). ~~Union[int, Sequence[int]]~~ |
-| **YIELDS** | The batches.                             |
+| **YIELDS** | The batches.                                     |
 ### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1,5 +1,36 @@
 {
    "resources": [
        {
            "id": "spikex",
            "title": "SpikeX - SpaCy Pipes for Knowledge Extraction",
            "slogan": "Use SpikeX to build knowledge extraction tools with almost-zero effort",
            "description": "SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.",
            "github": "erre-quadro/spikex",
            "pip": "spikex",
            "code_example": [
                "from spacy import load as spacy_load",
                "from spikex.wikigraph import load as wg_load",
                "from spikex.pipes import WikiPageX",
                "",
                "# load a spacy model and get a doc",
                "nlp = spacy_load('en_core_web_sm')",
                "doc = nlp('An apple a day keeps the doctor away')",
                "# load a WikiGraph",
                "wg = wg_load('simplewiki_core')",
                "# get a WikiPageX and extract all pages",
                "wikipagex = WikiPageX(wg)",
                "doc = wikipagex(doc)",
                "# see all pages extracted from the doc",
                "for span in doc._.wiki_spans:",
                "   print(span._.wiki_pages)"
            ],
            "category": ["pipeline", "standalone"],
            "author": "Erre Quadro",
            "author_links": {
                "github": "erre-quadro",
                "website": "https://www.errequadrosrl.com"
            }
        },
        {
            "id": "spacy-dbpedia-spotlight",
            "title": "DBpedia Spotlight for SpaCy",