Merge pull request #9777 from explosion/master

Update develop with master
2025-08-22 13:04:56 +03:00 · 2021-11-30 14:01:23 +01:00 · 2021-11-30 14:01:23 +01:00 · 58e29776bd
commit 58e29776bd
parent 29f28d1f3e 72f7f4e68a
59 changed files with 1065 additions and 101 deletions
--- a/.github/contributors/Pantalaymon.md
+++ b/.github/contributors/Pantalaymon.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |Valentin-Gabriel Soumah|
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |    2021-11-23        |
+| GitHub username                |     Pantalaymon      |
+| Website (optional)             |                      |
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -23,7 +23,7 @@ jobs:
  # defined in .flake8 and overwrites the selected codes.
  - job: "Validate"
    pool:
-      vmImage: "ubuntu-18.04"
+      vmImage: "ubuntu-latest"
    steps:
      - task: UsePythonVersion@0
        inputs:
@ -39,49 +39,49 @@ jobs:
      matrix:
        # We're only running one platform per Python version to speed up builds
        Python36Linux:
-          imageName: "ubuntu-18.04"
+          imageName: "ubuntu-latest"
          python.version: "3.6"
        #        Python36Windows:
-        #          imageName: "windows-2019"
+        #          imageName: "windows-latest"
        #          python.version: "3.6"
        #        Python36Mac:
-        #          imageName: "macos-10.14"
+        #          imageName: "macos-latest"
        #          python.version: "3.6"
        #        Python37Linux:
-        #          imageName: "ubuntu-18.04"
+        #          imageName: "ubuntu-latest"
        #          python.version: "3.7"
        Python37Windows:
-          imageName: "windows-2019"
+          imageName: "windows-latest"
          python.version: "3.7"
        #        Python37Mac:
-        #          imageName: "macos-10.14"
+        #          imageName: "macos-latest"
        #          python.version: "3.7"
        #        Python38Linux:
-        #          imageName: "ubuntu-18.04"
+        #          imageName: "ubuntu-latest"
        #          python.version: "3.8"
        #        Python38Windows:
-        #          imageName: "windows-2019"
+        #          imageName: "windows-latest"
        #          python.version: "3.8"
        Python38Mac:
-          imageName: "macos-10.14"
+          imageName: "macos-latest"
          python.version: "3.8"
        Python39Linux:
-          imageName: "ubuntu-18.04"
+          imageName: "ubuntu-latest"
          python.version: "3.9"
        #        Python39Windows:
-        #          imageName: "windows-2019"
+        #          imageName: "windows-latest"
        #          python.version: "3.9"
        #        Python39Mac:
-        #          imageName: "macos-10.14"
+        #          imageName: "macos-latest"
        #          python.version: "3.9"
        Python310Linux:
-          imageName: "ubuntu-20.04"
+          imageName: "ubuntu-latest"
          python.version: "3.10"
        Python310Windows:
-          imageName: "windows-2019"
+          imageName: "windows-latest"
          python.version: "3.10"
        Python310Mac:
-          imageName: "macos-10.15"
+          imageName: "macos-latest"
          python.version: "3.10"
      maxParallel: 4
    pool:
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -4,6 +4,7 @@ from pathlib import Path
 from wasabi import Printer, MarkdownRenderer, get_raw_input
 from thinc.api import Config
 from collections import defaultdict
+from catalogue import RegistryError
 import srsly
 import sys

@ -212,9 +213,18 @@ def get_third_party_dependencies(
        if "factory" in component:
            funcs["factories"].add(component["factory"])
    modules = set()
+    lang = config["nlp"]["lang"]
    for reg_name, func_names in funcs.items():
        for func_name in func_names:
-            func_info = util.registry.find(reg_name, func_name)
+            # Try the lang-specific version and fall back
+            try:
+                func_info = util.registry.find(reg_name, lang + "." + func_name)
+            except RegistryError:
+                try:
+                    func_info = util.registry.find(reg_name, func_name)
+                except RegistryError as regerr:
+                    # lang-specific version being absent is not actually an issue
+                    raise regerr from None
            module_name = func_info.get("module")  # type: ignore[attr-defined]
            if module_name:  # the code is part of a module, not a --code file
                modules.add(func_info["module"].split(".")[0])  # type: ignore[index]
@ -397,7 +407,7 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
            continue
        col1 = md.bold(md.code(pipe))
        col2 = ", ".join(
-            [md.code(label.replace("|", "\\|")) for label in labels]
+            [md.code(str(label).replace("|", "\\|")) for label in labels]
        )  # noqa: W605
        label_data.append((col1, col2))
        n_labels += len(labels)
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -181,11 +181,19 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
 def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
    """Generate named entities in [{start: i, end: i, label: 'label'}] format.

-    doc (Doc): Document do parse.
+    doc (Doc): Document to parse.
+    options (Dict[str, Any]): NER-specific visualisation options.
    RETURNS (dict): Generated entities keyed by text (original text) and ents.
    """
+    kb_url_template = options.get("kb_url_template", None)
    ents = [
-        {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
+        {
+            "start": ent.start_char,
+            "end": ent.end_char,
+            "label": ent.label_,
+            "kb_id": ent.kb_id_ if ent.kb_id_ else "",
+            "kb_url": kb_url_template.format(ent.kb_id_) if kb_url_template else "#",
+        }
        for ent in doc.ents
    ]
    if not ents:
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -191,6 +191,7 @@ class Warnings(metaclass=ErrorsWithCodes):
            "lead to errors.")
    W115 = ("Skipping {method}: the floret vector table cannot be modified. "
            "Vectors are calculated from character ngrams.")
+    W116 = ("Unable to clean attribute '{attr}'.")


 class Errors(metaclass=ErrorsWithCodes):
@ -887,6 +888,7 @@ class Errors(metaclass=ErrorsWithCodes):
    E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
             "Non-UD tags should use the `tag` property.")
    E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
+    E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't exist.")


 # Deprecated model shortcuts, only used in errors and warnings
--- a/spacy/language.py
+++ b/spacy/language.py
@ -701,7 +701,8 @@ class Language:
        if (
            self.vocab.vectors.shape != source.vocab.vectors.shape
            or self.vocab.vectors.key2row != source.vocab.vectors.key2row
-            or self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes()
+            or self.vocab.vectors.to_bytes(exclude=["strings"])
+            != source.vocab.vectors.to_bytes(exclude=["strings"])
        ):
            warnings.warn(Warnings.W113.format(name=source_name))
        if source_name not in source.component_names:
@ -1822,7 +1823,9 @@ class Language:
                        )
                    if model not in source_nlp_vectors_hashes:
                        source_nlp_vectors_hashes[model] = hash(
-                            source_nlps[model].vocab.vectors.to_bytes()
+                            source_nlps[model].vocab.vectors.to_bytes(
+                                exclude=["strings"]
+                            )
                        )
                    if "_sourced_vectors_hashes" not in nlp.meta:
                        nlp.meta["_sourced_vectors_hashes"] = {}
--- a/spacy/ml/extract_spans.py
+++ b/spacy/ml/extract_spans.py
@ -28,7 +28,13 @@ def forward(
    X, spans = source_spans
    assert spans.dataXd.ndim == 2
    indices = _get_span_indices(ops, spans, X.lengths)
-    Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0])  # type: ignore[arg-type, index]
+    if len(indices) > 0:
+        Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0])  # type: ignore[arg-type, index]
+    else:
+        Y = Ragged(
+            ops.xp.zeros(X.dataXd.shape, dtype=X.dataXd.dtype),
+            ops.xp.zeros((len(X.lengths),), dtype="i"),
+        )
    x_shape = X.dataXd.shape
    x_lengths = X.lengths

@ -53,7 +59,7 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
        for j in range(spans_i.shape[0]):
            indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1]))  # type: ignore[call-overload, index]
        offset += length
-    return ops.flatten(indices)
+    return ops.flatten(indices, dtype="i", ndim_if_empty=1)


 def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
--- a/spacy/pipeline/_parser_internals/arc_eager.pyx
+++ b/spacy/pipeline/_parser_internals/arc_eager.pyx
@ -585,7 +585,10 @@ cdef class ArcEager(TransitionSystem):
            actions[RIGHT][label] = 1
            actions[REDUCE][label] = 1
        for example in kwargs.get('examples', []):
-            heads, labels = example.get_aligned_parse(projectivize=True)
+            # use heads and labels from the reference parse (without regard to
+            # misalignments between the predicted and reference)
+            example_gold_preproc = Example(example.reference, example.reference)
+            heads, labels = example_gold_preproc.get_aligned_parse(projectivize=True)
            for child, (head, label) in enumerate(zip(heads, labels)):
                if head is None or label is None:
                    continue
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -431,10 +431,16 @@ class EntityRuler(Pipe):
        path = ensure_path(path)
        self.clear()
        depr_patterns_path = path.with_suffix(".jsonl")
-        if depr_patterns_path.is_file():
+        if path.suffix == ".jsonl":  # user provides a jsonl
+            if path.is_file:
+                patterns = srsly.read_jsonl(path)
+                self.add_patterns(patterns)
+            else:
+                raise ValueError(Errors.E1023.format(path=path))
+        elif depr_patterns_path.is_file():
            patterns = srsly.read_jsonl(depr_patterns_path)
            self.add_patterns(patterns)
-        else:
+        elif path.is_dir():  # path is a valid directory
            cfg = {}
            deserializers_patterns = {
                "patterns": lambda p: self.add_patterns(
@ -451,6 +457,8 @@ class EntityRuler(Pipe):
                self.nlp.vocab, attr=self.phrase_matcher_attr
            )
            from_disk(path, deserializers_patterns, {})
+        else:  # path is not a valid directory or file
+            raise ValueError(Errors.E146.format(path=path))
        return self

    def to_disk(
--- a/spacy/pipeline/functions.py
+++ b/spacy/pipeline/functions.py
@ -1,6 +1,8 @@
 from typing import Dict, Any
 import srsly
+import warnings

+from ..errors import Warnings
 from ..language import Language
 from ..matcher import Matcher
 from ..tokens import Doc
@ -136,3 +138,65 @@ class TokenSplitter:
            "cfg": lambda p: self._set_config(srsly.read_json(p)),
        }
        util.from_disk(path, serializers, [])
+
+
+@Language.factory(
+    "doc_cleaner",
+    default_config={"attrs": {"tensor": None, "_.trf_data": None}, "silent": True},
+)
+def make_doc_cleaner(nlp: Language, name: str, *, attrs: Dict[str, Any], silent: bool):
+    return DocCleaner(attrs, silent=silent)
+
+
+class DocCleaner:
+    def __init__(self, attrs: Dict[str, Any], *, silent: bool = True):
+        self.cfg: Dict[str, Any] = {"attrs": dict(attrs), "silent": silent}
+
+    def __call__(self, doc: Doc) -> Doc:
+        attrs: dict = self.cfg["attrs"]
+        silent: bool = self.cfg["silent"]
+        for attr, value in attrs.items():
+            obj = doc
+            parts = attr.split(".")
+            skip = False
+            for part in parts[:-1]:
+                if hasattr(obj, part):
+                    obj = getattr(obj, part)
+                else:
+                    skip = True
+                    if not silent:
+                        warnings.warn(Warnings.W116.format(attr=attr))
+            if not skip:
+                if hasattr(obj, parts[-1]):
+                    setattr(obj, parts[-1], value)
+                else:
+                    if not silent:
+                        warnings.warn(Warnings.W116.format(attr=attr))
+        return doc
+
+    def to_bytes(self, **kwargs):
+        serializers = {
+            "cfg": lambda: srsly.json_dumps(self.cfg),
+        }
+        return util.to_bytes(serializers, [])
+
+    def from_bytes(self, data, **kwargs):
+        deserializers = {
+            "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
+        }
+        util.from_bytes(data, deserializers, [])
+        return self
+
+    def to_disk(self, path, **kwargs):
+        path = util.ensure_path(path)
+        serializers = {
+            "cfg": lambda p: srsly.write_json(p, self.cfg),
+        }
+        return util.to_disk(path, serializers, [])
+
+    def from_disk(self, path, **kwargs):
+        path = util.ensure_path(path)
+        serializers = {
+            "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
+        }
+        util.from_disk(path, serializers, [])
--- a/spacy/pipeline/morphologizer.pyx
+++ b/spacy/pipeline/morphologizer.pyx
@ -231,12 +231,13 @@ class Morphologizer(Tagger):
        cdef Vocab vocab = self.vocab
        cdef bint overwrite = self.cfg["overwrite"]
        cdef bint extend = self.cfg["extend"]
+        labels = self.labels
        for i, doc in enumerate(docs):
            doc_tag_ids = batch_tag_ids[i]
            if hasattr(doc_tag_ids, "get"):
                doc_tag_ids = doc_tag_ids.get()
            for j, tag_id in enumerate(doc_tag_ids):
-                morph = self.labels[tag_id]
+                morph = labels[tag_id]
                # set morph
                if doc.c[j].morph == 0 or overwrite or extend:
                    if overwrite and extend:
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@ -78,7 +78,7 @@ def build_ngram_suggester(sizes: List[int]) -> Suggester:
        if len(spans) > 0:
            output = Ragged(ops.xp.vstack(spans), lengths_array)
        else:
-            output = Ragged(ops.xp.zeros((0, 0)), lengths_array)
+            output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

        assert output.dataXd.ndim == 2
        return output
--- a/spacy/pipeline/tagger.pyx
+++ b/spacy/pipeline/tagger.pyx
@ -166,13 +166,14 @@ class Tagger(TrainablePipe):
        cdef Doc doc
        cdef Vocab vocab = self.vocab
        cdef bint overwrite = self.cfg["overwrite"]
+        labels = self.labels
        for i, doc in enumerate(docs):
            doc_tag_ids = batch_tag_ids[i]
            if hasattr(doc_tag_ids, "get"):
                doc_tag_ids = doc_tag_ids.get()
            for j, tag_id in enumerate(doc_tag_ids):
                if doc.c[j].tag == 0 or overwrite:
-                    doc.c[j].tag = self.vocab.strings[self.labels[tag_id]]
+                    doc.c[j].tag = self.vocab.strings[labels[tag_id]]

    def update(self, examples, *, drop=0., sgd=None, losses=None):
        """Learn from a batch of documents and gold-standard information,
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@ -222,6 +222,8 @@ class TokenPattern(BaseModel):
    lemma: Optional[StringValue] = None
    shape: Optional[StringValue] = None
    ent_type: Optional[StringValue] = None
+    ent_id: Optional[StringValue] = None
+    ent_kb_id: Optional[StringValue] = None
    norm: Optional[StringValue] = None
    length: Optional[NumberValue] = None
    spacy: Optional[StrictBool] = None
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -359,14 +359,15 @@ class Scorer:
            pred_doc = example.predicted
            gold_doc = example.reference
            # Option to handle docs without annotation for this attribute
-            if has_annotation is not None:
-                if not has_annotation(gold_doc):
-                    continue
-            # Find all labels in gold and doc
-            labels = set(
-                [k.label_ for k in getter(gold_doc, attr)]
-                + [k.label_ for k in getter(pred_doc, attr)]
-            )
+            if has_annotation is not None and not has_annotation(gold_doc):
+                continue
+            # Find all labels in gold
+            labels = set([k.label_ for k in getter(gold_doc, attr)])
+            # If labeled, find all labels in pred
+            if has_annotation is None or (
+                has_annotation is not None and has_annotation(pred_doc)
+            ):
+                labels |= set([k.label_ for k in getter(pred_doc, attr)])
            # Set up all labels for per type scoring and prepare gold per type
            gold_per_type: Dict[str, Set] = {label: set() for label in labels}
            for label in labels:
@ -384,16 +385,19 @@ class Scorer:
                gold_spans.add(gold_span)
                gold_per_type[span.label_].add(gold_span)
            pred_per_type: Dict[str, Set] = {label: set() for label in labels}
-            for span in example.get_aligned_spans_x2y(
-                getter(pred_doc, attr), allow_overlap
+            if has_annotation is None or (
+                has_annotation is not None and has_annotation(pred_doc)
            ):
-                pred_span: Tuple
-                if labeled:
-                    pred_span = (span.label_, span.start, span.end - 1)
-                else:
-                    pred_span = (span.start, span.end - 1)
-                pred_spans.add(pred_span)
-                pred_per_type[span.label_].add(pred_span)
+                for span in example.get_aligned_spans_x2y(
+                    getter(pred_doc, attr), allow_overlap
+                ):
+                    pred_span: Tuple
+                    if labeled:
+                        pred_span = (span.label_, span.start, span.end - 1)
+                    else:
+                        pred_span = (span.start, span.end - 1)
+                    pred_spans.add(pred_span)
+                    pred_per_type[span.label_].add(pred_span)
            # Scores per label
            if labeled:
                for k, v in score_per_type.items():
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -49,6 +49,11 @@ def tokenizer():
    return get_lang_class("xx")().tokenizer


+@pytest.fixture(scope="session")
+def af_tokenizer():
+    return get_lang_class("af")().tokenizer
+
+
@pytest.fixture(scope="session")
 def am_tokenizer():
    return get_lang_class("am")().tokenizer
@ -125,6 +130,11 @@ def es_vocab():
    return get_lang_class("es")().vocab


+@pytest.fixture(scope="session")
+def et_tokenizer():
+    return get_lang_class("et")().tokenizer
+
+
@pytest.fixture(scope="session")
 def eu_tokenizer():
    return get_lang_class("eu")().tokenizer
@ -190,6 +200,11 @@ def id_tokenizer():
    return get_lang_class("id")().tokenizer


+@pytest.fixture(scope="session")
+def is_tokenizer():
+    return get_lang_class("is")().tokenizer
+
+
@pytest.fixture(scope="session")
 def it_tokenizer():
    return get_lang_class("it")().tokenizer
@ -222,6 +237,11 @@ def lt_tokenizer():
    return get_lang_class("lt")().tokenizer


+@pytest.fixture(scope="session")
+def lv_tokenizer():
+    return get_lang_class("lv")().tokenizer
+
+
@pytest.fixture(scope="session")
 def mk_tokenizer():
    return get_lang_class("mk")().tokenizer
@ -289,11 +309,26 @@ def sa_tokenizer():
    return get_lang_class("sa")().tokenizer


+@pytest.fixture(scope="session")
+def sk_tokenizer():
+    return get_lang_class("sk")().tokenizer
+
+
+@pytest.fixture(scope="session")
+def sl_tokenizer():
+    return get_lang_class("sl")().tokenizer
+
+
@pytest.fixture(scope="session")
 def sr_tokenizer():
    return get_lang_class("sr")().tokenizer


+@pytest.fixture(scope="session")
+def sq_tokenizer():
+    return get_lang_class("sq")().tokenizer
+
+
@pytest.fixture(scope="session")
 def sv_tokenizer():
    return get_lang_class("sv")().tokenizer
@ -354,6 +389,11 @@ def vi_tokenizer():
    return get_lang_class("vi")().tokenizer


+@pytest.fixture(scope="session")
+def xx_tokenizer():
+    return get_lang_class("xx")().tokenizer
+
+
@pytest.fixture(scope="session")
 def yo_tokenizer():
    return get_lang_class("yo")().tokenizer
--- a/spacy/tests/lang/af/init.py
+++ b/spacy/tests/lang/af/init.py
--- a/spacy/tests/lang/af/test_text.py
+++ b/spacy/tests/lang/af/test_text.py
@ -0,0 +1,22 @@
+import pytest
+
+
+def test_long_text(af_tokenizer):
+    # Excerpt: Universal Declaration of Human Rights; “'n” changed to “die” in first sentence
+    text = """
+Hierdie Universele Verklaring van Menseregte as die algemene standaard vir die verwesenliking deur alle mense en nasies, 
+om te verseker dat elke individu en elke deel van die gemeenskap hierdie Verklaring in ag sal neem en deur opvoeding, 
+respek vir hierdie regte en vryhede te bevorder, op nasionale en internasionale vlak, daarna sal strewe om die universele 
+en effektiewe erkenning en agting van hierdie regte te verseker, nie net vir die mense van die Lidstate nie, maar ook vir 
+die mense in die gebiede onder hul jurisdiksie.
+
+"""
+    tokens = af_tokenizer(text)
+    assert len(tokens) == 100
+
+
+@pytest.mark.xfail
+def test_indefinite_article(af_tokenizer):
+    text = "as 'n algemene standaard"
+    tokens = af_tokenizer(text)
+    assert len(tokens) == 4
--- a/spacy/tests/lang/af/test_tokenizer.py
+++ b/spacy/tests/lang/af/test_tokenizer.py
@ -0,0 +1,29 @@
+import pytest
+
+AF_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Elkeen het die reg tot lewe, vryheid en sekuriteit van persoon.",
+        [
+            "Elkeen",
+            "het",
+            "die",
+            "reg",
+            "tot",
+            "lewe",
+            ",",
+            "vryheid",
+            "en",
+            "sekuriteit",
+            "van",
+            "persoon",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", AF_BASIC_TOKENIZATION_TESTS)
+def test_af_tokenizer_basic(af_tokenizer, text, expected_tokens):
+    tokens = af_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/et/init.py
+++ b/spacy/tests/lang/et/init.py
--- a/spacy/tests/lang/et/test_text.py
+++ b/spacy/tests/lang/et/test_text.py
@ -0,0 +1,26 @@
+import pytest
+
+
+def test_long_text(et_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+arvestades, et nimetatud deklaratsiooni eesmärk on tagada selles
+kuulutatud õiguste üldine ja tõhus tunnustamine ning järgimine;
+arvestades, et Euroopa Nõukogu eesmärk on saavutada tema
+liikmete suurem ühtsus ning et üheks selle eesmärgi saavutamise
+vahendiks on inimõiguste ja põhivabaduste järgimine ning
+elluviimine;
+taaskinnitades oma sügavat usku neisse põhivabadustesse, mis
+on õigluse ja rahu aluseks maailmas ning mida kõige paremini
+tagab ühelt poolt tõhus poliitiline demokraatia ning teiselt poolt
+inimõiguste, millest nad sõltuvad, üldine mõistmine ja järgimine;
+"""
+    tokens = et_tokenizer(text)
+    assert len(tokens) == 94
+
+
+@pytest.mark.xfail
+def test_ordinal_number(et_tokenizer):
+    text = "10. detsembril 1948"
+    tokens = et_tokenizer(text)
+    assert len(tokens) == 3
--- a/spacy/tests/lang/et/test_tokenizer.py
+++ b/spacy/tests/lang/et/test_tokenizer.py
@ -0,0 +1,29 @@
+import pytest
+
+ET_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Kedagi ei või piinata ega ebainimlikult või alandavalt kohelda "
+        "ega karistada.",
+        [
+            "Kedagi",
+            "ei",
+            "või",
+            "piinata",
+            "ega",
+            "ebainimlikult",
+            "või",
+            "alandavalt",
+            "kohelda",
+            "ega",
+            "karistada",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", ET_BASIC_TOKENIZATION_TESTS)
+def test_et_tokenizer_basic(et_tokenizer, text, expected_tokens):
+    tokens = et_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/hr/init.py
+++ b/spacy/tests/lang/hr/init.py
--- a/spacy/tests/lang/hr/test_text.py
+++ b/spacy/tests/lang/hr/test_text.py
@ -0,0 +1,26 @@
+import pytest
+
+
+def test_long_text(hr_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+uzimajući u obzir da ta deklaracija nastoji osigurati opće i djelotvorno
+priznanje i poštovanje u njoj proglašenih prava;
+uzimajući u obzir da je cilj Vijeća Europe postizanje većeg jedinstva
+njegovih članica, i da je jedan od načina postizanja toga cilja
+očuvanje i daljnje ostvarivanje ljudskih prava i temeljnih sloboda;
+potvrđujući svoju duboku privrženost tim temeljnim slobodama
+koje su osnova pravde i mira u svijetu i koje su najbolje zaštićene
+istinskom političkom demokracijom s jedne strane te zajedničkim
+razumijevanjem i poštovanjem ljudskih prava o kojima te slobode
+ovise s druge strane;
+"""
+    tokens = hr_tokenizer(text)
+    assert len(tokens) == 105
+
+
+@pytest.mark.xfail
+def test_ordinal_number(hr_tokenizer):
+    text = "10. prosinca 1948"
+    tokens = hr_tokenizer(text)
+    assert len(tokens) == 3
--- a/spacy/tests/lang/hr/test_tokenizer.py
+++ b/spacy/tests/lang/hr/test_tokenizer.py
@ -0,0 +1,31 @@
+import pytest
+
+HR_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Nitko se ne smije podvrgnuti mučenju ni nečovječnom ili "
+        "ponižavajućem postupanju ili kazni.",
+        [
+            "Nitko",
+            "se",
+            "ne",
+            "smije",
+            "podvrgnuti",
+            "mučenju",
+            "ni",
+            "nečovječnom",
+            "ili",
+            "ponižavajućem",
+            "postupanju",
+            "ili",
+            "kazni",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", HR_BASIC_TOKENIZATION_TESTS)
+def test_hr_tokenizer_basic(hr_tokenizer, text, expected_tokens):
+    tokens = hr_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/is/init.py
+++ b/spacy/tests/lang/is/init.py
--- a/spacy/tests/lang/is/test_text.py
+++ b/spacy/tests/lang/is/test_text.py
@ -0,0 +1,26 @@
+import pytest
+
+
+def test_long_text(is_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+hafa í huga, að yfirlýsing þessi hefur það markmið að tryggja
+almenna og raunhæfa viðurkenningu og vernd þeirra réttinda,
+sem þar er lýst;
+hafa í huga, að markmið Evrópuráðs er að koma á nánari einingu
+aðildarríkjanna og að ein af leiðunum að því marki er sú, að
+mannréttindi og mannfrelsi séu í heiðri höfð og efld;
+lýsa á ný eindreginni trú sinni á það mannfrelsi, sem er undirstaða
+réttlætis og friðar í heiminum og best er tryggt, annars vegar með
+virku, lýðræðislegu stjórnarfari og, hins vegar, almennum skilningi
+og varðveislu þeirra mannréttinda, sem eru grundvöllur frelsisins;
+"""
+    tokens = is_tokenizer(text)
+    assert len(tokens) == 120
+
+
+@pytest.mark.xfail
+def test_ordinal_number(is_tokenizer):
+    text = "10. desember 1948"
+    tokens = is_tokenizer(text)
+    assert len(tokens) == 3
--- a/spacy/tests/lang/is/test_tokenizer.py
+++ b/spacy/tests/lang/is/test_tokenizer.py
@ -0,0 +1,30 @@
+import pytest
+
+IS_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Enginn maður skal sæta pyndingum eða ómannlegri eða "
+        "vanvirðandi meðferð eða refsingu. ",
+        [
+            "Enginn",
+            "maður",
+            "skal",
+            "sæta",
+            "pyndingum",
+            "eða",
+            "ómannlegri",
+            "eða",
+            "vanvirðandi",
+            "meðferð",
+            "eða",
+            "refsingu",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", IS_BASIC_TOKENIZATION_TESTS)
+def test_is_tokenizer_basic(is_tokenizer, text, expected_tokens):
+    tokens = is_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/lv/init.py
+++ b/spacy/tests/lang/lv/init.py
--- a/spacy/tests/lang/lv/test_text.py
+++ b/spacy/tests/lang/lv/test_text.py
@ -0,0 +1,27 @@
+import pytest
+
+
+def test_long_text(lv_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+Ievērodamas, ka šī deklarācija paredz nodrošināt vispārēju un
+efektīvu tajā pasludināto tiesību atzīšanu un ievērošanu;
+Ievērodamas, ka Eiropas Padomes mērķis ir panākt lielāku vienotību
+tās dalībvalstu starpā un ka viens no līdzekļiem, kā šo mērķi
+sasniegt, ir cilvēka tiesību un pamatbrīvību ievērošana un turpmāka
+īstenošana;
+No jauna apliecinādamas patiesu pārliecību, ka šīs pamatbrīvības
+ir taisnīguma un miera pamats visā pasaulē un ka tās vislabāk var
+nodrošināt patiess demokrātisks politisks režīms no vienas puses un
+vispārējo cilvēktiesību, uz kurām tās pamatojas, kopīga izpratne un
+ievērošana no otras puses;
+"""
+    tokens = lv_tokenizer(text)
+    assert len(tokens) == 109
+
+
+@pytest.mark.xfail
+def test_ordinal_number(lv_tokenizer):
+    text = "10. decembrī"
+    tokens = lv_tokenizer(text)
+    assert len(tokens) == 2
--- a/spacy/tests/lang/lv/test_tokenizer.py
+++ b/spacy/tests/lang/lv/test_tokenizer.py
@ -0,0 +1,30 @@
+import pytest
+
+LV_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Nevienu nedrīkst spīdzināt vai cietsirdīgi vai pazemojoši ar viņu "
+        "apieties vai sodīt.",
+        [
+            "Nevienu",
+            "nedrīkst",
+            "spīdzināt",
+            "vai",
+            "cietsirdīgi",
+            "vai",
+            "pazemojoši",
+            "ar",
+            "viņu",
+            "apieties",
+            "vai",
+            "sodīt",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", LV_BASIC_TOKENIZATION_TESTS)
+def test_lv_tokenizer_basic(lv_tokenizer, text, expected_tokens):
+    tokens = lv_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/sk/init.py
+++ b/spacy/tests/lang/sk/init.py
--- a/spacy/tests/lang/sk/test_text.py
+++ b/spacy/tests/lang/sk/test_text.py
@ -0,0 +1,48 @@
+import pytest
+
+
+def test_long_text(sk_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+majúc na zreteli, že cieľom tejto deklarácie je zabezpečiť všeobecné
+a účinné uznávanie a dodržiavanie práv v nej vyhlásených;
+majúc na zreteli, že cieľom Rady Európy je dosiahnutie väčšej
+jednoty medzi jej členmi, a že jedným zo spôsobov, ktorým sa
+má tento cieľ napĺňať, je ochrana a ďalší rozvoj ľudských práv
+a základných slobôd;
+znovu potvrdzujúc svoju hlbokú vieru v tie základné slobody, ktoré
+sú základom spravodlivosti a mieru vo svete, a ktoré sú najlepšie
+zachovávané na jednej strane účinnou politickou demokraciou
+a na strane druhej spoločným poňatím a dodržiavaním ľudských
+práv, od ktorých závisia;
+    """
+    tokens = sk_tokenizer(text)
+    assert len(tokens) == 118
+
+
+@pytest.mark.parametrize(
+    "text,match",
+    [
+        ("10", True),
+        ("1", True),
+        ("10,000", True),
+        ("10,00", True),
+        ("štyri", True),
+        ("devätnásť", True),
+        ("milión", True),
+        ("pes", False),
+        (",", False),
+        ("1/2", True),
+    ],
+)
+def test_lex_attrs_like_number(sk_tokenizer, text, match):
+    tokens = sk_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match
+
+
+@pytest.mark.xfail
+def test_ordinal_number(sk_tokenizer):
+    text = "10. decembra 1948"
+    tokens = sk_tokenizer(text)
+    assert len(tokens) == 3
--- a/spacy/tests/lang/sk/test_tokenizer.py
+++ b/spacy/tests/lang/sk/test_tokenizer.py
@ -0,0 +1,15 @@
+import pytest
+
+SK_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Kedy sa narodil Andrej Kiska?",
+        ["Kedy", "sa", "narodil", "Andrej", "Kiska", "?"],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", SK_BASIC_TOKENIZATION_TESTS)
+def test_sk_tokenizer_basic(sk_tokenizer, text, expected_tokens):
+    tokens = sk_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/sl/init.py
+++ b/spacy/tests/lang/sl/init.py
--- a/spacy/tests/lang/sl/test_text.py
+++ b/spacy/tests/lang/sl/test_text.py
@ -0,0 +1,27 @@
+import pytest
+
+
+def test_long_text(sl_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+upoštevajoč, da si ta deklaracija prizadeva zagotoviti splošno in
+učinkovito priznavanje in spoštovanje v njej razglašenih pravic,
+upoštevajoč, da je cilj Sveta Evrope doseči večjo enotnost med
+njegovimi članicami, in da je eden izmed načinov za zagotavljanje
+tega cilja varstvo in nadaljnji razvoj človekovih pravic in temeljnih
+svoboščin,
+ponovno potrjujoč svojo globoko vero v temeljne svoboščine, na
+katerih temeljita pravičnost in mir v svetu, in ki jih je mogoče najbolje
+zavarovati na eni strani z dejansko politično demokracijo in na drugi
+strani s skupnim razumevanjem in spoštovanjem človekovih pravic,
+od katerih so te svoboščine odvisne,    
+"""
+    tokens = sl_tokenizer(text)
+    assert len(tokens) == 116
+
+
+@pytest.mark.xfail
+def test_ordinal_number(sl_tokenizer):
+    text = "10. decembra 1948"
+    tokens = sl_tokenizer(text)
+    assert len(tokens) == 3
--- a/spacy/tests/lang/sl/test_tokenizer.py
+++ b/spacy/tests/lang/sl/test_tokenizer.py
@ -0,0 +1,32 @@
+import pytest
+
+SL_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Vsakdo ima pravico do spoštovanja njegovega zasebnega in "
+        "družinskega življenja, doma in dopisovanja.",
+        [
+            "Vsakdo",
+            "ima",
+            "pravico",
+            "do",
+            "spoštovanja",
+            "njegovega",
+            "zasebnega",
+            "in",
+            "družinskega",
+            "življenja",
+            ",",
+            "doma",
+            "in",
+            "dopisovanja",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", SL_BASIC_TOKENIZATION_TESTS)
+def test_sl_tokenizer_basic(sl_tokenizer, text, expected_tokens):
+    tokens = sl_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/sq/init.py
+++ b/spacy/tests/lang/sq/init.py
--- a/spacy/tests/lang/sq/test_text.py
+++ b/spacy/tests/lang/sq/test_text.py
@ -0,0 +1,25 @@
+import pytest
+
+
+def test_long_text(sq_tokenizer):
+    # Excerpt: European Convention on Human Rights
+    text = """
+Qeveritë nënshkruese, anëtare të Këshillit të Evropës,
+Duke pasur parasysh Deklaratën Universale të të Drejtave të
+Njeriut, të shpallur nga Asambleja e Përgjithshme e Kombeve të
+Bashkuara më 10 dhjetor 1948;
+Duke pasur parasysh, se kjo Deklaratë ka për qëllim të sigurojë
+njohjen dhe zbatimin universal dhe efektiv të të drejtave të
+shpallura në të;
+Duke pasur parasysh se qëllimi i Këshillit të Evropës është që të
+realizojë një bashkim më të ngushtë midis anëtarëve të tij dhe
+se një nga mjetet për të arritur këtë qëllim është mbrojtja dhe
+zhvillimi i të drejtave të njeriut dhe i lirive themelore;
+Duke ripohuar besimin e tyre të thellë në këto liri themelore që
+përbëjnë themelet e drejtësisë dhe të paqes në botë, ruajtja e të
+cilave mbështetet kryesisht mbi një regjim politik demokratik nga
+njëra anë, dhe nga ana tjetër mbi një kuptim dhe respektim të
+përbashkët të të drejtave të njeriut nga të cilat varen;
+"""
+    tokens = sq_tokenizer(text)
+    assert len(tokens) == 182
--- a/spacy/tests/lang/sq/test_tokenizer.py
+++ b/spacy/tests/lang/sq/test_tokenizer.py
@ -0,0 +1,31 @@
+import pytest
+
+SQ_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Askush nuk mund t’i nënshtrohet torturës ose dënimeve ose "
+        "trajtimeve çnjerëzore ose poshtëruese.",
+        [
+            "Askush",
+            "nuk",
+            "mund",
+            "t’i",
+            "nënshtrohet",
+            "torturës",
+            "ose",
+            "dënimeve",
+            "ose",
+            "trajtimeve",
+            "çnjerëzore",
+            "ose",
+            "poshtëruese",
+            ".",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", SQ_BASIC_TOKENIZATION_TESTS)
+def test_sq_tokenizer_basic(sq_tokenizer, text, expected_tokens):
+    tokens = sq_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/lang/xx/init.py
+++ b/spacy/tests/lang/xx/init.py
--- a/spacy/tests/lang/xx/test_text.py
+++ b/spacy/tests/lang/xx/test_text.py
@ -0,0 +1,24 @@
+import pytest
+
+
+def test_long_text(xx_tokenizer):
+    # Excerpt: Text in Skolt Sami taken from https://www.samediggi.fi
+    text = """
+Säʹmmla lie Euroopp unioon oʹdinakai alggmeer. Säʹmmlai alggmeerstatus lij raʹvvjum Lääʹddjânnam vuâđđlääʹjjest.  
+Alggmeer kriteeʹr vuâđđâʹvve meeraikõskksaž tuâjjorganisaatio, ILO, suåppmõʹšše nââmar 169. 
+Suåppmõõžž mieʹldd jiõččvälddsaž jânnmin jälsteei meeraid ââʹnet alggmeeran, 
+ko sij puõlvvâʹvve naroodâst, kååʹtt jânnam välddmõõžž leʹbe aazztummuž leʹbe ânnʼjõž riikkraaʹji šõddâm ääiʹj jälste 
+jânnmest leʹbe tõn mäddtiõđlaž vuuʹdest, koozz jânnam kooll. Alggmeer ij leäkku mieʹrreei sââʹjest jiiʹjjes jälstemvuuʹdest. 
+Alggmeer âlgg jiõčč ââʹnned jiiʹjjes alggmeeran leʹbe leeʹd tõn miõlâst, što sij lie alggmeer. 
+Alggmeer lij õlggâm seeilted vuõiggâdvuõđlaž sââʹjest huõlǩâni obbnes leʹbe vueʹzzi jiiʹjjes sosiaalʼlaž, täälʼlaž, 
+kulttuurlaž da poliittlaž instituutioid.
+
+Säʹmmlai statuuzz ǩeeʹrjteš Lääʹddjânnam vuâđđläkka eeʹjj 1995. Säʹmmlain alggmeeran lij vuõiggâdvuõtt tuõʹllʼjed da 
+ooudâsviikkâd ǩiõlâz da kulttuurâz di tõõzz kuulli ääʹrbvuâlaž jieʹllemvueʹjjeez. Sääʹmǩiõl ââʹnnmest veʹrǧǧniiʹǩǩi 
+åʹrnn lij šiõttuum jiiʹjjes lääʹǩǩ. Säʹmmlain lij leämmaž eeʹjjest 1996 vueʹljeeʹl dommvuuʹdsteez ǩiõlâz da kulttuurâz kuõskki 
+vuâđđlääʹjj meâldlaž jiõččvaaldâšm. Säʹmmlai jiõččvaldšma kuulli tuâjaid håidd säʹmmlai vaalin vaʹlljääm parlameʹntt, 
+Sääʹmteʹǧǧ.
+"""
+
+    tokens = xx_tokenizer(text)
+    assert len(tokens) == 179
--- a/spacy/tests/lang/xx/test_tokenizer.py
+++ b/spacy/tests/lang/xx/test_tokenizer.py
@ -0,0 +1,25 @@
+import pytest
+
+XX_BASIC_TOKENIZATION_TESTS = [
+    (
+        "Lääʹddjânnmest lie nuʹtt 10 000 säʹmmliʹžžed. Seeʹst pâʹjjel",
+        [
+            "Lääʹddjânnmest",
+            "lie",
+            "nuʹtt",
+            "10",
+            "000",
+            "säʹmmliʹžžed",
+            ".",
+            "Seeʹst",
+            "pâʹjjel",
+        ],
+    ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", XX_BASIC_TOKENIZATION_TESTS)
+def test_xx_tokenizer_basic(xx_tokenizer, text, expected_tokens):
+    tokens = xx_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/spacy/tests/matcher/test_pattern_validation.py
+++ b/spacy/tests/matcher/test_pattern_validation.py
@ -22,6 +22,8 @@ TEST_PATTERNS = [
    ([{"TEXT": {"VALUE": "foo"}}], 2, 0),  # prev: (1, 0)
    ([{"IS_DIGIT": -1}], 1, 0),
    ([{"ORTH": -1}], 1, 0),
+    ([{"ENT_ID": -1}], 1, 0),
+    ([{"ENT_KB_ID": -1}], 1, 0),
    # Good patterns
    ([{"TEXT": "foo"}, {"LOWER": "bar"}], 0, 0),
    ([{"LEMMA": {"IN": ["love", "like"]}}, {"POS": "DET", "OP": "?"}], 0, 0),
@ -33,6 +35,8 @@ TEST_PATTERNS = [
    ([{"orth": "foo"}], 0, 0),  # prev: xfail
    ([{"IS_SENT_START": True}], 0, 0),
    ([{"SENT_START": True}], 0, 0),
+    ([{"ENT_ID": "STRING"}], 0, 0),
+    ([{"ENT_KB_ID": "STRING"}], 0, 0),
 ]


--- a/spacy/tests/pipeline/test_entity_ruler.py
+++ b/spacy/tests/pipeline/test_entity_ruler.py
@ -5,6 +5,8 @@ from spacy.tokens import Span
 from spacy.language import Language
 from spacy.pipeline import EntityRuler
 from spacy.errors import MatchPatternError
+from spacy.tests.util import make_tempdir
+
 from thinc.api import NumpyOps, get_current_ops


@ -238,3 +240,23 @@ def test_entity_ruler_multiprocessing(nlp, n_process):
        for doc in nlp.pipe(texts, n_process=2):
            for ent in doc.ents:
                assert ent.ent_id_ == "1234"
+
+
+def test_entity_ruler_serialize_jsonl(nlp, patterns):
+    ruler = nlp.add_pipe("entity_ruler")
+    ruler.add_patterns(patterns)
+    with make_tempdir() as d:
+        ruler.to_disk(d / "test_ruler.jsonl")
+        ruler.from_disk(d / "test_ruler.jsonl")  # read from an existing jsonl file
+        with pytest.raises(ValueError):
+            ruler.from_disk(d / "non_existing.jsonl")  # read from a bad jsonl file
+
+
+def test_entity_ruler_serialize_dir(nlp, patterns):
+    ruler = nlp.add_pipe("entity_ruler")
+    ruler.add_patterns(patterns)
+    with make_tempdir() as d:
+        ruler.to_disk(d / "test_ruler")
+        ruler.from_disk(d / "test_ruler")  # read from an existing directory
+        with pytest.raises(ValueError):
+            ruler.from_disk(d / "non_existing_dir")  # read from a bad directory
--- a/spacy/tests/pipeline/test_functions.py
+++ b/spacy/tests/pipeline/test_functions.py
@ -3,6 +3,8 @@ from spacy.pipeline.functions import merge_subtokens
 from spacy.language import Language
 from spacy.tokens import Span, Doc

+from ..doc.test_underscore import clean_underscore  # noqa: F401
+

@pytest.fixture
 def doc(en_vocab):
@ -74,3 +76,26 @@ def test_token_splitter():
        "i",
    ]
    assert all(len(t.text) <= token_splitter.split_length for t in doc)
+
+
+@pytest.mark.usefixtures("clean_underscore")
+def test_factories_doc_cleaner():
+    nlp = Language()
+    nlp.add_pipe("doc_cleaner")
+    doc = nlp.make_doc("text")
+    doc.tensor = [1, 2, 3]
+    doc = nlp(doc)
+    assert doc.tensor is None
+
+    nlp = Language()
+    nlp.add_pipe("doc_cleaner", config={"silent": False})
+    with pytest.warns(UserWarning):
+        doc = nlp("text")
+
+    Doc.set_extension("test_attr", default=-1)
+    nlp = Language()
+    nlp.add_pipe("doc_cleaner", config={"attrs": {"_.test_attr": 0}})
+    doc = nlp.make_doc("text")
+    doc._.test_attr = 100
+    doc = nlp(doc)
+    assert doc._.test_attr == 0
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@ -1,7 +1,7 @@
 import pytest
 import numpy
 from numpy.testing import assert_array_equal, assert_almost_equal
-from thinc.api import get_current_ops
+from thinc.api import get_current_ops, Ragged

 from spacy import util
 from spacy.lang.en import English
@ -29,6 +29,7 @@ TRAIN_DATA_OVERLAPPING = [
        "I like London and Berlin",
        {"spans": {SPAN_KEY: [(7, 13, "LOC"), (18, 24, "LOC"), (7, 24, "DOUBLE_LOC")]}},
    ),
+    ("", {"spans": {SPAN_KEY: []}}),
 ]


@ -365,3 +366,31 @@ def test_overfitting_IO_overlapping():
            "London and Berlin",
        }
        assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}
+
+
+def test_zero_suggestions():
+    # Test with a suggester that returns 0 suggestions
+
+    @registry.misc("test_zero_suggester")
+    def make_zero_suggester():
+        def zero_suggester(docs, *, ops=None):
+            if ops is None:
+                ops = get_current_ops()
+            return Ragged(
+                ops.xp.zeros((0, 0), dtype="i"), ops.xp.zeros((len(docs),), dtype="i")
+            )
+
+        return zero_suggester
+
+    fix_random_seed(0)
+    nlp = English()
+    spancat = nlp.add_pipe(
+        "spancat",
+        config={"suggester": {"@misc": "test_zero_suggester"}, "spans_key": SPAN_KEY},
+    )
+    train_examples = make_examples(nlp)
+    optimizer = nlp.initialize(get_examples=lambda: train_examples)
+    assert spancat.model.get_dim("nO") == 2
+    assert set(spancat.labels) == {"LOC", "PERSON"}
+
+    nlp.update(train_examples, sgd=optimizer)
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@ -565,7 +565,16 @@ def test_get_third_party_dependencies():
            }
        },
    )
-    get_third_party_dependencies(nlp.config) == []
+    assert get_third_party_dependencies(nlp.config) == []
+
+    # Test with lang-specific factory
+    @Dutch.factory("third_party_test")
+    def test_factory(nlp, name):
+        return lambda x: x
+
+    nlp.add_pipe("third_party_test")
+    # Before #9674 this would throw an exception
+    get_third_party_dependencies(nlp.config)


@pytest.mark.parametrize(
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -1,8 +1,9 @@
 import pytest
+
 from spacy import displacy
 from spacy.displacy.render import DependencyRenderer, EntityRenderer
-from spacy.tokens import Span, Doc
 from spacy.lang.fa import Persian
+from spacy.tokens import Span, Doc


 def test_displacy_parse_ents(en_vocab):
@ -12,7 +13,38 @@ def test_displacy_parse_ents(en_vocab):
    ents = displacy.parse_ents(doc)
    assert isinstance(ents, dict)
    assert ents["text"] == "But Google is starting from behind "
-    assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
+    assert ents["ents"] == [
+        {"start": 4, "end": 10, "label": "ORG", "kb_id": "", "kb_url": "#"}
+    ]
+
+    doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"], kb_id="Q95")]
+    ents = displacy.parse_ents(doc)
+    assert isinstance(ents, dict)
+    assert ents["text"] == "But Google is starting from behind "
+    assert ents["ents"] == [
+        {"start": 4, "end": 10, "label": "ORG", "kb_id": "Q95", "kb_url": "#"}
+    ]
+
+
+def test_displacy_parse_ents_with_kb_id_options(en_vocab):
+    """Test that named entities with kb_id on a Doc are converted into displaCy's format."""
+    doc = Doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
+    doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"], kb_id="Q95")]
+
+    ents = displacy.parse_ents(
+        doc, {"kb_url_template": "https://www.wikidata.org/wiki/{}"}
+    )
+    assert isinstance(ents, dict)
+    assert ents["text"] == "But Google is starting from behind "
+    assert ents["ents"] == [
+        {
+            "start": 4,
+            "end": 10,
+            "label": "ORG",
+            "kb_id": "Q95",
+            "kb_url": "https://www.wikidata.org/wiki/Q95",
+        }
+    ]


 def test_displacy_parse_deps(en_vocab):
--- a/spacy/training/initialize.py
+++ b/spacy/training/initialize.py
@ -132,7 +132,7 @@ def init_vocab(
        logger.info(f"Added vectors: {vectors}")
    # warn if source model vectors are not identical
    sourced_vectors_hashes = nlp.meta.pop("_sourced_vectors_hashes", {})
-    vectors_hash = hash(nlp.vocab.vectors.to_bytes())
+    vectors_hash = hash(nlp.vocab.vectors.to_bytes(exclude=["strings"]))
    for sourced_component, sourced_vectors_hash in sourced_vectors_hashes.items():
        if vectors_hash != sourced_vectors_hash:
            warnings.warn(Warnings.W113.format(name=sourced_component))
--- a/spacy/training/pretrain.py
+++ b/spacy/training/pretrain.py
@ -31,6 +31,8 @@ def pretrain(
    allocator = config["training"]["gpu_allocator"]
    if use_gpu >= 0 and allocator:
        set_gpu_allocator(allocator)
+    # ignore in pretraining because we're creating it now
+    config["initialize"]["init_tok2vec"] = None
    nlp = load_model_from_config(config)
    _config = nlp.config.interpolate()
    P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -124,6 +124,14 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
 [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
 argument that connects to the shared `tok2vec` component in the pipeline.

+Listeners work by caching the `Tok2Vec` output for a given batch of `Doc`s. This
+means that in order for a component to work with the listener, the batch of
+`Doc`s passed to the listener must be the same as the batch of `Doc`s passed to
+the `Tok2Vec`. As a result, any manipulation of the `Doc`s which would affect
+`Tok2Vec` output, such as to create special contexts or remove `Doc`s for which
+no prediction can be made, must happen inside the model, **after** the call to
+the `Tok2Vec` component.
+
 | Name        | Description                                                                                                                                                                                                                                                                                                                          |
 | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `width`     | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~                                                                                                                                                                                                                                     |
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -181,25 +181,25 @@ single corpus once and then divide it up into `train` and `dev` partitions.
 This section defines settings and controls for the training and evaluation
 process that are used when you run [`spacy train`](/api/cli#train).

-| Name                    | Description                                                                                                                                                                                                                                                                                                                         |
-| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `accumulate_gradient`   | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~                                                                                                                                                                                                                                                              |
-| `batcher`               | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~                                                                                                                        |
-| `before_to_disk`        | Optional callback to modify `nlp` object right before it is saved to disk during and after training. Can be used to remove or reset config values or disable components. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~                                                                                           |
-| `dev_corpus`            | Dot notation of the config location defining the dev corpus. Defaults to `corpora.dev`. ~~str~~                                                                                                                                                                                                                                     |
-| `dropout`               | The dropout rate. Defaults to `0.1`. ~~float~~                                                                                                                                                                                                                                                                                      |
-| `eval_frequency`        | How often to evaluate during training (steps). Defaults to `200`. ~~int~~                                                                                                                                                                                                                                                           |
-| `frozen_components`     | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                      |
-| `annotating_components` | Pipeline component names that should set annotations on the predicted docs during training. See [here](/usage/training#annotating-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                          |
-| `gpu_allocator`         | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~                                                                                                                                                                                   |
-| `logger`                | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
-| `max_epochs`            | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~                                                                                             |
-| `max_steps`             | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                   |
-| `optimizer`             | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~                                                                                                                                             |
-| `patience`              | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~                                                                                                                                                                                                        |
-| `score_weights`         | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~                                                                                                                                                       |
-| `seed`                  | The random seed. Defaults to variable `${system.seed}`. ~~int~~                                                                                                                                                                                                                                                                     |
-| `train_corpus`          | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~                                                                                                                                                                                                                                 |
+| Name                                                 | Description                                                                                                                                                                                                                                                                                                                         |
+| ---------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `accumulate_gradient`                                | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~                                                                                                                                                                                                                                                              |
+| `batcher`                                            | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~                                                                                                                        |
+| `before_to_disk`                                     | Optional callback to modify `nlp` object right before it is saved to disk during and after training. Can be used to remove or reset config values or disable components. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~                                                                                           |
+| `dev_corpus`                                         | Dot notation of the config location defining the dev corpus. Defaults to `corpora.dev`. ~~str~~                                                                                                                                                                                                                                     |
+| `dropout`                                            | The dropout rate. Defaults to `0.1`. ~~float~~                                                                                                                                                                                                                                                                                      |
+| `eval_frequency`                                     | How often to evaluate during training (steps). Defaults to `200`. ~~int~~                                                                                                                                                                                                                                                           |
+| `frozen_components`                                  | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                      |
+| `annotating_components` <Tag variant="new">3.1</Tag> | Pipeline component names that should set annotations on the predicted docs during training. See [here](/usage/training#annotating-components) for details. Defaults to `[]`. ~~List[str]~~                                                                                                                                          |
+| `gpu_allocator`                                      | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~                                                                                                                                                                                   |
+| `logger`                                             | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
+| `max_epochs`                                         | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~                                                                                             |
+| `max_steps`                                          | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~                                                                                                                                                                                                                   |
+| `optimizer`                                          | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~                                                                                                                                             |
+| `patience`                                           | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~                                                                                                                                                                                                        |
+| `score_weights`                                      | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~                                                                                                                                                       |
+| `seed`                                               | The random seed. Defaults to variable `${system.seed}`. ~~int~~                                                                                                                                                                                                                                                                     |
+| `train_corpus`                                       | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~                                                                                                                                                                                                                                 |

 ### pretraining {#config-pretraining tag="section,optional"}

@ -248,7 +248,7 @@ Also see the usage guides on the
 | `after_init`   | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                                |
 | `before_init`  | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~                                                                                                                                                                                                                                                                                               |
 | `components`   | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~                                                                      |
-| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~                                                                                                                                                                                                                                                |
+| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~                                                                                                                                                       |
 | `lookups`      | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~                                                                                                                                                                                                                                                       |
 | `tokenizer`    | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
 | `vectors`      | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~                                                                                                                                                                                                                                           |
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@ -44,6 +44,8 @@ rule-based matching are:
 | `SPACY`                                         | Token has a trailing space. ~~bool~~                                                                                      |
 |  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~       |
 | `ENT_TYPE`                                      | The token's entity label. ~~str~~                                                                                         |
+| `ENT_ID`                                        | The token's entity ID (`ent_id`). ~~str~~                                                                                 |
+| `ENT_KB_ID`                                     | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~                                                               |
 | `_` <Tag variant="new">2.1</Tag>                | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
 | `OP`                                            | Operator or quantifier to determine how often to match a token pattern. ~~str~~                                           |

--- a/website/docs/api/pipeline-functions.md
+++ b/website/docs/api/pipeline-functions.md
@ -130,3 +130,25 @@ exceed the transformer model max length.
 | `min_length`   | The minimum length for a token to be split. Defaults to `25`. ~~int~~ |
 | `split_length` | The length of the split tokens. Defaults to `5`. ~~int~~              |
 | **RETURNS**    | The modified `Doc` with the split tokens. ~~Doc~~                     |
+
+## doc_cleaner {#doc_cleaner tag="function" new="3.2.1"}
+
+Clean up `Doc` attributes. Intended for use at the end of pipelines with
+`tok2vec` or `transformer` pipeline components that store tensors and other
+values that can require a lot of memory and frequently aren't needed after the
+whole pipeline has run.
+
+> #### Example
+>
+> ```python
+> config = {"attrs": {"tensor": None}}
+> nlp.add_pipe("doc_cleaner", config=config)
+> doc = nlp("text")
+> assert doc.tensor is None
+> ```
+
+| Setting     | Description                                                                                                                                                                         |
+| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `attrs`     | A dict of the `Doc` attributes and the values to set them to. Defaults to `{"tensor": None, "_.trf_data": None}` to clean up after `tok2vec` and `transformer` components. ~~dict~~ |
+| `silent`    | If `False`, show warnings if attributes aren't found or can't be set. Defaults to `True`. ~~bool~~                                                                                  |
+| **RETURNS** | The modified `Doc` with the modified attributes. ~~Doc~~                                                                                                                            |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -313,11 +313,12 @@ If a setting is not present in the options, the default value will be used.
 > displacy.serve(doc, style="ent", options=options)
 > ```

-| Name                                    | Description                                                                                                                                                                                                                                 |
-| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `ents`                                  | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~                                                                                                                                                        |
-| `colors`                                | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~                                                                                                                                                 |
-| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
+| Name                                             | Description                                                                                                                                                                                                                                 |
+| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `ents`                                           | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~                                                                                                                                                        |
+| `colors`                                         | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~                                                                                                                                                 |
+| `template` <Tag variant="new">2.2</Tag>          | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
+| `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~                                                                                       |

 By default, displaCy comes with colors for all entity types used by
 [spaCy's trained pipelines](/models). If you're using custom entity types, you
@ -326,6 +327,14 @@ or pipeline package can also expose a
 [`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
 to add custom labels and their colors automatically.

+By default, displaCy links to `#` for entities without a `kb_id` set on their
+span. If you wish to link an entity to their URL then consider using the
+`kb_url_template` option from above. For example if the `kb_id` on a span is
+`Q95` and this is a Wikidata identifier then this option can be set to
+`https://www.wikidata.org/wiki/{}`. Clicking on your entity in the rendered HTML
+should redirect you to their Wikidata page, in this case
+`https://www.wikidata.org/wiki/Q95`.
+
 ## registry {#registry source="spacy/util.py" new="3"}

 spaCy's function registry extends
@ -412,10 +421,10 @@ finished. To log each training step, a
 and the accuracy scores on the development set.

 The built-in, default logger is the ConsoleLogger, which prints results to the
-console in tabular format. The 
+console in tabular format. The
 [spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as
-a dependency of spaCy, enables other loggers: currently it provides one that sends
-results to a [Weights & Biases](https://www.wandb.com/) dashboard.
+a dependency of spaCy, enables other loggers: currently it provides one that
+sends results to a [Weights & Biases](https://www.wandb.com/) dashboard.

 Instead of using one of the built-in loggers, you can
 [implement your own](/usage/training#custom-logging).
@ -466,7 +475,6 @@ start decreasing across epochs.

 </Accordion>

-
 ## Readers {#readers}

 ### File readers {#file-readers source="github.com/explosion/srsly" new="3"}
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
 model doesn't seem to work feel free to open an
 [issue](https://github.com/explosion/spacy/issues). Additionally note that
 Transformers loaded in spaCy can only be used for tensors, and pretrained
-task-specific heads or text generation features cannot be used as part of 
-the `transformer` pipeline component.
+task-specific heads or text generation features cannot be used as part of the
+`transformer` pipeline component.

 <Infobox variant="warning">

@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
 sentence structure and word cooccurrence statistics.

 Pretraining produces a **binary weights file** that can be loaded back in at the
-start of training, using the configuration option `initialize.init_tok2vec`.
-The weights file specifies an initial set of weights. Training then proceeds as
+start of training, using the configuration option `initialize.init_tok2vec`. The
+weights file specifies an initial set of weights. Training then proceeds as
 normal.

 You can only pretrain one subnetwork from your pipeline at a time, and the
@ -751,15 +751,14 @@ layer = "tok2vec"

 #### Connecting pretraining to training {#pretraining-training}

-To benefit from pretraining, your training step needs to know to initialize
-its `tok2vec` component with the weights learned from the pretraining step.
-You do this by setting `initialize.init_tok2vec` to the filename of the
-`.bin` file that you want to use from pretraining.
+To benefit from pretraining, your training step needs to know to initialize its
+`tok2vec` component with the weights learned from the pretraining step. You do
+this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
+that you want to use from pretraining.

-A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
-as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
-To make use of the final output, you could fill in this value in your config
-file:
+A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
+an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
+make use of the final output, you could fill in this value in your config file:

 ```ini
 ### config.cfg
@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}

 <Infobox variant="warning">

-The outputs of `spacy pretrain` are not the same data format as the
-pre-packaged static word vectors that would go into 
-[`initialize.vectors`](/api/data-formats#config-initialize).
-The pretraining output consists of the weights that the `tok2vec`
-component should start with in an existing pipeline, so it goes in
-`initialize.init_tok2vec`.
+The outputs of `spacy pretrain` are not the same data format as the pre-packaged
+static word vectors that would go into
+[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
+output consists of the weights that the `tok2vec` component should start with in
+an existing pipeline, so it goes in `initialize.init_tok2vec`.

 </Infobox>

-
 #### Pretraining objectives {#pretraining-objectives}

 > ```ini
--- a/website/docs/usage/v3-2.md
+++ b/website/docs/usage/v3-2.md
@ -159,7 +159,7 @@ their contributions!

 - All Universal Dependencies training data has been updated to v2.8.
 - The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
-  Rodriguez and the Barcelona Supercomputing Center!
+  Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
 - The transformer pipelines are trained using spacy-transformers v1.1, with
  improved IO and more options for
  [model config and output](/api/architectures#TransformerModel).
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1752,6 +1752,23 @@
            },
            "category": ["courses"]
        },
+        {
+            "type": "education",
+            "id": "applt-course",
+            "title": "Applied Language Technology",
+            "slogan": "NLP for newcomers using spaCy and Stanza",
+            "description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.",
+            "url": "https://applied-language-technology.readthedocs.io/",
+            "image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg",
+            "thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png",
+            "author": "Tuomo Hiippala",
+            "author_links": {
+                "twitter": "tuomo_h",
+                "github": "thiippal",
+                "website": "https://www.mv.helsinki.fi/home/thiippal/"
+            },
+            "category": ["courses"]
+        },
        {
            "type": "education",
            "id": "video-spacys-ner-model",
@ -3592,6 +3609,32 @@
                "github": "xxyzz"
            },
            "category": ["standalone"]
+        },
+        {
+            "id": "eng_spacysentiment",
+            "title": "eng_spacysentiment",
+            "slogan": "Simple sentiment analysis using spaCy pipelines",
+            "description": "Sentiment analysis for simple english sentences using pre-trained spaCy pipelines",
+            "github": "vishnunkumar/spacysentiment",
+            "pip": "eng-spacysentiment",
+            "code_example": [
+                "import eng_spacysentiment",
+                "nlp = eng_spacysentiment.load()",
+                "text = \"Welcome to Arsenals official YouTube channel Watch as we take you closer and show you the personality of the club\"",
+                "doc = nlp(text)",
+                "print(doc.cats)",
+                "# {'positive': 0.29878824949264526, 'negative': 0.7012117505073547}"
+            ],
+            "thumb": "",
+            "image": "",
+            "code_language": "python",
+            "author": "Vishnu Nandakumar",
+            "author_links": {
+                "github": "Vishnunkumar",
+                "twitter": "vishnun_uchiha"
+            },
+            "category": ["pipeline"],
+            "tags": ["pipeline", "nlp", "sentiment"]
        }
    ],