Merge branch 'master' into spacy.io

2025-12-26 19:43:17 +03:00 · 2021-02-14 13:38:33 +11:00 · 2021-02-14 13:38:33 +11:00 · 3246cf8b2b
commit 3246cf8b2b
parent 9fbee83f8a 660642902a
47 changed files with 898 additions and 280 deletions
--- a/.github/contributors/peter-exos.md
+++ b/.github/contributors/peter-exos.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |  Peter Baumann       |
 | Company name (if applicable)   |  Exos Financial      |
 | Title or role (if applicable)  |  data scientist      |
 | Date                           |  Feb 1st, 2021       |
 | GitHub username                |  peter-exos          |
 | Website (optional)             |                      |
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.1"
+__version__ = "3.0.3"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@ -16,7 +16,7 @@ import os
 from ..schemas import ProjectConfigSchema, validate
 from ..util import import_file, run_command, make_tempdir, registry, logger
-from ..util import is_compatible_version, ENV_VARS
+from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
 from .. import about
 if TYPE_CHECKING:
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
                    value = "true"
                else:
                    value = args.pop(0)
-            # Just like we do in the config, we're calling json.loads on the
+            result[opt] = _parse_override(value)
            # values. But since they come from the CLI, it'd be unintuitive to
            # explicitly mark strings with escaped quotes. So we're working
            # around that here by falling back to a string if parsing fails.
            # TODO: improve logic to handle simple types like list of strings?
            try:
                result[opt] = srsly.json_loads(value)
            except ValueError:
                result[opt] = str(value)
        else:
            msg.fail(f"{err}: name should start with --", exits=1)
    return result
-def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
+def _parse_override(value: Any) -> Any:
    # Just like we do in the config, we're calling json.loads on the
    # values. But since they come from the CLI, it'd be unintuitive to
    # explicitly mark strings with escaped quotes. So we're working
    # around that here by falling back to a string if parsing fails.
    # TODO: improve logic to handle simple types like list of strings?
    try:
        return srsly.json_loads(value)
    except ValueError:
        return str(value)
 def load_project_config(
    path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
 ) -> Dict[str, Any]:
    """Load the project.yml file from a directory and validate it. Also make
    sure that all directories defined in the config exist.
    path (Path): The path to the project directory.
    interpolate (bool): Whether to substitute project variables.
    overrides (Dict[str, Any]): Optional config overrides.
    RETURNS (Dict[str, Any]): The loaded project.yml.
    """
    config_path = path / PROJECT_FILE
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
        if not dir_path.exists():
            dir_path.mkdir(parents=True)
    if interpolate:
-        err = "project.yml validation error"
+        err = f"{PROJECT_FILE} validation error"
        with show_validation_error(title=err, hint_fill=False):
-            config = substitute_project_variables(config)
+            config = substitute_project_variables(config, overrides)
    return config
-def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
+def substitute_project_variables(
-    key = "vars"
+    config: Dict[str, Any],
    overrides: Dict[str, Any] = SimpleFrozenDict(),
    key: str = "vars",
    env_key: str = "env",
 ) -> Dict[str, Any]:
    """Interpolate variables in the project file using the config system.
    config (Dict[str, Any]): The project config.
    overrides (Dict[str, Any]): Optional config overrides.
    key (str): Key containing variables in project config.
    env_key (str): Key containing environment variable mapping in project config.
    RETURNS (Dict[str, Any]): The interpolated project config.
    """
    config.setdefault(key, {})
-    config[key].update(overrides)
+    config.setdefault(env_key, {})
    # Substitute references to env vars with their values
    for config_var, env_var in config[env_key].items():
        config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
    # Need to put variables in the top scope again so we can have a top-level
    # section "project" (otherwise, a list of commands in the top scope wouldn't)
    # be allowed by Thinc's config system
-    cfg = Config({"project": config, key: config[key]})
+    cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
    cfg = Config().from_str(cfg.to_str(), overrides=overrides)
    interpolated = cfg.interpolate()
    return dict(interpolated["project"])
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -175,10 +175,13 @@ def render_parses(
 def print_prf_per_type(
    msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
 ) -> None:
-    data = [
+    data = []
-        (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
+    for key, value in scores.items():
-        for k, v in scores.items()
+        row = [key]
-    ]
+        for k in ("p", "r", "f"):
            v = value[k]
            row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
        data.append(row)
    msg.table(
        data,
        header=("", "P", "R", "F"),
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
    msg: Printer, scores: Dict[str, Dict[str, float]]
 ) -> None:
    msg.table(
-        [(k, f"{v:.2f}") for k, v in scores.items()],
+        [
            (k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
            for k, v in scores.items()
        ],
        header=("", "ROC AUC"),
        aligns=("l", "r"),
        title="Textcat ROC AUC (per label)",
--- a/spacy/cli/project/run.py
+++ b/spacy/cli/project/run.py
@ -3,19 +3,23 @@ from pathlib import Path
 from wasabi import msg
 import sys
 import srsly
 import typer
 from ... import about
 from ...git_info import GIT_VERSION
 from ...util import working_dir, run_command, split_command, is_cwd, join_command
 from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
-from ...util import check_bool_env_var
+from ...util import check_bool_env_var, SimpleFrozenDict
 from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
-from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
+from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
-@project_cli.command("run")
+@project_cli.command(
    "run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
 )
 def project_run_cli(
    # fmt: off
    ctx: typer.Context,  # This is only used to read additional arguments
    subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
@ -33,13 +37,15 @@ def project_run_cli(
    if show_help or not subcommand:
        print_run_help(project_dir, subcommand)
    else:
-        project_run(project_dir, subcommand, force=force, dry=dry)
+        overrides = parse_config_overrides(ctx.args)
        project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
 def project_run(
    project_dir: Path,
    subcommand: str,
    *,
    overrides: Dict[str, Any] = SimpleFrozenDict(),
    force: bool = False,
    dry: bool = False,
    capture: bool = False,
@ -59,7 +65,7 @@ def project_run(
        when you want to turn over execution to the command, and capture=True
        when you want to run the command more like a function.
    """
-    config = load_project_config(project_dir)
+    config = load_project_config(project_dir, overrides=overrides)
    commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
    workflows = config.get("workflows", {})
    validate_subcommand(commands.keys(), workflows.keys(), subcommand)
--- a/spacy/cli/templates/quickstart_training_recommendations.yml
+++ b/spacy/cli/templates/quickstart_training_recommendations.yml
@ -28,6 +28,15 @@ bg:
  accuracy:
    name: iarfmoose/roberta-base-bulgarian
    size_factor: 3
 bn:
  word_vectors: null
  transformer:
  efficiency:
    name: sagorsarker/bangla-bert-base
    size_factor: 3
  accuracy:
    name: sagorsarker/bangla-bert-base
    size_factor: 3
 da:
  word_vectors: da_core_news_lg
  transformer:
@ -104,10 +113,10 @@ hi:
  word_vectors: null
  transformer:
    efficiency:
-      name: monsoon-nlp/hindi-tpu-electra
+      name: ai4bharat/indic-bert
      size_factor: 3
    accuracy:
-      name: monsoon-nlp/hindi-tpu-electra
+      name: ai4bharat/indic-bert
      size_factor: 3
 id:
  word_vectors: null
@ -185,10 +194,10 @@ si:
  word_vectors: null
  transformer:
    efficiency:
-      name: keshan/SinhalaBERTo
+      name: setu4993/LaBSE
      size_factor: 3
    accuracy:
-      name: keshan/SinhalaBERTo
+      name: setu4993/LaBSE
      size_factor: 3
 sv:
  word_vectors: null
@ -203,10 +212,10 @@ ta:
  word_vectors: null
  transformer:
    efficiency:
-      name: monsoon-nlp/tamillion
+      name: ai4bharat/indic-bert
      size_factor: 3
    accuracy:
-      name: monsoon-nlp/tamillion
+      name: ai4bharat/indic-bert
      size_factor: 3
 te:
  word_vectors: null
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -579,8 +579,8 @@ class Errors:
    E922 = ("Component '{name}' has been initialized with an output dimension of "
            "{nO} - cannot add any more labels.")
    E923 = ("It looks like there is no proper sample data to initialize the "
-            "Model of component '{name}'. This is likely a bug in spaCy, so "
+            "Model of component '{name}'. To check your input data paths and "
-            "feel free to open an issue: https://github.com/explosion/spaCy/issues")
+            "annotation, run: python -m spacy debug data config.cfg")
    E924 = ("The '{name}' component does not seem to be initialized properly. "
            "This is likely a bug in spaCy, so feel free to open an issue: "
            "https://github.com/explosion/spaCy/issues")
--- a/spacy/lang/tn/init.py
+++ b/spacy/lang/tn/init.py
@ -0,0 +1,18 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_INFIXES
 from ...language import Language
 class SetswanaDefaults(Language.Defaults):
    infixes = TOKENIZER_INFIXES
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
 class Setswana(Language):
    lang = "tn"
    Defaults = SetswanaDefaults
 __all__ = ["Setswana"]
--- a/spacy/lang/tn/examples.py
+++ b/spacy/lang/tn/examples.py
@ -0,0 +1,15 @@
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.tn.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
    "Johannesburg ke toropo e kgolo mo Afrika Borwa.",
    "O ko kae?",
    "ke mang presidente ya Afrika Borwa?",
    "ke eng toropo kgolo ya Afrika Borwa?",
    "Nelson Mandela o belegwe leng?",
 ]
--- a/spacy/lang/tn/lex_attrs.py
+++ b/spacy/lang/tn/lex_attrs.py
@ -0,0 +1,107 @@
 from ...attrs import LIKE_NUM
 _num_words = [
    "lefela",
    "nngwe",
    "pedi",
    "tharo",
    "nne",
    "tlhano",
    "thataro",
    "supa",
    "robedi",
    "robongwe",
    "lesome",
    "lesomenngwe",
    "lesomepedi",
    "sometharo",
    "somenne",
    "sometlhano",
    "somethataro",
    "somesupa",
    "somerobedi",
    "somerobongwe",
    "someamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
 ]
 _ordinal_words = [
    "ntlha",
    "bobedi",
    "boraro",
    "bone",
    "botlhano",
    "borataro",
    "bosupa",
    "borobedi ",
    "borobongwe",
    "bolesome",
    "bolesomengwe",
    "bolesomepedi",
    "bolesometharo",
    "bolesomenne",
    "bolesometlhano",
    "bolesomethataro",
    "bolesomesupa",
    "bolesomerobedi",
    "bolesomerobongwe",
    "somamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
 ]
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    text_lower = text.lower()
    if text_lower in _num_words:
        return True
    # CHeck ordinal number
    if text_lower in _ordinal_words:
        return True
    if text_lower.endswith("th"):
        if text_lower[:-2].isdigit():
            return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/tn/punctuation.py
+++ b/spacy/lang/tn/punctuation.py
@ -0,0 +1,19 @@
 from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
 )
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/tn/stop_words.py
+++ b/spacy/lang/tn/stop_words.py
@ -0,0 +1,20 @@
 # Stop words
 STOP_WORDS = set(
    """
 ke gareng ga selekanyo tlhwatlhwa yo mongwe se
 sengwe fa go le jalo gongwe ba na mo tikologong
 jaaka kwa morago nna gonne ka sa pele nako teng
 tlase fela ntle magareng tsona feta bobedi kgabaganya
 moo gape kgatlhanong botlhe tsotlhe bokana e esi
 setseng mororo dinako golo kgolo nnye wena gago
 o ntse ntle tla goreng gangwe mang yotlhe gore
 eo yona tseraganyo eng ne sentle re rona thata
 godimo fitlha pedi masomamabedi lesomepedi mmogo
 tharo tseo boraro tseno yone jaanong bobona bona
 lesome tsaya tsamaiso nngwe masomethataro thataro
 tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
 bonala e tshwanang bogolo tsenya tsweetswee karolo
 sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
 tlhano lesometlhano botlalo lekgolo
 """.split()
 )
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -451,7 +451,7 @@ cdef class Lexeme:
            Lexeme.c_set_flag(self.c, IS_QUOTE, x)
    property is_left_punct:
-        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
+        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
        def __get__(self):
            return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
--- a/spacy/matcher/phrasematcher.pxd
+++ b/spacy/matcher/phrasematcher.pxd
@ -18,4 +18,4 @@ cdef class PhraseMatcher:
    cdef Pool mem
    cdef key_t _terminal_hash
-    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
+    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil
--- a/spacy/matcher/phrasematcher.pyx
+++ b/spacy/matcher/phrasematcher.pyx
@ -230,10 +230,10 @@ cdef class PhraseMatcher:
                result = internal_node
            map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
-    def __call__(self, doc, *, as_spans=False):
+    def __call__(self, object doclike, *, as_spans=False):
        """Find all sequences matching the supplied patterns on the `Doc`.
-        doc (Doc): The document to match over.
+        doclike (Doc or Span): The document to match over.
        as_spans (bool): Return Span objects with labels instead of (match_id,
            start, end) tuples.
        RETURNS (list): A list of `(match_id, start, end)` tuples,
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
        DOCS: https://spacy.io/api/phrasematcher#call
        """
        matches = []
-        if doc is None or len(doc) == 0:
+        if doclike is None or len(doclike) == 0:
            # if doc is empty or None just return empty list
            return matches
        if isinstance(doclike, Doc):
            doc = doclike
            start_idx = 0
            end_idx = len(doc)
        elif isinstance(doclike, Span):
            doc = doclike.doc
            start_idx = doclike.start
            end_idx = doclike.end
        else:
            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
        cdef vector[SpanC] c_matches
-        self.find_matches(doc, &c_matches)
+        self.find_matches(doc, start_idx, end_idx, &c_matches)
        for i in range(c_matches.size()):
            matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
        for i, (ent_id, start, end) in enumerate(matches):
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
        else:
            return matches
-    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
+    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
        cdef MapStruct* current_node = self.c_map
        cdef int start = 0
-        cdef int idx = 0
+        cdef int idx = start_idx
-        cdef int idy = 0
+        cdef int idy = start_idx
        cdef key_t key
        cdef void* value
        cdef int i = 0
        cdef SpanC ms
        cdef void* result
-        while idx < doc.length:
+        while idx < end_idx:
            start = idx
            token = Token.get_struct_attr(&doc.c[idx], self.attr)
            # look for sequences from this position
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
            if result:
                current_node = <MapStruct*>result
                idy = idx + 1
-                while idy < doc.length:
+                while idy < end_idx:
                    result = map_get(current_node, self._terminal_hash)
                    if result:
                        i = 0
--- a/spacy/ml/models/textcat.py
+++ b/spacy/ml/models/textcat.py
@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
    model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
    model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
    model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
    model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
    init_chain(model, X, Y)
    return model
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
        gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
        loss = self.distance.get_loss(sentence_encodings, entity_encodings)
        loss = loss / len(entity_encodings)
-        return loss, gradients
+        return float(loss), gradients
    def predict(self, docs: Iterable[Doc]) -> List[str]:
        """Apply the pipeline's model to a batch of docs, without modifying them.
--- a/spacy/pipeline/functions.py
+++ b/spacy/pipeline/functions.py
@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
    retokenizes=True,
 )
 def make_token_splitter(
-    nlp: Language, name: str, *, min_length=0, split_length=0,
+    nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
 ):
    return TokenSplitter(min_length=min_length, split_length=split_length)
--- a/spacy/pipeline/multitask.pyx
+++ b/spacy/pipeline/multitask.pyx
@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
        target = vectors[ids]
        gradient = self.distance.get_grad(prediction, target)
        loss = self.distance.get_loss(prediction, target)
-        return loss, gradient
+        return float(loss), gradient
    def update(self, examples, *, drop=0., sgd=None, losses=None):
        pass
--- a/spacy/pipeline/tok2vec.py
+++ b/spacy/pipeline/tok2vec.py
@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
        tokvecs = self.model.predict(docs)
        batch_id = Tok2VecListener.get_batch_id(docs)
        for listener in self.listeners:
-            listener.receive(batch_id, tokvecs, lambda dX: [])
+            listener.receive(batch_id, tokvecs, _empty_backprop)
        return tokvecs
    def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
        # of data.
        # When the components batch differently, we don't receive a matching
        # prediction from the upstream, so we can't predict.
-        if not all(doc.tensor.size for doc in inputs):
+        outputs = []
-            # But we do need to do *something* if the tensor hasn't been set.
+        width = model.get_dim("nO")
-            # The compromise is to at least return data of the right shape,
+        for doc in inputs:
-            # so the output is valid.
+            if doc.tensor.size == 0:
-            width = model.get_dim("nO")
+                # But we do need to do *something* if the tensor hasn't been set.
-            outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
+                # The compromise is to at least return data of the right shape,
-        else:
+                # so the output is valid.
-            outputs = [doc.tensor for doc in inputs]
+                outputs.append(model.ops.alloc2f(len(doc), width))
            else:
                outputs.append(doc.tensor)
        return outputs, lambda dX: []
 def _empty_backprop(dX):  # for pickling
    return []
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
 class ProjectConfigSchema(BaseModel):
    # fmt: off
    vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
    env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
    assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
    workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
    commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
--- a/spacy/tests/lang/test_initialize.py
+++ b/spacy/tests/lang/test_initialize.py
@ -8,7 +8,8 @@ from spacy.util import get_lang_class
 LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
             "et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
             "it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
-             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
+             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
             "yo"]
 # fmt: on
--- a/spacy/tests/matcher/test_phrase_matcher.py
+++ b/spacy/tests/matcher/test_phrase_matcher.py
@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
 def test_phrase_matcher_sent_start(en_vocab, attr):
    _ = PhraseMatcher(en_vocab, attr=attr)  # noqa: F841
 def test_span_in_phrasematcher(en_vocab):
    """Ensure that PhraseMatcher accepts Span and Doc as input"""
    # fmt: off
    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
    # fmt: on
    doc = Doc(en_vocab, words=words)
    span = doc[:8]
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
    matcher = PhraseMatcher(en_vocab)
    matcher.add("SPACY", [pattern])
    matches_doc = matcher(doc)
    matches_span = matcher(span)
    assert len(matches_doc) == 1
    assert len(matches_span) == 1
 def test_span_v_doc_in_phrasematcher(en_vocab):
    """Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
    # fmt: off
    words = [
        "I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
        "and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
        "everywhere", "."
    ]
    # fmt: on
    doc = Doc(en_vocab, words=words)
    span = doc[9:15]  # second clause
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
    matcher = PhraseMatcher(en_vocab)
    matcher.add("SPACY", [pattern])
    matches_doc = matcher(doc)
    matches_span = matcher(span)
    assert len(matches_doc) == 3
    assert len(matches_span) == 1
--- a/spacy/tests/pipeline/test_pipe_factories.py
+++ b/spacy/tests/pipeline/test_pipe_factories.py
@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
    assert config["arg"] == "world"
-def test_pipe_factories_decorator_idempotent():
+class PipeFactoriesIdempotent:
    def __init__(self, nlp, name):
        ...
    def __call__(self, doc):
        ...
@pytest.mark.parametrize(
    "i,func,func2",
    [
        (0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
        (1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
    ],
 )
 def test_pipe_factories_decorator_idempotent(i, func, func2):
    """Check that decorator can be run multiple times if the function is the
    same. This is especially relevant for live reloading because we don't
    want spaCy to raise an error if a module registering components is reloaded.
    """
-    name = "test_pipe_factories_decorator_idempotent"
+    name = f"test_pipe_factories_decorator_idempotent_{i}"
    func = lambda nlp, name: lambda doc: doc
    for i in range(5):
        Language.factory(name, func=func)
    nlp = Language()
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
    # Make sure it also works for component decorator, which creates the
    # factory function
    name2 = f"{name}2"
    func2 = lambda doc: doc
    for i in range(5):
        Language.component(name2, func=func2)
    nlp = Language()
--- a/spacy/tests/regression/test_issue6501-7000.py
+++ b/spacy/tests/regression/test_issue6501-7000.py
@ -0,0 +1,229 @@
 import pytest
 from spacy.lang.en import English
 import numpy as np
 import spacy
 from spacy.tokens import Doc
 from spacy.matcher import PhraseMatcher
 from spacy.tokens import DocBin
 from spacy.util import load_config_from_str
 from spacy.training import Example
 from spacy.training.initialize import init_nlp
 import pickle
 from ..util import make_tempdir
 def test_issue6730(en_vocab):
    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
    from spacy.kb import KnowledgeBase
    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
    with pytest.raises(ValueError):
        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
    assert kb.contains_alias("") is False
    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
    with make_tempdir() as tmp_dir:
        kb.to_disk(tmp_dir)
        kb.from_disk(tmp_dir)
    assert kb.get_size_aliases() == 2
    assert set(kb.get_alias_strings()) == {"x", "y"}
 def test_issue6755(en_tokenizer):
    doc = en_tokenizer("This is a magnificent sentence.")
    span = doc[:0]
    assert span.text_with_ws == ""
    assert span.text == ""
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,label",
    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
 )
 def test_issue6815_1(sentence, start_idx, end_idx, label):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, label=label)
    assert span.label_ == label
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
 )
 def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
    assert span.kb_id == kb_id
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,vector",
    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
 )
 def test_issue6815_3(sentence, start_idx, end_idx, vector):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, vector=vector)
    assert (span.vector == vector).all()
 def test_issue6839(en_vocab):
    """Ensure that PhraseMatcher accepts Span as input"""
    # fmt: off
    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
    # fmt: on
    doc = Doc(en_vocab, words=words)
    span = doc[:8]
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
    matcher = PhraseMatcher(en_vocab)
    matcher.add("SPACY", [pattern])
    matches = matcher(span)
    assert matches
 CONFIG_ISSUE_6908 = """
 [paths]
 train = "TRAIN_PLACEHOLDER"
 raw = null
 init_tok2vec = null
 vectors = null
 [system]
 seed = 0
 gpu_allocator = null
 [nlp]
 lang = "en"
 pipeline = ["textcat"]
 tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
 disabled = []
 before_creation = null
 after_creation = null
 after_pipeline_creation = null
 batch_size = 1000
 [components]
 [components.textcat]
 factory = "TEXTCAT_PLACEHOLDER"
 [corpora]
 [corpora.train]
@readers = "spacy.Corpus.v1"
 path = ${paths:train}
 [corpora.dev]
@readers = "spacy.Corpus.v1"
 path = ${paths:train}
 [training]
 train_corpus = "corpora.train"
 dev_corpus = "corpora.dev"
 seed = ${system.seed}
 gpu_allocator = ${system.gpu_allocator}
 frozen_components = []
 before_to_disk = null
 [pretraining]
 [initialize]
 vectors = ${paths.vectors}
 init_tok2vec = ${paths.init_tok2vec}
 vocab_data = null
 lookups = null
 before_init = null
 after_init = null
 [initialize.components]
 [initialize.components.textcat]
 labels = ['label1', 'label2']
 [initialize.tokenizer]
 """
@pytest.mark.parametrize(
    "component_name", ["textcat", "textcat_multilabel"],
 )
 def test_issue6908(component_name):
    """Test intializing textcat with labels in a list"""
    def create_data(out_file):
        nlp = spacy.blank("en")
        doc = nlp.make_doc("Some text")
        doc.cats = {"label1": 0, "label2": 1}
        out_data = DocBin(docs=[doc]).to_bytes()
        with out_file.open("wb") as file_:
            file_.write(out_data)
    with make_tempdir() as tmp_path:
        train_path = tmp_path / "train.spacy"
        create_data(train_path)
        config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
        config = load_config_from_str(config_str)
        init_nlp(config)
 CONFIG_ISSUE_6950 = """
 [nlp]
 lang = "en"
 pipeline = ["tok2vec", "tagger"]
 [components]
 [components.tok2vec]
 factory = "tok2vec"
 [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
 [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
 width = ${components.tok2vec.model.encode:width}
 attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
 rows = [5000,2500,2500,2500]
 include_static_vectors = false
 [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
 width = 96
 depth = 4
 window_size = 1
 maxout_pieces = 3
 [components.ner]
 factory = "ner"
 [components.tagger]
 factory = "tagger"
 [components.tagger.model]
@architectures = "spacy.Tagger.v1"
 nO = null
 [components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
 width = ${components.tok2vec.model.encode:width}
 upstream = "*"
 """
 def test_issue6950():
    """Test that the nlp object with initialized tok2vec with listeners pickles
    correctly (and doesn't have lambdas).
    """
    nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
    nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
    pickle.dumps(nlp)
    nlp("hello")
    pickle.dumps(nlp)
--- a/spacy/tests/regression/test_issue6730.py
+++ b/spacy/tests/regression/test_issue6730.py
@ -1,23 +0,0 @@
 import pytest
 from ..util import make_tempdir
 def test_issue6730(en_vocab):
    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
    from spacy.kb import KnowledgeBase
    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
    with pytest.raises(ValueError):
        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
    assert kb.contains_alias("") is False
    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
    with make_tempdir() as tmp_dir:
        kb.to_disk(tmp_dir)
        kb.from_disk(tmp_dir)
    assert kb.get_size_aliases() == 2
    assert set(kb.get_alias_strings()) == {"x", "y"}
--- a/spacy/tests/regression/test_issue6755.py
+++ b/spacy/tests/regression/test_issue6755.py
@ -1,5 +0,0 @@
 def test_issue6755(en_tokenizer):
    doc = en_tokenizer("This is a magnificent sentence.")
    span = doc[:0]
    assert span.text_with_ws == ""
    assert span.text == ""
--- a/spacy/tests/regression/test_issue6815.py
+++ b/spacy/tests/regression/test_issue6815.py
@ -1,35 +0,0 @@
 import pytest
 from spacy.lang.en import English
 import numpy as np
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,label",
    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
 )
 def test_char_span_label(sentence, start_idx, end_idx, label):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, label=label)
    assert span.label_ == label
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
 )
 def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
    assert span.kb_id == kb_id
@pytest.mark.parametrize(
    "sentence, start_idx,end_idx,vector",
    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
 )
 def test_char_span_vector(sentence, start_idx, end_idx, vector):
    nlp = English()
    doc = nlp(sentence)
    span = doc[:].char_span(start_idx, end_idx, vector=vector)
    assert (span.vector == vector).all()
--- a/spacy/tests/regression/test_issue6908.py
+++ b/spacy/tests/regression/test_issue6908.py
@ -1,102 +0,0 @@
 import pytest
 import spacy
 from spacy.language import Language
 from spacy.tokens import DocBin
 from spacy import util
 from spacy.schemas import ConfigSchemaInit
 from spacy.training.initialize import init_nlp
 from ..util import make_tempdir
 TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
 [paths]
 train = "TRAIN_PLACEHOLDER"
 raw = null
 init_tok2vec = null
 vectors = null
 [system]
 seed = 0
 gpu_allocator = null
 [nlp]
 lang = "en"
 pipeline = ["textcat"]
 tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
 disabled = []
 before_creation = null
 after_creation = null
 after_pipeline_creation = null
 batch_size = 1000
 [components]
 [components.textcat]
 factory = "TEXTCAT_PLACEHOLDER"
 [corpora]
 [corpora.train]
@readers = "spacy.Corpus.v1"
 path = ${paths:train}
 [corpora.dev]
@readers = "spacy.Corpus.v1"
 path = ${paths:train}
 [training]
 train_corpus = "corpora.train"
 dev_corpus = "corpora.dev"
 seed = ${system.seed}
 gpu_allocator = ${system.gpu_allocator}
 frozen_components = []
 before_to_disk = null
 [pretraining]
 [initialize]
 vectors = ${paths.vectors}
 init_tok2vec = ${paths.init_tok2vec}
 vocab_data = null
 lookups = null
 before_init = null
 after_init = null
 [initialize.components]
 [initialize.components.textcat]
 labels = ['label1', 'label2']
 [initialize.tokenizer]
 """
@pytest.mark.parametrize(
    "component_name",
    ["textcat", "textcat_multilabel"],
 )
 def test_textcat_initialize_labels_validation(component_name):
    """Test intializing textcat with labels in a list"""
    def create_data(out_file):
        nlp = spacy.blank("en")
        doc = nlp.make_doc("Some text")
        doc.cats = {"label1": 0, "label2": 1}
        out_data = DocBin(docs=[doc]).to_bytes()
        with out_file.open("wb") as file_:
            file_.write(out_data)
    with make_tempdir() as tmp_path:
        train_path = tmp_path / "train.spacy"
        create_data(train_path)
        config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
            "TEXTCAT_PLACEHOLDER", component_name
        )
        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
        config = util.load_config_from_str(config_str)
        init_nlp(config)
--- a/spacy/tests/regression/test_issue7019.py
+++ b/spacy/tests/regression/test_issue7019.py
@ -0,0 +1,12 @@
 from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
 from wasabi import msg
 def test_issue7019():
    scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
    print_textcats_auc_per_cat(msg, scores)
    scores = {
        "LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
        "LABEL_B": {"p": None, "r": None, "f": None},
    }
    print_prf_per_type(msg, scores, name="foo", type="bar")
--- a/spacy/tests/regression/test_issue7029.py
+++ b/spacy/tests/regression/test_issue7029.py
@ -0,0 +1,67 @@
 from spacy.lang.en import English
 from spacy.training import Example
 from spacy.util import load_config_from_str
 CONFIG = """
 [nlp]
 lang = "en"
 pipeline = ["tok2vec", "tagger"]
 [components]
 [components.tok2vec]
 factory = "tok2vec"
 [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
 [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
 width = ${components.tok2vec.model.encode:width}
 attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
 rows = [5000,2500,2500,2500]
 include_static_vectors = false
 [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
 width = 96
 depth = 4
 window_size = 1
 maxout_pieces = 3
 [components.tagger]
 factory = "tagger"
 [components.tagger.model]
@architectures = "spacy.Tagger.v1"
 nO = null
 [components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
 width = ${components.tok2vec.model.encode:width}
 upstream = "*"
 """
 TRAIN_DATA = [
    ("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
    ("Eat blue ham", {"tags": ["V", "J", "N"]}),
 ]
 def test_issue7029():
    """Test that an empty document doesn't mess up an entire batch."""
    nlp = English.from_config(load_config_from_str(CONFIG))
    train_examples = []
    for t in TRAIN_DATA:
        train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
    optimizer = nlp.initialize(get_examples=lambda: train_examples)
    for i in range(50):
        losses = {}
        nlp.update(train_examples, sgd=optimizer, losses=losses)
    texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
    nlp.select_pipes(enable=["tok2vec", "tagger"])
    docs1 = list(nlp.pipe(texts, batch_size=1))
    docs2 = list(nlp.pipe(texts, batch_size=4))
    assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@ -325,6 +325,23 @@ def test_project_config_interpolation():
        substitute_project_variables(project)
 def test_project_config_interpolation_env():
    variables = {"a": 10}
    env_var = "SPACY_TEST_FOO"
    env_vars = {"foo": env_var}
    commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
    project = {"commands": commands, "vars": variables, "env": env_vars}
    with make_tempdir() as d:
        srsly.write_yaml(d / "project.yml", project)
        cfg = load_project_config(d)
    assert cfg["commands"][0]["script"][0] == "hello 10 "
    os.environ[env_var] = "123"
    with make_tempdir() as d:
        srsly.write_yaml(d / "project.yml", project)
        cfg = load_project_config(d)
    assert cfg["commands"][0]["script"][0] == "hello 10 123"
@pytest.mark.parametrize(
    "args,expected",
    [
--- a/spacy/tests/test_pickles.py
+++ b/spacy/tests/test_pickles.py
@ -1,7 +1,9 @@
 import pytest
 import numpy
 import srsly
 from spacy.lang.en import English
 from spacy.strings import StringStore
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from spacy.attrs import NORM
@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
 def test_pickle_vocab(text1, text2):
-    vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
+    vocab = Vocab(
        lex_attr_getters={int(NORM): lambda string: string[:-1]},
        get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
    )
    vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
    lex1 = vocab[text1]
    lex2 = vocab[text2]
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
    assert unpickled[text2].norm == lex2.norm
    assert unpickled[text1].norm != unpickled[text2].norm
    assert unpickled.vectors is not None
    assert unpickled.get_noun_chunks is not None
    assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
 def test_pickle_doc(en_vocab):
    words = ["a", "b", "c"]
    deps = ["dep"] * len(words)
    heads = [0] * len(words)
    doc = Doc(
        en_vocab,
        words=words,
        deps=deps,
        heads=heads,
    )
    data = srsly.pickle_dumps(doc)
    unpickled = srsly.pickle_loads(data)
    assert [t.text for t in unpickled] == words
    assert [t.dep_ for t in unpickled] == deps
    assert [t.head.i for t in unpickled] == heads
    assert list(doc.noun_chunks) == []
--- a/spacy/tests/vocab_vectors/test_lexeme.py
+++ b/spacy/tests/vocab_vectors/test_lexeme.py
@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
    assert en_vocab["199"].check_flag(IS_DIGIT) is False
    assert en_vocab["the"].check_flag(is_len4) is False
    assert en_vocab["dogs"].check_flag(is_len4) is True
    en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
 def test_vocab_lexeme_oov_rank(en_vocab):
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -245,7 +245,7 @@ cdef class Tokenizer:
        cdef int offset
        cdef int modified_doc_length
        # Find matches for special cases
-        self._special_matcher.find_matches(doc, &c_matches)
+        self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
        # Skip processing if no matches
        if c_matches.size() == 0:
            return True
--- a/spacy/training/initialize.py
+++ b/spacy/training/initialize.py
@ -215,8 +215,7 @@ def convert_vectors(
 def read_vectors(vectors_loc: Path, truncate_vectors: int):
-    f = open_file(vectors_loc)
+    f = ensure_shape(vectors_loc)
    f = ensure_shape(f)
    shape = tuple(int(size) for size in next(f).split())
    if truncate_vectors >= 1:
        shape = (truncate_vectors, shape[1])
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
        return loc.open("r", encoding="utf8")
-def ensure_shape(lines):
+def ensure_shape(vectors_loc):
    """Ensure that the first line of the data is the vectors shape.
    If it's not, we read in the data and output the shape as the first result,
    so that the reader doesn't have to deal with the problem.
    """
    lines = open_file(vectors_loc)
    first_line = next(lines)
    try:
        shape = tuple(int(size) for size in first_line.split())
@ -269,7 +269,11 @@ def ensure_shape(lines):
        # Figure out the shape, make it the first value, and then give the
        # rest of the data.
        width = len(first_line.split()) - 1
-        captured = [first_line] + list(lines)
+        length = 1
-        length = len(captured)
+        for _ in lines:
            length += 1
        yield f"{length} {width}"
-        yield from captured
+        # Reading the lines in again from file. This to avoid having to
        # store all the results in a list in memory
        lines2 = open_file(vectors_loc)
        yield from lines2
--- a/spacy/util.py
+++ b/spacy/util.py
@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
    """
    if not callable(func1) or not callable(func2):
        return False
    if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
        return False
    same_name = func1.__qualname__ == func2.__qualname__
    same_file = inspect.getfile(func1) == inspect.getfile(func2)
    same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
--- a/spacy/vocab.pyx
+++ b/spacy/vocab.pyx
@ -551,12 +551,13 @@ def pickle_vocab(vocab):
    data_dir = vocab.data_dir
    lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
    lookups = vocab.lookups
    get_noun_chunks = vocab.get_noun_chunks
    return (unpickle_vocab,
-            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
+            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
 def unpickle_vocab(sstore, vectors, morphology, data_dir,
-                   lex_attr_getters, lookups):
+                   lex_attr_getters, lookups, get_noun_chunks):
    cdef Vocab vocab = Vocab()
    vocab.vectors = vectors
    vocab.strings = sstore
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
    vocab.data_dir = data_dir
    vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
    vocab.lookups = lookups
    vocab.get_noun_chunks = get_noun_chunks
    return vocab
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
 > lemmatizer = nlp.add_pipe("lemmatizer")
 >
 > # Construction via add_pipe with custom settings
-> config = {"mode": "rule", overwrite=True}
+> config = {"mode": "rule", "overwrite": True}
 > lemmatizer = nlp.add_pipe("lemmatizer", config=config)
 > ```
--- a/website/docs/api/phrasematcher.md
+++ b/website/docs/api/phrasematcher.md
@ -44,7 +44,7 @@ be shown.
 ## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
-Find all token sequences matching the supplied patterns on the `Doc`.
+Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
 > #### Example
 >
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
 | Name                                  | Description                                                                                                                                                                                                                                                                                              |
 | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doc`                                 | The document to match over. ~~Doc~~                                                                                                                                                                                                                                                                      |
+| `doclike`                             | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                                      |
 | _keyword-only_                        |                                                                                                                                                                                                                                                                                                          |
 | `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
 | **RETURNS**                           | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
 Create a data augmentation callback that uses orth-variant replacement. The
 callback can be added to a corpus or other data iterator during training. It's
-is especially useful for punctuation and case replacement, to help generalize
+especially useful for punctuation and case replacement, to help generalize
 beyond corpora that don't have smart quotes, or only have smart quotes etc.
 | Name            | Description                                                                                                                                                                                                                                                                                               |
--- a/website/docs/usage/_benchmarks-models.md
+++ b/website/docs/usage/_benchmarks-models.md
@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 | Pipeline                                                   | Parser | Tagger |  NER |
 | ---------------------------------------------------------- | -----: | -----: | ---: |
-| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.2 |   97.8 | 89.9 |
+| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.1 |   97.8 | 89.8 |
-| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   91.9 |   97.4 | 85.5 |
+| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.0 |   97.4 | 85.5 |
 | `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 | 85.5 |
 <figcaption class="caption">
@ -22,7 +22,7 @@ the development set).
 | Named Entity Recognition System  | OntoNotes | CoNLL '03 |
 | -------------------------------- | --------: | --------: |
-| spaCy RoBERTa (2020)             |      89.7 |      91.6 |
+| spaCy RoBERTa (2020)             |      89.8 |      91.6 |
 | Stanza (StanfordNLP)<sup>1</sup> |      88.8 |      92.1 |
 | Flair<sup>2</sup>                |      89.7 |      93.1 |
--- a/website/docs/usage/facts-figures.md
+++ b/website/docs/usage/facts-figures.md
@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 | Dependency Parsing System                                                      |  UAS |  LAS |
 | ------------------------------------------------------------------------------ | ---: | ---: |
-| spaCy RoBERTa (2020)                                                           | 95.5 | 94.3 |
+| spaCy RoBERTa (2020)                                                           | 95.1 | 93.7 |
 | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
 By default, the project will be cloned into the current working directory. You
 can specify an optional second argument to define the output directory. The
-`--repo` option lets you define a custom repo to clone from if you don't want
+`--repo` option lets you define a custom repo to clone from if you don't want to
-to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
+use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
-can also use any private repo you have access to with Git.
+also use any private repo you have access to with Git.
 ### 2. Fetch the project assets {#assets}
@ -221,6 +221,7 @@ pipelines.
 | `title`         | An optional project title used in `--help` message and [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | `description`   | An optional project description used in [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 | `vars`          | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`.                                                                                                                                                |
 | `env`           | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`.                                                                                                                                                                                                                                                                                          |
 | `directories`   | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist.                                                                                                                                                                                                                                                                                                                 |
 | `assets`        | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo.                                                                        |
 | `workflows`     | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command.                                                                                                                                                                                                                                                                                                                                         |
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
 specify the destination paths and a checksum, and leave out the URL. When your
 teammates clone and run your project, they can place the files in the respective
 directory themselves. The [`project assets`](/api/cli#project-assets) command
-will alert you about missing files and mismatched checksums, so you can ensure that
+will alert you about missing files and mismatched checksums, so you can ensure
-others are running your project with the same data.
+that others are running your project with the same data.
 ### Dependencies and outputs {#deps-outputs}
@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
 automatically. For instance, if you only run the command `train` that depends on
 data created by `preprocess` and those files are missing, spaCy will show an
 error – it won't just re-run `preprocess`. If you're looking for more advanced
-data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
+data management, check out the [Data Version Control (DVC) integration](#dvc).
-can also use `outputs_no_cache` instead of `outputs` to define outputs that
+If you're planning on integrating your spaCy project with DVC, you can also use
-won't be cached or tracked.
+`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
 or tracked.
 ### Files and directory structure {#project-files}
@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
 `python scripts/custom_evaluation.py` with the function arguments. You can also
 use the `vars` section to define reusable variables that will be substituted in
 commands, paths and URLs. In this example, the batch size is defined as a
-variable will be added in place of `${vars.batch_size}` in the script.
+variable will be added in place of `${vars.batch_size}` in the script. Just like
 in the [training config](/usage/training##config-overrides), you can also
 override settings on the command line – for example using `--vars.batch_size`.
 > #### Calling into Python
 >
@ -491,6 +495,29 @@ commands:
      - 'corpus/eval.json'
 ```
 You can also use the `env` section to reference **environment variables** and
 make their values available to the commands. This can be useful for overriding
 settings on the command line and passing through system-level settings.
 > #### Usage example
 >
 > ```bash
 > export GPU_ID=1
 > BATCH_SIZE=128 python -m spacy project run evaluate
 > ```
 ```yaml
 ### project.yml
 env:
  batch_size: BATCH_SIZE
  gpu_id: GPU_ID
 commands:
  - name: evaluate
    script:
      - 'python scripts/custom_evaluation.py ${env.batch_size}'
 ```
 ### Documenting your project {#custom-docs}
 > #### Readme Example
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -185,7 +185,7 @@ sections of a config file are:
 For a full overview of spaCy's config format and settings, see the
 [data format documentation](/api/data-formats#config) and
-[Thinc's config system docs](https://thinc.ai/usage/config). The settings
+[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
 available for the different architectures are documented with the
 [model architectures API](/api/architectures). See the Thinc documentation for
 [optimizers](https://thinc.ai/docs/api-optimizers) and
--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -198,6 +198,7 @@
            "has_examples": true
        },
        { "code": "tl", "name": "Tagalog" },
        { "code": "tn", "name": "Setswana", "has_examples": true },
        { "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
        { "code": "tt", "name": "Tatar", "has_examples": true },
        {