Merge branch 'master' into spacy.io

2025-11-04 01:48:04 +03:00 · 2021-02-14 13:38:33 +11:00 · 2021-02-14 13:38:33 +11:00 · 3246cf8b2b
commit 3246cf8b2b
parent 9fbee83f8a 660642902a
47 changed files with 898 additions and 280 deletions
--- a/.github/contributors/peter-exos.md
+++ b/.github/contributors/peter-exos.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Peter Baumann       |
+| Company name (if applicable)   |  Exos Financial      |
+| Title or role (if applicable)  |  data scientist      |
+| Date                           |  Feb 1st, 2021       |
+| GitHub username                |  peter-exos          |
+| Website (optional)             |                      |
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.1"
+__version__ = "3.0.3"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@ -16,7 +16,7 @@ import os

 from ..schemas import ProjectConfigSchema, validate
 from ..util import import_file, run_command, make_tempdir, registry, logger
-from ..util import is_compatible_version, ENV_VARS
+from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
 from .. import about

 if TYPE_CHECKING:
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
                    value = "true"
                else:
                    value = args.pop(0)
+            result[opt] = _parse_override(value)
+        else:
+            msg.fail(f"{err}: name should start with --", exits=1)
+    return result
+
+
+def _parse_override(value: Any) -> Any:
    # Just like we do in the config, we're calling json.loads on the
    # values. But since they come from the CLI, it'd be unintuitive to
    # explicitly mark strings with escaped quotes. So we're working
    # around that here by falling back to a string if parsing fails.
    # TODO: improve logic to handle simple types like list of strings?
    try:
-                result[opt] = srsly.json_loads(value)
+        return srsly.json_loads(value)
    except ValueError:
-                result[opt] = str(value)
-        else:
-            msg.fail(f"{err}: name should start with --", exits=1)
-    return result
+        return str(value)


-def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
+def load_project_config(
+    path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
+) -> Dict[str, Any]:
    """Load the project.yml file from a directory and validate it. Also make
    sure that all directories defined in the config exist.

    path (Path): The path to the project directory.
    interpolate (bool): Whether to substitute project variables.
+    overrides (Dict[str, Any]): Optional config overrides.
    RETURNS (Dict[str, Any]): The loaded project.yml.
    """
    config_path = path / PROJECT_FILE
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
        if not dir_path.exists():
            dir_path.mkdir(parents=True)
    if interpolate:
-        err = "project.yml validation error"
+        err = f"{PROJECT_FILE} validation error"
        with show_validation_error(title=err, hint_fill=False):
-            config = substitute_project_variables(config)
+            config = substitute_project_variables(config, overrides)
    return config


-def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
-    key = "vars"
+def substitute_project_variables(
+    config: Dict[str, Any],
+    overrides: Dict[str, Any] = SimpleFrozenDict(),
+    key: str = "vars",
+    env_key: str = "env",
+) -> Dict[str, Any]:
+    """Interpolate variables in the project file using the config system.
+
+    config (Dict[str, Any]): The project config.
+    overrides (Dict[str, Any]): Optional config overrides.
+    key (str): Key containing variables in project config.
+    env_key (str): Key containing environment variable mapping in project config.
+    RETURNS (Dict[str, Any]): The interpolated project config.
+    """
    config.setdefault(key, {})
-    config[key].update(overrides)
+    config.setdefault(env_key, {})
+    # Substitute references to env vars with their values
+    for config_var, env_var in config[env_key].items():
+        config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
    # Need to put variables in the top scope again so we can have a top-level
    # section "project" (otherwise, a list of commands in the top scope wouldn't)
    # be allowed by Thinc's config system
-    cfg = Config({"project": config, key: config[key]})
+    cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
+    cfg = Config().from_str(cfg.to_str(), overrides=overrides)
    interpolated = cfg.interpolate()
    return dict(interpolated["project"])

--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -175,10 +175,13 @@ def render_parses(
 def print_prf_per_type(
    msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
 ) -> None:
-    data = [
-        (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
-        for k, v in scores.items()
-    ]
+    data = []
+    for key, value in scores.items():
+        row = [key]
+        for k in ("p", "r", "f"):
+            v = value[k]
+            row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
+        data.append(row)
    msg.table(
        data,
        header=("", "P", "R", "F"),
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
    msg: Printer, scores: Dict[str, Dict[str, float]]
 ) -> None:
    msg.table(
-        [(k, f"{v:.2f}") for k, v in scores.items()],
+        [
+            (k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
+            for k, v in scores.items()
+        ],
        header=("", "ROC AUC"),
        aligns=("l", "r"),
        title="Textcat ROC AUC (per label)",
--- a/spacy/cli/project/run.py
+++ b/spacy/cli/project/run.py
@ -3,19 +3,23 @@ from pathlib import Path
 from wasabi import msg
 import sys
 import srsly
+import typer

 from ... import about
 from ...git_info import GIT_VERSION
 from ...util import working_dir, run_command, split_command, is_cwd, join_command
 from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
-from ...util import check_bool_env_var
+from ...util import check_bool_env_var, SimpleFrozenDict
 from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
-from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
+from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides


-@project_cli.command("run")
+@project_cli.command(
+    "run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
+)
 def project_run_cli(
    # fmt: off
+    ctx: typer.Context,  # This is only used to read additional arguments
    subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
@ -33,13 +37,15 @@ def project_run_cli(
    if show_help or not subcommand:
        print_run_help(project_dir, subcommand)
    else:
-        project_run(project_dir, subcommand, force=force, dry=dry)
+        overrides = parse_config_overrides(ctx.args)
+        project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)


 def project_run(
    project_dir: Path,
    subcommand: str,
    *,
+    overrides: Dict[str, Any] = SimpleFrozenDict(),
    force: bool = False,
    dry: bool = False,
    capture: bool = False,
@ -59,7 +65,7 @@ def project_run(
        when you want to turn over execution to the command, and capture=True
        when you want to run the command more like a function.
    """
-    config = load_project_config(project_dir)
+    config = load_project_config(project_dir, overrides=overrides)
    commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
    workflows = config.get("workflows", {})
    validate_subcommand(commands.keys(), workflows.keys(), subcommand)
--- a/spacy/cli/templates/quickstart_training_recommendations.yml
+++ b/spacy/cli/templates/quickstart_training_recommendations.yml
@ -28,6 +28,15 @@ bg:
  accuracy:
    name: iarfmoose/roberta-base-bulgarian
    size_factor: 3
+bn:
+  word_vectors: null
+  transformer:
+  efficiency:
+    name: sagorsarker/bangla-bert-base
+    size_factor: 3
+  accuracy:
+    name: sagorsarker/bangla-bert-base
+    size_factor: 3
 da:
  word_vectors: da_core_news_lg
  transformer:
@ -104,10 +113,10 @@ hi:
  word_vectors: null
  transformer:
    efficiency:
-      name: monsoon-nlp/hindi-tpu-electra
+      name: ai4bharat/indic-bert
      size_factor: 3
    accuracy:
-      name: monsoon-nlp/hindi-tpu-electra
+      name: ai4bharat/indic-bert
      size_factor: 3
 id:
  word_vectors: null
@ -185,10 +194,10 @@ si:
  word_vectors: null
  transformer:
    efficiency:
-      name: keshan/SinhalaBERTo
+      name: setu4993/LaBSE
      size_factor: 3
    accuracy:
-      name: keshan/SinhalaBERTo
+      name: setu4993/LaBSE
      size_factor: 3
 sv:
  word_vectors: null
@ -203,10 +212,10 @@ ta:
  word_vectors: null
  transformer:
    efficiency:
-      name: monsoon-nlp/tamillion
+      name: ai4bharat/indic-bert
      size_factor: 3
    accuracy:
-      name: monsoon-nlp/tamillion
+      name: ai4bharat/indic-bert
      size_factor: 3
 te:
  word_vectors: null
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -579,8 +579,8 @@ class Errors:
    E922 = ("Component '{name}' has been initialized with an output dimension of "
            "{nO} - cannot add any more labels.")
    E923 = ("It looks like there is no proper sample data to initialize the "
-            "Model of component '{name}'. This is likely a bug in spaCy, so "
-            "feel free to open an issue: https://github.com/explosion/spaCy/issues")
+            "Model of component '{name}'. To check your input data paths and "
+            "annotation, run: python -m spacy debug data config.cfg")
    E924 = ("The '{name}' component does not seem to be initialized properly. "
            "This is likely a bug in spaCy, so feel free to open an issue: "
            "https://github.com/explosion/spaCy/issues")
--- a/spacy/lang/tn/init.py
+++ b/spacy/lang/tn/init.py
@ -0,0 +1,18 @@
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_INFIXES
+from ...language import Language
+
+
+class SetswanaDefaults(Language.Defaults):
+    infixes = TOKENIZER_INFIXES
+    stop_words = STOP_WORDS
+    lex_attr_getters = LEX_ATTRS
+
+
+class Setswana(Language):
+    lang = "tn"
+    Defaults = SetswanaDefaults
+
+
+__all__ = ["Setswana"]
--- a/spacy/lang/tn/examples.py
+++ b/spacy/lang/tn/examples.py
@ -0,0 +1,15 @@
+"""
+Example sentences to test spaCy and its language models.
+>>> from spacy.lang.tn.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
+    "Johannesburg ke toropo e kgolo mo Afrika Borwa.",
+    "O ko kae?",
+    "ke mang presidente ya Afrika Borwa?",
+    "ke eng toropo kgolo ya Afrika Borwa?",
+    "Nelson Mandela o belegwe leng?",
+]
--- a/spacy/lang/tn/lex_attrs.py
+++ b/spacy/lang/tn/lex_attrs.py
@ -0,0 +1,107 @@
+from ...attrs import LIKE_NUM
+
+_num_words = [
+    "lefela",
+    "nngwe",
+    "pedi",
+    "tharo",
+    "nne",
+    "tlhano",
+    "thataro",
+    "supa",
+    "robedi",
+    "robongwe",
+    "lesome",
+    "lesomenngwe",
+    "lesomepedi",
+    "sometharo",
+    "somenne",
+    "sometlhano",
+    "somethataro",
+    "somesupa",
+    "somerobedi",
+    "somerobongwe",
+    "someamabedi",
+    "someamararo",
+    "someamane",
+    "someamatlhano",
+    "someamarataro",
+    "someamasupa",
+    "someamarobedi",
+    "someamarobongwe",
+    "lekgolo",
+    "sekete",
+    "milione",
+    "bilione",
+    "terilione",
+    "kwatirilione",
+    "gajillione",
+    "bazillione",
+]
+
+
+_ordinal_words = [
+    "ntlha",
+    "bobedi",
+    "boraro",
+    "bone",
+    "botlhano",
+    "borataro",
+    "bosupa",
+    "borobedi ",
+    "borobongwe",
+    "bolesome",
+    "bolesomengwe",
+    "bolesomepedi",
+    "bolesometharo",
+    "bolesomenne",
+    "bolesometlhano",
+    "bolesomethataro",
+    "bolesomesupa",
+    "bolesomerobedi",
+    "bolesomerobongwe",
+    "somamabedi",
+    "someamararo",
+    "someamane",
+    "someamatlhano",
+    "someamarataro",
+    "someamasupa",
+    "someamarobedi",
+    "someamarobongwe",
+    "lekgolo",
+    "sekete",
+    "milione",
+    "bilione",
+    "terilione",
+    "kwatirilione",
+    "gajillione",
+    "bazillione",
+]
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+
+    text_lower = text.lower()
+    if text_lower in _num_words:
+        return True
+
+    # CHeck ordinal number
+    if text_lower in _ordinal_words:
+        return True
+    if text_lower.endswith("th"):
+        if text_lower[:-2].isdigit():
+            return True
+
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/tn/punctuation.py
+++ b/spacy/lang/tn/punctuation.py
@ -0,0 +1,19 @@
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
+from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
+
+_infixes = (
+    LIST_ELLIPSES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
+    ]
+)
+
+
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/tn/stop_words.py
+++ b/spacy/lang/tn/stop_words.py
@ -0,0 +1,20 @@
+# Stop words
+STOP_WORDS = set(
+    """
+ke gareng ga selekanyo tlhwatlhwa yo mongwe se
+sengwe fa go le jalo gongwe ba na mo tikologong
+jaaka kwa morago nna gonne ka sa pele nako teng
+tlase fela ntle magareng tsona feta bobedi kgabaganya
+moo gape kgatlhanong botlhe tsotlhe bokana e esi
+setseng mororo dinako golo kgolo nnye wena gago
+o ntse ntle tla goreng gangwe mang yotlhe gore
+eo yona tseraganyo eng ne sentle re rona thata
+godimo fitlha pedi masomamabedi lesomepedi mmogo
+tharo tseo boraro tseno yone jaanong bobona bona
+lesome tsaya tsamaiso nngwe masomethataro thataro
+tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
+bonala e tshwanang bogolo tsenya tsweetswee karolo
+sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
+tlhano lesometlhano botlalo lekgolo
+""".split()
+)
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -451,7 +451,7 @@ cdef class Lexeme:
            Lexeme.c_set_flag(self.c, IS_QUOTE, x)

    property is_left_punct:
-        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
+        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
        def __get__(self):
            return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)

--- a/spacy/matcher/phrasematcher.pxd
+++ b/spacy/matcher/phrasematcher.pxd
@ -18,4 +18,4 @@ cdef class PhraseMatcher:
    cdef Pool mem
    cdef key_t _terminal_hash

-    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
+    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil
--- a/spacy/matcher/phrasematcher.pyx
+++ b/spacy/matcher/phrasematcher.pyx
@ -230,10 +230,10 @@ cdef class PhraseMatcher:
                result = internal_node
            map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)

-    def __call__(self, doc, *, as_spans=False):
+    def __call__(self, object doclike, *, as_spans=False):
        """Find all sequences matching the supplied patterns on the `Doc`.

-        doc (Doc): The document to match over.
+        doclike (Doc or Span): The document to match over.
        as_spans (bool): Return Span objects with labels instead of (match_id,
            start, end) tuples.
        RETURNS (list): A list of `(match_id, start, end)` tuples,
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
        DOCS: https://spacy.io/api/phrasematcher#call
        """
        matches = []
-        if doc is None or len(doc) == 0:
+        if doclike is None or len(doclike) == 0:
            # if doc is empty or None just return empty list
            return matches
+        if isinstance(doclike, Doc):
+            doc = doclike
+            start_idx = 0
+            end_idx = len(doc)
+        elif isinstance(doclike, Span):
+            doc = doclike.doc
+            start_idx = doclike.start
+            end_idx = doclike.end
+        else:
+            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))

        cdef vector[SpanC] c_matches
-        self.find_matches(doc, &c_matches)
+        self.find_matches(doc, start_idx, end_idx, &c_matches)
        for i in range(c_matches.size()):
            matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
        for i, (ent_id, start, end) in enumerate(matches):
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
        else:
            return matches

-    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
+    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
        cdef MapStruct* current_node = self.c_map
        cdef int start = 0
-        cdef int idx = 0
-        cdef int idy = 0
+        cdef int idx = start_idx
+        cdef int idy = start_idx
        cdef key_t key
        cdef void* value
        cdef int i = 0
        cdef SpanC ms
        cdef void* result
-        while idx < doc.length:
+        while idx < end_idx:
            start = idx
            token = Token.get_struct_attr(&doc.c[idx], self.attr)
            # look for sequences from this position
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
            if result:
                current_node = <MapStruct*>result
                idy = idx + 1
-                while idy < doc.length:
+                while idy < end_idx:
                    result = map_get(current_node, self._terminal_hash)
                    if result:
                        i = 0
--- a/spacy/ml/models/textcat.py
+++ b/spacy/ml/models/textcat.py
@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
    model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
    model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
    model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
+    model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
    init_chain(model, X, Y)
    return model

--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
        gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
        loss = self.distance.get_loss(sentence_encodings, entity_encodings)
        loss = loss / len(entity_encodings)
-        return loss, gradients
+        return float(loss), gradients

    def predict(self, docs: Iterable[Doc]) -> List[str]:
        """Apply the pipeline's model to a batch of docs, without modifying them.
--- a/spacy/pipeline/functions.py
+++ b/spacy/pipeline/functions.py
@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
    retokenizes=True,
 )
 def make_token_splitter(
-    nlp: Language, name: str, *, min_length=0, split_length=0,
+    nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
 ):
    return TokenSplitter(min_length=min_length, split_length=split_length)

--- a/spacy/pipeline/multitask.pyx
+++ b/spacy/pipeline/multitask.pyx
@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
        target = vectors[ids]
        gradient = self.distance.get_grad(prediction, target)
        loss = self.distance.get_loss(prediction, target)
-        return loss, gradient
+        return float(loss), gradient

    def update(self, examples, *, drop=0., sgd=None, losses=None):
        pass
--- a/spacy/pipeline/tok2vec.py
+++ b/spacy/pipeline/tok2vec.py
@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
        tokvecs = self.model.predict(docs)
        batch_id = Tok2VecListener.get_batch_id(docs)
        for listener in self.listeners:
-            listener.receive(batch_id, tokvecs, lambda dX: [])
+            listener.receive(batch_id, tokvecs, _empty_backprop)
        return tokvecs

    def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
        # of data.
        # When the components batch differently, we don't receive a matching
        # prediction from the upstream, so we can't predict.
-        if not all(doc.tensor.size for doc in inputs):
+        outputs = []
+        width = model.get_dim("nO")
+        for doc in inputs:
+            if doc.tensor.size == 0:
                # But we do need to do *something* if the tensor hasn't been set.
                # The compromise is to at least return data of the right shape,
                # so the output is valid.
-            width = model.get_dim("nO")
-            outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
+                outputs.append(model.ops.alloc2f(len(doc), width))
            else:
-            outputs = [doc.tensor for doc in inputs]
+                outputs.append(doc.tensor)
        return outputs, lambda dX: []
+
+
+def _empty_backprop(dX):  # for pickling
+    return []
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
 class ProjectConfigSchema(BaseModel):
    # fmt: off
    vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
+    env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
    assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
    workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
    commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
--- a/spacy/tests/lang/test_initialize.py
+++ b/spacy/tests/lang/test_initialize.py
@ -8,7 +8,8 @@ from spacy.util import get_lang_class
 LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
             "et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
             "it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
-             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
+             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
+             "yo"]
 # fmt: on


--- a/spacy/tests/matcher/test_phrase_matcher.py
+++ b/spacy/tests/matcher/test_phrase_matcher.py
@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
 def test_phrase_matcher_sent_start(en_vocab, attr):
    _ = PhraseMatcher(en_vocab, attr=attr)  # noqa: F841
+
+
+def test_span_in_phrasematcher(en_vocab):
+    """Ensure that PhraseMatcher accepts Span and Doc as input"""
+    # fmt: off
+    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
+    # fmt: on
+    doc = Doc(en_vocab, words=words)
+    span = doc[:8]
+    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
+    matcher = PhraseMatcher(en_vocab)
+    matcher.add("SPACY", [pattern])
+    matches_doc = matcher(doc)
+    matches_span = matcher(span)
+    assert len(matches_doc) == 1
+    assert len(matches_span) == 1
+
+
+def test_span_v_doc_in_phrasematcher(en_vocab):
+    """Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
+    # fmt: off
+    words = [
+        "I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
+        "and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
+        "everywhere", "."
+    ]
+    # fmt: on
+    doc = Doc(en_vocab, words=words)
+    span = doc[9:15]  # second clause
+    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
+    matcher = PhraseMatcher(en_vocab)
+    matcher.add("SPACY", [pattern])
+    matches_doc = matcher(doc)
+    matches_span = matcher(span)
+    assert len(matches_doc) == 3
+    assert len(matches_span) == 1
--- a/spacy/tests/pipeline/test_pipe_factories.py
+++ b/spacy/tests/pipeline/test_pipe_factories.py
@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
    assert config["arg"] == "world"


-def test_pipe_factories_decorator_idempotent():
+class PipeFactoriesIdempotent:
+    def __init__(self, nlp, name):
+        ...
+
+    def __call__(self, doc):
+        ...
+
+
+@pytest.mark.parametrize(
+    "i,func,func2",
+    [
+        (0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
+        (1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
+    ],
+)
+def test_pipe_factories_decorator_idempotent(i, func, func2):
    """Check that decorator can be run multiple times if the function is the
    same. This is especially relevant for live reloading because we don't
    want spaCy to raise an error if a module registering components is reloaded.
    """
-    name = "test_pipe_factories_decorator_idempotent"
-    func = lambda nlp, name: lambda doc: doc
+    name = f"test_pipe_factories_decorator_idempotent_{i}"
    for i in range(5):
        Language.factory(name, func=func)
    nlp = Language()
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
    # Make sure it also works for component decorator, which creates the
    # factory function
    name2 = f"{name}2"
-    func2 = lambda doc: doc
    for i in range(5):
        Language.component(name2, func=func2)
    nlp = Language()
--- a/spacy/tests/regression/test_issue6501-7000.py
+++ b/spacy/tests/regression/test_issue6501-7000.py
@ -0,0 +1,229 @@
+import pytest
+from spacy.lang.en import English
+import numpy as np
+import spacy
+from spacy.tokens import Doc
+from spacy.matcher import PhraseMatcher
+from spacy.tokens import DocBin
+from spacy.util import load_config_from_str
+from spacy.training import Example
+from spacy.training.initialize import init_nlp
+import pickle
+
+from ..util import make_tempdir
+
+
+def test_issue6730(en_vocab):
+    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
+    from spacy.kb import KnowledgeBase
+
+    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
+    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
+
+    with pytest.raises(ValueError):
+        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
+    assert kb.contains_alias("") is False
+
+    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
+    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
+
+    with make_tempdir() as tmp_dir:
+        kb.to_disk(tmp_dir)
+        kb.from_disk(tmp_dir)
+    assert kb.get_size_aliases() == 2
+    assert set(kb.get_alias_strings()) == {"x", "y"}
+
+
+def test_issue6755(en_tokenizer):
+    doc = en_tokenizer("This is a magnificent sentence.")
+    span = doc[:0]
+    assert span.text_with_ws == ""
+    assert span.text == ""
+
+
+@pytest.mark.parametrize(
+    "sentence, start_idx,end_idx,label",
+    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
+)
+def test_issue6815_1(sentence, start_idx, end_idx, label):
+    nlp = English()
+    doc = nlp(sentence)
+    span = doc[:].char_span(start_idx, end_idx, label=label)
+    assert span.label_ == label
+
+
+@pytest.mark.parametrize(
+    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
+)
+def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
+    nlp = English()
+    doc = nlp(sentence)
+    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
+    assert span.kb_id == kb_id
+
+
+@pytest.mark.parametrize(
+    "sentence, start_idx,end_idx,vector",
+    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
+)
+def test_issue6815_3(sentence, start_idx, end_idx, vector):
+    nlp = English()
+    doc = nlp(sentence)
+    span = doc[:].char_span(start_idx, end_idx, vector=vector)
+    assert (span.vector == vector).all()
+
+
+def test_issue6839(en_vocab):
+    """Ensure that PhraseMatcher accepts Span as input"""
+    # fmt: off
+    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
+    # fmt: on
+    doc = Doc(en_vocab, words=words)
+    span = doc[:8]
+    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
+    matcher = PhraseMatcher(en_vocab)
+    matcher.add("SPACY", [pattern])
+    matches = matcher(span)
+    assert matches
+
+
+CONFIG_ISSUE_6908 = """
+[paths]
+train = "TRAIN_PLACEHOLDER"
+raw = null
+init_tok2vec = null
+vectors = null
+
+[system]
+seed = 0
+gpu_allocator = null
+
+[nlp]
+lang = "en"
+pipeline = ["textcat"]
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+batch_size = 1000
+
+[components]
+
+[components.textcat]
+factory = "TEXTCAT_PLACEHOLDER"
+
+[corpora]
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths:train}
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths:train}
+
+
+[training]
+train_corpus = "corpora.train"
+dev_corpus = "corpora.dev"
+seed = ${system.seed}
+gpu_allocator = ${system.gpu_allocator}
+frozen_components = []
+before_to_disk = null
+
+[pretraining]
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+before_init = null
+after_init = null
+
+[initialize.components]
+
+[initialize.components.textcat]
+labels = ['label1', 'label2']
+
+[initialize.tokenizer]
+"""
+
+
+@pytest.mark.parametrize(
+    "component_name", ["textcat", "textcat_multilabel"],
+)
+def test_issue6908(component_name):
+    """Test intializing textcat with labels in a list"""
+
+    def create_data(out_file):
+        nlp = spacy.blank("en")
+        doc = nlp.make_doc("Some text")
+        doc.cats = {"label1": 0, "label2": 1}
+        out_data = DocBin(docs=[doc]).to_bytes()
+        with out_file.open("wb") as file_:
+            file_.write(out_data)
+
+    with make_tempdir() as tmp_path:
+        train_path = tmp_path / "train.spacy"
+        create_data(train_path)
+        config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
+        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
+        config = load_config_from_str(config_str)
+        init_nlp(config)
+
+
+CONFIG_ISSUE_6950 = """
+[nlp]
+lang = "en"
+pipeline = ["tok2vec", "tagger"]
+
+[components]
+
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v1"
+
+[components.tok2vec.model.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+width = ${components.tok2vec.model.encode:width}
+attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
+rows = [5000,2500,2500,2500]
+include_static_vectors = false
+
+[components.tok2vec.model.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v1"
+width = 96
+depth = 4
+window_size = 1
+maxout_pieces = 3
+
+[components.ner]
+factory = "ner"
+
+[components.tagger]
+factory = "tagger"
+
+[components.tagger.model]
+@architectures = "spacy.Tagger.v1"
+nO = null
+
+[components.tagger.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode:width}
+upstream = "*"
+"""
+
+
+def test_issue6950():
+    """Test that the nlp object with initialized tok2vec with listeners pickles
+    correctly (and doesn't have lambdas).
+    """
+    nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
+    nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
+    pickle.dumps(nlp)
+    nlp("hello")
+    pickle.dumps(nlp)
--- a/spacy/tests/regression/test_issue6730.py
+++ b/spacy/tests/regression/test_issue6730.py
@ -1,23 +0,0 @@
-import pytest
-from ..util import make_tempdir
-
-
-def test_issue6730(en_vocab):
-    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
-    from spacy.kb import KnowledgeBase
-
-    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
-    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
-
-    with pytest.raises(ValueError):
-        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
-    assert kb.contains_alias("") is False
-
-    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
-    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
-
-    with make_tempdir() as tmp_dir:
-        kb.to_disk(tmp_dir)
-        kb.from_disk(tmp_dir)
-    assert kb.get_size_aliases() == 2
-    assert set(kb.get_alias_strings()) == {"x", "y"}
--- a/spacy/tests/regression/test_issue6755.py
+++ b/spacy/tests/regression/test_issue6755.py
@ -1,5 +0,0 @@
-def test_issue6755(en_tokenizer):
-    doc = en_tokenizer("This is a magnificent sentence.")
-    span = doc[:0]
-    assert span.text_with_ws == ""
-    assert span.text == ""
--- a/spacy/tests/regression/test_issue6815.py
+++ b/spacy/tests/regression/test_issue6815.py
@ -1,35 +0,0 @@
-import pytest
-from spacy.lang.en import English
-import numpy as np
-
-
-@pytest.mark.parametrize(
-    "sentence, start_idx,end_idx,label",
-    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
-)
-def test_char_span_label(sentence, start_idx, end_idx, label):
-    nlp = English()
-    doc = nlp(sentence)
-    span = doc[:].char_span(start_idx, end_idx, label=label)
-    assert span.label_ == label
-
-
-@pytest.mark.parametrize(
-    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
-)
-def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
-    nlp = English()
-    doc = nlp(sentence)
-    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
-    assert span.kb_id == kb_id
-
-
-@pytest.mark.parametrize(
-    "sentence, start_idx,end_idx,vector",
-    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
-)
-def test_char_span_vector(sentence, start_idx, end_idx, vector):
-    nlp = English()
-    doc = nlp(sentence)
-    span = doc[:].char_span(start_idx, end_idx, vector=vector)
-    assert (span.vector == vector).all()
--- a/spacy/tests/regression/test_issue6908.py
+++ b/spacy/tests/regression/test_issue6908.py
@ -1,102 +0,0 @@
-import pytest
-import spacy
-from spacy.language import Language
-from spacy.tokens import DocBin
-from spacy import util
-from spacy.schemas import ConfigSchemaInit
-
-from spacy.training.initialize import init_nlp
-
-from ..util import make_tempdir
-
-TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
-[paths]
-train = "TRAIN_PLACEHOLDER"
-raw = null
-init_tok2vec = null
-vectors = null
-
-[system]
-seed = 0
-gpu_allocator = null
-
-[nlp]
-lang = "en"
-pipeline = ["textcat"]
-tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
-disabled = []
-before_creation = null
-after_creation = null
-after_pipeline_creation = null
-batch_size = 1000
-
-[components]
-
-[components.textcat]
-factory = "TEXTCAT_PLACEHOLDER"
-
-[corpora]
-
-[corpora.train]
-@readers = "spacy.Corpus.v1"
-path = ${paths:train}
-
-[corpora.dev]
-@readers = "spacy.Corpus.v1"
-path = ${paths:train}
-
-
-[training]
-train_corpus = "corpora.train"
-dev_corpus = "corpora.dev"
-seed = ${system.seed}
-gpu_allocator = ${system.gpu_allocator}
-frozen_components = []
-before_to_disk = null
-
-[pretraining]
-
-[initialize]
-vectors = ${paths.vectors}
-init_tok2vec = ${paths.init_tok2vec}
-vocab_data = null
-lookups = null
-before_init = null
-after_init = null
-
-[initialize.components]
-
-[initialize.components.textcat]
-labels = ['label1', 'label2']
-
-[initialize.tokenizer]
-"""
-
-
-@pytest.mark.parametrize(
-    "component_name",
-    ["textcat", "textcat_multilabel"],
-)
-def test_textcat_initialize_labels_validation(component_name):
-    """Test intializing textcat with labels in a list"""
-
-    def create_data(out_file):
-        nlp = spacy.blank("en")
-        doc = nlp.make_doc("Some text")
-        doc.cats = {"label1": 0, "label2": 1}
-
-        out_data = DocBin(docs=[doc]).to_bytes()
-        with out_file.open("wb") as file_:
-            file_.write(out_data)
-
-    with make_tempdir() as tmp_path:
-        train_path = tmp_path / "train.spacy"
-        create_data(train_path)
-
-        config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
-            "TEXTCAT_PLACEHOLDER", component_name
-        )
-        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
-
-        config = util.load_config_from_str(config_str)
-        init_nlp(config)
--- a/spacy/tests/regression/test_issue7019.py
+++ b/spacy/tests/regression/test_issue7019.py
@ -0,0 +1,12 @@
+from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
+from wasabi import msg
+
+
+def test_issue7019():
+    scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
+    print_textcats_auc_per_cat(msg, scores)
+    scores = {
+        "LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
+        "LABEL_B": {"p": None, "r": None, "f": None},
+    }
+    print_prf_per_type(msg, scores, name="foo", type="bar")
--- a/spacy/tests/regression/test_issue7029.py
+++ b/spacy/tests/regression/test_issue7029.py
@ -0,0 +1,67 @@
+from spacy.lang.en import English
+from spacy.training import Example
+from spacy.util import load_config_from_str
+
+
+CONFIG = """
+[nlp]
+lang = "en"
+pipeline = ["tok2vec", "tagger"]
+
+[components]
+
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v1"
+
+[components.tok2vec.model.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+width = ${components.tok2vec.model.encode:width}
+attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
+rows = [5000,2500,2500,2500]
+include_static_vectors = false
+
+[components.tok2vec.model.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v1"
+width = 96
+depth = 4
+window_size = 1
+maxout_pieces = 3
+
+[components.tagger]
+factory = "tagger"
+
+[components.tagger.model]
+@architectures = "spacy.Tagger.v1"
+nO = null
+
+[components.tagger.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode:width}
+upstream = "*"
+"""
+
+
+TRAIN_DATA = [
+    ("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
+    ("Eat blue ham", {"tags": ["V", "J", "N"]}),
+]
+
+
+def test_issue7029():
+    """Test that an empty document doesn't mess up an entire batch."""
+    nlp = English.from_config(load_config_from_str(CONFIG))
+    train_examples = []
+    for t in TRAIN_DATA:
+        train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+    optimizer = nlp.initialize(get_examples=lambda: train_examples)
+    for i in range(50):
+        losses = {}
+        nlp.update(train_examples, sgd=optimizer, losses=losses)
+    texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
+    nlp.select_pipes(enable=["tok2vec", "tagger"])
+    docs1 = list(nlp.pipe(texts, batch_size=1))
+    docs2 = list(nlp.pipe(texts, batch_size=4))
+    assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@ -325,6 +325,23 @@ def test_project_config_interpolation():
        substitute_project_variables(project)


+def test_project_config_interpolation_env():
+    variables = {"a": 10}
+    env_var = "SPACY_TEST_FOO"
+    env_vars = {"foo": env_var}
+    commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
+    project = {"commands": commands, "vars": variables, "env": env_vars}
+    with make_tempdir() as d:
+        srsly.write_yaml(d / "project.yml", project)
+        cfg = load_project_config(d)
+    assert cfg["commands"][0]["script"][0] == "hello 10 "
+    os.environ[env_var] = "123"
+    with make_tempdir() as d:
+        srsly.write_yaml(d / "project.yml", project)
+        cfg = load_project_config(d)
+    assert cfg["commands"][0]["script"][0] == "hello 10 123"
+
+
@pytest.mark.parametrize(
    "args,expected",
    [
--- a/spacy/tests/test_pickles.py
+++ b/spacy/tests/test_pickles.py
@ -1,7 +1,9 @@
 import pytest
 import numpy
 import srsly
+from spacy.lang.en import English
 from spacy.strings import StringStore
+from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from spacy.attrs import NORM

@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):

@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
 def test_pickle_vocab(text1, text2):
-    vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
+    vocab = Vocab(
+        lex_attr_getters={int(NORM): lambda string: string[:-1]},
+        get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
+    )
    vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
    lex1 = vocab[text1]
    lex2 = vocab[text2]
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
    assert unpickled[text2].norm == lex2.norm
    assert unpickled[text1].norm != unpickled[text2].norm
    assert unpickled.vectors is not None
+    assert unpickled.get_noun_chunks is not None
    assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
+
+
+def test_pickle_doc(en_vocab):
+    words = ["a", "b", "c"]
+    deps = ["dep"] * len(words)
+    heads = [0] * len(words)
+    doc = Doc(
+        en_vocab,
+        words=words,
+        deps=deps,
+        heads=heads,
+    )
+    data = srsly.pickle_dumps(doc)
+    unpickled = srsly.pickle_loads(data)
+    assert [t.text for t in unpickled] == words
+    assert [t.dep_ for t in unpickled] == deps
+    assert [t.head.i for t in unpickled] == heads
+    assert list(doc.noun_chunks) == []
--- a/spacy/tests/vocab_vectors/test_lexeme.py
+++ b/spacy/tests/vocab_vectors/test_lexeme.py
@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
    assert en_vocab["199"].check_flag(IS_DIGIT) is False
    assert en_vocab["the"].check_flag(is_len4) is False
    assert en_vocab["dogs"].check_flag(is_len4) is True
+    en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)


 def test_vocab_lexeme_oov_rank(en_vocab):
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -245,7 +245,7 @@ cdef class Tokenizer:
        cdef int offset
        cdef int modified_doc_length
        # Find matches for special cases
-        self._special_matcher.find_matches(doc, &c_matches)
+        self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
        # Skip processing if no matches
        if c_matches.size() == 0:
            return True
--- a/spacy/training/initialize.py
+++ b/spacy/training/initialize.py
@ -215,8 +215,7 @@ def convert_vectors(


 def read_vectors(vectors_loc: Path, truncate_vectors: int):
-    f = open_file(vectors_loc)
-    f = ensure_shape(f)
+    f = ensure_shape(vectors_loc)
    shape = tuple(int(size) for size in next(f).split())
    if truncate_vectors >= 1:
        shape = (truncate_vectors, shape[1])
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
        return loc.open("r", encoding="utf8")


-def ensure_shape(lines):
+def ensure_shape(vectors_loc):
    """Ensure that the first line of the data is the vectors shape.
    If it's not, we read in the data and output the shape as the first result,
    so that the reader doesn't have to deal with the problem.
    """
+    lines = open_file(vectors_loc)
    first_line = next(lines)
    try:
        shape = tuple(int(size) for size in first_line.split())
@ -269,7 +269,11 @@ def ensure_shape(lines):
        # Figure out the shape, make it the first value, and then give the
        # rest of the data.
        width = len(first_line.split()) - 1
-        captured = [first_line] + list(lines)
-        length = len(captured)
+        length = 1
+        for _ in lines:
+            length += 1
        yield f"{length} {width}"
-        yield from captured
+        # Reading the lines in again from file. This to avoid having to
+        # store all the results in a list in memory
+        lines2 = open_file(vectors_loc)
+        yield from lines2
--- a/spacy/util.py
+++ b/spacy/util.py
@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
    """
    if not callable(func1) or not callable(func2):
        return False
+    if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
+        return False
    same_name = func1.__qualname__ == func2.__qualname__
    same_file = inspect.getfile(func1) == inspect.getfile(func2)
    same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
--- a/spacy/vocab.pyx
+++ b/spacy/vocab.pyx
@ -551,12 +551,13 @@ def pickle_vocab(vocab):
    data_dir = vocab.data_dir
    lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
    lookups = vocab.lookups
+    get_noun_chunks = vocab.get_noun_chunks
    return (unpickle_vocab,
-            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
+            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))


 def unpickle_vocab(sstore, vectors, morphology, data_dir,
-                   lex_attr_getters, lookups):
+                   lex_attr_getters, lookups, get_noun_chunks):
    cdef Vocab vocab = Vocab()
    vocab.vectors = vectors
    vocab.strings = sstore
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
    vocab.data_dir = data_dir
    vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
    vocab.lookups = lookups
+    vocab.get_noun_chunks = get_noun_chunks
    return vocab


--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
 > lemmatizer = nlp.add_pipe("lemmatizer")
 >
 > # Construction via add_pipe with custom settings
-> config = {"mode": "rule", overwrite=True}
+> config = {"mode": "rule", "overwrite": True}
 > lemmatizer = nlp.add_pipe("lemmatizer", config=config)
 > ```

--- a/website/docs/api/phrasematcher.md
+++ b/website/docs/api/phrasematcher.md
@ -44,7 +44,7 @@ be shown.

 ## PhraseMatcher.\_\_call\_\_ {#call tag="method"}

-Find all token sequences matching the supplied patterns on the `Doc`.
+Find all token sequences matching the supplied patterns on the `Doc` or `Span`.

 > #### Example
 >
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.

 | Name                                  | Description                                                                                                                                                                                                                                                                                              |
 | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doc`                                 | The document to match over. ~~Doc~~                                                                                                                                                                                                                                                                      |
+| `doclike`                             | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                                      |
 | _keyword-only_                        |                                                                                                                                                                                                                                                                                                          |
 | `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
 | **RETURNS**                           | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the

 Create a data augmentation callback that uses orth-variant replacement. The
 callback can be added to a corpus or other data iterator during training. It's
-is especially useful for punctuation and case replacement, to help generalize
+especially useful for punctuation and case replacement, to help generalize
 beyond corpora that don't have smart quotes, or only have smart quotes etc.

 | Name            | Description                                                                                                                                                                                                                                                                                               |
--- a/website/docs/usage/_benchmarks-models.md
+++ b/website/docs/usage/_benchmarks-models.md
@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'

 | Pipeline                                                   | Parser | Tagger |  NER |
 | ---------------------------------------------------------- | -----: | -----: | ---: |
-| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.2 |   97.8 | 89.9 |
-| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   91.9 |   97.4 | 85.5 |
+| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.1 |   97.8 | 89.8 |
+| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.0 |   97.4 | 85.5 |
 | `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 | 85.5 |

 <figcaption class="caption">
@ -22,7 +22,7 @@ the development set).

 | Named Entity Recognition System  | OntoNotes | CoNLL '03 |
 | -------------------------------- | --------: | --------: |
-| spaCy RoBERTa (2020)             |      89.7 |      91.6 |
+| spaCy RoBERTa (2020)             |      89.8 |      91.6 |
 | Stanza (StanfordNLP)<sup>1</sup> |      88.8 |      92.1 |
 | Flair<sup>2</sup>                |      89.7 |      93.1 |

--- a/website/docs/usage/facts-figures.md
+++ b/website/docs/usage/facts-figures.md
@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'

 | Dependency Parsing System                                                      |  UAS |  LAS |
 | ------------------------------------------------------------------------------ | ---: | ---: |
-| spaCy RoBERTa (2020)                                                           | 95.5 | 94.3 |
+| spaCy RoBERTa (2020)                                                           | 95.1 | 93.7 |
 | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |

--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud

 By default, the project will be cloned into the current working directory. You
 can specify an optional second argument to define the output directory. The
-`--repo` option lets you define a custom repo to clone from if you don't want
-to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
-can also use any private repo you have access to with Git.
+`--repo` option lets you define a custom repo to clone from if you don't want to
+use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
+also use any private repo you have access to with Git.

 ### 2. Fetch the project assets {#assets}

@ -221,6 +221,7 @@ pipelines.
 | `title`         | An optional project title used in `--help` message and [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | `description`   | An optional project description used in [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 | `vars`          | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`.                                                                                                                                                |
+| `env`           | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`.                                                                                                                                                                                                                                                                                          |
 | `directories`   | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist.                                                                                                                                                                                                                                                                                                                 |
 | `assets`        | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo.                                                                        |
 | `workflows`     | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command.                                                                                                                                                                                                                                                                                                                                         |
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
 specify the destination paths and a checksum, and leave out the URL. When your
 teammates clone and run your project, they can place the files in the respective
 directory themselves. The [`project assets`](/api/cli#project-assets) command
-will alert you about missing files and mismatched checksums, so you can ensure that
-others are running your project with the same data.
+will alert you about missing files and mismatched checksums, so you can ensure
+that others are running your project with the same data.

 ### Dependencies and outputs {#deps-outputs}

@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
 automatically. For instance, if you only run the command `train` that depends on
 data created by `preprocess` and those files are missing, spaCy will show an
 error – it won't just re-run `preprocess`. If you're looking for more advanced
-data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
-can also use `outputs_no_cache` instead of `outputs` to define outputs that
-won't be cached or tracked.
+data management, check out the [Data Version Control (DVC) integration](#dvc).
+If you're planning on integrating your spaCy project with DVC, you can also use
+`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
+or tracked.

 ### Files and directory structure {#project-files}

@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
 `python scripts/custom_evaluation.py` with the function arguments. You can also
 use the `vars` section to define reusable variables that will be substituted in
 commands, paths and URLs. In this example, the batch size is defined as a
-variable will be added in place of `${vars.batch_size}` in the script.
+variable will be added in place of `${vars.batch_size}` in the script. Just like
+in the [training config](/usage/training##config-overrides), you can also
+override settings on the command line – for example using `--vars.batch_size`.

 > #### Calling into Python
 >
@ -491,6 +495,29 @@ commands:
      - 'corpus/eval.json'
 ```

+You can also use the `env` section to reference **environment variables** and
+make their values available to the commands. This can be useful for overriding
+settings on the command line and passing through system-level settings.
+
+> #### Usage example
+>
+> ```bash
+> export GPU_ID=1
+> BATCH_SIZE=128 python -m spacy project run evaluate
+> ```
+
+```yaml
+### project.yml
+env:
+  batch_size: BATCH_SIZE
+  gpu_id: GPU_ID
+
+commands:
+  - name: evaluate
+    script:
+      - 'python scripts/custom_evaluation.py ${env.batch_size}'
+```
+
 ### Documenting your project {#custom-docs}

 > #### Readme Example
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -185,7 +185,7 @@ sections of a config file are:

 For a full overview of spaCy's config format and settings, see the
 [data format documentation](/api/data-formats#config) and
-[Thinc's config system docs](https://thinc.ai/usage/config). The settings
+[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
 available for the different architectures are documented with the
 [model architectures API](/api/architectures). See the Thinc documentation for
 [optimizers](https://thinc.ai/docs/api-optimizers) and
--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -198,6 +198,7 @@
            "has_examples": true
        },
        { "code": "tl", "name": "Tagalog" },
+        { "code": "tn", "name": "Setswana", "has_examples": true },
        { "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
        { "code": "tt", "name": "Tatar", "has_examples": true },
        {