From 67ecdcc3ac3079b45273bad469db131614cf1936 Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Tue, 27 Jul 2021 19:08:46 +0900
Subject: [PATCH 01/22] Update subset/superset docs (#8795)

* Update subset/superset docs

* Update website/docs/usage/rule-based-matching.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
---
 website/docs/usage/rule-based-matching.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index 037850154..b718ef2b2 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -232,14 +232,20 @@ following rich comparison attributes are available:
 >
 > # Matches tokens of length >= 10
 > pattern2 = [{"LENGTH": {">=": 10}}]
+>
+> # Match based on morph attributes
+> pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
+> # "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
+> # "Number=Plur|Gender=Neut" will not match
+> # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
 > ```
 
 | Attribute                  | Description                                                                                             |
 | -------------------------- | ------------------------------------------------------------------------------------------------------- |
 | `IN`                       | Attribute value is member of a list. ~~Any~~                                                            |
 | `NOT_IN`                   | Attribute value is _not_ member of a list. ~~Any~~                                                      |
-| `ISSUBSET`                 | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~                                          |
-| `ISSUPERSET`               | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~                                        |
+| `IS_SUBSET`                | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~                                          |
+| `IS_SUPERSET`              | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~                                        |
 | `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
 
 #### Regular expressions {#regex new="2.1"}

From 76ac95923a7db82c222bb45f17c4316e4afc0b8f Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Tue, 27 Jul 2021 19:19:25 +0900
Subject: [PATCH 02/22] Add note to migration guide about lexeme tables (fix
 #7290)

This just adds the resolution from #6388 to the docs.
---
 website/docs/usage/v3.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md
index 8b4d2de7c..cdf78d59e 100644
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@@ -854,6 +854,19 @@ pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
 you have tag maps and morph rules in the v2.x format, you can load them into the
 attribute ruler before training using the `[initialize]` block of your config.
 
+### Using Lexeme Tables
+
+To use tables like `lexeme_prob` when training a model from scratch, you need
+to add an entry to the `initialize` block in your config. Here's what that
+looks like for the pretrained models:
+
+```
+[initialize.lookups]
+@misc = "spacy.LookupsDataLoader.v1"
+lang = ${nlp.lang}
+tables = ["lexeme_norm"]
+```
+
 > #### What does the initialization do?
 >
 > The `[initialize]` block is used when

From 8547514aa4d17bd5f457d2b58d6c3efc3748520d Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Tue, 27 Jul 2021 13:14:38 +0200
Subject: [PATCH 03/22] Remove labels from textcat component config example
 (#8815)

---
 website/docs/api/data-formats.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md
index 7dbf50595..1bdeb509a 100644
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@@ -90,7 +90,6 @@ Defines the `nlp` object, its tokenizer and
 > ```ini
 > [components.textcat]
 > factory = "textcat"
-> labels = ["POSITIVE", "NEGATIVE"]
 >
 > [components.textcat.model]
 > @architectures = "spacy.TextCatBOW.v2"

From 8867e60fbb3fd737fc32d0b7c85374a230365e80 Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Thu, 29 Jul 2021 14:56:56 +0900
Subject: [PATCH 04/22] Update website/docs/usage/v3.md

Co-authored-by: Ines Montani <ines@ines.io>
---
 website/docs/usage/v3.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md
index cdf78d59e..980f06172 100644
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@@ -858,9 +858,9 @@ attribute ruler before training using the `[initialize]` block of your config.
 
 To use tables like `lexeme_prob` when training a model from scratch, you need
 to add an entry to the `initialize` block in your config. Here's what that
-looks like for the pretrained models:
+looks like for the existing trained pipelines:
 
-```
+```ini
 [initialize.lookups]
 @misc = "spacy.LookupsDataLoader.v1"
 lang = ${nlp.lang}

From c92d268176eb8ba9816e6ca582c87ce99cf3b6a3 Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Thu, 29 Jul 2021 15:03:07 +0900
Subject: [PATCH 05/22] Add note about SPEED in output

In #8823 it was pointed out that the `SPEED` value wasn't documented
anywhere.
---
 website/docs/api/cli.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md
index 10ab2083e..c04ae5879 100644
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@@ -878,6 +878,9 @@ skew. To render a sample of dependency parses in a HTML file using the
 [displaCy visualizations](/usage/visualizers), set as output directory as the
 `--displacy-path` argument.
 
+The `SPEED` value included in the output is the words per second processed by
+the pipeline.
+
 ```cli
 $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-preproc] [--gpu-id] [--displacy-path] [--displacy-limit]
 ```

From e125313a50413e703ae4e04f983148c7cb71c030 Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Thu, 29 Jul 2021 16:34:08 +0900
Subject: [PATCH 06/22] Revert "Add note about SPEED in output"

This reverts commit c92d268176eb8ba9816e6ca582c87ce99cf3b6a3.
---
 website/docs/api/cli.md | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md
index c04ae5879..10ab2083e 100644
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@@ -878,9 +878,6 @@ skew. To render a sample of dependency parses in a HTML file using the
 [displaCy visualizations](/usage/visualizers), set as output directory as the
 `--displacy-path` argument.
 
-The `SPEED` value included in the output is the words per second processed by
-the pipeline.
-
 ```cli
 $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-preproc] [--gpu-id] [--displacy-path] [--displacy-limit]
 ```

From a60cb1391081e7827db8e65e288ff4f19a12044f Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Thu, 29 Jul 2021 16:35:19 +0900
Subject: [PATCH 07/22] Update speed entry in metrics table

---
 website/docs/usage/training.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 17fac05e5..89aff8844 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -652,7 +652,7 @@ excluded from the logs and the score won't be weighted.
 | **Recall** (R)             | Percentage of reference annotations recovered. Should increase.                                                         |
 | **F-Score** (F)            | Harmonic mean of precision and recall. Should increase.                                                                 |
 | **UAS** / **LAS**          | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
-| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable.                                                               |
+| Speed | Prediction speed in words per second (WPS). Should stay stable.                                                               |
 
 Note that if the development data has raw text, some of the gold-standard
 entities might not align to the predicted tokenization. These tokenization

From 15b12f3e35ccff2c597fc6c8cc65d15f9591205b Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Thu, 29 Jul 2021 10:10:12 +0200
Subject: [PATCH 08/22] Fix formatting of ent_id_sep in EntityRuler API docs

---
 website/docs/api/entityruler.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md
index 66cb6d4e4..93b5da45a 100644
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@@ -35,11 +35,11 @@ how the component should be configured. You can override its settings via the
 > ```
 
 | Setting               | Description                                                                                                                                                                                   |
-| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
+| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
 | `validate`            | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~                                                                             |
 | `overwrite_ents`      | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~                                                     |
-| `ent_id_sep`          | Separator used internally for entity IDs. Defaults to `"                                                                                                                                      |     | "`. ~~str~~ |
+| `ent_id_sep`          | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~                                                                                                                       |
 
 ```python
 %%GITHUB_SPACY/spacy/pipeline/entityruler.py
@@ -64,14 +64,14 @@ be a token pattern (list) or a phrase pattern (string). For example:
 > ```
 
 | Name                              | Description                                                                                                                                                                                                                           |
-| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
+| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `nlp`                             | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~                                                                                                                                     |
 | `name` <Tag variant="new">3</Tag> | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ |
 | _keyword-only_                    |                                                                                                                                                                                                                                       |
 | `phrase_matcher_attr`             | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~                                         |
 | `validate`                        | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~                                                                                                                |
 | `overwrite_ents`                  | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~                                                                                             |
-| `ent_id_sep`                      | Separator used internally for entity IDs. Defaults to `"                                                                                                                                                                              |     | "`. ~~str~~ |
+| `ent_id_sep`                      | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~                                                                                                                                                               |
 | `patterns`                        | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~                                                                                                                                 |
 
 ## EntityRuler.initialize {#initialize tag="method" new="3"}

From 02258916c8977c0d5d506380c9f65eec9df880d7 Mon Sep 17 00:00:00 2001
From: thomashacker <EdwardSchmuhl@web.de>
Date: Thu, 29 Jul 2021 11:19:40 +0200
Subject: [PATCH 09/22] Fix example config typo for transformer architecture

---
 website/docs/api/architectures.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md
index e90dc1183..f1a11bbc4 100644
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@@ -409,7 +409,7 @@ a single token vector given zero or more wordpiece vectors.
 >
 > ```ini
 > [model]
-> @architectures = "spacy.Tok2VecTransformer.v1"
+> @architectures = "spacy-transformers.Tok2VecTransformer.v1"
 > name = "albert-base-v2"
 > tokenizer_config = {"use_fast": false}
 > grad_factor = 1.0

From 65d163fab596120684f8ca15d6623afaf2a220ed Mon Sep 17 00:00:00 2001
From: Ines Montani <ines@ines.io>
Date: Fri, 30 Jul 2021 09:10:04 +1000
Subject: [PATCH 10/22] Adjust formatting [ci skip]

---
 website/docs/usage/training.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 89aff8844..6deba3761 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -652,7 +652,7 @@ excluded from the logs and the score won't be weighted.
 | **Recall** (R)             | Percentage of reference annotations recovered. Should increase.                                                         |
 | **F-Score** (F)            | Harmonic mean of precision and recall. Should increase.                                                                 |
 | **UAS** / **LAS**          | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
-| Speed | Prediction speed in words per second (WPS). Should stay stable.                                                               |
+| **Speed** | Prediction speed in words per second (WPS). Should stay stable.                                                               |
 
 Note that if the development data has raw text, some of the gold-standard
 entities might not align to the predicted tokenization. These tokenization

From de076194c4a08c70a2a80e300fece917e621ca9c Mon Sep 17 00:00:00 2001
From: themrmax <max@fillr.com>
Date: Mon, 2 Aug 2021 05:33:38 -0700
Subject: [PATCH 11/22] Make ConsoleLogger flush after each logging line
 (#8810)

This is necessary to avoid "logging blackouts" when running training on Kubernetes pods
---
 spacy/training/loggers.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spacy/training/loggers.py b/spacy/training/loggers.py
index f7f70226d..5cf2db6b3 100644
--- a/spacy/training/loggers.py
+++ b/spacy/training/loggers.py
@@ -29,7 +29,7 @@ def console_logger(progress_bar: bool = False):
     def setup_printer(
         nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
     ) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
-        write = lambda text: stdout.write(f"{text}\n")
+        write = lambda text: print(text, file=stdout, flush=True)
         msg = Printer(no_print=True)
         # ensure that only trainable components are logged
         logged_pipes = [

From 0485cdefcc0d6f2fd81ac2191e36543a2a6d54f1 Mon Sep 17 00:00:00 2001
From: Nick Sorros <nsorros@gmail.com>
Date: Mon, 2 Aug 2021 19:13:53 +0300
Subject: [PATCH 12/22] Add logger debug for project push and pull (#8860)

* Add logger debug for project push and pull

* Sign contributor agreement
---
 .github/contributors/nsorros.md | 106 ++++++++++++++++++++++++++++++++
 spacy/cli/project/pull.py       |   6 +-
 spacy/cli/project/push.py       |   6 +-
 3 files changed, 116 insertions(+), 2 deletions(-)
 create mode 100644 .github/contributors/nsorros.md

diff --git a/.github/contributors/nsorros.md b/.github/contributors/nsorros.md
new file mode 100644
index 000000000..a449c52e1
--- /dev/null
+++ b/.github/contributors/nsorros.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Nick Sorros          |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2/8/2021             |
+| GitHub username                | nsorros              |
+| Website (optional)             |                      |
diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py
index b88387a9f..f121ef0a0 100644
--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@@ -2,7 +2,7 @@ from pathlib import Path
 from wasabi import msg
 from .remote_storage import RemoteStorage
 from .remote_storage import get_command_hash
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger
 from .._util import load_project_config
 from .run import update_lockfile
 
@@ -39,11 +39,13 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
     # in the list.
     while commands:
         for i, cmd in enumerate(list(commands)):
+            logger.debug(f"CMD: {cmd['name']}.")
             deps = [project_dir / dep for dep in cmd.get("deps", [])]
             if all(dep.exists() for dep in deps):
                 cmd_hash = get_command_hash("", "", deps, cmd["script"])
                 for output_path in cmd.get("outputs", []):
                     url = storage.pull(output_path, command_hash=cmd_hash)
+                    logger.debug(f"URL: {url} for {output_path} with command hash {cmd_hash}")
                     yield url, output_path
 
                 out_locs = [project_dir / out for out in cmd.get("outputs", [])]
@@ -53,6 +55,8 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
                 # we iterate over the loop again.
                 commands.pop(i)
                 break
+            else:
+                logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs.")
         else:
             # If we didn't break the for loop, break the while loop.
             break
diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py
index 44050b716..98a01d6dd 100644
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@@ -3,7 +3,7 @@ from wasabi import msg
 from .remote_storage import RemoteStorage
 from .remote_storage import get_content_hash, get_command_hash
 from .._util import load_project_config
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger
 
 
 @project_cli.command("push")
@@ -37,12 +37,15 @@ def project_push(project_dir: Path, remote: str):
         remote = config["remotes"][remote]
     storage = RemoteStorage(project_dir, remote)
     for cmd in config.get("commands", []):
+        logger.debug(f"CMD: cmd['name']")
         deps = [project_dir / dep for dep in cmd.get("deps", [])]
         if any(not dep.exists() for dep in deps):
+            logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs")
             continue
         cmd_hash = get_command_hash(
             "", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
         )
+        logger.debug(f"CMD_HASH: {cmd_hash}")
         for output_path in cmd.get("outputs", []):
             output_loc = project_dir / output_path
             if output_loc.exists() and _is_not_empty_dir(output_loc):
@@ -51,6 +54,7 @@ def project_push(project_dir: Path, remote: str):
                     command_hash=cmd_hash,
                     content_hash=get_content_hash(output_loc),
                 )
+                logger.debug(f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}")
                 yield output_path, url
 
 

From 9ad3b8cf8d3ebf52c10537ee7459b57393ad3445 Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Mon, 2 Aug 2021 18:22:35 +0200
Subject: [PATCH 13/22] Only add sourced vectors hashes to meta if necessary
 (#8830)

---
 spacy/language.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/spacy/language.py b/spacy/language.py
index 589dca2bf..14b423be6 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -1698,7 +1698,6 @@ class Language:
         # them here so they're only loaded once
         source_nlps = {}
         source_nlp_vectors_hashes = {}
-        nlp.meta["_sourced_vectors_hashes"] = {}
         for pipe_name in config["nlp"]["pipeline"]:
             if pipe_name not in pipeline:
                 opts = ", ".join(pipeline.keys())
@@ -1747,6 +1746,8 @@ class Language:
                         source_nlp_vectors_hashes[model] = hash(
                             source_nlps[model].vocab.vectors.to_bytes()
                         )
+                    if "_sourced_vectors_hashes" not in nlp.meta:
+                        nlp.meta["_sourced_vectors_hashes"] = {}
                     nlp.meta["_sourced_vectors_hashes"][
                         pipe_name
                     ] = source_nlp_vectors_hashes[model]

From fbbbda195446be61eb8a3ea4930884e059c28e64 Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Mon, 2 Aug 2021 19:07:19 +0200
Subject: [PATCH 14/22] Fix start/end chars for empty and out-of-bounds spans
 (#8816)

---
 spacy/tests/doc/test_span.py | 30 ++++++++++++++++++++++++++++++
 spacy/tokens/span.pyx        |  9 +++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py
index 6e34f2126..01b022b9d 100644
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@@ -357,6 +357,9 @@ def test_span_eq_hash(doc, doc_not_parsed):
     assert hash(doc[0:2]) != hash(doc[1:3])
     assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
 
+    # check that an out-of-bounds is not equivalent to the span of the full doc
+    assert doc[0 : len(doc)] != doc[len(doc) : len(doc) + 1]
+
 
 def test_span_boundaries(doc):
     start = 1
@@ -369,6 +372,33 @@ def test_span_boundaries(doc):
     with pytest.raises(IndexError):
         span[5]
 
+    empty_span_0 = doc[0:0]
+    assert empty_span_0.text == ""
+    assert empty_span_0.start == 0
+    assert empty_span_0.end == 0
+    assert empty_span_0.start_char == 0
+    assert empty_span_0.end_char == 0
+
+    empty_span_1 = doc[1:1]
+    assert empty_span_1.text == ""
+    assert empty_span_1.start == 1
+    assert empty_span_1.end == 1
+    assert empty_span_1.start_char == empty_span_1.end_char
+
+    oob_span_start = doc[-len(doc) - 1 : -len(doc) - 10]
+    assert oob_span_start.text == ""
+    assert oob_span_start.start == 0
+    assert oob_span_start.end == 0
+    assert oob_span_start.start_char == 0
+    assert oob_span_start.end_char == 0
+
+    oob_span_end = doc[len(doc) + 1 : len(doc) + 10]
+    assert oob_span_end.text == ""
+    assert oob_span_end.start == len(doc)
+    assert oob_span_end.end == len(doc)
+    assert oob_span_end.start_char == len(doc.text)
+    assert oob_span_end.end_char == len(doc.text)
+
 
 def test_span_lemma(doc):
     # span lemmas should have the same number of spaces as the span
diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx
index 093b2a4da..48c6053c1 100644
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@@ -105,13 +105,18 @@ cdef class Span:
         if label not in doc.vocab.strings:
             raise ValueError(Errors.E084.format(label=label))
 
+        start_char = doc[start].idx if start < doc.length else len(doc.text)
+        if start == end:
+            end_char = start_char
+        else:
+            end_char = doc[end - 1].idx + len(doc[end - 1])
         self.c = SpanC(
             label=label,
             kb_id=kb_id,
             start=start,
             end=end,
-            start_char=doc[start].idx if start < doc.length else 0,
-            end_char=doc[end - 1].idx + len(doc[end - 1]) if end >= 1 else 0,
+            start_char=start_char,
+            end_char=end_char,
         )
         self._vector = vector
         self._vector_norm = vector_norm

From 175847f92cb89f2515f848c5eda8872af247954f Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Mon, 2 Aug 2021 19:39:26 +0200
Subject: [PATCH 15/22] Support list values and INTERSECTS in Matcher (#8784)

* Support list values and IS_INTERSECT in Matcher

* Support list values as token attributes for set operators, not just as
pattern values.

* Add `IS_INTERSECT` operator.

* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.

* Rename IS_INTERSECT to INTERSECTS
---
 spacy/matcher/matcher.pyx                 | 15 +++--
 spacy/schemas.py                          |  6 +-
 spacy/tests/matcher/test_matcher_api.py   | 76 +++++++++++++++++++++++
 website/docs/api/matcher.md               | 15 ++---
 website/docs/usage/rule-based-matching.md | 15 ++---
 5 files changed, 106 insertions(+), 21 deletions(-)

diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx
index 7b1cfb633..555766f62 100644
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@@ -845,7 +845,7 @@ class _RegexPredicate:
 
 
 class _SetPredicate:
-    operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET")
+    operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET", "INTERSECTS")
 
     def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
         self.i = i
@@ -868,14 +868,16 @@ class _SetPredicate:
         else:
             value = get_token_attr_for_matcher(token.c, self.attr)
 
-        if self.predicate in ("IS_SUBSET", "IS_SUPERSET"):
+        if self.predicate in ("IS_SUBSET", "IS_SUPERSET", "INTERSECTS"):
             if self.attr == MORPH:
                 # break up MORPH into individual Feat=Val values
                 value = set(get_string_id(v) for v in MorphAnalysis.from_id(self.vocab, value))
             else:
-                # IS_SUBSET for other attrs will be equivalent to "IN"
-                # IS_SUPERSET will only match for other attrs with 0 or 1 values
-                value = set([value])
+                # treat a single value as a list
+                if isinstance(value, (str, int)):
+                    value = set([get_string_id(value)])
+                else:
+                    value = set(get_string_id(v) for v in value)
         if self.predicate == "IN":
             return value in self.value
         elif self.predicate == "NOT_IN":
@@ -884,6 +886,8 @@ class _SetPredicate:
             return value <= self.value
         elif self.predicate == "IS_SUPERSET":
             return value >= self.value
+        elif self.predicate == "INTERSECTS":
+            return bool(value & self.value)
 
     def __repr__(self):
         return repr(("SetPredicate", self.i, self.attr, self.value, self.predicate))
@@ -928,6 +932,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
         "NOT_IN": _SetPredicate,
         "IS_SUBSET": _SetPredicate,
         "IS_SUPERSET": _SetPredicate,
+        "INTERSECTS": _SetPredicate,
         "==": _ComparisonPredicate,
         "!=": _ComparisonPredicate,
         ">=": _ComparisonPredicate,
diff --git a/spacy/schemas.py b/spacy/schemas.py
index 992e17d70..83623b104 100644
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@@ -159,6 +159,7 @@ class TokenPatternString(BaseModel):
     NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in")
     IS_SUBSET: Optional[List[StrictStr]] = Field(None, alias="is_subset")
     IS_SUPERSET: Optional[List[StrictStr]] = Field(None, alias="is_superset")
+    INTERSECTS: Optional[List[StrictStr]] = Field(None, alias="intersects")
 
     class Config:
         extra = "forbid"
@@ -175,8 +176,9 @@ class TokenPatternNumber(BaseModel):
     REGEX: Optional[StrictStr] = Field(None, alias="regex")
     IN: Optional[List[StrictInt]] = Field(None, alias="in")
     NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in")
-    ISSUBSET: Optional[List[StrictInt]] = Field(None, alias="issubset")
-    ISSUPERSET: Optional[List[StrictInt]] = Field(None, alias="issuperset")
+    IS_SUBSET: Optional[List[StrictInt]] = Field(None, alias="is_subset")
+    IS_SUPERSET: Optional[List[StrictInt]] = Field(None, alias="is_superset")
+    INTERSECTS: Optional[List[StrictInt]] = Field(None, alias="intersects")
     EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==")
     NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=")
     GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=")
diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py
index e0f655bbe..a42735eae 100644
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@@ -270,6 +270,16 @@ def test_matcher_subset_value_operator(en_vocab):
     doc[0].tag_ = "A"
     assert len(matcher(doc)) == 0
 
+    # IS_SUBSET with a list value
+    Token.set_extension("ext", default=[])
+    matcher = Matcher(en_vocab)
+    pattern = [{"_": {"ext": {"IS_SUBSET": ["A", "B"]}}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0]._.ext = ["A"]
+    doc[1]._.ext = ["C", "D"]
+    assert len(matcher(doc)) == 2
+
 
 def test_matcher_superset_value_operator(en_vocab):
     matcher = Matcher(en_vocab)
@@ -308,6 +318,72 @@ def test_matcher_superset_value_operator(en_vocab):
     doc[0].tag_ = "A"
     assert len(matcher(doc)) == 3
 
+    # IS_SUPERSET with a list value
+    Token.set_extension("ext", default=[])
+    matcher = Matcher(en_vocab)
+    pattern = [{"_": {"ext": {"IS_SUPERSET": ["A"]}}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0]._.ext = ["A", "B"]
+    assert len(matcher(doc)) == 1
+
+
+def test_matcher_intersect_value_operator(en_vocab):
+    matcher = Matcher(en_vocab)
+    pattern = [{"MORPH": {"INTERSECTS": ["Feat=Val", "Feat2=Val2", "Feat3=Val3"]}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    assert len(matcher(doc)) == 0
+    doc[0].set_morph("Feat=Val")
+    assert len(matcher(doc)) == 1
+    doc[0].set_morph("Feat=Val|Feat2=Val2")
+    assert len(matcher(doc)) == 1
+    doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3")
+    assert len(matcher(doc)) == 1
+    doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4")
+    assert len(matcher(doc)) == 1
+
+    # INTERSECTS with a single value is the same as IN
+    matcher = Matcher(en_vocab)
+    pattern = [{"TAG": {"INTERSECTS": ["A", "B"]}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0].tag_ = "A"
+    assert len(matcher(doc)) == 1
+
+    # INTERSECTS with an empty pattern list matches nothing
+    matcher = Matcher(en_vocab)
+    pattern = [{"TAG": {"INTERSECTS": []}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0].tag_ = "A"
+    assert len(matcher(doc)) == 0
+
+    # INTERSECTS with a list value
+    Token.set_extension("ext", default=[])
+    matcher = Matcher(en_vocab)
+    pattern = [{"_": {"ext": {"INTERSECTS": ["A", "C"]}}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0]._.ext = ["A", "B"]
+    assert len(matcher(doc)) == 1
+
+    # INTERSECTS with an empty pattern list matches nothing
+    matcher = Matcher(en_vocab)
+    pattern = [{"_": {"ext": {"INTERSECTS": []}}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0]._.ext = ["A", "B"]
+    assert len(matcher(doc)) == 0
+
+    # INTERSECTS with an empty value matches nothing
+    matcher = Matcher(en_vocab)
+    pattern = [{"_": {"ext": {"INTERSECTS": ["A", "B"]}}}]
+    matcher.add("M", [pattern])
+    doc = Doc(en_vocab, words=["a", "b", "c"])
+    doc[0]._.ext = []
+    assert len(matcher(doc)) == 0
+
 
 def test_matcher_morph_handling(en_vocab):
     # order of features in pattern doesn't matter
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index 9c15f8797..c34560dec 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -77,13 +77,14 @@ it compares to another value.
 > ]
 > ```
 
-| Attribute                  | Description                                                                                             |
-| -------------------------- | ------------------------------------------------------------------------------------------------------- |
-| `IN`                       | Attribute value is member of a list. ~~Any~~                                                            |
-| `NOT_IN`                   | Attribute value is _not_ member of a list. ~~Any~~                                                      |
-| `ISSUBSET`                 | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~                                          |
-| `ISSUPERSET`               | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~                                        |
-| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
+| Attribute                  | Description                                                                                              |
+| -------------------------- | -------------------------------------------------------------------------------------------------------- |
+| `IN`                       | Attribute value is member of a list. ~~Any~~                                                             |
+| `NOT_IN`                   | Attribute value is _not_ member of a list. ~~Any~~                                                       |
+| `IS_SUBSET`                | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~                   |
+| `IS_SUPERSET`              | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~                 |
+| `INTERSECTS`               | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ |
+| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~  |
 
 ## Matcher.\_\_init\_\_ {#init tag="method"}
 
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index b718ef2b2..81c838584 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -240,13 +240,14 @@ following rich comparison attributes are available:
 > # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
 > ```
 
-| Attribute                  | Description                                                                                             |
-| -------------------------- | ------------------------------------------------------------------------------------------------------- |
-| `IN`                       | Attribute value is member of a list. ~~Any~~                                                            |
-| `NOT_IN`                   | Attribute value is _not_ member of a list. ~~Any~~                                                      |
-| `IS_SUBSET`                | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~                                          |
-| `IS_SUPERSET`              | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~                                        |
-| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
+| Attribute                  | Description                                                                                               |
+| -------------------------- | --------------------------------------------------------------------------------------------------------- |
+| `IN`                       | Attribute value is member of a list. ~~Any~~                                                              |
+| `NOT_IN`                   | Attribute value is _not_ member of a list. ~~Any~~                                                        |
+| `IS_SUBSET`                | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~                    |
+| `IS_SUPERSET`              | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~                  |
+| `INTERSECTS`               | Attribute value (for `MORPH` or custom list attributes) has a non-empty intersection with a list. ~~Any~~ |
+| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~   |
 
 #### Regular expressions {#regex new="2.1"}
 

From 941a591f3cb5d1406e1816417eb526529cdc4d27 Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Tue, 3 Aug 2021 14:42:44 +0200
Subject: [PATCH 16/22] Pass excludes when serializing vocab (#8824)

* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
---
 spacy/language.py                             |  8 ++++----
 spacy/pipeline/attributeruler.py              |  8 ++++----
 spacy/pipeline/entity_linker.py               |  7 ++++---
 spacy/pipeline/lemmatizer.py                  |  8 ++++----
 spacy/pipeline/trainable_pipe.pyx             |  8 ++++----
 spacy/pipeline/transition_parser.pyx          |  8 ++++----
 .../serialize/test_serialize_pipeline.py      | 20 ++++++++++++++++++-
 spacy/tokenizer.pyx                           |  4 ++--
 8 files changed, 45 insertions(+), 26 deletions(-)

diff --git a/spacy/language.py b/spacy/language.py
index 14b423be6..a8cad1259 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -1909,7 +1909,7 @@ class Language:
             if not hasattr(proc, "to_disk"):
                 continue
             serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
-        serializers["vocab"] = lambda p: self.vocab.to_disk(p)
+        serializers["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
         util.to_disk(path, serializers, exclude)
 
     def from_disk(
@@ -1940,7 +1940,7 @@ class Language:
 
         def deserialize_vocab(path: Path) -> None:
             if path.exists():
-                self.vocab.from_disk(path)
+                self.vocab.from_disk(path, exclude=exclude)
 
         path = util.ensure_path(path)
         deserializers = {}
@@ -1978,7 +1978,7 @@ class Language:
         DOCS: https://spacy.io/api/language#to_bytes
         """
         serializers = {}
-        serializers["vocab"] = lambda: self.vocab.to_bytes()
+        serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
         serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
         serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
         serializers["config.cfg"] = lambda: self.config.to_bytes()
@@ -2014,7 +2014,7 @@ class Language:
             b, interpolate=False
         )
         deserializers["meta.json"] = deserialize_meta
-        deserializers["vocab"] = self.vocab.from_bytes
+        deserializers["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
         deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(
             b, exclude=["vocab"]
         )
diff --git a/spacy/pipeline/attributeruler.py b/spacy/pipeline/attributeruler.py
index a6efd5906..f95a5a48c 100644
--- a/spacy/pipeline/attributeruler.py
+++ b/spacy/pipeline/attributeruler.py
@@ -276,7 +276,7 @@ class AttributeRuler(Pipe):
         DOCS: https://spacy.io/api/attributeruler#to_bytes
         """
         serialize = {}
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
         serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns)
         return util.to_bytes(serialize, exclude)
 
@@ -296,7 +296,7 @@ class AttributeRuler(Pipe):
             self.add_patterns(srsly.msgpack_loads(b))
 
         deserialize = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
             "patterns": load_patterns,
         }
         util.from_bytes(bytes_data, deserialize, exclude)
@@ -313,7 +313,7 @@ class AttributeRuler(Pipe):
         DOCS: https://spacy.io/api/attributeruler#to_disk
         """
         serialize = {
-            "vocab": lambda p: self.vocab.to_disk(p),
+            "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
             "patterns": lambda p: srsly.write_msgpack(p, self.patterns),
         }
         util.to_disk(path, serialize, exclude)
@@ -334,7 +334,7 @@ class AttributeRuler(Pipe):
             self.add_patterns(srsly.read_msgpack(p))
 
         deserialize = {
-            "vocab": lambda p: self.vocab.from_disk(p),
+            "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
             "patterns": load_patterns,
         }
         util.from_disk(path, deserialize, exclude)
diff --git a/spacy/pipeline/entity_linker.py b/spacy/pipeline/entity_linker.py
index ba7e71f15..7b52025bc 100644
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@@ -412,7 +412,7 @@ class EntityLinker(TrainablePipe):
         serialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
         serialize["kb"] = self.kb.to_bytes
         serialize["model"] = self.model.to_bytes
         return util.to_bytes(serialize, exclude)
@@ -436,7 +436,7 @@ class EntityLinker(TrainablePipe):
         deserialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
         deserialize["kb"] = lambda b: self.kb.from_bytes(b)
         deserialize["model"] = load_model
         util.from_bytes(bytes_data, deserialize, exclude)
@@ -453,7 +453,7 @@ class EntityLinker(TrainablePipe):
         DOCS: https://spacy.io/api/entitylinker#to_disk
         """
         serialize = {}
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
         serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
         serialize["kb"] = lambda p: self.kb.to_disk(p)
         serialize["model"] = lambda p: self.model.to_disk(p)
@@ -480,6 +480,7 @@ class EntityLinker(TrainablePipe):
 
         deserialize = {}
         deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
         deserialize["kb"] = lambda p: self.kb.from_disk(p)
         deserialize["model"] = load_model
         util.from_disk(path, deserialize, exclude)
diff --git a/spacy/pipeline/lemmatizer.py b/spacy/pipeline/lemmatizer.py
index 87504fade..2f436c57a 100644
--- a/spacy/pipeline/lemmatizer.py
+++ b/spacy/pipeline/lemmatizer.py
@@ -269,7 +269,7 @@ class Lemmatizer(Pipe):
         DOCS: https://spacy.io/api/lemmatizer#to_disk
         """
         serialize = {}
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
         serialize["lookups"] = lambda p: self.lookups.to_disk(p)
         util.to_disk(path, serialize, exclude)
 
@@ -285,7 +285,7 @@ class Lemmatizer(Pipe):
         DOCS: https://spacy.io/api/lemmatizer#from_disk
         """
         deserialize = {}
-        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
         deserialize["lookups"] = lambda p: self.lookups.from_disk(p)
         util.from_disk(path, deserialize, exclude)
         self._validate_tables()
@@ -300,7 +300,7 @@ class Lemmatizer(Pipe):
         DOCS: https://spacy.io/api/lemmatizer#to_bytes
         """
         serialize = {}
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
         serialize["lookups"] = self.lookups.to_bytes
         return util.to_bytes(serialize, exclude)
 
@@ -316,7 +316,7 @@ class Lemmatizer(Pipe):
         DOCS: https://spacy.io/api/lemmatizer#from_bytes
         """
         deserialize = {}
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
         deserialize["lookups"] = lambda b: self.lookups.from_bytes(b)
         util.from_bytes(bytes_data, deserialize, exclude)
         self._validate_tables()
diff --git a/spacy/pipeline/trainable_pipe.pyx b/spacy/pipeline/trainable_pipe.pyx
index ce1e133a2..76b0733cf 100644
--- a/spacy/pipeline/trainable_pipe.pyx
+++ b/spacy/pipeline/trainable_pipe.pyx
@@ -273,7 +273,7 @@ cdef class TrainablePipe(Pipe):
         serialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
         serialize["model"] = self.model.to_bytes
         return util.to_bytes(serialize, exclude)
 
@@ -296,7 +296,7 @@ cdef class TrainablePipe(Pipe):
         deserialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
         deserialize["model"] = load_model
         util.from_bytes(bytes_data, deserialize, exclude)
         return self
@@ -313,7 +313,7 @@ cdef class TrainablePipe(Pipe):
         serialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
         serialize["model"] = lambda p: self.model.to_disk(p)
         util.to_disk(path, serialize, exclude)
 
@@ -338,7 +338,7 @@ cdef class TrainablePipe(Pipe):
         deserialize = {}
         if hasattr(self, "cfg") and self.cfg is not None:
             deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
-        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
         deserialize["model"] = load_model
         util.from_disk(path, deserialize, exclude)
         return self
diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx
index a495b1bc7..5e11f5972 100644
--- a/spacy/pipeline/transition_parser.pyx
+++ b/spacy/pipeline/transition_parser.pyx
@@ -569,7 +569,7 @@ cdef class Parser(TrainablePipe):
     def to_disk(self, path, exclude=tuple()):
         serializers = {
             "model": lambda p: (self.model.to_disk(p) if self.model is not True else True),
-            "vocab": lambda p: self.vocab.to_disk(p),
+            "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
             "moves": lambda p: self.moves.to_disk(p, exclude=["strings"]),
             "cfg": lambda p: srsly.write_json(p, self.cfg)
         }
@@ -577,7 +577,7 @@ cdef class Parser(TrainablePipe):
 
     def from_disk(self, path, exclude=tuple()):
         deserializers = {
-            "vocab": lambda p: self.vocab.from_disk(p),
+            "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
             "moves": lambda p: self.moves.from_disk(p, exclude=["strings"]),
             "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
             "model": lambda p: None,
@@ -597,7 +597,7 @@ cdef class Parser(TrainablePipe):
     def to_bytes(self, exclude=tuple()):
         serializers = {
             "model": lambda: (self.model.to_bytes()),
-            "vocab": lambda: self.vocab.to_bytes(),
+            "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
             "moves": lambda: self.moves.to_bytes(exclude=["strings"]),
             "cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
         }
@@ -605,7 +605,7 @@ cdef class Parser(TrainablePipe):
 
     def from_bytes(self, bytes_data, exclude=tuple()):
         deserializers = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
             "moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
             "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
             "model": lambda b: None,
diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py
index c8162a690..05871a524 100644
--- a/spacy/tests/serialize/test_serialize_pipeline.py
+++ b/spacy/tests/serialize/test_serialize_pipeline.py
@@ -1,5 +1,5 @@
 import pytest
-from spacy import registry, Vocab
+from spacy import registry, Vocab, load
 from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
 from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
 from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
@@ -268,3 +268,21 @@ def test_serialize_custom_trainable_pipe():
         pipe.to_disk(d)
         new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d)
     assert new_pipe.to_bytes() == pipe_bytes
+
+
+def test_load_without_strings():
+    nlp = spacy.blank("en")
+    orig_strings_length = len(nlp.vocab.strings)
+    word = "unlikely_word_" * 20
+    nlp.vocab.strings.add(word)
+    assert len(nlp.vocab.strings) == orig_strings_length + 1
+    with make_tempdir() as d:
+        nlp.to_disk(d)
+        # reload with strings
+        reloaded_nlp = load(d)
+        assert len(nlp.vocab.strings) == len(reloaded_nlp.vocab.strings)
+        assert word in reloaded_nlp.vocab.strings
+        # reload without strings
+        reloaded_nlp = load(d, exclude=["strings"])
+        assert orig_strings_length == len(reloaded_nlp.vocab.strings)
+        assert word not in reloaded_nlp.vocab.strings
diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index 61a7582b1..5a89e5a17 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -765,7 +765,7 @@ cdef class Tokenizer:
         DOCS: https://spacy.io/api/tokenizer#to_bytes
         """
         serializers = {
-            "vocab": lambda: self.vocab.to_bytes(),
+            "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
             "prefix_search": lambda: _get_regex_pattern(self.prefix_search),
             "suffix_search": lambda: _get_regex_pattern(self.suffix_search),
             "infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
@@ -786,7 +786,7 @@ cdef class Tokenizer:
         """
         data = {}
         deserializers = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
             "prefix_search": lambda b: data.setdefault("prefix_search", b),
             "suffix_search": lambda b: data.setdefault("suffix_search", b),
             "infix_finditer": lambda b: data.setdefault("infix_finditer", b),

From 77d698dcae9d81767cced2ce307d4cb89060358e Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Wed, 4 Aug 2021 16:20:41 +0900
Subject: [PATCH 17/22] Fix check for RIGHT_ATTRS in dep matcher (#8807)

* Fix check for RIGHT_ATTRs in dep matcher

If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.

A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.

This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.

* Fix test

This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.

* Revert test, update error message

Error message now lists missing keys, and RIGHT_ATTRS is required.

* Use list of required keys in error message

Also removes unused key param arg.
---
 spacy/errors.py                     |  4 ++--
 spacy/matcher/dependencymatcher.pyx | 18 +++++++++++-------
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/spacy/errors.py b/spacy/errors.py
index 5651ab0fa..9264ca6d1 100644
--- a/spacy/errors.py
+++ b/spacy/errors.py
@@ -356,8 +356,8 @@ class Errors:
     E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
     E099 = ("Invalid pattern: the first node of pattern should be an anchor "
             "node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
-    E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
-            "REL_OP and RIGHT_ID.")
+    E100 = ("Nodes other than the anchor node should all contain {required}, "
+            "but these are missing: {missing}")
     E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
             "have been declared in previous edges.")
     E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx
index b6e84a5da..9e0842d59 100644
--- a/spacy/matcher/dependencymatcher.pyx
+++ b/spacy/matcher/dependencymatcher.pyx
@@ -122,13 +122,17 @@ cdef class DependencyMatcher:
                     raise ValueError(Errors.E099.format(key=key))
                 visited_nodes[relation["RIGHT_ID"]] = True
             else:
-                if not(
-                    "RIGHT_ID" in relation
-                    and "RIGHT_ATTRS" in relation
-                    and "REL_OP" in relation
-                    and "LEFT_ID" in relation
-                ):
-                    raise ValueError(Errors.E100.format(key=key))
+                required_keys = set(
+                    ("RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID")
+                )
+                relation_keys = set(relation.keys())
+                missing = required_keys - relation_keys
+                if missing:
+                    missing_txt = ", ".join(list(missing))
+                    raise ValueError(Errors.E100.format(
+                        required=required_keys,
+                        missing=missing_txt
+                    ))
                 if (
                     relation["RIGHT_ID"] in visited_nodes
                     or relation["LEFT_ID"] not in visited_nodes

From fa2e7a4bbffe34bc6164818b1d52a3e463e7ea99 Mon Sep 17 00:00:00 2001
From: Adriane Boyd <adrianeboyd@gmail.com>
Date: Wed, 4 Aug 2021 14:29:43 +0200
Subject: [PATCH 18/22] Fix spancat tests on GPU (#8872)

* Fix spancat tests on GPU

* Fix more spancat tests
---
 spacy/tests/pipeline/test_spancat.py | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/spacy/tests/pipeline/test_spancat.py b/spacy/tests/pipeline/test_spancat.py
index 0364abf73..a0ef7e3f0 100644
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@@ -1,9 +1,11 @@
 import pytest
-from numpy.testing import assert_equal
+from numpy.testing import assert_equal, assert_array_equal
+from thinc.api import get_current_ops
 from spacy.language import Language
 from spacy.training import Example
 from spacy.util import fix_random_seed, registry
 
+OPS = get_current_ops()
 
 SPAN_KEY = "labeled_spans"
 
@@ -116,12 +118,12 @@ def test_ngram_suggester(en_tokenizer):
             for span in spans:
                 assert 0 <= span[0] < len(doc)
                 assert 0 < span[1] <= len(doc)
-                spans_set.add((span[0], span[1]))
+                spans_set.add((int(span[0]), int(span[1])))
             # spans are unique
             assert spans.shape[0] == len(spans_set)
             offset += ngrams.lengths[i]
         # the number of spans is correct
-        assert_equal(ngrams.lengths, [max(0, len(doc) - (size - 1)) for doc in docs])
+        assert_array_equal(OPS.to_numpy(ngrams.lengths), [max(0, len(doc) - (size - 1)) for doc in docs])
 
     # test 1-3-gram suggestions
     ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])
@@ -129,9 +131,9 @@ def test_ngram_suggester(en_tokenizer):
         en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
     ]
     ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
-    assert_equal(
-        ngrams.data,
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [1, 3, 6, 9, 12])
+    assert_array_equal(
+        OPS.to_numpy(ngrams.data),
         [
             # doc 0
             [0, 1],
@@ -176,13 +178,13 @@ def test_ngram_suggester(en_tokenizer):
     ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
     docs = [en_tokenizer(text) for text in ["", "a", ""]]
     ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
 
     # test all empty docs
     ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
     docs = [en_tokenizer(text) for text in ["", "", ""]]
     ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
 
 
 def test_ngram_sizes(en_tokenizer):
@@ -195,12 +197,12 @@ def test_ngram_sizes(en_tokenizer):
     ]
     ngrams_1 = size_suggester(docs)
     ngrams_2 = range_suggester(docs)
-    assert_equal(ngrams_1.lengths, [1, 3, 6, 9, 12])
-    assert_equal(ngrams_1.lengths, ngrams_2.lengths)
-    assert_equal(ngrams_1.data, ngrams_2.data)
+    assert_array_equal(OPS.to_numpy(ngrams_1.lengths), [1, 3, 6, 9, 12])
+    assert_array_equal(OPS.to_numpy(ngrams_1.lengths), OPS.to_numpy(ngrams_2.lengths))
+    assert_array_equal(OPS.to_numpy(ngrams_1.data), OPS.to_numpy(ngrams_2.data))
 
     # one more variation
     suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
     range_suggester = suggester_factory(min_size=2, max_size=4)
     ngrams_3 = range_suggester(docs)
-    assert_equal(ngrams_3.lengths, [0, 1, 3, 6, 9])
+    assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])

From 1dfffe5fb445196d72d6064d340e698f79a49b60 Mon Sep 17 00:00:00 2001
From: Kabir Khan <kabir@explosion.ai>
Date: Thu, 5 Aug 2021 00:21:22 -0700
Subject: [PATCH 19/22] No output info message in train (#8885)

* Add info message that no output directory was provided in train

* Update train.py

* Fix logging
---
 spacy/cli/train.py | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/spacy/cli/train.py b/spacy/cli/train.py
index 2932edd3b..9fd87dbc1 100644
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@@ -43,9 +43,13 @@ def train_cli(
     # Make sure all files and paths exists if they are needed
     if not config_path or (str(config_path) != "-" and not config_path.exists()):
         msg.fail("Config file not found", config_path, exits=1)
-    if output_path is not None and not output_path.exists():
-        output_path.mkdir(parents=True)
-        msg.good(f"Created output directory: {output_path}")
+    if not output_path:
+        msg.info("No output directory provided")
+    else:
+        if not output_path.exists():
+            output_path.mkdir(parents=True)
+            msg.good(f"Created output directory: {output_path}")
+        msg.info(f"Saving to output directory: {output_path}")
     overrides = parse_config_overrides(ctx.args)
     import_code(code_path)
     setup_gpu(use_gpu)

From 56d4d87aebdaf3383c9ae8f482f47edd423de1cd Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Fri, 6 Aug 2021 13:38:06 +0200
Subject: [PATCH 20/22] Auto-format code with black (#8895)

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
---
 spacy/cli/project/pull.py            | 4 +++-
 spacy/cli/project/push.py            | 4 +++-
 spacy/tests/pipeline/test_spancat.py | 5 ++++-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py
index f121ef0a0..6e3cde88c 100644
--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@@ -45,7 +45,9 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
                 cmd_hash = get_command_hash("", "", deps, cmd["script"])
                 for output_path in cmd.get("outputs", []):
                     url = storage.pull(output_path, command_hash=cmd_hash)
-                    logger.debug(f"URL: {url} for {output_path} with command hash {cmd_hash}")
+                    logger.debug(
+                        f"URL: {url} for {output_path} with command hash {cmd_hash}"
+                    )
                     yield url, output_path
 
                 out_locs = [project_dir / out for out in cmd.get("outputs", [])]
diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py
index 98a01d6dd..bc779e9cd 100644
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@@ -54,7 +54,9 @@ def project_push(project_dir: Path, remote: str):
                     command_hash=cmd_hash,
                     content_hash=get_content_hash(output_loc),
                 )
-                logger.debug(f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}")
+                logger.debug(
+                    f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}"
+                )
                 yield output_path, url
 
 
diff --git a/spacy/tests/pipeline/test_spancat.py b/spacy/tests/pipeline/test_spancat.py
index a0ef7e3f0..6a5ae2c66 100644
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@@ -123,7 +123,10 @@ def test_ngram_suggester(en_tokenizer):
             assert spans.shape[0] == len(spans_set)
             offset += ngrams.lengths[i]
         # the number of spans is correct
-        assert_array_equal(OPS.to_numpy(ngrams.lengths), [max(0, len(doc) - (size - 1)) for doc in docs])
+        assert_array_equal(
+            OPS.to_numpy(ngrams.lengths),
+            [max(0, len(doc) - (size - 1)) for doc in docs],
+        )
 
     # test 1-3-gram suggestions
     ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])

From 439f30faadea4b63efe01f9e79ecf048a08eeadd Mon Sep 17 00:00:00 2001
From: Eduard Zorita <eduardvalera@gmail.com>
Date: Sat, 7 Aug 2021 12:30:03 +0200
Subject: [PATCH 21/22] Add stub files for main cython classes (#8427)

* Add stub files for main API classes

* Add contributor agreement for ezorita

* Update types for ndarray and hash()

* Fix __getitem__ and __iter__

* Add attributes of Doc and Token classes

* Overload type hints for Span.__getitem__

* Fix type hint overload for Span.__getitem__

Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
---
 .github/contributors/ezorita.md | 106 ++++++++++++++++
 spacy/lexeme.pyi                |  61 ++++++++++
 spacy/matcher/matcher.pyi       |  41 +++++++
 spacy/strings.pyi               |  22 ++++
 spacy/tokens/_retokenize.pyi    |  17 +++
 spacy/tokens/doc.pyi            | 180 +++++++++++++++++++++++++++
 spacy/tokens/morphanalysis.pyi  |  20 +++
 spacy/tokens/span.pyi           | 124 +++++++++++++++++++
 spacy/tokens/span_group.pyi     |  24 ++++
 spacy/tokens/token.pyi          | 208 ++++++++++++++++++++++++++++++++
 spacy/vocab.pyi                 |  78 ++++++++++++
 11 files changed, 881 insertions(+)
 create mode 100644 .github/contributors/ezorita.md
 create mode 100644 spacy/lexeme.pyi
 create mode 100644 spacy/matcher/matcher.pyi
 create mode 100644 spacy/strings.pyi
 create mode 100644 spacy/tokens/_retokenize.pyi
 create mode 100644 spacy/tokens/doc.pyi
 create mode 100644 spacy/tokens/morphanalysis.pyi
 create mode 100644 spacy/tokens/span.pyi
 create mode 100644 spacy/tokens/span_group.pyi
 create mode 100644 spacy/tokens/token.pyi
 create mode 100644 spacy/vocab.pyi

diff --git a/.github/contributors/ezorita.md b/.github/contributors/ezorita.md
new file mode 100644
index 000000000..e5f3f5283
--- /dev/null
+++ b/.github/contributors/ezorita.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Eduard Zorita        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06/17/2021           |
+| GitHub username                | ezorita              |
+| Website (optional)             |                      |
diff --git a/spacy/lexeme.pyi b/spacy/lexeme.pyi
new file mode 100644
index 000000000..4eae6be43
--- /dev/null
+++ b/spacy/lexeme.pyi
@@ -0,0 +1,61 @@
+from typing import (
+    Union,
+    Any,
+)
+from thinc.types import Floats1d
+from .tokens import Doc, Span, Token
+from .vocab import Vocab
+
+class Lexeme:
+    def __init__(self, vocab: Vocab, orth: int) -> None: ...
+    def __richcmp__(self, other: Lexeme, op: int) -> bool: ...
+    def __hash__(self) -> int: ...
+    def set_attrs(self, **attrs: Any) -> None: ...
+    def set_flag(self, flag_id: int, value: bool) -> None: ...
+    def check_flag(self, flag_id: int) -> bool: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector_norm(self) -> float: ...
+    vector: Floats1d
+    rank: str
+    sentiment: float
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def text(self) -> str: ...
+    lower: str
+    norm: int
+    shape: int
+    prefix: int
+    suffix: int
+    cluster: int
+    lang: int
+    prob: float
+    lower_: str
+    norm_: str
+    shape_: str
+    prefix_: str
+    suffix_: str
+    lang_: str
+    flags: int
+    @property
+    def is_oov(self) -> bool: ...
+    is_stop: bool
+    is_alpha: bool
+    is_ascii: bool
+    is_digit: bool
+    is_lower: bool
+    is_upper: bool
+    is_title: bool
+    is_punct: bool
+    is_space: bool
+    is_bracket: bool
+    is_quote: bool
+    is_left_punct: bool
+    is_right_punct: bool
+    is_currency: bool
+    like_url: bool
+    like_num: bool
+    like_email: bool
diff --git a/spacy/matcher/matcher.pyi b/spacy/matcher/matcher.pyi
new file mode 100644
index 000000000..3be065bcd
--- /dev/null
+++ b/spacy/matcher/matcher.pyi
@@ -0,0 +1,41 @@
+from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
+from ..vocab import Vocab
+from ..tokens import Doc, Span
+
+class Matcher:
+    def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
+    def __reduce__(self) -> Any: ...
+    def __len__(self) -> int: ...
+    def __contains__(self, key: str) -> bool: ...
+    def add(
+        self,
+        key: str,
+        patterns: List[List[Dict[str, Any]]],
+        *,
+        on_match: Optional[
+            Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
+        ] = ...,
+        greedy: Optional[str] = ...
+    ) -> None: ...
+    def remove(self, key: str) -> None: ...
+    def has_key(self, key: Union[str, int]) -> bool: ...
+    def get(
+        self, key: Union[str, int], default: Optional[Any] = ...
+    ) -> Tuple[Optional[Callable[[Any], Any]], List[List[Dict[Any, Any]]]]: ...
+    def pipe(
+        self,
+        docs: Iterable[Tuple[Doc, Any]],
+        batch_size: int = ...,
+        return_matches: bool = ...,
+        as_tuples: bool = ...,
+    ) -> Union[
+        Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
+    ]: ...
+    def __call__(
+        self,
+        doclike: Union[Doc, Span],
+        *,
+        as_spans: bool = ...,
+        allow_missing: bool = ...,
+        with_alignments: bool = ...
+    ) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
diff --git a/spacy/strings.pyi b/spacy/strings.pyi
new file mode 100644
index 000000000..57bf71b93
--- /dev/null
+++ b/spacy/strings.pyi
@@ -0,0 +1,22 @@
+from typing import Optional, Iterable, Iterator, Union, Any
+from pathlib import Path
+
+def get_string_id(key: str) -> int: ...
+
+class StringStore:
+    def __init__(
+        self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
+    ) -> None: ...
+    def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
+    def as_int(self, key: Union[bytes, str, int]) -> int: ...
+    def as_string(self, key: Union[bytes, str, int]) -> str: ...
+    def add(self, string: str) -> int: ...
+    def __len__(self) -> int: ...
+    def __contains__(self, string: str) -> bool: ...
+    def __iter__(self) -> Iterator[str]: ...
+    def __reduce__(self) -> Any: ...
+    def to_disk(self, path: Union[str, Path]) -> None: ...
+    def from_disk(self, path: Union[str, Path]) -> StringStore: ...
+    def to_bytes(self, **kwargs: Any) -> bytes: ...
+    def from_bytes(self, bytes_data: bytes, **kwargs: Any) -> StringStore: ...
+    def _reset_and_load(self, strings: Iterable[str]) -> None: ...
diff --git a/spacy/tokens/_retokenize.pyi b/spacy/tokens/_retokenize.pyi
new file mode 100644
index 000000000..b829b71a3
--- /dev/null
+++ b/spacy/tokens/_retokenize.pyi
@@ -0,0 +1,17 @@
+from typing import Dict, Any, Union, List, Tuple
+from .doc import Doc
+from .span import Span
+from .token import Token
+
+class Retokenizer:
+    def __init__(self, doc: Doc) -> None: ...
+    def merge(self, span: Span, attrs: Dict[Union[str, int], Any] = ...) -> None: ...
+    def split(
+        self,
+        token: Token,
+        orths: List[str],
+        heads: List[Union[Token, Tuple[Token, int]]],
+        attrs: Dict[Union[str, int], List[Any]] = ...,
+    ) -> None: ...
+    def __enter__(self) -> Retokenizer: ...
+    def __exit__(self, *args: Any) -> None: ...
diff --git a/spacy/tokens/doc.pyi b/spacy/tokens/doc.pyi
new file mode 100644
index 000000000..8688fb91f
--- /dev/null
+++ b/spacy/tokens/doc.pyi
@@ -0,0 +1,180 @@
+from typing import (
+    Callable,
+    Protocol,
+    Iterable,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    List,
+    Dict,
+    Any,
+    overload,
+)
+from cymem.cymem import Pool
+from thinc.types import Floats1d, Floats2d, Ints2d
+from .span import Span
+from .token import Token
+from ._dict_proxies import SpanGroups
+from ._retokenize import Retokenizer
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+from pathlib import Path
+import numpy
+
+class DocMethod(Protocol):
+    def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...
+
+class Doc:
+    vocab: Vocab
+    mem: Pool
+    spans: SpanGroups
+    max_length: int
+    length: int
+    sentiment: float
+    cats: Dict[str, float]
+    user_hooks: Dict[str, Callable[..., Any]]
+    user_token_hooks: Dict[str, Callable[..., Any]]
+    user_span_hooks: Dict[str, Callable[..., Any]]
+    tensor: numpy.ndarray
+    user_data: Dict[str, Any]
+    has_unknown_spaces: bool
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Doc], Any]] = ...,
+        setter: Optional[Callable[[Doc, Any], None]] = ...,
+        method: Optional[DocMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[DocMethod],
+        Optional[Callable[[Doc], Any]],
+        Optional[Callable[[Doc, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[DocMethod],
+        Optional[Callable[[Doc], Any]],
+        Optional[Callable[[Doc, Any], None]],
+    ]: ...
+    def __init__(
+        self,
+        vocab: Vocab,
+        words: Optional[List[str]] = ...,
+        spaces: Optional[List[bool]] = ...,
+        user_data: Optional[Dict[Any, Any]] = ...,
+        tags: Optional[List[str]] = ...,
+        pos: Optional[List[str]] = ...,
+        morphs: Optional[List[str]] = ...,
+        lemmas: Optional[List[str]] = ...,
+        heads: Optional[List[int]] = ...,
+        deps: Optional[List[str]] = ...,
+        sent_starts: Optional[List[Union[bool, None]]] = ...,
+        ents: Optional[List[str]] = ...,
+    ) -> None: ...
+    @property
+    def _(self) -> Underscore: ...
+    @property
+    def is_tagged(self) -> bool: ...
+    @property
+    def is_parsed(self) -> bool: ...
+    @property
+    def is_nered(self) -> bool: ...
+    @property
+    def is_sentenced(self) -> bool: ...
+    def has_annotation(
+        self, attr: Union[int, str], *, require_complete: bool = ...
+    ) -> bool: ...
+    @overload
+    def __getitem__(self, i: int) -> Token: ...
+    @overload
+    def __getitem__(self, i: slice) -> Span: ...
+    def __iter__(self) -> Iterator[Token]: ...
+    def __len__(self) -> int: ...
+    def __unicode__(self) -> str: ...
+    def __bytes__(self) -> bytes: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
+    @property
+    def doc(self) -> Doc: ...
+    def char_span(
+        self,
+        start_idx: int,
+        end_idx: int,
+        label: Union[int, str] = ...,
+        kb_id: Union[int, str] = ...,
+        vector: Optional[Floats1d] = ...,
+        alignment_mode: str = ...,
+    ) -> Span: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def has_vector(self) -> bool: ...
+    vector: Floats1d
+    vector_norm: float
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    ents: Tuple[Span]
+    def set_ents(
+        self,
+        entities: List[Span],
+        *,
+        blocked: Optional[List[Span]] = ...,
+        missing: Optional[List[Span]] = ...,
+        outside: Optional[List[Span]] = ...,
+        default: str = ...
+    ) -> None: ...
+    @property
+    def noun_chunks(self) -> Iterator[Span]: ...
+    @property
+    def sents(self) -> Iterator[Span]: ...
+    @property
+    def lang(self) -> int: ...
+    @property
+    def lang_(self) -> str: ...
+    def count_by(
+        self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
+    ) -> Dict[Any, int]: ...
+    def from_array(self, attrs: List[int], array: Ints2d) -> Doc: ...
+    @staticmethod
+    def from_docs(
+        docs: List[Doc],
+        ensure_whitespace: bool = ...,
+        attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
+    ) -> Doc: ...
+    def get_lca_matrix(self) -> Ints2d: ...
+    def copy(self) -> Doc: ...
+    def to_disk(
+        self, path: Union[str, Path], *, exclude: Iterable[str] = ...
+    ) -> None: ...
+    def from_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_bytes(
+        self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_dict(
+        self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def extend_tensor(self, tensor: Floats2d) -> None: ...
+    def retokenize(self) -> Retokenizer: ...
+    def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
+    def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
+    @staticmethod
+    def _get_array_attrs() -> Tuple[Any]: ...
diff --git a/spacy/tokens/morphanalysis.pyi b/spacy/tokens/morphanalysis.pyi
new file mode 100644
index 000000000..c7e05e58f
--- /dev/null
+++ b/spacy/tokens/morphanalysis.pyi
@@ -0,0 +1,20 @@
+from typing import Any, Dict, Iterator, List, Union
+from ..vocab import Vocab
+
+class MorphAnalysis:
+    def __init__(
+        self, vocab: Vocab, features: Union[Dict[str, str], str] = ...
+    ) -> None: ...
+    @classmethod
+    def from_id(cls, vocab: Vocab, key: Any) -> MorphAnalysis: ...
+    def __contains__(self, feature: str) -> bool: ...
+    def __iter__(self) -> Iterator[str]: ...
+    def __len__(self) -> int: ...
+    def __hash__(self) -> int: ...
+    def __eq__(self, other: MorphAnalysis) -> bool: ...
+    def __ne__(self, other: MorphAnalysis) -> bool: ...
+    def get(self, field: Any) -> List[str]: ...
+    def to_json(self) -> str: ...
+    def to_dict(self) -> Dict[str, str]: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
diff --git a/spacy/tokens/span.pyi b/spacy/tokens/span.pyi
new file mode 100644
index 000000000..4f65abace
--- /dev/null
+++ b/spacy/tokens/span.pyi
@@ -0,0 +1,124 @@
+from typing import Callable, Protocol, Iterator, Optional, Union, Tuple, Any, overload
+from thinc.types import Floats1d, Ints2d, FloatsXd
+from .doc import Doc
+from .token import Token
+from .underscore import Underscore
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+
+class SpanMethod(Protocol):
+    def __call__(self: Span, *args: Any, **kwargs: Any) -> Any: ...
+
+class Span:
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Span], Any]] = ...,
+        setter: Optional[Callable[[Span, Any], None]] = ...,
+        method: Optional[SpanMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[SpanMethod],
+        Optional[Callable[[Span], Any]],
+        Optional[Callable[[Span, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[SpanMethod],
+        Optional[Callable[[Span], Any]],
+        Optional[Callable[[Span, Any], None]],
+    ]: ...
+    def __init__(
+        self,
+        doc: Doc,
+        start: int,
+        end: int,
+        label: int = ...,
+        vector: Optional[Floats1d] = ...,
+        vector_norm: Optional[float] = ...,
+        kb_id: Optional[int] = ...,
+    ) -> None: ...
+    def __richcmp__(self, other: Span, op: int) -> bool: ...
+    def __hash__(self) -> int: ...
+    def __len__(self) -> int: ...
+    def __repr__(self) -> str: ...
+    @overload
+    def __getitem__(self, i: int) -> Token: ...
+    @overload
+    def __getitem__(self, i: slice) -> Span: ...
+    def __iter__(self) -> Iterator[Token]: ...
+    @property
+    def _(self) -> Underscore: ...
+    def as_doc(self, *, copy_user_data: bool = ...) -> Doc: ...
+    def get_lca_matrix(self) -> Ints2d: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def vocab(self) -> Vocab: ...
+    @property
+    def sent(self) -> Span: ...
+    @property
+    def ents(self) -> Tuple[Span]: ...
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector(self) -> Floats1d: ...
+    @property
+    def vector_norm(self) -> float: ...
+    @property
+    def tensor(self) -> FloatsXd: ...
+    @property
+    def sentiment(self) -> float: ...
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    @property
+    def noun_chunks(self) -> Iterator[Span]: ...
+    @property
+    def root(self) -> Token: ...
+    def char_span(
+        self,
+        start_idx: int,
+        end_idx: int,
+        label: int = ...,
+        kb_id: int = ...,
+        vector: Optional[Floats1d] = ...,
+    ) -> Span: ...
+    @property
+    def conjuncts(self) -> Tuple[Token]: ...
+    @property
+    def lefts(self) -> Iterator[Token]: ...
+    @property
+    def rights(self) -> Iterator[Token]: ...
+    @property
+    def n_lefts(self) -> int: ...
+    @property
+    def n_rights(self) -> int: ...
+    @property
+    def subtree(self) -> Iterator[Token]: ...
+    start: int
+    end: int
+    start_char: int
+    end_char: int
+    label: int
+    kb_id: int
+    ent_id: int
+    ent_id_: str
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def lemma_(self) -> str: ...
+    label_: str
+    kb_id_: str
diff --git a/spacy/tokens/span_group.pyi b/spacy/tokens/span_group.pyi
new file mode 100644
index 000000000..4bd6bec27
--- /dev/null
+++ b/spacy/tokens/span_group.pyi
@@ -0,0 +1,24 @@
+from typing import Any, Dict, Iterable
+from .doc import Doc
+from .span import Span
+
+class SpanGroup:
+    def __init__(
+        self,
+        doc: Doc,
+        *,
+        name: str = ...,
+        attrs: Dict[str, Any] = ...,
+        spans: Iterable[Span] = ...
+    ) -> None: ...
+    def __repr__(self) -> str: ...
+    @property
+    def doc(self) -> Doc: ...
+    @property
+    def has_overlap(self) -> bool: ...
+    def __len__(self) -> int: ...
+    def append(self, span: Span) -> None: ...
+    def extend(self, spans: Iterable[Span]) -> None: ...
+    def __getitem__(self, i: int) -> Span: ...
+    def to_bytes(self) -> bytes: ...
+    def from_bytes(self, bytes_data: bytes) -> SpanGroup: ...
diff --git a/spacy/tokens/token.pyi b/spacy/tokens/token.pyi
new file mode 100644
index 000000000..23d028ffd
--- /dev/null
+++ b/spacy/tokens/token.pyi
@@ -0,0 +1,208 @@
+from typing import (
+    Callable,
+    Protocol,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from .doc import Doc
+from .span import Span
+from .morphanalysis import MorphAnalysis
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+
+class TokenMethod(Protocol):
+    def __call__(self: Token, *args: Any, **kwargs: Any) -> Any: ...
+
+class Token:
+    i: int
+    doc: Doc
+    vocab: Vocab
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Token], Any]] = ...,
+        setter: Optional[Callable[[Token, Any], None]] = ...,
+        method: Optional[TokenMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[TokenMethod],
+        Optional[Callable[[Token], Any]],
+        Optional[Callable[[Token, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[TokenMethod],
+        Optional[Callable[[Token], Any]],
+        Optional[Callable[[Token, Any], None]],
+    ]: ...
+    def __init__(self, vocab: Vocab, doc: Doc, offset: int) -> None: ...
+    def __hash__(self) -> int: ...
+    def __len__(self) -> int: ...
+    def __unicode__(self) -> str: ...
+    def __bytes__(self) -> bytes: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
+    def __richcmp__(self, other: Token, op: int) -> bool: ...
+    @property
+    def _(self) -> Underscore: ...
+    def nbor(self, i: int = ...) -> Token: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    def has_morph(self) -> bool: ...
+    morph: MorphAnalysis
+    @property
+    def lex(self) -> Lexeme: ...
+    @property
+    def lex_id(self) -> int: ...
+    @property
+    def rank(self) -> int: ...
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    @property
+    def prob(self) -> float: ...
+    @property
+    def sentiment(self) -> float: ...
+    @property
+    def lang(self) -> int: ...
+    @property
+    def idx(self) -> int: ...
+    @property
+    def cluster(self) -> int: ...
+    @property
+    def orth(self) -> int: ...
+    @property
+    def lower(self) -> int: ...
+    @property
+    def norm(self) -> int: ...
+    @property
+    def shape(self) -> int: ...
+    @property
+    def prefix(self) -> int: ...
+    @property
+    def suffix(self) -> int: ...
+    lemma: int
+    pos: int
+    tag: int
+    dep: int
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector(self) -> Floats1d: ...
+    @property
+    def vector_norm(self) -> float: ...
+    @property
+    def tensor(self) -> Optional[FloatsXd]: ...
+    @property
+    def n_lefts(self) -> int: ...
+    @property
+    def n_rights(self) -> int: ...
+    @property
+    def sent(self) -> Span: ...
+    sent_start: bool
+    is_sent_start: Optional[bool]
+    is_sent_end: Optional[bool]
+    @property
+    def lefts(self) -> Iterator[Token]: ...
+    @property
+    def rights(self) -> Iterator[Token]: ...
+    @property
+    def children(self) -> Iterator[Token]: ...
+    @property
+    def subtree(self) -> Iterator[Token]: ...
+    @property
+    def left_edge(self) -> Token: ...
+    @property
+    def right_edge(self) -> Token: ...
+    @property
+    def ancestors(self) -> Iterator[Token]: ...
+    def is_ancestor(self, descendant: Token) -> bool: ...
+    def has_head(self) -> bool: ...
+    head: Token
+    @property
+    def conjuncts(self) -> Tuple[Token]: ...
+    ent_type: int
+    ent_type_: str
+    @property
+    def ent_iob(self) -> int: ...
+    @classmethod
+    def iob_strings(cls) -> Tuple[str]: ...
+    @property
+    def ent_iob_(self) -> str: ...
+    ent_id: int
+    ent_id_: str
+    ent_kb_id: int
+    ent_kb_id_: str
+    @property
+    def whitespace_(self) -> str: ...
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def lower_(self) -> str: ...
+    norm_: str
+    @property
+    def shape_(self) -> str: ...
+    @property
+    def prefix_(self) -> str: ...
+    @property
+    def suffix_(self) -> str: ...
+    @property
+    def lang_(self) -> str: ...
+    lemma_: str
+    pos_: str
+    tag_: str
+    def has_dep(self) -> bool: ...
+    dep_: str
+    @property
+    def is_oov(self) -> bool: ...
+    @property
+    def is_stop(self) -> bool: ...
+    @property
+    def is_alpha(self) -> bool: ...
+    @property
+    def is_ascii(self) -> bool: ...
+    @property
+    def is_digit(self) -> bool: ...
+    @property
+    def is_lower(self) -> bool: ...
+    @property
+    def is_upper(self) -> bool: ...
+    @property
+    def is_title(self) -> bool: ...
+    @property
+    def is_punct(self) -> bool: ...
+    @property
+    def is_space(self) -> bool: ...
+    @property
+    def is_bracket(self) -> bool: ...
+    @property
+    def is_quote(self) -> bool: ...
+    @property
+    def is_left_punct(self) -> bool: ...
+    @property
+    def is_right_punct(self) -> bool: ...
+    @property
+    def is_currency(self) -> bool: ...
+    @property
+    def like_url(self) -> bool: ...
+    @property
+    def like_num(self) -> bool: ...
+    @property
+    def like_email(self) -> bool: ...
diff --git a/spacy/vocab.pyi b/spacy/vocab.pyi
new file mode 100644
index 000000000..0a8ef6198
--- /dev/null
+++ b/spacy/vocab.pyi
@@ -0,0 +1,78 @@
+from typing import (
+    Callable,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    List,
+    Dict,
+    Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from . import Language
+from .strings import StringStore
+from .lexeme import Lexeme
+from .lookups import Lookups
+from .tokens import Doc, Span
+from pathlib import Path
+
+def create_vocab(
+    lang: Language, defaults: Any, vectors_name: Optional[str] = ...
+) -> Vocab: ...
+
+class Vocab:
+    def __init__(
+        self,
+        lex_attr_getters: Optional[Dict[str, Callable[[str], Any]]] = ...,
+        strings: Optional[Union[List[str], StringStore]] = ...,
+        lookups: Optional[Lookups] = ...,
+        oov_prob: float = ...,
+        vectors_name: Optional[str] = ...,
+        writing_system: Dict[str, Any] = ...,
+        get_noun_chunks: Optional[Callable[[Union[Doc, Span]], Iterator[Span]]] = ...,
+    ) -> None: ...
+    @property
+    def lang(self) -> Language: ...
+    def __len__(self) -> int: ...
+    def add_flag(
+        self, flag_getter: Callable[[str], bool], flag_id: int = ...
+    ) -> int: ...
+    def __contains__(self, key: str) -> bool: ...
+    def __iter__(self) -> Iterator[Lexeme]: ...
+    def __getitem__(self, id_or_string: Union[str, int]) -> Lexeme: ...
+    @property
+    def vectors_length(self) -> int: ...
+    def reset_vectors(
+        self, *, width: Optional[int] = ..., shape: Optional[int] = ...
+    ) -> None: ...
+    def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
+    def get_vector(
+        self,
+        orth: Union[int, str],
+        minn: Optional[int] = ...,
+        maxn: Optional[int] = ...,
+    ) -> FloatsXd: ...
+    def set_vector(self, orth: Union[int, str], vector: Floats1d) -> None: ...
+    def has_vector(self, orth: Union[int, str]) -> bool: ...
+    lookups: Lookups
+    def to_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> None: ...
+    def from_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Vocab: ...
+    def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_bytes(
+        self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Vocab: ...
+
+def pickle_vocab(vocab: Vocab) -> Any: ...
+def unpickle_vocab(
+    sstore: StringStore,
+    vectors: Any,
+    morphology: Any,
+    data_dir: Any,
+    lex_attr_getters: Any,
+    lookups: Any,
+    get_noun_chunks: Any,
+) -> Vocab: ...

From cac298471fb1d8395afc965443ffe5453dcb8b0c Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Sun, 8 Aug 2021 22:04:00 +0900
Subject: [PATCH 22/22] Fix #8902 (bad link in docs)

typo fix
---
 website/docs/api/vocab.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md
index 8fe769cdd..320ad5605 100644
--- a/website/docs/api/vocab.md
+++ b/website/docs/api/vocab.md
@@ -29,7 +29,7 @@ Create the vocabulary.
 | `oov_prob`                                  | The default OOV probability. Defaults to `-20.0`. ~~float~~                                                                                            |
 | `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~                                                                                                          |
 | `writing_system`                            | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~         |
-| `get_noun_chunks`                           | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
+| `get_noun_chunks`                           | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
 
 ## Vocab.\_\_len\_\_ {#len tag="method"}