Merge remote-tracking branch 'upstream/master' into spacy.io

2025-08-07 05:40:20 +03:00 · 2021-08-20 14:49:51 +02:00 · 2021-08-20 14:49:51 +02:00 · fb8c2f794a
commit fb8c2f794a
parent 0e4da8ed70 e1f88de729
50 changed files with 2312 additions and 162 deletions
--- a/.github/contributors/ezorita.md
+++ b/.github/contributors/ezorita.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Eduard Zorita        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06/17/2021           |
+| GitHub username                | ezorita              |
+| Website (optional)             |                      |
--- a/.github/contributors/nsorros.md
+++ b/.github/contributors/nsorros.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Nick Sorros          |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2/8/2021             |
+| GitHub username                | nsorros              |
+| Website (optional)             |                      |
--- a/.github/contributors/swfarnsworth.md
+++ b/.github/contributors/swfarnsworth.md
@ -0,0 +1,88 @@
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Steele Farnsworth                    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  13 August, 2021                    |
+| GitHub username                |   swfarnsworth                   |
+| Website (optional)             |                      |
+
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -185,7 +185,6 @@ Each time a `git commit` is initiated, `black` and `flake8` will run automatical
 In case of error, or when `black` modified a file, the modified file needs to be `git add` once again and a new
 `git commit` has to be issued.

-
 ### Code formatting

 [`black`](https://github.com/ambv/black) is an opinionated Python code
@ -414,14 +413,7 @@ all test files and test functions need to be prefixed with `test_`.
 When adding tests, make sure to use descriptive names, keep the code short and
 concise and only test for one behavior at a time. Try to `parametrize` test
 cases wherever possible, use our pre-defined fixtures for spaCy components and
-avoid unnecessary imports.
-
-Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
-Tests that require the model to be loaded should be marked with
-`@pytest.mark.models`. Loading the models is expensive and not necessary if
-you're not actually testing the model performance. If all you need is a `Doc`
-object with annotations like heads, POS tags or the dependency parse, you can
-use the `Doc` constructor to construct it manually.
+avoid unnecessary imports. Extensive tests that take a long time should be marked with `@pytest.mark.slow`.

 📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**

--- a/extra/DEVELOPER_DOCS/Code
+++ b/extra/DEVELOPER_DOCS/Code
@ -0,0 +1,546 @@
+# Code Conventions
+
+For a general overview of code conventions for contributors, see the [section in the contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#code-conventions).
+
+1. [Code compatibility](#code-compatibility)
+2. [Auto-formatting](#auto-formatting)
+3. [Linting](#linting)
+4. [Documenting code](#documenting-code)
+5. [Type hints](#type-hints)
+6. [Structuring logic](#structuring-logic)
+7. [Naming](#naming)
+8. [Error handling](#error-handling)
+9. [Writing tests](#writing-tests)
+
+## Code compatibility
+
+spaCy supports **Python 3.6** and above, so all code should be written compatible with 3.6. This means that there are certain new syntax features that we won't be able to use until we drop support for older Python versions. Some newer features provide backports that we can conditionally install for older versions, although we only want to do this if it's absolutely necessary. If we need to use conditional imports based on the Python version or other custom compatibility-specific helpers, those should live in `compat.py`.
+
+## Auto-formatting
+
+spaCy uses `black` for auto-formatting (which is also available as a pre-commit hook). It's recommended to configure your editor to perform this automatically, either triggered manually or whenever you save a file. We also have a GitHub action that regularly formats the code base and submits a PR if changes are available. Note that auto-formatting is currently only available for `.py` (Python) files, not for `.pyx` (Cython).
+
+As a rule of thumb, if the auto-formatting produces output that looks messy, it can often indicate that there's a better way to structure the code to make it more concise.
+
+```diff
+- range_suggester = registry.misc.get("spacy.ngram_range_suggester.v1")(
+-     min_size=1, max_size=3
+- )
+ suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
+ range_suggester = suggester_factory(min_size=1, max_size=3)
+```
+
+In some specific cases, e.g. in the tests, it can make sense to disable auto-formatting for a specific block. You can do this by wrapping the code in `# fmt: off` and `# fmt: on`:
+
+```diff
+ # fmt: off
+text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
+deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
+        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
+        "poss", "nsubj", "ccomp", "punct"]
+ # fmt: on
+```
+
+## Linting
+
+[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code style. It scans one or more files and outputs errors and warnings. This feedback can help you stick to general standards and conventions, and can be very useful for spotting potential mistakes and inconsistencies in your code. Code you write should be compatible with our flake8 rules and not cause any warnings.
+
+```bash
+flake8 spacy
+```
+
+The most common problems surfaced by linting are:
+
+- **Trailing or missing whitespace.** This is related to formatting and should be fixed automatically by running `black`.
+- **Unused imports.** Those should be removed if the imports aren't actually used. If they're required, e.g. to expose them so they can be imported from the given module, you can add a comment and `# noqa: F401` exception (see details below).
+- **Unused variables.** This can often indicate bugs, e.g. a variable that's declared and not correctly passed on or returned. To prevent ambiguity here, your code shouldn't contain unused variables. If you're unpacking a list of tuples and end up with variables you don't need, you can call them `_` to indicate that they're unused.
+- **Redefinition of function.** This can also indicate bugs, e.g. a copy-pasted function that you forgot to rename and that now replaces the original function.
+- **Repeated dictionary keys.** This either indicates a bug or unnecessary duplication.
+- **Comparison with `True`, `False`, `None`**. This is mostly a stylistic thing: when checking whether a value is `True`, `False` or `None`, you should be using `is` instead of `==`. For example, `if value is None`.
+
+### Ignoring linter rules for special cases
+
+To ignore a given line, you can add a comment like `# noqa: F401`, specifying the code of the error or warning we want to ignore. It's also possible to ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. In general, you should always **specify the code(s)** you want to ignore – otherwise, you may end up missing actual problems.
+
+```python
+# The imported class isn't used in this file, but imported here, so it can be
+# imported *from* here by another module.
+from .submodule import SomeClass  # noqa: F401
+
+try:
+    do_something()
+except:  # noqa: E722
+    # This bare except is justified, for some specific reason
+    do_something_else()
+```
+
+## Documenting code
+
+All functions and methods you write should be documented with a docstring inline. The docstring can contain a simple summary, and an overview of the arguments and their (simplified) types. Modern editors will show this information to users when they call the function or method in their code.
+
+If it's part of the public API and there's a documentation section available, we usually add the link as `DOCS:` at the end. This allows us to keep the docstrings simple and concise, while also providing additional information and examples if necessary.
+
+```python
+def has_pipe(self, name: str) -> bool:
+    """Check if a component name is present in the pipeline. Equivalent to
+    `name in nlp.pipe_names`.
+
+    name (str): Name of the component.
+    RETURNS (bool): Whether a component of the name exists in the pipeline.
+
+    DOCS: https://spacy.io/api/language#has_pipe
+    """
+    ...
+```
+
+We specifically chose this approach of maintaining the docstrings and API reference separately, instead of auto-generating the API docs from the docstrings like other packages do. We want to be able to provide extensive explanations and examples in the documentation and use our own custom markup for it that would otherwise clog up the docstrings. We also want to be able to update the documentation independently of the code base. It's slightly more work, but it's absolutely worth it in terms of user and developer experience.
+
+### Inline code comments
+
+We don't expect you to add inline comments for everything you're doing – this should be obvious from reading the code. If it's not, the first thing to check is whether your code can be improved to make it more explicit. That said, if your code includes complex logic or aspects that may be unintuitive at first glance (or even included a subtle bug that you ended up fixing), you should leave a quick comment that provides more context.
+
+```diff
+token_index = indices[value]
+ # Index describes Token.i of last token but Span indices are inclusive
+span = doc[prev_token_index:token_index + 1]
+```
+
+```diff
+ # To create the components we need to use the final interpolated config
+ # so all values are available (if component configs use variables).
+ # Later we replace the component config with the raw config again.
+interpolated = filled.interpolate() if not filled.is_interpolated else filled
+```
+
+Don't be shy about including comments for tricky parts that _you_ found hard to implement or get right – those may come in handy for the next person working on this code, or even future you!
+
+If your change implements a fix to a specific issue, it can often be helpful to include the issue number in the comment, especially if it's a relatively straightforward adjustment:
+
+```diff
+ # Ensure object is a Span, not a Doc (#1234)
+if isinstance(obj, Doc):
+    obj = obj[obj.start:obj.end]
+```
+
+### Including TODOs
+
+It's fine to include code comments that indicate future TODOs, using the `TODO:` prefix. Modern editors typically format this in a different color, so it's easy to spot. TODOs don't necessarily have to be things that are absolutely critical to fix fight now – those should already be addressed in your pull request once it's ready for review. But they can include notes about potential future improvements.
+
+```diff
+ # TODO: this is currently pretty slow
+dir_checksum = hashlib.md5()
+for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
+    dir_checksum.update(sub_file.read_bytes())
+```
+
+If any of the TODOs you've added are important and should be fixed soon, you should add a task for this on Explosion's internal Ora board or an issue on the public issue tracker to make sure we don't forget to address it.
+
+## Type hints
+
+We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation.
+
+If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` – although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
+
+```diff
+- def func(some_arg: dict) -> None:
+ def func(some_arg: Dict[str, Any]) -> None:
+    ...
+```
+
+```python
+def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
+    def callback(arg1: str, arg2: int) -> List[str]:
+        ...
+
+    return callback
+```
+
+For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
+
+```python
+def build_tagger_model(
+    tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
+) -> Model[List[Doc], List[Floats2d]]:
+    ...
+```
+
+If you need to use a type hint that refers to something later declared in the same module, or the class that a method belongs to, you can use a string value instead:
+
+```python
+class SomeClass:
+    def from_bytes(self, data: bytes) -> "SomeClass":
+        ...
+```
+
+In some cases, you won't be able to import a class from a different module to use it as a type hint because it'd cause circular imports. For instance, `spacy/util.py` includes various helper functions that return an instance of `Language`, but we couldn't import it, because `spacy/language.py` imports `util` itself. In this case, we can provide `"Language"` as a string and make the import conditional on `typing.TYPE_CHECKING` so it only runs when the code is evaluated by a type checker:
+
+```python
+from typing TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from .language import Language
+
+def load_model(name: str) -> "Language":
+    ...
+```
+
+## Structuring logic
+
+### Positional and keyword arguments
+
+We generally try to avoid writing functions and methods with too many arguments, and use keyword-only arguments wherever possible. Python lets you define arguments as keyword-only by separating them with a `, *`. If you're writing functions with additional arguments that customize the behavior, you typically want to make those arguments keyword-only, so their names have to be provided explicitly.
+
+```diff
+- def do_something(name: str, validate: bool = False):
+ def do_something(name: str, *, validate: bool = False):
+    ...
+
+- do_something("some_name", True)
+ do_something("some_name", validate=True)
+```
+
+This makes the function calls easier to read, because it's immediately clear what the additional values mean. It also makes it easier to extend arguments or change their order later on, because you don't end up with any function calls that depend on a specific positional order.
+
+### Avoid mutable default arguments
+
+A common Python gotcha are [mutable default arguments](https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments): if your argument defines a mutable default value like `[]` or `{}` and then goes and mutates it, the default value is created _once_ when the function is created and the same object is then mutated every time the function is called. This can be pretty unintuitive when you first encounter it. We therefore avoid writing logic that does this.
+
+If your arguments need to default to an empty list or dict, you can use the `SimpleFrozenList` and `SimpleFrozenDict` helpers provided by spaCy. They are simple frozen implementations that raise an error if they're being mutated to prevent bugs and logic that accidentally mutates default arguments.
+
+```diff
+- def to_bytes(self, *, exclude: List[str] = []):
+ def to_bytes(self, *, exclude: List[str] = SimpleFrozenList()):
+    ...
+```
+
+```diff
+def do_something(values: List[str] = SimpleFrozenList()):
+    if some_condition:
+-         values.append("foo")  # raises an error
+         values = [*values, "foo"]
+    return values
+```
+
+### Don't use `try`/`except` for control flow
+
+We strongly discourage using `try`/`except` blocks for anything that's not third-party error handling or error handling that we otherwise have little control over. There's typically always a way to anticipate the _actual_ problem and **check for it explicitly**, which makes the code easier to follow and understand, and prevents bugs:
+
+```diff
+- try:
+-     token = doc[i]
+- except IndexError:
+-     token = doc[-1]
+
+ if i < len(doc):
+     token = doc[i]
+ else:
+     token = doc[-1]
+```
+
+Even if you end up having to check for multiple conditions explicitly, this is still preferred over a catch-all `try`/`except`. It can be very helpful to think about the exact scenarios you need to cover, and what could go wrong at each step, which often leads to better code and fewer bugs. `try/except` blocks can also easily mask _other_ bugs and problems that raise the same errors you're catching, which is obviously bad.
+
+If you have to use `try`/`except`, make sure to only include what's **absolutely necessary** in the `try` block and define the exception(s) explicitly. Otherwise, you may end up masking very different exceptions caused by other bugs.
+
+```diff
+- try:
+-     value1 = get_some_value()
+-     value2 = get_some_other_value()
+-     score = external_library.compute_some_score(value1, value2)
+- except:
+-     score = 0.0
+
+ value1 = get_some_value()
+ value2 = get_some_other_value()
+ try:
+     score = external_library.compute_some_score(value1, value2)
+ except ValueError:
+     score = 0.0
+```
+
+### Avoid lambda functions
+
+`lambda` functions can be useful for defining simple anonymous functions in a single line, but they also introduce problems: for instance, they require [additional logic](https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions) in order to be pickled and are pretty ugly to type-annotate. So we typically avoid them in the code base and only use them in the serialization handlers and within tests for simplicity. Instead of `lambda`s, check if your code can be refactored to not need them, or use helper functions instead.
+
+```diff
+- split_string: Callable[[str], List[str]] = lambda value: [v.strip() for v in value.split(",")]
+
+ def split_string(value: str) -> List[str]:
+     return [v.strip() for v in value.split(",")]
+```
+
+### Iteration and comprehensions
+
+We generally avoid using built-in functions like `filter` or `map` in favor of list or generator comprehensions.
+
+```diff
+- filtered = filter(lambda x: x in ["foo", "bar"], values)
+ filtered = (x for x in values if x in ["foo", "bar"])
+- filtered = list(filter(lambda x: x in ["foo", "bar"], values))
+ filtered = [x for x in values if x in ["foo", "bar"]]
+
+- result = map(lambda x: { x: x in ["foo", "bar"]}, values)
+ result = ({x: x in ["foo", "bar"]} for x in values)
+- result = list(map(lambda x: { x: x in ["foo", "bar"]}, values))
+ result = [{x: x in ["foo", "bar"]} for x in values]
+```
+
+If your logic is more complex, it's often better to write a loop instead, even if it adds more lines of code in total. The result will be much easier to follow and understand.
+
+```diff
+- result = [{"key": key, "scores": {f"{i}": score for i, score in enumerate(scores)}} for key, scores in values]
+
+ result = []
+ for key, scores in values:
+     scores_dict = {f"{i}": score for i, score in enumerate(scores)}
+     result.append({"key": key, "scores": scores_dict})
+```
+
+### Composition vs. inheritance
+
+Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion** — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing. Unless you're implementing a new data structure or pipeline component, you typically shouldn't have to use classes at all.
+
+### Don't use `print`
+
+The core library never `print`s anything. While we encourage using `print` statements for simple debugging (it's the most straightforward way of looking at what's happening), make sure to clean them up once you're ready to submit your pull request. If you want to output warnings or debugging information for users, use the respective dedicated mechanisms for this instead (see sections on warnings and logging for details).
+
+The only exceptions are the CLI functions, which pretty-print messages for the user, and methods that are explicitly intended for printing things, e.g. `Language.analyze_pipes` with `pretty=True` enabled. For this, we use our lightweight helper library [`wasabi`](https://github.com/ines/wasabi).
+
+## Naming
+
+Naming is hard and often a topic of long internal discussions. We don't expect you to come up with the perfect names for everything you write – finding the right names is often an iterative and collaborative process. That said, we do try to follow some basic conventions.
+
+Consistent with general Python conventions, we use `CamelCase` for class names including dataclasses, `snake_case` for methods, functions and variables, and `UPPER_SNAKE_CASE` for constants, typically defined at the top of a module. We also avoid using variable names that shadow the names of built-in functions, e.g. `input`, `help` or `list`.
+
+### Naming variables
+
+Variable names should always make it clear _what exactly_ the variable is and what it's used for. Instances of common classes should use the same consistent names. For example, you should avoid naming a text string (or anything else that's not a `Doc` object) `doc`. The most common class-to-variable mappings are:
+
+| Class      | Variable              | Example                                     |
+| ---------- | --------------------- | ------------------------------------------- |
+| `Language` | `nlp`                 | `nlp = spacy.blank("en")`                   |
+| `Doc`      | `doc`                 | `doc = nlp("Some text")`                    |
+| `Span`     | `span`, `ent`, `sent` | `span = doc[1:4]`, `ent = doc.ents[0]`      |
+| `Token`    | `token`               | `token = doc[0]`                            |
+| `Lexeme`   | `lexeme`, `lex`       | `lex = nlp.vocab["foo"]`                    |
+| `Vocab`    | `vocab`               | `vocab = Vocab()`                           |
+| `Example`  | `example`, `eg`       | `example = Example.from_dict(doc, gold)`    |
+| `Config`   | `config`, `cfg`       | `config = Config().from_disk("config.cfg")` |
+
+We try to avoid introducing too many temporary variables, as these clutter your namespace. It's okay to re-assign to an existing variable, but only if the value has the same type.
+
+```diff
+ents = get_a_list_of_entities()
+ents = [ent for ent in doc.ents if ent.label_ == "PERSON"]
+- ents = {(ent.start, ent.end): ent.label_ for ent in ents}
+ ent_mappings = {(ent.start, ent.end): ent.label_ for ent in ents}
+```
+
+### Naming methods and functions
+
+Try choosing short and descriptive names wherever possible and imperative verbs for methods that do something, e.g. `disable_pipes`, `add_patterns` or `get_vector`. Private methods and functions that are not intended to be part of the user-facing API should be prefixed with an underscore `_`. It's often helpful to look at the existing classes for inspiration.
+
+Objects that can be serialized, e.g. data structures and pipeline components, should implement the same consistent methods for serialization. Those usually include at least `to_disk`, `from_disk`, `to_bytes` and `from_bytes`. Some objects can also implement more specific methods like `{to/from}_dict` or `{to/from}_str`.
+
+## Error handling
+
+We always encourage writing helpful and detailed custom error messages for everything we can anticipate going wrong, and including as much detail as possible. spaCy provides a directory of error messages in `errors.py` with unique codes for each message. This allows us to keep the code base more concise and avoids long and nested blocks of texts throughout the code that disrupt the reading flow. The codes make it easy to find references to the same error in different places, and also helps identify problems reported by users (since we can just search for the error code).
+
+Errors can be referenced via their code, e.g. `Errors.E123`. Messages can also include placeholders for values, that can be populated by formatting the string with `.format()`.
+
+```python
+class Errors:
+    E123 = "Something went wrong"
+    E456 = "Unexpected value: {value}"
+```
+
+```diff
+if something_went_wrong:
+-    raise ValueError("Something went wrong!")
+    raise ValueError(Errors.E123)
+
+if not isinstance(value, int):
+-    raise ValueError(f"Unexpected value: {value}")
+    raise ValueError(Errors.E456.format(value=value))
+```
+
+As a general rule of thumb, all error messages raised within the **core library** should be added to `Errors`. The only place where we write errors and messages as strings is `spacy.cli`, since these functions typically pretty-print and generate a lot of output that'd otherwise be very difficult to separate from the actual logic.
+
+### Re-raising exceptions
+
+If we anticipate possible errors in third-party code that we don't control, or our own code in a very different context, we typically try to provide custom and more specific error messages if possible. If we need to re-raise an exception within a `try`/`except` block, we can re-raise a custom exception.
+
+[Re-raising `from`](https://docs.python.org/3/tutorial/errors.html#exception-chaining) the original caught exception lets us chain the exceptions, so the user sees both the original error, as well as the custom message with a note "The above exception was the direct cause of the following exception".
+
+```diff
+try:
+    run_third_party_code_that_might_fail()
+except ValueError as e:
+    raise ValueError(Errors.E123) from e
+```
+
+In some cases, it makes sense to suppress the original exception, e.g. if we know what it is and know that it's not particularly helpful. In that case, we can raise `from None`. This prevents clogging up the user's terminal with multiple and irrelevant chained exceptions.
+
+```diff
+try:
+    run_our_own_code_that_might_fail_confusingly()
+except ValueError:
+    raise ValueError(Errors.E123) from None
+```
+
+### Avoid using naked `assert`
+
+During development, it can sometimes be helpful to add `assert` statements throughout your code to make sure that the values you're working with are what you expect. However, as you clean up your code, those should either be removed or replaced by more explicit error handling:
+
+```diff
+- assert score >= 0.0
+ if score < 0.0:
+     raise ValueError(Errors.789.format(score=score))
+```
+
+Otherwise, the user will get to see a naked `AssertionError` with no further explanation, which is very unhelpful. Instead of adding an error message to `assert`, it's always better to `raise` more explicit errors for specific conditions. If you're checking for something that _has to be right_ and would otherwise be a bug in spaCy, you can express this in the error message:
+
+```python
+E161 = ("Found an internal inconsistency when predicting entity links. "
+        "This is likely a bug in spaCy, so feel free to open an issue: "
+        "https://github.com/explosion/spaCy/issues")
+```
+
+### Warnings
+
+Instead of raising an error, some parts of the code base can raise warnings to notify the user of a potential problem. This is done using Python's `warnings.warn` and the messages defined in `Warnings` in the `errors.py`. Whether or not warnings are shown can be controlled by the user, including custom filters for disabling specific warnings using a regular expression matching our internal codes, e.g. `W123`.
+
+```diff
+- print("Warning: No examples provided for validation")
+ warnings.warn(Warnings.W123)
+```
+
+When adding warnings, make sure you're not calling `warnings.warn` repeatedly, e.g. in a loop, which will clog up the terminal output. Instead, you can collect the potential problems first and then raise a single warning. If the problem is critical, consider raising an error instead.
+
+```diff
+ n_empty = 0
+for spans in lots_of_annotations:
+    if len(spans) == 0:
+-       warnings.warn(Warnings.456)
+       n_empty += 1
+ warnings.warn(Warnings.456.format(count=n_empty))
+```
+
+### Logging
+
+Log statements can be added via spaCy's `logger`, which uses Python's native `logging` module under the hood. We generally only use logging for debugging information that **the user may choose to see** in debugging mode or that's **relevant during training** but not at runtime.
+
+```diff
+ logger.info("Set up nlp object from config")
+config = nlp.config.interpolate()
+```
+
+`spacy train` and similar CLI commands will enable all log statements of level `INFO` by default (which is not the case at runtime). This allows outputting specific information within certain parts of the core library during training, without having it shown at runtime. `DEBUG`-level logs are only shown if the user enables `--verbose` logging during training. They can be used to provide more specific and potentially more verbose details, especially in areas that can indicate bugs or problems, or to surface more details about what spaCy does under the hood. You should only use logging statements if absolutely necessary and important.
+
+## Writing tests
+
+spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests for spaCy modules and classes live in their own directories of the same name and all test files should be prefixed with `test_`. Tests included in the core library only cover the code and do not depend on any trained pipelines. When implementing a new feature or fixing a bug, it's usually good to start by writing some tests that describe what _should_ happen. As you write your code, you can then keep running the relevant tests until all of them pass.
+
+### Test suite structure
+
+When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
+
+Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time.
+
+The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
+
+### Constructing objects and state
+
+Test functions usually follow the same simple structure: they set up some state, perform the operation you want to test and `assert` conditions that you expect to be true, usually before and after the operation.
+
+Tests should focus on exactly what they're testing and avoid dependencies on other unrelated library functionality wherever possible. If all your test needs is a `Doc` object with certain annotations set, you should always construct it manually:
+
+```python
+def test_doc_creation_with_pos():
+    doc = Doc(Vocab(), words=["hello", "world"], pos=["NOUN", "VERB"])
+    assert doc[0].pos_ == "NOUN"
+    assert doc[1].pos_ == "VERB"
+```
+
+### Parametrizing tests
+
+If you need to run the same test function over different input examples, you usually want to parametrize the test cases instead of using a loop within your test. This lets you keep a better separation between test cases and test logic, and it'll result in more useful output because `pytest` will be able to tell you which exact test case failed.
+
+The `@pytest.mark.parametrize` decorator takes two arguments: a string defining one or more comma-separated arguments that should be passed to the test function and a list of corresponding test cases (or a list of tuples to provide multiple arguments).
+
+```python
+@pytest.mark.parametrize("words", [["hello", "world"], ["this", "is", "a", "test"]])
+def test_doc_length(words):
+    doc = Doc(Vocab(), words=words)
+    assert len(doc) == len(words)
+```
+
+```python
+@pytest.mark.parametrize("text,expected_len", [("hello world", 2), ("I can't!", 4)])
+def test_token_length(en_tokenizer, text, expected_len):  # en_tokenizer is a fixture
+    doc = en_tokenizer(text)
+    assert len(doc) == expected_len
+```
+
+You can also stack `@pytest.mark.parametrize` decorators, although this is not recommended unless it's absolutely needed or required for the test. When stacking decorators, keep in mind that this will run the test with all possible combinations of the respective parametrized values, which is often not what you want and can slow down the test suite.
+
+### Handling failing tests
+
+`xfail` means that a test **should pass but currently fails**, i.e. is expected to fail. You can mark a test as currently xfailing by adding the `@pytest.mark.xfail` decorator. This should only be used for tests that don't yet work, not for logic that cause errors we raise on purpose (see the section on testing errors for this). It's often very helpful to implement tests for edge cases that we don't yet cover and mark them as `xfail`. You can also provide a `reason` keyword argument to the decorator with an explanation of why the test currently fails.
+
+```diff
+ @pytest.mark.xfail(reason="Issue #225 - not yet implemented")
+def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
+    doc = en_tokenizer("Will this road take me to Puddleton?\u2014No.")
+    assert doc[8].text == "\u2014"
+```
+
+When you run the test suite, you may come across tests that are reported as `xpass`. This means that they're marked as `xfail` but didn't actually fail. This is worth looking into: sometimes, it can mean that we have since fixed a bug that caused the test to previously fail, so we can remove the decorator. In other cases, especially when it comes to machine learning model implementations, it can also indicate that the **test is flaky**: it sometimes passes and sometimes fails. This can be caused by a bug, or by constraints being too narrowly defined. If a test shows different behavior depending on whether its run in isolation or not, this can indicate that it reacts to global state set in a previous test, which is unideal and should be avoided.
+
+### Writing slow tests
+
+If a test is useful but potentially quite slow, you can mark it with the `@pytest.mark.slow` decorator. This is a special marker we introduced and tests decorated with it only run if you run the test suite with `--slow`, but not as part of the main CI process. Before introducing a slow test, double-check that there isn't another and more efficient way to test for the behavior. You should also consider adding a simpler test with maybe only a subset of the test cases that can always run, so we at least have some coverage.
+
+### Skipping tests
+
+The `@pytest.mark.skip` decorator lets you skip tests entirely. You only want to do this for failing tests that may be slow to run or cause memory errors or segfaults, which would otherwise terminate the entire process and wouldn't be caught by `xfail`. We also sometimes use the `skip` decorator for old and outdated regression tests that we want to keep around but that don't apply anymore. When using the `skip` decorator, make sure to provide the `reason` keyword argument with a quick explanation of why you chose to skip this test.
+
+### Testing errors and warnings
+
+`pytest` lets you check whether a given error is raised by using the `pytest.raises` contextmanager. This is very useful when implementing custom error handling, so make sure you're not only testing for the correct behavior but also for errors resulting from incorrect inputs. If you're testing errors, you should always check for `pytest.raises` explicitly and not use `xfail`.
+
+```python
+words = ["a", "b", "c", "d", "e"]
+ents = ["Q-PERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"]
+with pytest.raises(ValueError):
+    Doc(Vocab(), words=words, ents=ents)
+```
+
+You can also use the `pytest.warns` contextmanager to check that a given warning type is raised. The first argument is the warning type or `None` (which will capture a list of warnings that you can `assert` is empty).
+
+```python
+def test_phrase_matcher_validation(en_vocab):
+    doc1 = Doc(en_vocab, words=["Test"], deps=["ROOT"])
+    doc2 = Doc(en_vocab, words=["Test"])
+    matcher = PhraseMatcher(en_vocab, validate=True)
+    with pytest.warns(UserWarning):
+        # Warn about unnecessarily parsed document
+        matcher.add("TEST1", [doc1])
+    with pytest.warns(None) as record:
+        matcher.add("TEST2", [docs])
+        assert not record.list
+```
+
+Keep in mind that your tests will fail if you're using the `pytest.warns` contextmanager with a given warning and the warning is _not_ shown. So you should only use it to check that spaCy handles and outputs warnings correctly. If your test outputs a warning that's expected but not relevant to what you're testing, you can use the `@pytest.mark.filterwarnings` decorator and ignore specific warnings starting with a given code:
+
+```python
+@pytest.mark.filterwarnings("ignore:\\[W036")
+def test_matcher_empty(en_vocab):
+    matcher = Matcher(en_vocab)
+    matcher(Doc(en_vocab, words=["test"]))
+```
+
+### Testing trained pipelines
+
+Our regular test suite does not depend on any of the trained pipelines, since their outputs can vary and aren't generally required to test the library functionality. We test pipelines separately using the tests included in the [`spacy-models`](https://github.com/explosion/spacy-models) repository, which run whenever we train a new suite of models. The tests here mostly focus on making sure that the packages can be loaded and that the predictions seam reasonable, and they include checks for common bugs we encountered previously. If your test does not primarily focus on verifying a model's predictions, it should be part of the core library tests and construct the required objects manually, instead of being added to the models tests.
+
+Keep in mind that specific predictions may change, and we can't test for all incorrect predictions reported by users. Different models make different mistakes, so even a model that's significantly more accurate overall may end up making wrong predictions that it previously didn't. However, some surprising incorrect predictions may indicate deeper bugs that we definitely want to investigate.
--- a/extra/DEVELOPER_DOCS/Language.md
+++ b/extra/DEVELOPER_DOCS/Language.md
@ -0,0 +1,150 @@
+# Language
+
+> Reference: `spacy/language.py`
+
+1. [Constructing the `nlp` object from a config](#1-constructing-the-nlp-object-from-a-config)
+   - [A. Overview of `Language.from_config`](#1a-overview)
+   - [B. Component factories](#1b-how-pipeline-component-factories-work-in-the-config)
+   - [C. Sourcing a component](#1c-sourcing-a-pipeline-component)
+   - [D. Tracking components as they're modified](#1d-tracking-components-as-theyre-modified)
+   - [E. spaCy's config utility function](#1e-spacys-config-utility-functions)
+2. [Initialization](#initialization)
+   - [A. Initialization for training](#2a-initialization-for-training): `init_nlp`
+   - [B. Initializing the `nlp` object](#2b-initializing-the-nlp-object): `Language.initialize`
+   - [C. Initializing the vocab](#2c-initializing-the-vocab): `init_vocab`
+
+## 1. Constructing the `nlp` object from a config
+
+### 1A. Overview
+
+Most of the functions referenced in the config are regular functions with arbitrary arguments registered via the function registry. However, the pipeline components are a bit special: they don't only receive arguments passed in via the config file, but also the current `nlp` object and the string `name` of the individual component instance (so a user can have multiple components created with the same factory, e.g. `ner_one` and `ner_two`). This name can then be used by the components to add to the losses and scores. This special requirement means that pipeline components can't just be resolved via the config the "normal" way: we need to retrieve the component functions manually and pass them their arguments, plus the `nlp` and `name`.
+
+The `Language.from_config` classmethod takes care of constructing the `nlp` object from a config. It's the single place where this happens and what `spacy.load` delegates to under the hood. Its main responsibilities are:
+
+- **Load and validate the config**, and optionally **auto-fill** all missing values that we either have defaults for in the config template or that registered function arguments define defaults for. This helps ensure backwards-compatibility, because we're able to add a new argument `foo: str = "bar"` to an existing function, without breaking configs that don't specity it.
+- **Execute relevant callbacks** for pipeline creation, e.g. optional functions called before and after creation of the `nlp` object and pipeline.
+- **Initialize language subclass and create tokenizer**. The `from_config` classmethod will always be called on a language subclass, e.g. `English`, not on `Language` directly. Initializing the subclass takes a callback to create the tokenizer.
+- **Set up the pipeline components**. Components can either refer to a component factory or a `source`, i.e. an existing pipeline that's loaded and that the component is then copied from. We also need to ensure that we update the information about which components are disabled.
+- **Manage listeners.** If sourced components "listen" to other components (`tok2vec`, `transformer`), we need to ensure that the references are valid. If the config specifies that listeners should be replaced by copies (e.g. to give the `ner` component its own `tok2vec` model instead of listening to the shared `tok2vec` component in the pipeline), we also need to take care of that.
+
+Note that we only resolve and load **selected sections** in `Language.from_config`, i.e. only the parts that are relevant at runtime, which is `[nlp]` and `[components]`. We don't want to be resolving anything related to training or initialization, since this would mean loading and constructing unnecessary functions, including functions that require information that isn't necessarily available at runtime, like `paths.train`.
+
+### 1B. How pipeline component factories work in the config
+
+As opposed to regular registered functions that refer to a registry and function name (e.g. `"@misc": "foo.v1"`), pipeline components follow a different format and refer to their component `factory` name. This corresponds to the name defined via the `@Language.component` or `@Language.factory` decorator. We need this decorator to define additional meta information for the components, like their default config and score weights.
+
+```ini
+[components.my_component]
+factory = "foo"
+some_arg = "bar"
+other_arg = ${paths.some_path}
+```
+
+This means that we need to create and resolve the `config["components"]` separately from the rest of the config. There are some important considerations and things we need to manage explicitly to avoid unexpected behavior:
+
+#### Variable interpolation
+
+When a config is resolved, references to variables are replaced, so that the functions receive the correct value instead of just the variable name. To interpolate a config, we need it in its entirety: we couldn't just interpolate a subsection that refers to variables defined in a different subsection. So we first interpolate the entire config.
+
+However, the `nlp.config` should include the original config with variables intact – otherwise, loading a pipeline and saving it to disk will destroy all logic implemented via variables and hard-code the values all over the place. This means that when we create the components, we need to keep two versions of the config: the interpolated config with the "real" values and the `raw_config` including the variable references.
+
+#### Factory registry
+
+Component factories are special and use the `@Language.factory` or `@Language.component` decorator to register themselves and their meta. When the decorator runs, it performs some basic validation, stores the meta information for the factory on the `Language` class (default config, scores etc.) and then adds the factory function to `registry.factories`. The `component` decorator can be used for registering simple functions that just take a `Doc` object and return it so in that case, we create the factory for the user automatically.
+
+There's one important detail to note about how factories are registered via entry points: A package that wants to expose spaCy components still needs to register them via the `@Language` decorators so we have the component meta information and can perform required checks. All we care about here is that the decorated function is **loaded and imported**. When it is, the `@Language` decorator takes care of everything, including actually registering the component factory.
+
+Normally, adding to the registry via an entry point will just add the function to the registry under the given name. But for `spacy_factories`, we don't actually want that: all we care about is that the function decorated with `@Language` is imported so the decorator runs. So we only exploit Python's entry point system to automatically import the function, and the `spacy_factories` entry point group actually adds to a **separate registry**, `registry._factories`, under the hood. Its only purpose is that the functions are imported. The decorator then runs, creates the factory if needed and adds it to the `registry.factories` registry.
+
+#### Language-specific factories
+
+spaCy supports registering factories on the `Language` base class, as well as language-specific subclasses like `English` or `German`. This allows providing different factories depending on the language, e.g. a different default lemmatizer. The `Language.get_factory_name` classmethod constructs the factory name as `{lang}.{name}` if a language is available (i.e. if it's a subclass) and falls back to `{name}` otherwise. So `@German.factory("foo")` will add a factory `de.foo` under the hood. If you add `nlp.add_pipe("foo")`, we first check if there's a factory for `{nlp.lang}.foo` and if not, we fall back to checking for a factory `foo`.
+
+#### Creating a pipeline component from a factory
+
+`Language.add_pipe` takes care of adding a pipeline component, given its factory name, its config. If no source pipeline to copy the component from is provided, it delegates to `Language.create_pipe`, which sets up the actual component function.
+
+- Validate the config and make sure that the factory was registered via the decorator and that we have meta for it.
+- Update the component config with any defaults specified by the component's `default_config`, if available. This is done by merging the values we receive into the defaults. It ensures that you can still add a component without having to specify its _entire_ config including more complex settings like `model`. If no `model` is defined, we use the default.
+- Check if we have a language-specific factory for the given `nlp.lang` and if not, fall back to the global factory.
+- Construct the component config, consisting of whatever arguments were provided, plus the current `nlp` object and `name`, which are default expected arguments of all factories. We also add a reference to the `@factories` registry, so we can resolve the config via the registry, like any other config. With the added `nlp` and `name`, it should now include all expected arguments of the given function.
+- Fill the config to make sure all unspecified defaults from the function arguments are added and update the `raw_config` (uninterpolated with variables intact) with that information, so the component config we store in `nlp.config` is up to date. We do this by adding the `raw_config` _into_ the filled config – otherwise, the references to variables would be overwritten.
+- Resolve the config and create all functions it refers to (e.g. `model`). This gives us the actual component function that we can insert into the pipeline.
+
+### 1C. Sourcing a pipeline component
+
+```ini
+[components.ner]
+source = "en_core_web_sm"
+```
+
+spaCy also allows ["sourcing" a component](https://spacy.io/usage/processing-pipelines#sourced-components), which will copy it over from an existing pipeline. In this case, `Language.add_pipe` will delegate to `Language.create_pipe_from_source`. In order to copy a component effectively and validate it, the source pipeline first needs to be loaded. This is done in `Language.from_config`, so a source pipeline only has to be loaded once if multiple components source from it. Sourcing a component will perform the following checks and modifications:
+
+- For each sourced pipeline component loaded in `Language.from_config`, a hash of the vectors data from the source pipeline is stored in the pipeline meta so we're able to check whether the vectors match and warn if not (since different vectors that are used as features in components can lead to degraded performance). Because the vectors are not loaded at the point when components are sourced, the check is postponed to `init_vocab` as part of `Language.initialize`.
+- If the sourced pipeline component is loaded through `Language.add_pipe(source=)`, the vectors are already loaded and can be compared directly. The check compares the shape and keys first and finally falls back to comparing the actual byte representation of the vectors (which is slower).
+- Ensure that the component is available in the pipeline.
+- Interpolate the entire config of the source pipeline so all variables are replaced and the component's config that's copied over doesn't include references to variables that are not available in the destination config.
+- Add the source `vocab.strings` to the destination's `vocab.strings` so we don't end up with unavailable strings in the final pipeline (which would also include labels used by the sourced component).
+
+Note that there may be other incompatibilities that we're currently not checking for and that could cause a sourced component to not work in the destination pipeline. We're interested in adding more checks here but there'll always be a small number of edge cases we'll never be able to catch, including a sourced component depending on other pipeline state that's not available in the destination pipeline.
+
+### 1D. Tracking components as they're modified
+
+The `Language` class implements methods for removing, replacing or renaming pipeline components. Whenever we make these changes, we need to update the information stored on the `Language` object to ensure that it matches the current state of the pipeline. If a user just writes to `nlp.config` manually, we obviously can't ensure that the config matches the reality – but since we offer modification via the pipe methods, it's expected that spaCy keeps the config in sync under the hood. Otherwise, saving a modified pipeline to disk and loading it back wouldn't work. The internal attributes we need to keep in sync here are:
+
+| Attribute                | Type                         | Description                                                                                                                                                     |
+| ------------------------ | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `Language._components`   | `List[Tuple[str, Callable]]` | All pipeline components as `(name, func)` tuples. This is used as the source of truth for `Language.pipeline`, `Language.pipe_names` and `Language.components`. |
+| `Language._pipe_meta`    | `Dict[str, FactoryMeta]`     | The meta information of a component's factory, keyed by component name. This can include multiple components referring to the same factory meta.                |
+| `Language._pipe_configs` | `Dict[str, Config]`          | The component's config, keyed by component name.                                                                                                                |
+| `Language._disabled`     | `Set[str]`                   | Names of components that are currently disabled.                                                                                                                |
+| `Language._config`       | `Config`                     | The underlying config. This is only internals and will be used as the basis for constructing the config in the `Language.config` property.                      |
+
+In addition to the actual component settings in `[components]`, the config also allows specifying component-specific arguments via the `[initialize.components]` block, which are passed to the component's `initialize` method during initialization if it's available. So we also need to keep this in sync in the underlying config.
+
+### 1E. spaCy's config utility functions
+
+When working with configs in spaCy, make sure to use the utility functions provided by spaCy if available, instead of calling the respective `Config` methods. The utilities take care of providing spaCy-specific error messages and ensure a consistent order of config sections by setting the `section_order` argument. This ensures that exported configs always have the same consistent format.
+
+- `util.load_config`: load a config from a file
+- `util.load_config_from_str`: load a confirm from a string representation
+- `util.copy_config`: deepcopy a config
+
+## 2. Initialization
+
+Initialization is a separate step of the [config lifecycle](https://spacy.io/usage/training#config-lifecycle) that's not performed at runtime. It's implemented via the `training.initialize.init_nlp` helper and calls into `Language.initialize` method, which sets up the pipeline and component models before training. The `initialize` method takes a callback that returns a sample of examples, which is used to initialize the component models, add all required labels and perform shape inference if applicable.
+
+Components can also define custom initialization setting via the `[initialize.components]` block, e.g. if they require external data like lookup tables to be loaded in. All config settings defined here will be passed to the component's `initialize` method, if it implements one. Components are expected to handle their own serialization after they're initialized so that any data or settings they require are saved with the pipeline and will be available from disk when the pipeline is loaded back at runtime.
+
+### 2A. Initialization for training
+
+The `init_nlp` function is called before training and returns an initialized `nlp` object that can be updated with the examples. It only needs the config and does the following:
+
+- Load and validate the config. In order to validate certain settings like the `seed`, we also interpolate the config to get the final value (because in theory, a user could provide this via a variable).
+- Set up the GPU allocation, if required.
+- Create the `nlp` object from the raw, uninterpolated config, which delegates to `Language.from_config`. Since this method may modify and auto-fill the config and pipeline component settings, we then use the interpolated version of `nlp.config` going forward, to ensure that what we're training with is up to date.
+- Resolve the `[training]` block of the config and perform validation, e.g. to check that the corpora are available.
+- Determine the components that should be frozen (not updated during training) or resumed (sourced components from a different pipeline that should be updated from the examples and not reset and re-initialized). To resume training, we can call the `nlp.resume_training` method.
+- Initialize the `nlp` object via `nlp.initialize` and pass it a `get_examples` callback that returns the training corpus (used for shape inference, setting up labels etc.). If the training corpus is streamed, we only provide a small sample of the data, which can potentially be infinite. `nlp.initialize` will delegate to the components as well and pass the data sample forward.
+- Check the listeners and warn about components dependencies, e.g. if a frozen component listens to a component that is retrained, or vice versa (which can degrade results).
+
+### 2B. Initializing the `nlp` object
+
+The `Language.initialize` method does the following:
+
+- **Resolve the config** defined in the `[initialize]` block separately (since everything else is already available in the loaded `nlp` object), based on the fully interpolated config.
+- **Execute callbacks**, i.e. `before_init` and `after_init`, if they're defined.
+- **Initialize the vocab**, including vocab data, lookup tables and vectors.
+- **Initialize the tokenizer** if it implements an `initialize` method. This is not the case for the default tokenizers, but it allows custom tokenizers to depend on external data resources that are loaded in on initialization.
+- **Initialize all pipeline components** if they implement an `initialize` method and pass them the `get_examples` callback, the current `nlp` object as well as well additional initialization config settings provided in the component-specific block.
+- **Initialize pretraining** if a `[pretraining]` block is available in the config. This allows loading pretrained tok2vec weights in `spacy pretrain`.
+- **Register listeners** if token-to-vector embedding layers of a component model "listen" to a previous component (`tok2vec`, `transformer`) in the pipeline.
+- **Create an optimizer** on the `Language` class, either by adding the optimizer passed as `sgd` to `initialize`, or by creating the optimizer defined in the config's training settings.
+
+### 2C. Initializing the vocab
+
+Vocab initialization is handled in the `training.initialize.init_vocab` helper. It takes the relevant loaded functions and values from the config and takes care of the following:
+
+- Add lookup tables defined in the config initialization, e.g. custom lemmatization tables. Those will be added to `nlp.vocab.lookups` from where they can be accessed by components.
+- Add JSONL-formatted [vocabulary data](https://spacy.io/api/data-formats#vocab-jsonl) to pre-populate the lexical attributes.
+- Load vectors into the pipeline. Vectors are defined as a name or path to a saved `nlp` object containing the vectors, e.g. `en_vectors_web_lg`. It's loaded and the vectors are ported over, while ensuring that all source strings are available in the destination strings. We also warn if there's a mismatch between sourced vectors, since this can lead to problems.
--- a/extra/DEVELOPER_DOCS/README.md
+++ b/extra/DEVELOPER_DOCS/README.md
@ -0,0 +1,7 @@
+<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
+
+# Developer Documentation
+
+This directory includes additional documentation and explanations of spaCy's internals. It's mostly intended for the spaCy core development team and contributors interested in the more complex parts of the library. The documents generally focus on more abstract implementation details and how specific methods and algorithms work, and they assume knowledge of what's already available in the [usage documentation](https://spacy.io/usage) and [API reference](https://spacy.io/api).
+
+If you're looking to contribute to spaCy, make sure to check out the documentation and [contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md) first.
--- a/licenses/3rd_party_licenses.txt
+++ b/licenses/3rd_party_licenses.txt
@ -104,3 +104,26 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
+
+
+importlib_metadata
+------------------
+
+* Files: util.py
+
+The implementation of packages_distributions() is adapted from
+importlib_metadata, which is distributed under the following license:
+
+Copyright 2017-2019 Jason R. Coombs, Barry Warsaw
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.1.1"
+__version__ = "3.1.2"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -101,13 +101,14 @@ def debug_data(
    # Create the gold corpus to be able to better analyze data
    dot_names = [T["train_corpus"], T["dev_corpus"]]
    train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
+
+    nlp.initialize(lambda: train_corpus(nlp))
+    msg.good("Pipeline can be initialized with data")
+
    train_dataset = list(train_corpus(nlp))
    dev_dataset = list(dev_corpus(nlp))
    msg.good("Corpus is loadable")

-    nlp.initialize(lambda: train_dataset)
-    msg.good("Pipeline can be initialized with data")
-
    # Create all gold data here to avoid iterating over the train_dataset constantly
    gold_train_data = _compile_gold(train_dataset, factory_names, nlp, make_proj=True)
    gold_train_unpreprocessed_data = _compile_gold(
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -2,6 +2,8 @@ from typing import Optional, Union, Any, Dict, List, Tuple
 import shutil
 from pathlib import Path
 from wasabi import Printer, MarkdownRenderer, get_raw_input
+from thinc.api import Config
+from collections import defaultdict
 import srsly
 import sys

@ -99,6 +101,12 @@ def package(
        msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
    meta = srsly.read_json(meta_path)
    meta = get_meta(input_dir, meta)
+    if meta["requirements"]:
+        msg.good(
+            f"Including {len(meta['requirements'])} package requirement(s) from "
+            f"meta and config",
+            ", ".join(meta["requirements"]),
+        )
    if name is not None:
        meta["name"] = name
    if version is not None:
@ -175,6 +183,51 @@ def has_wheel() -> bool:
        return False


+def get_third_party_dependencies(
+    config: Config, exclude: List[str] = util.SimpleFrozenList()
+) -> List[str]:
+    """If the config includes references to registered functions that are
+    provided by third-party packages (spacy-transformers, other libraries), we
+    want to include them in meta["requirements"] so that the package specifies
+    them as dependencies and the user won't have to do it manually.
+
+    We do this by:
+    - traversing the config to check for registered function (@ keys)
+    - looking up the functions and getting their module
+    - looking up the module version and generating an appropriate version range
+
+    config (Config): The pipeline config.
+    exclude (list): List of packages to exclude (e.g. that already exist in meta).
+    RETURNS (list): The versioned requirements.
+    """
+    own_packages = ("spacy", "spacy-nightly", "thinc", "srsly")
+    distributions = util.packages_distributions()
+    funcs = defaultdict(set)
+    for path, value in util.walk_dict(config):
+        if path[-1].startswith("@"):  # collect all function references by registry
+            funcs[path[-1][1:]].add(value)
+    modules = set()
+    for reg_name, func_names in funcs.items():
+        sub_registry = getattr(util.registry, reg_name)
+        for func_name in func_names:
+            func_info = sub_registry.find(func_name)
+            module_name = func_info.get("module")
+            if module_name:  # the code is part of a module, not a --code file
+                modules.add(func_info["module"].split(".")[0])
+    dependencies = []
+    for module_name in modules:
+        if module_name in distributions:
+            dist = distributions.get(module_name)
+            if dist:
+                pkg = dist[0]
+                if pkg in own_packages or pkg in exclude:
+                    continue
+                version = util.get_package_version(pkg)
+                version_range = util.get_minor_version_range(version)
+                dependencies.append(f"{pkg}{version_range}")
+    return dependencies
+
+
 def get_build_formats(formats: List[str]) -> Tuple[bool, bool]:
    supported = ["sdist", "wheel", "none"]
    for form in formats:
@ -208,7 +261,7 @@ def get_meta(
    nlp = util.load_model_from_path(Path(model_path))
    meta.update(nlp.meta)
    meta.update(existing_meta)
-    meta["spacy_version"] = util.get_model_version_range(about.__version__)
+    meta["spacy_version"] = util.get_minor_version_range(about.__version__)
    meta["vectors"] = {
        "width": nlp.vocab.vectors_length,
        "vectors": len(nlp.vocab.vectors),
@ -217,6 +270,11 @@ def get_meta(
    }
    if about.__title__ != "spacy":
        meta["parent_package"] = about.__title__
+    meta.setdefault("requirements", [])
+    # Update the requirements with all third-party packages in the config
+    existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
+    reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
+    meta["requirements"].extend(reqs)
    return meta


--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@ -2,7 +2,7 @@ from pathlib import Path
 from wasabi import msg
 from .remote_storage import RemoteStorage
 from .remote_storage import get_command_hash
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger
 from .._util import load_project_config
 from .run import update_lockfile

@ -39,11 +39,15 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
    # in the list.
    while commands:
        for i, cmd in enumerate(list(commands)):
+            logger.debug(f"CMD: {cmd['name']}.")
            deps = [project_dir / dep for dep in cmd.get("deps", [])]
            if all(dep.exists() for dep in deps):
                cmd_hash = get_command_hash("", "", deps, cmd["script"])
                for output_path in cmd.get("outputs", []):
                    url = storage.pull(output_path, command_hash=cmd_hash)
+                    logger.debug(
+                        f"URL: {url} for {output_path} with command hash {cmd_hash}"
+                    )
                    yield url, output_path

                out_locs = [project_dir / out for out in cmd.get("outputs", [])]
@ -53,6 +57,8 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
                # we iterate over the loop again.
                commands.pop(i)
                break
+            else:
+                logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs.")
        else:
            # If we didn't break the for loop, break the while loop.
            break
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@ -3,7 +3,7 @@ from wasabi import msg
 from .remote_storage import RemoteStorage
 from .remote_storage import get_content_hash, get_command_hash
 from .._util import load_project_config
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger


@project_cli.command("push")
@ -37,12 +37,15 @@ def project_push(project_dir: Path, remote: str):
        remote = config["remotes"][remote]
    storage = RemoteStorage(project_dir, remote)
    for cmd in config.get("commands", []):
+        logger.debug(f"CMD: cmd['name']")
        deps = [project_dir / dep for dep in cmd.get("deps", [])]
        if any(not dep.exists() for dep in deps):
+            logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs")
            continue
        cmd_hash = get_command_hash(
            "", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
        )
+        logger.debug(f"CMD_HASH: {cmd_hash}")
        for output_path in cmd.get("outputs", []):
            output_loc = project_dir / output_path
            if output_loc.exists() and _is_not_empty_dir(output_loc):
@ -51,6 +54,9 @@ def project_push(project_dir: Path, remote: str):
                    command_hash=cmd_hash,
                    content_hash=get_content_hash(output_loc),
                )
+                logger.debug(
+                    f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}"
+                )
                yield output_path, url


--- a/spacy/cli/project/run.py
+++ b/spacy/cli/project/run.py
@ -212,6 +212,9 @@ def check_rerun(
    strict_version (bool):
    RETURNS (bool): Whether to re-run the command.
    """
+    # Always rerun if no-skip is set
+    if command.get("no_skip", False):
+        return True
    lock_path = project_dir / PROJECT_LOCK
    if not lock_path.exists():  # We don't have a lockfile, run command
        return True
--- a/spacy/cli/templates/quickstart_training_recommendations.yml
+++ b/spacy/cli/templates/quickstart_training_recommendations.yml
@ -41,10 +41,10 @@ da:
  word_vectors: da_core_news_lg
  transformer:
    efficiency:
-      name: DJSammy/bert-base-danish-uncased_BotXO,ai
+      name: Maltehb/danish-bert-botxo
      size_factor: 3
    accuracy:
-      name: DJSammy/bert-base-danish-uncased_BotXO,ai
+      name: Maltehb/danish-bert-botxo
      size_factor: 3
 de:
  word_vectors: de_core_news_lg
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -43,9 +43,13 @@ def train_cli(
    # Make sure all files and paths exists if they are needed
    if not config_path or (str(config_path) != "-" and not config_path.exists()):
        msg.fail("Config file not found", config_path, exits=1)
-    if output_path is not None and not output_path.exists():
-        output_path.mkdir(parents=True)
-        msg.good(f"Created output directory: {output_path}")
+    if not output_path:
+        msg.info("No output directory provided")
+    else:
+        if not output_path.exists():
+            output_path.mkdir(parents=True)
+            msg.good(f"Created output directory: {output_path}")
+        msg.info(f"Saving to output directory: {output_path}")
    overrides = parse_config_overrides(ctx.args)
    import_code(code_path)
    setup_gpu(use_gpu)
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -27,6 +27,14 @@ try:  # Python 3.8+
 except ImportError:
    from typing_extensions import Literal  # noqa: F401

+# Important note: The importlib_metadata "backport" includes functionality
+# that's not part of the built-in importlib.metadata. We should treat this
+# import like the built-in and only use what's available there.
+try:  # Python 3.8+
+    import importlib.metadata as importlib_metadata
+except ImportError:
+    from catalogue import _importlib_metadata as importlib_metadata  # noqa: F401
+
 from thinc.api import Optimizer  # noqa: F401

 pickle = pickle
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -356,8 +356,8 @@ class Errors:
    E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
    E099 = ("Invalid pattern: the first node of pattern should be an anchor "
            "node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
-    E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
-            "REL_OP and RIGHT_ID.")
+    E100 = ("Nodes other than the anchor node should all contain {required}, "
+            "but these are missing: {missing}")
    E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
            "have been declared in previous edges.")
    E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
@ -521,6 +521,10 @@ class Errors:
    E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")

    # New errors added in v3.x
+    E866 = ("A SpanGroup is not functional after the corresponding Doc has "
+            "been garbage collected. To keep using the spans, make sure that "
+            "the corresponding Doc object is still available in the scope of "
+            "your function.")
    E867 = ("The 'textcat' component requires at least two labels because it "
            "uses mutually exclusive classes where exactly one label is True "
            "for each doc. For binary classification tasks, you can use two "
--- a/spacy/glossary.py
+++ b/spacy/glossary.py
@ -95,6 +95,7 @@ GLOSSARY = {
    "XX": "unknown",
    "BES": 'auxiliary "be"',
    "HVS": 'forms of "have"',
+    "_SP": "whitespace",
    # POS Tags (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
--- a/spacy/language.py
+++ b/spacy/language.py
@ -199,7 +199,7 @@ class Language:

        DOCS: https://spacy.io/api/language#meta
        """
-        spacy_version = util.get_model_version_range(about.__version__)
+        spacy_version = util.get_minor_version_range(about.__version__)
        if self.vocab.lang:
            self._meta.setdefault("lang", self.vocab.lang)
        else:
@ -1698,7 +1698,6 @@ class Language:
        # them here so they're only loaded once
        source_nlps = {}
        source_nlp_vectors_hashes = {}
-        nlp.meta["_sourced_vectors_hashes"] = {}
        for pipe_name in config["nlp"]["pipeline"]:
            if pipe_name not in pipeline:
                opts = ", ".join(pipeline.keys())
@ -1747,6 +1746,8 @@ class Language:
                        source_nlp_vectors_hashes[model] = hash(
                            source_nlps[model].vocab.vectors.to_bytes()
                        )
+                    if "_sourced_vectors_hashes" not in nlp.meta:
+                        nlp.meta["_sourced_vectors_hashes"] = {}
                    nlp.meta["_sourced_vectors_hashes"][
                        pipe_name
                    ] = source_nlp_vectors_hashes[model]
@ -1908,7 +1909,7 @@ class Language:
            if not hasattr(proc, "to_disk"):
                continue
            serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
-        serializers["vocab"] = lambda p: self.vocab.to_disk(p)
+        serializers["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
        util.to_disk(path, serializers, exclude)

    def from_disk(
@ -1939,7 +1940,7 @@ class Language:

        def deserialize_vocab(path: Path) -> None:
            if path.exists():
-                self.vocab.from_disk(path)
+                self.vocab.from_disk(path, exclude=exclude)

        path = util.ensure_path(path)
        deserializers = {}
@ -1977,7 +1978,7 @@ class Language:
        DOCS: https://spacy.io/api/language#to_bytes
        """
        serializers = {}
-        serializers["vocab"] = lambda: self.vocab.to_bytes()
+        serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
        serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
        serializers["config.cfg"] = lambda: self.config.to_bytes()
@ -2013,7 +2014,7 @@ class Language:
            b, interpolate=False
        )
        deserializers["meta.json"] = deserialize_meta
-        deserializers["vocab"] = self.vocab.from_bytes
+        deserializers["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
        deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(
            b, exclude=["vocab"]
        )
--- a/spacy/lexeme.pyi
+++ b/spacy/lexeme.pyi
@ -0,0 +1,61 @@
+from typing import (
+    Union,
+    Any,
+)
+from thinc.types import Floats1d
+from .tokens import Doc, Span, Token
+from .vocab import Vocab
+
+class Lexeme:
+    def __init__(self, vocab: Vocab, orth: int) -> None: ...
+    def __richcmp__(self, other: Lexeme, op: int) -> bool: ...
+    def __hash__(self) -> int: ...
+    def set_attrs(self, **attrs: Any) -> None: ...
+    def set_flag(self, flag_id: int, value: bool) -> None: ...
+    def check_flag(self, flag_id: int) -> bool: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector_norm(self) -> float: ...
+    vector: Floats1d
+    rank: str
+    sentiment: float
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def text(self) -> str: ...
+    lower: str
+    norm: int
+    shape: int
+    prefix: int
+    suffix: int
+    cluster: int
+    lang: int
+    prob: float
+    lower_: str
+    norm_: str
+    shape_: str
+    prefix_: str
+    suffix_: str
+    lang_: str
+    flags: int
+    @property
+    def is_oov(self) -> bool: ...
+    is_stop: bool
+    is_alpha: bool
+    is_ascii: bool
+    is_digit: bool
+    is_lower: bool
+    is_upper: bool
+    is_title: bool
+    is_punct: bool
+    is_space: bool
+    is_bracket: bool
+    is_quote: bool
+    is_left_punct: bool
+    is_right_punct: bool
+    is_currency: bool
+    like_url: bool
+    like_num: bool
+    like_email: bool
--- a/spacy/matcher/dependencymatcher.pyx
+++ b/spacy/matcher/dependencymatcher.pyx
@ -3,7 +3,6 @@ from typing import List
 from collections import defaultdict
 from itertools import product

-import numpy
 import warnings

 from .matcher cimport Matcher
@ -122,13 +121,15 @@ cdef class DependencyMatcher:
                    raise ValueError(Errors.E099.format(key=key))
                visited_nodes[relation["RIGHT_ID"]] = True
            else:
-                if not(
-                    "RIGHT_ID" in relation
-                    and "RIGHT_ATTRS" in relation
-                    and "REL_OP" in relation
-                    and "LEFT_ID" in relation
-                ):
-                    raise ValueError(Errors.E100.format(key=key))
+                required_keys = {"RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID"}
+                relation_keys = set(relation.keys())
+                missing = required_keys - relation_keys
+                if missing:
+                    missing_txt = ", ".join(list(missing))
+                    raise ValueError(Errors.E100.format(
+                        required=required_keys,
+                        missing=missing_txt
+                    ))
                if (
                    relation["RIGHT_ID"] in visited_nodes
                    or relation["LEFT_ID"] not in visited_nodes
@ -175,28 +176,22 @@ cdef class DependencyMatcher:
        self._callbacks[key] = on_match

        # Add 'RIGHT_ATTRS' to self._patterns[key]
-        _patterns = []
-        for pattern in patterns:
-            token_patterns = []
-            for i in range(len(pattern)):
-                token_pattern = [pattern[i]["RIGHT_ATTRS"]]
-                token_patterns.append(token_pattern)
-            _patterns.append(token_patterns)
+        _patterns = [[[pat["RIGHT_ATTRS"]] for pat in pattern] for pattern in patterns]
        self._patterns[key].extend(_patterns)

        # Add each node pattern of all the input patterns individually to the
        # matcher. This enables only a single instance of Matcher to be used.
        # Multiple adds are required to track each node pattern.
        tokens_to_key_list = []
-        for i in range(len(_patterns)):
+        for i, current_patterns in enumerate(_patterns):

            # Preallocate list space
-            tokens_to_key = [None]*len(_patterns[i])
+            tokens_to_key = [None] * len(current_patterns)

            # TODO: Better ways to hash edges in pattern?
-            for j in range(len(_patterns[i])):
+            for j, _pattern in enumerate(current_patterns):
                k = self._get_matcher_key(key, i, j)
-                self._matcher.add(k, [_patterns[i][j]])
+                self._matcher.add(k, [_pattern])
                tokens_to_key[j] = k

            tokens_to_key_list.append(tokens_to_key)
@ -333,7 +328,7 @@ cdef class DependencyMatcher:
        # position of the matched tokens
        for candidate_match in product(*all_positions):

-            # A potential match is a valid match if all relationhips between the
+            # A potential match is a valid match if all relationships between the
            # matched tokens are satisfied.
            is_valid = True
            for left_idx in range(len(candidate_match)):
@ -420,18 +415,10 @@ cdef class DependencyMatcher:
        return []

    def _right_sib(self, doc, node):
-        candidate_children = []
-        for child in list(doc[node].head.children):
-            if child.i > node:
-                candidate_children.append(doc[child.i])
-        return candidate_children
+        return [doc[child.i] for child in doc[node].head.children if child.i > node]

    def _left_sib(self, doc, node):
-        candidate_children = []
-        for child in list(doc[node].head.children):
-            if child.i < node:
-                candidate_children.append(doc[child.i])
-        return candidate_children
+        return [doc[child.i] for child in doc[node].head.children if child.i < node]

    def _normalize_key(self, key):
        if isinstance(key, basestring):
--- a/spacy/matcher/matcher.pyi
+++ b/spacy/matcher/matcher.pyi
@ -0,0 +1,41 @@
+from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
+from ..vocab import Vocab
+from ..tokens import Doc, Span
+
+class Matcher:
+    def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
+    def __reduce__(self) -> Any: ...
+    def __len__(self) -> int: ...
+    def __contains__(self, key: str) -> bool: ...
+    def add(
+        self,
+        key: str,
+        patterns: List[List[Dict[str, Any]]],
+        *,
+        on_match: Optional[
+            Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
+        ] = ...,
+        greedy: Optional[str] = ...
+    ) -> None: ...
+    def remove(self, key: str) -> None: ...
+    def has_key(self, key: Union[str, int]) -> bool: ...
+    def get(
+        self, key: Union[str, int], default: Optional[Any] = ...
+    ) -> Tuple[Optional[Callable[[Any], Any]], List[List[Dict[Any, Any]]]]: ...
+    def pipe(
+        self,
+        docs: Iterable[Tuple[Doc, Any]],
+        batch_size: int = ...,
+        return_matches: bool = ...,
+        as_tuples: bool = ...,
+    ) -> Union[
+        Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
+    ]: ...
+    def __call__(
+        self,
+        doclike: Union[Doc, Span],
+        *,
+        as_spans: bool = ...,
+        allow_missing: bool = ...,
+        with_alignments: bool = ...
+    ) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
--- a/spacy/pipeline/attributeruler.py
+++ b/spacy/pipeline/attributeruler.py
@ -276,7 +276,7 @@ class AttributeRuler(Pipe):
        DOCS: https://spacy.io/api/attributeruler#to_bytes
        """
        serialize = {}
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns)
        return util.to_bytes(serialize, exclude)

@ -296,7 +296,7 @@ class AttributeRuler(Pipe):
            self.add_patterns(srsly.msgpack_loads(b))

        deserialize = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
            "patterns": load_patterns,
        }
        util.from_bytes(bytes_data, deserialize, exclude)
@ -313,7 +313,7 @@ class AttributeRuler(Pipe):
        DOCS: https://spacy.io/api/attributeruler#to_disk
        """
        serialize = {
-            "vocab": lambda p: self.vocab.to_disk(p),
+            "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
            "patterns": lambda p: srsly.write_msgpack(p, self.patterns),
        }
        util.to_disk(path, serialize, exclude)
@ -334,7 +334,7 @@ class AttributeRuler(Pipe):
            self.add_patterns(srsly.read_msgpack(p))

        deserialize = {
-            "vocab": lambda p: self.vocab.from_disk(p),
+            "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
            "patterns": load_patterns,
        }
        util.from_disk(path, deserialize, exclude)
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@ -412,7 +412,7 @@ class EntityLinker(TrainablePipe):
        serialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serialize["kb"] = self.kb.to_bytes
        serialize["model"] = self.model.to_bytes
        return util.to_bytes(serialize, exclude)
@ -436,7 +436,7 @@ class EntityLinker(TrainablePipe):
        deserialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
        deserialize["kb"] = lambda b: self.kb.from_bytes(b)
        deserialize["model"] = load_model
        util.from_bytes(bytes_data, deserialize, exclude)
@ -453,7 +453,7 @@ class EntityLinker(TrainablePipe):
        DOCS: https://spacy.io/api/entitylinker#to_disk
        """
        serialize = {}
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["kb"] = lambda p: self.kb.to_disk(p)
        serialize["model"] = lambda p: self.model.to_disk(p)
@ -480,6 +480,7 @@ class EntityLinker(TrainablePipe):

        deserialize = {}
        deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
        deserialize["kb"] = lambda p: self.kb.from_disk(p)
        deserialize["model"] = load_model
        util.from_disk(path, deserialize, exclude)
--- a/spacy/pipeline/lemmatizer.py
+++ b/spacy/pipeline/lemmatizer.py
@ -269,7 +269,7 @@ class Lemmatizer(Pipe):
        DOCS: https://spacy.io/api/lemmatizer#to_disk
        """
        serialize = {}
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
        serialize["lookups"] = lambda p: self.lookups.to_disk(p)
        util.to_disk(path, serialize, exclude)

@ -285,7 +285,7 @@ class Lemmatizer(Pipe):
        DOCS: https://spacy.io/api/lemmatizer#from_disk
        """
        deserialize = {}
-        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
        deserialize["lookups"] = lambda p: self.lookups.from_disk(p)
        util.from_disk(path, deserialize, exclude)
        self._validate_tables()
@ -300,7 +300,7 @@ class Lemmatizer(Pipe):
        DOCS: https://spacy.io/api/lemmatizer#to_bytes
        """
        serialize = {}
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serialize["lookups"] = self.lookups.to_bytes
        return util.to_bytes(serialize, exclude)

@ -316,7 +316,7 @@ class Lemmatizer(Pipe):
        DOCS: https://spacy.io/api/lemmatizer#from_bytes
        """
        deserialize = {}
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
        deserialize["lookups"] = lambda b: self.lookups.from_bytes(b)
        util.from_bytes(bytes_data, deserialize, exclude)
        self._validate_tables()
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@ -386,9 +386,7 @@ class SpanCategorizer(TrainablePipe):
        kwargs = dict(kwargs)
        attr_prefix = "spans_"
        kwargs.setdefault("attr", f"{attr_prefix}{self.key}")
-        kwargs.setdefault("labels", self.labels)
-        kwargs.setdefault("multi_label", True)
-        kwargs.setdefault("threshold", self.cfg["threshold"])
+        kwargs.setdefault("allow_overlap", True)
        kwargs.setdefault(
            "getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
        )
@ -400,7 +398,7 @@ class SpanCategorizer(TrainablePipe):
        pass

    def _get_aligned_spans(self, eg: Example):
-        return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []))
+        return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []), allow_overlap=True)

    def _make_span_group(
        self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
@ -408,16 +406,24 @@ class SpanCategorizer(TrainablePipe):
        spans = SpanGroup(doc, name=self.key)
        max_positive = self.cfg["max_positive"]
        threshold = self.cfg["threshold"]
+
+        keeps = scores >= threshold
+        ranked = (scores * -1).argsort()
+        if max_positive is not None:
+            filter = ranked[:, max_positive:]
+            for i, row in enumerate(filter):
+                keeps[i, row] = False
+        spans.attrs["scores"] = scores[keeps].flatten()
+
+        indices = self.model.ops.to_numpy(indices)
+        keeps = self.model.ops.to_numpy(keeps)
+
        for i in range(indices.shape[0]):
-            start = int(indices[i, 0])
-            end = int(indices[i, 1])
-            positives = []
-            for j, score in enumerate(scores[i]):
-                if score >= threshold:
-                    positives.append((score, start, end, labels[j]))
-            positives.sort(reverse=True)
-            if max_positive:
-                positives = positives[:max_positive]
-            for score, start, end, label in positives:
-                spans.append(Span(doc, start, end, label=label))
+            start = indices[i, 0]
+            end = indices[i, 1]
+
+            for j, keep in enumerate(keeps[i]):
+                if keep:
+                    spans.append(Span(doc, start, end, label=labels[j]))
+
        return spans
--- a/spacy/pipeline/trainable_pipe.pyx
+++ b/spacy/pipeline/trainable_pipe.pyx
@ -273,7 +273,7 @@ cdef class TrainablePipe(Pipe):
        serialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
-        serialize["vocab"] = self.vocab.to_bytes
+        serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serialize["model"] = self.model.to_bytes
        return util.to_bytes(serialize, exclude)

@ -296,7 +296,7 @@ cdef class TrainablePipe(Pipe):
        deserialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
-        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
        deserialize["model"] = load_model
        util.from_bytes(bytes_data, deserialize, exclude)
        return self
@ -313,7 +313,7 @@ cdef class TrainablePipe(Pipe):
        serialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
-        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+        serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
        serialize["model"] = lambda p: self.model.to_disk(p)
        util.to_disk(path, serialize, exclude)

@ -338,7 +338,7 @@ cdef class TrainablePipe(Pipe):
        deserialize = {}
        if hasattr(self, "cfg") and self.cfg is not None:
            deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
-        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+        deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
        deserialize["model"] = load_model
        util.from_disk(path, deserialize, exclude)
        return self
--- a/spacy/pipeline/transition_parser.pyx
+++ b/spacy/pipeline/transition_parser.pyx
@ -569,7 +569,7 @@ cdef class Parser(TrainablePipe):
    def to_disk(self, path, exclude=tuple()):
        serializers = {
            "model": lambda p: (self.model.to_disk(p) if self.model is not True else True),
-            "vocab": lambda p: self.vocab.to_disk(p),
+            "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
            "moves": lambda p: self.moves.to_disk(p, exclude=["strings"]),
            "cfg": lambda p: srsly.write_json(p, self.cfg)
        }
@ -577,7 +577,7 @@ cdef class Parser(TrainablePipe):

    def from_disk(self, path, exclude=tuple()):
        deserializers = {
-            "vocab": lambda p: self.vocab.from_disk(p),
+            "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
            "moves": lambda p: self.moves.from_disk(p, exclude=["strings"]),
            "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
            "model": lambda p: None,
@ -597,7 +597,7 @@ cdef class Parser(TrainablePipe):
    def to_bytes(self, exclude=tuple()):
        serializers = {
            "model": lambda: (self.model.to_bytes()),
-            "vocab": lambda: self.vocab.to_bytes(),
+            "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
            "moves": lambda: self.moves.to_bytes(exclude=["strings"]),
            "cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
        }
@ -605,7 +605,7 @@ cdef class Parser(TrainablePipe):

    def from_bytes(self, bytes_data, exclude=tuple()):
        deserializers = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
            "moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
            "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
            "model": lambda b: None,
--- a/spacy/strings.pyi
+++ b/spacy/strings.pyi
@ -0,0 +1,22 @@
+from typing import Optional, Iterable, Iterator, Union, Any
+from pathlib import Path
+
+def get_string_id(key: str) -> int: ...
+
+class StringStore:
+    def __init__(
+        self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
+    ) -> None: ...
+    def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
+    def as_int(self, key: Union[bytes, str, int]) -> int: ...
+    def as_string(self, key: Union[bytes, str, int]) -> str: ...
+    def add(self, string: str) -> int: ...
+    def __len__(self) -> int: ...
+    def __contains__(self, string: str) -> bool: ...
+    def __iter__(self) -> Iterator[str]: ...
+    def __reduce__(self) -> Any: ...
+    def to_disk(self, path: Union[str, Path]) -> None: ...
+    def from_disk(self, path: Union[str, Path]) -> StringStore: ...
+    def to_bytes(self, **kwargs: Any) -> bytes: ...
+    def from_bytes(self, bytes_data: bytes, **kwargs: Any) -> StringStore: ...
+    def _reset_and_load(self, strings: Iterable[str]) -> None: ...
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -357,6 +357,9 @@ def test_span_eq_hash(doc, doc_not_parsed):
    assert hash(doc[0:2]) != hash(doc[1:3])
    assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])

+    # check that an out-of-bounds is not equivalent to the span of the full doc
+    assert doc[0 : len(doc)] != doc[len(doc) : len(doc) + 1]
+

 def test_span_boundaries(doc):
    start = 1
@ -369,6 +372,33 @@ def test_span_boundaries(doc):
    with pytest.raises(IndexError):
        span[5]

+    empty_span_0 = doc[0:0]
+    assert empty_span_0.text == ""
+    assert empty_span_0.start == 0
+    assert empty_span_0.end == 0
+    assert empty_span_0.start_char == 0
+    assert empty_span_0.end_char == 0
+
+    empty_span_1 = doc[1:1]
+    assert empty_span_1.text == ""
+    assert empty_span_1.start == 1
+    assert empty_span_1.end == 1
+    assert empty_span_1.start_char == empty_span_1.end_char
+
+    oob_span_start = doc[-len(doc) - 1 : -len(doc) - 10]
+    assert oob_span_start.text == ""
+    assert oob_span_start.start == 0
+    assert oob_span_start.end == 0
+    assert oob_span_start.start_char == 0
+    assert oob_span_start.end_char == 0
+
+    oob_span_end = doc[len(doc) + 1 : len(doc) + 10]
+    assert oob_span_end.text == ""
+    assert oob_span_end.start == len(doc)
+    assert oob_span_end.end == len(doc)
+    assert oob_span_end.start_char == len(doc.text)
+    assert oob_span_end.end_char == len(doc.text)
+

 def test_span_lemma(doc):
    # span lemmas should have the same number of spaces as the span
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@ -1,9 +1,17 @@
 import pytest
-from numpy.testing import assert_equal
-from spacy.language import Language
-from spacy.training import Example
-from spacy.util import fix_random_seed, registry
+import numpy
+from numpy.testing import assert_array_equal, assert_almost_equal
+from thinc.api import get_current_ops

+from spacy import util
+from spacy.lang.en import English
+from spacy.language import Language
+from spacy.tokens.doc import SpanGroups
+from spacy.tokens import SpanGroup
+from spacy.training import Example
+from spacy.util import fix_random_seed, registry, make_tempdir
+
+OPS = get_current_ops()

 SPAN_KEY = "labeled_spans"

@ -15,17 +23,21 @@ TRAIN_DATA = [
    ),
 ]

+TRAIN_DATA_OVERLAPPING = [
+    ("Who is Shaka Khan?", {"spans": {SPAN_KEY: [(7, 17, "PERSON")]}}),
+    (
+        "I like London and Berlin",
+        {"spans": {SPAN_KEY: [(7, 13, "LOC"), (18, 24, "LOC"), (7, 24, "DOUBLE_LOC")]}},
+    ),
+]

-def make_get_examples(nlp):
+
+def make_examples(nlp, data=TRAIN_DATA):
    train_examples = []
-    for t in TRAIN_DATA:
+    for t in data:
        eg = Example.from_dict(nlp.make_doc(t[0]), t[1])
        train_examples.append(eg)
-
-    def get_examples():
-        return train_examples
-
-    return get_examples
+    return train_examples


 def test_no_label():
@ -52,9 +64,7 @@ def test_implicit_labels():
    nlp = Language()
    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
    assert len(spancat.labels) == 0
-    train_examples = []
-    for t in TRAIN_DATA:
-        train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+    train_examples = make_examples(nlp)
    nlp.initialize(get_examples=lambda: train_examples)
    assert spancat.labels == ("PERSON", "LOC")

@ -69,24 +79,70 @@ def test_explicit_labels():
    assert spancat.labels == ("PERSON", "LOC")


-def test_simple_train():
-    fix_random_seed(0)
+def test_doc_gc():
+    # If the Doc object is garbage collected, the spans won't be functional afterwards
    nlp = Language()
    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
-    get_examples = make_get_examples(nlp)
-    nlp.initialize(get_examples)
-    sgd = nlp.create_optimizer()
-    assert len(spancat.labels) != 0
-    for i in range(40):
-        losses = {}
-        nlp.update(list(get_examples()), losses=losses, drop=0.1, sgd=sgd)
-    doc = nlp("I like London and Berlin.")
-    assert doc.spans[spancat.key] == doc.spans[SPAN_KEY]
-    assert len(doc.spans[spancat.key]) == 2
-    assert doc.spans[spancat.key][0].text == "London"
-    scores = nlp.evaluate(get_examples())
-    assert f"spans_{SPAN_KEY}_f" in scores
-    assert scores[f"spans_{SPAN_KEY}_f"] == 1.0
+    spancat.add_label("PERSON")
+    nlp.initialize()
+    texts = ["Just a sentence.", "I like London and Berlin", "I like Berlin", "I eat ham."]
+    all_spans = [doc.spans for doc in nlp.pipe(texts)]
+    for text, spangroups in zip(texts, all_spans):
+        assert isinstance(spangroups, SpanGroups)
+        for key, spangroup in spangroups.items():
+            assert isinstance(spangroup, SpanGroup)
+            assert len(spangroup) > 0
+            with pytest.raises(RuntimeError):
+                span = spangroup[0]
+
+
+@pytest.mark.parametrize(
+    "max_positive,nr_results", [(None, 4), (1, 2), (2, 3), (3, 4), (4, 4)]
+)
+def test_make_spangroup(max_positive, nr_results):
+    fix_random_seed(0)
+    nlp = Language()
+    spancat = nlp.add_pipe(
+        "spancat",
+        config={"spans_key": SPAN_KEY, "threshold": 0.5, "max_positive": max_positive},
+    )
+    doc = nlp.make_doc("Greater London")
+    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2])
+    indices = ngram_suggester([doc])[0].dataXd
+    assert_array_equal(indices, numpy.asarray([[0, 1], [1, 2], [0, 2]]))
+    labels = ["Thing", "City", "Person", "GreatCity"]
+    scores = numpy.asarray(
+        [[0.2, 0.4, 0.3, 0.1], [0.1, 0.6, 0.2, 0.4], [0.8, 0.7, 0.3, 0.9]], dtype="f"
+    )
+    spangroup = spancat._make_span_group(doc, indices, scores, labels)
+    assert len(spangroup) == nr_results
+
+    # first span is always the second token "London"
+    assert spangroup[0].text == "London"
+    assert spangroup[0].label_ == "City"
+    assert_almost_equal(0.6, spangroup.attrs["scores"][0], 5)
+
+    # second span depends on the number of positives that were allowed
+    assert spangroup[1].text == "Greater London"
+    if max_positive == 1:
+        assert spangroup[1].label_ == "GreatCity"
+        assert_almost_equal(0.9, spangroup.attrs["scores"][1], 5)
+    else:
+        assert spangroup[1].label_ == "Thing"
+        assert_almost_equal(0.8, spangroup.attrs["scores"][1], 5)
+
+    if nr_results > 2:
+        assert spangroup[2].text == "Greater London"
+        if max_positive == 2:
+            assert spangroup[2].label_ == "GreatCity"
+            assert_almost_equal(0.9, spangroup.attrs["scores"][2], 5)
+        else:
+            assert spangroup[2].label_ == "City"
+            assert_almost_equal(0.7, spangroup.attrs["scores"][2], 5)
+
+    assert spangroup[-1].text == "Greater London"
+    assert spangroup[-1].label_ == "GreatCity"
+    assert_almost_equal(0.9, spangroup.attrs["scores"][-1], 5)


 def test_ngram_suggester(en_tokenizer):
@ -116,12 +172,15 @@ def test_ngram_suggester(en_tokenizer):
            for span in spans:
                assert 0 <= span[0] < len(doc)
                assert 0 < span[1] <= len(doc)
-                spans_set.add((span[0], span[1]))
+                spans_set.add((int(span[0]), int(span[1])))
            # spans are unique
            assert spans.shape[0] == len(spans_set)
            offset += ngrams.lengths[i]
        # the number of spans is correct
-        assert_equal(ngrams.lengths, [max(0, len(doc) - (size - 1)) for doc in docs])
+        assert_array_equal(
+            OPS.to_numpy(ngrams.lengths),
+            [max(0, len(doc) - (size - 1)) for doc in docs],
+        )

    # test 1-3-gram suggestions
    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])
@ -129,9 +188,9 @@ def test_ngram_suggester(en_tokenizer):
        en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
    ]
    ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
-    assert_equal(
-        ngrams.data,
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [1, 3, 6, 9, 12])
+    assert_array_equal(
+        OPS.to_numpy(ngrams.data),
        [
            # doc 0
            [0, 1],
@ -176,13 +235,13 @@ def test_ngram_suggester(en_tokenizer):
    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
    docs = [en_tokenizer(text) for text in ["", "a", ""]]
    ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])

    # test all empty docs
    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
    docs = [en_tokenizer(text) for text in ["", "", ""]]
    ngrams = ngram_suggester(docs)
-    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+    assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])


 def test_ngram_sizes(en_tokenizer):
@ -195,12 +254,101 @@ def test_ngram_sizes(en_tokenizer):
    ]
    ngrams_1 = size_suggester(docs)
    ngrams_2 = range_suggester(docs)
-    assert_equal(ngrams_1.lengths, [1, 3, 6, 9, 12])
-    assert_equal(ngrams_1.lengths, ngrams_2.lengths)
-    assert_equal(ngrams_1.data, ngrams_2.data)
+    assert_array_equal(OPS.to_numpy(ngrams_1.lengths), [1, 3, 6, 9, 12])
+    assert_array_equal(OPS.to_numpy(ngrams_1.lengths), OPS.to_numpy(ngrams_2.lengths))
+    assert_array_equal(OPS.to_numpy(ngrams_1.data), OPS.to_numpy(ngrams_2.data))

    # one more variation
    suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
    range_suggester = suggester_factory(min_size=2, max_size=4)
    ngrams_3 = range_suggester(docs)
-    assert_equal(ngrams_3.lengths, [0, 1, 3, 6, 9])
+    assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])
+
+
+def test_overfitting_IO():
+    # Simple test to try and quickly overfit the spancat component - ensuring the ML models work correctly
+    fix_random_seed(0)
+    nlp = English()
+    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
+    train_examples = make_examples(nlp)
+    optimizer = nlp.initialize(get_examples=lambda: train_examples)
+    assert spancat.model.get_dim("nO") == 2
+    assert set(spancat.labels) == {"LOC", "PERSON"}
+
+    for i in range(50):
+        losses = {}
+        nlp.update(train_examples, sgd=optimizer, losses=losses)
+    assert losses["spancat"] < 0.01
+
+    # test the trained model
+    test_text = "I like London and Berlin"
+    doc = nlp(test_text)
+    assert doc.spans[spancat.key] == doc.spans[SPAN_KEY]
+    spans = doc.spans[SPAN_KEY]
+    assert len(spans) == 2
+    assert len(spans.attrs["scores"]) == 2
+    assert min(spans.attrs["scores"]) > 0.9
+    assert set([span.text for span in spans]) == {"London", "Berlin"}
+    assert set([span.label_ for span in spans]) == {"LOC"}
+
+    # Also test the results are still the same after IO
+    with make_tempdir() as tmp_dir:
+        nlp.to_disk(tmp_dir)
+        nlp2 = util.load_model_from_path(tmp_dir)
+        doc2 = nlp2(test_text)
+        spans2 = doc2.spans[SPAN_KEY]
+        assert len(spans2) == 2
+        assert len(spans2.attrs["scores"]) == 2
+        assert min(spans2.attrs["scores"]) > 0.9
+        assert set([span.text for span in spans2]) == {"London", "Berlin"}
+        assert set([span.label_ for span in spans2]) == {"LOC"}
+
+    # Test scoring
+    scores = nlp.evaluate(train_examples)
+    assert f"spans_{SPAN_KEY}_f" in scores
+    assert scores[f"spans_{SPAN_KEY}_p"] == 1.0
+    assert scores[f"spans_{SPAN_KEY}_r"] == 1.0
+    assert scores[f"spans_{SPAN_KEY}_f"] == 1.0
+
+    # also test that the spancat works for just a single entity in a sentence
+    doc = nlp("London")
+    assert len(doc.spans[spancat.key]) == 1
+
+
+def test_overfitting_IO_overlapping():
+    # Test for overfitting on overlapping entities
+    fix_random_seed(0)
+    nlp = English()
+    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
+
+    train_examples = make_examples(nlp, data=TRAIN_DATA_OVERLAPPING)
+    optimizer = nlp.initialize(get_examples=lambda: train_examples)
+    assert spancat.model.get_dim("nO") == 3
+    assert set(spancat.labels) == {"PERSON", "LOC", "DOUBLE_LOC"}
+
+    for i in range(50):
+        losses = {}
+        nlp.update(train_examples, sgd=optimizer, losses=losses)
+    assert losses["spancat"] < 0.01
+
+    # test the trained model
+    test_text = "I like London and Berlin"
+    doc = nlp(test_text)
+    spans = doc.spans[SPAN_KEY]
+    assert len(spans) == 3
+    assert len(spans.attrs["scores"]) == 3
+    assert min(spans.attrs["scores"]) > 0.9
+    assert set([span.text for span in spans]) == {"London", "Berlin", "London and Berlin"}
+    assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
+
+    # Also test the results are still the same after IO
+    with make_tempdir() as tmp_dir:
+        nlp.to_disk(tmp_dir)
+        nlp2 = util.load_model_from_path(tmp_dir)
+        doc2 = nlp2(test_text)
+        spans2 = doc2.spans[SPAN_KEY]
+        assert len(spans2) == 3
+        assert len(spans2.attrs["scores"]) == 3
+        assert min(spans2.attrs["scores"]) > 0.9
+        assert set([span.text for span in spans2]) == {"London", "Berlin", "London and Berlin"}
+        assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}
--- a/spacy/tests/serialize/test_serialize_pipeline.py
+++ b/spacy/tests/serialize/test_serialize_pipeline.py
@ -1,5 +1,5 @@
 import pytest
-from spacy import registry, Vocab
+from spacy import registry, Vocab, load
 from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
 from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
 from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
@ -268,3 +268,21 @@ def test_serialize_custom_trainable_pipe():
        pipe.to_disk(d)
        new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d)
    assert new_pipe.to_bytes() == pipe_bytes
+
+
+def test_load_without_strings():
+    nlp = spacy.blank("en")
+    orig_strings_length = len(nlp.vocab.strings)
+    word = "unlikely_word_" * 20
+    nlp.vocab.strings.add(word)
+    assert len(nlp.vocab.strings) == orig_strings_length + 1
+    with make_tempdir() as d:
+        nlp.to_disk(d)
+        # reload with strings
+        reloaded_nlp = load(d)
+        assert len(nlp.vocab.strings) == len(reloaded_nlp.vocab.strings)
+        assert word in reloaded_nlp.vocab.strings
+        # reload without strings
+        reloaded_nlp = load(d, exclude=["strings"])
+        assert orig_strings_length == len(reloaded_nlp.vocab.strings)
+        assert word not in reloaded_nlp.vocab.strings
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@ -14,6 +14,7 @@ from spacy import about
 from spacy.util import get_minor_version
 from spacy.cli.validate import get_model_pkgs
 from spacy.cli.download import get_compatibility, get_version
+from spacy.cli.package import get_third_party_dependencies
 from thinc.api import ConfigValidationError, Config
 import srsly
 import os
@ -532,3 +533,10 @@ def test_init_labels(component_name):
        assert len(nlp2.get_pipe(component_name).labels) == 0
        nlp2.initialize()
        assert len(nlp2.get_pipe(component_name).labels) == 4
+
+
+def test_get_third_party_dependencies_runs():
+    # We can't easily test the detection of third-party packages here, but we
+    # can at least make sure that the function and its importlib magic runs.
+    nlp = Dutch()
+    assert get_third_party_dependencies(nlp.config) == []
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -765,7 +765,7 @@ cdef class Tokenizer:
        DOCS: https://spacy.io/api/tokenizer#to_bytes
        """
        serializers = {
-            "vocab": lambda: self.vocab.to_bytes(),
+            "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
            "prefix_search": lambda: _get_regex_pattern(self.prefix_search),
            "suffix_search": lambda: _get_regex_pattern(self.suffix_search),
            "infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
@ -786,7 +786,7 @@ cdef class Tokenizer:
        """
        data = {}
        deserializers = {
-            "vocab": lambda b: self.vocab.from_bytes(b),
+            "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
            "prefix_search": lambda b: data.setdefault("prefix_search", b),
            "suffix_search": lambda b: data.setdefault("suffix_search", b),
            "infix_finditer": lambda b: data.setdefault("infix_finditer", b),
--- a/spacy/tokens/_retokenize.pyi
+++ b/spacy/tokens/_retokenize.pyi
@ -0,0 +1,17 @@
+from typing import Dict, Any, Union, List, Tuple
+from .doc import Doc
+from .span import Span
+from .token import Token
+
+class Retokenizer:
+    def __init__(self, doc: Doc) -> None: ...
+    def merge(self, span: Span, attrs: Dict[Union[str, int], Any] = ...) -> None: ...
+    def split(
+        self,
+        token: Token,
+        orths: List[str],
+        heads: List[Union[Token, Tuple[Token, int]]],
+        attrs: Dict[Union[str, int], List[Any]] = ...,
+    ) -> None: ...
+    def __enter__(self) -> Retokenizer: ...
+    def __exit__(self, *args: Any) -> None: ...
--- a/spacy/tokens/doc.pyi
+++ b/spacy/tokens/doc.pyi
@ -0,0 +1,180 @@
+from typing import (
+    Callable,
+    Protocol,
+    Iterable,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    List,
+    Dict,
+    Any,
+    overload,
+)
+from cymem.cymem import Pool
+from thinc.types import Floats1d, Floats2d, Ints2d
+from .span import Span
+from .token import Token
+from ._dict_proxies import SpanGroups
+from ._retokenize import Retokenizer
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+from pathlib import Path
+import numpy
+
+class DocMethod(Protocol):
+    def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...
+
+class Doc:
+    vocab: Vocab
+    mem: Pool
+    spans: SpanGroups
+    max_length: int
+    length: int
+    sentiment: float
+    cats: Dict[str, float]
+    user_hooks: Dict[str, Callable[..., Any]]
+    user_token_hooks: Dict[str, Callable[..., Any]]
+    user_span_hooks: Dict[str, Callable[..., Any]]
+    tensor: numpy.ndarray
+    user_data: Dict[str, Any]
+    has_unknown_spaces: bool
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Doc], Any]] = ...,
+        setter: Optional[Callable[[Doc, Any], None]] = ...,
+        method: Optional[DocMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[DocMethod],
+        Optional[Callable[[Doc], Any]],
+        Optional[Callable[[Doc, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[DocMethod],
+        Optional[Callable[[Doc], Any]],
+        Optional[Callable[[Doc, Any], None]],
+    ]: ...
+    def __init__(
+        self,
+        vocab: Vocab,
+        words: Optional[List[str]] = ...,
+        spaces: Optional[List[bool]] = ...,
+        user_data: Optional[Dict[Any, Any]] = ...,
+        tags: Optional[List[str]] = ...,
+        pos: Optional[List[str]] = ...,
+        morphs: Optional[List[str]] = ...,
+        lemmas: Optional[List[str]] = ...,
+        heads: Optional[List[int]] = ...,
+        deps: Optional[List[str]] = ...,
+        sent_starts: Optional[List[Union[bool, None]]] = ...,
+        ents: Optional[List[str]] = ...,
+    ) -> None: ...
+    @property
+    def _(self) -> Underscore: ...
+    @property
+    def is_tagged(self) -> bool: ...
+    @property
+    def is_parsed(self) -> bool: ...
+    @property
+    def is_nered(self) -> bool: ...
+    @property
+    def is_sentenced(self) -> bool: ...
+    def has_annotation(
+        self, attr: Union[int, str], *, require_complete: bool = ...
+    ) -> bool: ...
+    @overload
+    def __getitem__(self, i: int) -> Token: ...
+    @overload
+    def __getitem__(self, i: slice) -> Span: ...
+    def __iter__(self) -> Iterator[Token]: ...
+    def __len__(self) -> int: ...
+    def __unicode__(self) -> str: ...
+    def __bytes__(self) -> bytes: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
+    @property
+    def doc(self) -> Doc: ...
+    def char_span(
+        self,
+        start_idx: int,
+        end_idx: int,
+        label: Union[int, str] = ...,
+        kb_id: Union[int, str] = ...,
+        vector: Optional[Floats1d] = ...,
+        alignment_mode: str = ...,
+    ) -> Span: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def has_vector(self) -> bool: ...
+    vector: Floats1d
+    vector_norm: float
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    ents: Tuple[Span]
+    def set_ents(
+        self,
+        entities: List[Span],
+        *,
+        blocked: Optional[List[Span]] = ...,
+        missing: Optional[List[Span]] = ...,
+        outside: Optional[List[Span]] = ...,
+        default: str = ...
+    ) -> None: ...
+    @property
+    def noun_chunks(self) -> Iterator[Span]: ...
+    @property
+    def sents(self) -> Iterator[Span]: ...
+    @property
+    def lang(self) -> int: ...
+    @property
+    def lang_(self) -> str: ...
+    def count_by(
+        self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
+    ) -> Dict[Any, int]: ...
+    def from_array(self, attrs: List[int], array: Ints2d) -> Doc: ...
+    @staticmethod
+    def from_docs(
+        docs: List[Doc],
+        ensure_whitespace: bool = ...,
+        attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
+    ) -> Doc: ...
+    def get_lca_matrix(self) -> Ints2d: ...
+    def copy(self) -> Doc: ...
+    def to_disk(
+        self, path: Union[str, Path], *, exclude: Iterable[str] = ...
+    ) -> None: ...
+    def from_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_bytes(
+        self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_dict(
+        self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Doc: ...
+    def extend_tensor(self, tensor: Floats2d) -> None: ...
+    def retokenize(self) -> Retokenizer: ...
+    def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
+    def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
+    @staticmethod
+    def _get_array_attrs() -> Tuple[Any]: ...
--- a/spacy/tokens/morphanalysis.pyi
+++ b/spacy/tokens/morphanalysis.pyi
@ -0,0 +1,20 @@
+from typing import Any, Dict, Iterator, List, Union
+from ..vocab import Vocab
+
+class MorphAnalysis:
+    def __init__(
+        self, vocab: Vocab, features: Union[Dict[str, str], str] = ...
+    ) -> None: ...
+    @classmethod
+    def from_id(cls, vocab: Vocab, key: Any) -> MorphAnalysis: ...
+    def __contains__(self, feature: str) -> bool: ...
+    def __iter__(self) -> Iterator[str]: ...
+    def __len__(self) -> int: ...
+    def __hash__(self) -> int: ...
+    def __eq__(self, other: MorphAnalysis) -> bool: ...
+    def __ne__(self, other: MorphAnalysis) -> bool: ...
+    def get(self, field: Any) -> List[str]: ...
+    def to_json(self) -> str: ...
+    def to_dict(self) -> Dict[str, str]: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
--- a/spacy/tokens/span.pyi
+++ b/spacy/tokens/span.pyi
@ -0,0 +1,124 @@
+from typing import Callable, Protocol, Iterator, Optional, Union, Tuple, Any, overload
+from thinc.types import Floats1d, Ints2d, FloatsXd
+from .doc import Doc
+from .token import Token
+from .underscore import Underscore
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+
+class SpanMethod(Protocol):
+    def __call__(self: Span, *args: Any, **kwargs: Any) -> Any: ...
+
+class Span:
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Span], Any]] = ...,
+        setter: Optional[Callable[[Span, Any], None]] = ...,
+        method: Optional[SpanMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[SpanMethod],
+        Optional[Callable[[Span], Any]],
+        Optional[Callable[[Span, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[SpanMethod],
+        Optional[Callable[[Span], Any]],
+        Optional[Callable[[Span, Any], None]],
+    ]: ...
+    def __init__(
+        self,
+        doc: Doc,
+        start: int,
+        end: int,
+        label: int = ...,
+        vector: Optional[Floats1d] = ...,
+        vector_norm: Optional[float] = ...,
+        kb_id: Optional[int] = ...,
+    ) -> None: ...
+    def __richcmp__(self, other: Span, op: int) -> bool: ...
+    def __hash__(self) -> int: ...
+    def __len__(self) -> int: ...
+    def __repr__(self) -> str: ...
+    @overload
+    def __getitem__(self, i: int) -> Token: ...
+    @overload
+    def __getitem__(self, i: slice) -> Span: ...
+    def __iter__(self) -> Iterator[Token]: ...
+    @property
+    def _(self) -> Underscore: ...
+    def as_doc(self, *, copy_user_data: bool = ...) -> Doc: ...
+    def get_lca_matrix(self) -> Ints2d: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    @property
+    def vocab(self) -> Vocab: ...
+    @property
+    def sent(self) -> Span: ...
+    @property
+    def ents(self) -> Tuple[Span]: ...
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector(self) -> Floats1d: ...
+    @property
+    def vector_norm(self) -> float: ...
+    @property
+    def tensor(self) -> FloatsXd: ...
+    @property
+    def sentiment(self) -> float: ...
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    @property
+    def noun_chunks(self) -> Iterator[Span]: ...
+    @property
+    def root(self) -> Token: ...
+    def char_span(
+        self,
+        start_idx: int,
+        end_idx: int,
+        label: int = ...,
+        kb_id: int = ...,
+        vector: Optional[Floats1d] = ...,
+    ) -> Span: ...
+    @property
+    def conjuncts(self) -> Tuple[Token]: ...
+    @property
+    def lefts(self) -> Iterator[Token]: ...
+    @property
+    def rights(self) -> Iterator[Token]: ...
+    @property
+    def n_lefts(self) -> int: ...
+    @property
+    def n_rights(self) -> int: ...
+    @property
+    def subtree(self) -> Iterator[Token]: ...
+    start: int
+    end: int
+    start_char: int
+    end_char: int
+    label: int
+    kb_id: int
+    ent_id: int
+    ent_id_: str
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def lemma_(self) -> str: ...
+    label_: str
+    kb_id_: str
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -105,13 +105,18 @@ cdef class Span:
        if label not in doc.vocab.strings:
            raise ValueError(Errors.E084.format(label=label))

+        start_char = doc[start].idx if start < doc.length else len(doc.text)
+        if start == end:
+            end_char = start_char
+        else:
+            end_char = doc[end - 1].idx + len(doc[end - 1])
        self.c = SpanC(
            label=label,
            kb_id=kb_id,
            start=start,
            end=end,
-            start_char=doc[start].idx if start < doc.length else 0,
-            end_char=doc[end - 1].idx + len(doc[end - 1]) if end >= 1 else 0,
+            start_char=start_char,
+            end_char=end_char,
        )
        self._vector = vector
        self._vector_norm = vector_norm
@ -213,10 +218,12 @@ cdef class Span:
        return Underscore(Underscore.span_extensions, self,
                          start=self.c.start_char, end=self.c.end_char)

-    def as_doc(self, *, bint copy_user_data=False):
+    def as_doc(self, *, bint copy_user_data=False, array_head=None, array=None):
        """Create a `Doc` object with a copy of the `Span`'s data.

        copy_user_data (bool): Whether or not to copy the original doc's user data.
+        array_head (tuple): `Doc` array attrs, can be passed in to speed up computation.
+        array (ndarray): `Doc` as array, can be passed in to speed up computation.
        RETURNS (Doc): The `Doc` copy of the span.

        DOCS: https://spacy.io/api/span#as_doc
@ -224,8 +231,10 @@ cdef class Span:
        words = [t.text for t in self]
        spaces = [bool(t.whitespace_) for t in self]
        cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
-        array_head = self.doc._get_array_attrs()
-        array = self.doc.to_array(array_head)
+        if array_head is None:
+            array_head = self.doc._get_array_attrs()
+        if array is None:
+            array = self.doc.to_array(array_head)
        array = array[self.start : self.end]
        self._fix_dep_copy(array_head, array)
        # Fix initial IOB so the entities are valid for doc.ents below.
--- a/spacy/tokens/span_group.pyi
+++ b/spacy/tokens/span_group.pyi
@ -0,0 +1,24 @@
+from typing import Any, Dict, Iterable
+from .doc import Doc
+from .span import Span
+
+class SpanGroup:
+    def __init__(
+        self,
+        doc: Doc,
+        *,
+        name: str = ...,
+        attrs: Dict[str, Any] = ...,
+        spans: Iterable[Span] = ...
+    ) -> None: ...
+    def __repr__(self) -> str: ...
+    @property
+    def doc(self) -> Doc: ...
+    @property
+    def has_overlap(self) -> bool: ...
+    def __len__(self) -> int: ...
+    def append(self, span: Span) -> None: ...
+    def extend(self, spans: Iterable[Span]) -> None: ...
+    def __getitem__(self, i: int) -> Span: ...
+    def to_bytes(self) -> bytes: ...
+    def from_bytes(self, bytes_data: bytes) -> SpanGroup: ...
--- a/spacy/tokens/span_group.pyx
+++ b/spacy/tokens/span_group.pyx
@ -1,6 +1,8 @@
 import weakref
 import struct
 import srsly
+
+from spacy.errors import Errors
 from .span cimport Span
 from libc.stdint cimport uint64_t, uint32_t, int32_t

@ -58,7 +60,11 @@ cdef class SpanGroup:

        DOCS: https://spacy.io/api/spangroup#doc
        """
-        return self._doc_ref()
+        doc = self._doc_ref()
+        if doc is None:
+            # referent has been garbage collected
+            raise RuntimeError(Errors.E866)
+        return doc

    @property
    def has_overlap(self):
--- a/spacy/tokens/token.pyi
+++ b/spacy/tokens/token.pyi
@ -0,0 +1,208 @@
+from typing import (
+    Callable,
+    Protocol,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from .doc import Doc
+from .span import Span
+from .morphanalysis import MorphAnalysis
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+
+class TokenMethod(Protocol):
+    def __call__(self: Token, *args: Any, **kwargs: Any) -> Any: ...
+
+class Token:
+    i: int
+    doc: Doc
+    vocab: Vocab
+    @classmethod
+    def set_extension(
+        cls,
+        name: str,
+        default: Optional[Any] = ...,
+        getter: Optional[Callable[[Token], Any]] = ...,
+        setter: Optional[Callable[[Token, Any], None]] = ...,
+        method: Optional[TokenMethod] = ...,
+        force: bool = ...,
+    ) -> None: ...
+    @classmethod
+    def get_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[TokenMethod],
+        Optional[Callable[[Token], Any]],
+        Optional[Callable[[Token, Any], None]],
+    ]: ...
+    @classmethod
+    def has_extension(cls, name: str) -> bool: ...
+    @classmethod
+    def remove_extension(
+        cls, name: str
+    ) -> Tuple[
+        Optional[Any],
+        Optional[TokenMethod],
+        Optional[Callable[[Token], Any]],
+        Optional[Callable[[Token, Any], None]],
+    ]: ...
+    def __init__(self, vocab: Vocab, doc: Doc, offset: int) -> None: ...
+    def __hash__(self) -> int: ...
+    def __len__(self) -> int: ...
+    def __unicode__(self) -> str: ...
+    def __bytes__(self) -> bytes: ...
+    def __str__(self) -> str: ...
+    def __repr__(self) -> str: ...
+    def __richcmp__(self, other: Token, op: int) -> bool: ...
+    @property
+    def _(self) -> Underscore: ...
+    def nbor(self, i: int = ...) -> Token: ...
+    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+    def has_morph(self) -> bool: ...
+    morph: MorphAnalysis
+    @property
+    def lex(self) -> Lexeme: ...
+    @property
+    def lex_id(self) -> int: ...
+    @property
+    def rank(self) -> int: ...
+    @property
+    def text(self) -> str: ...
+    @property
+    def text_with_ws(self) -> str: ...
+    @property
+    def prob(self) -> float: ...
+    @property
+    def sentiment(self) -> float: ...
+    @property
+    def lang(self) -> int: ...
+    @property
+    def idx(self) -> int: ...
+    @property
+    def cluster(self) -> int: ...
+    @property
+    def orth(self) -> int: ...
+    @property
+    def lower(self) -> int: ...
+    @property
+    def norm(self) -> int: ...
+    @property
+    def shape(self) -> int: ...
+    @property
+    def prefix(self) -> int: ...
+    @property
+    def suffix(self) -> int: ...
+    lemma: int
+    pos: int
+    tag: int
+    dep: int
+    @property
+    def has_vector(self) -> bool: ...
+    @property
+    def vector(self) -> Floats1d: ...
+    @property
+    def vector_norm(self) -> float: ...
+    @property
+    def tensor(self) -> Optional[FloatsXd]: ...
+    @property
+    def n_lefts(self) -> int: ...
+    @property
+    def n_rights(self) -> int: ...
+    @property
+    def sent(self) -> Span: ...
+    sent_start: bool
+    is_sent_start: Optional[bool]
+    is_sent_end: Optional[bool]
+    @property
+    def lefts(self) -> Iterator[Token]: ...
+    @property
+    def rights(self) -> Iterator[Token]: ...
+    @property
+    def children(self) -> Iterator[Token]: ...
+    @property
+    def subtree(self) -> Iterator[Token]: ...
+    @property
+    def left_edge(self) -> Token: ...
+    @property
+    def right_edge(self) -> Token: ...
+    @property
+    def ancestors(self) -> Iterator[Token]: ...
+    def is_ancestor(self, descendant: Token) -> bool: ...
+    def has_head(self) -> bool: ...
+    head: Token
+    @property
+    def conjuncts(self) -> Tuple[Token]: ...
+    ent_type: int
+    ent_type_: str
+    @property
+    def ent_iob(self) -> int: ...
+    @classmethod
+    def iob_strings(cls) -> Tuple[str]: ...
+    @property
+    def ent_iob_(self) -> str: ...
+    ent_id: int
+    ent_id_: str
+    ent_kb_id: int
+    ent_kb_id_: str
+    @property
+    def whitespace_(self) -> str: ...
+    @property
+    def orth_(self) -> str: ...
+    @property
+    def lower_(self) -> str: ...
+    norm_: str
+    @property
+    def shape_(self) -> str: ...
+    @property
+    def prefix_(self) -> str: ...
+    @property
+    def suffix_(self) -> str: ...
+    @property
+    def lang_(self) -> str: ...
+    lemma_: str
+    pos_: str
+    tag_: str
+    def has_dep(self) -> bool: ...
+    dep_: str
+    @property
+    def is_oov(self) -> bool: ...
+    @property
+    def is_stop(self) -> bool: ...
+    @property
+    def is_alpha(self) -> bool: ...
+    @property
+    def is_ascii(self) -> bool: ...
+    @property
+    def is_digit(self) -> bool: ...
+    @property
+    def is_lower(self) -> bool: ...
+    @property
+    def is_upper(self) -> bool: ...
+    @property
+    def is_title(self) -> bool: ...
+    @property
+    def is_punct(self) -> bool: ...
+    @property
+    def is_space(self) -> bool: ...
+    @property
+    def is_bracket(self) -> bool: ...
+    @property
+    def is_quote(self) -> bool: ...
+    @property
+    def is_left_punct(self) -> bool: ...
+    @property
+    def is_right_punct(self) -> bool: ...
+    @property
+    def is_currency(self) -> bool: ...
+    @property
+    def like_url(self) -> bool: ...
+    @property
+    def like_num(self) -> bool: ...
+    @property
+    def like_email(self) -> bool: ...
--- a/spacy/training/initialize.py
+++ b/spacy/training/initialize.py
@ -77,7 +77,9 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
                f"examples for initialization. If necessary, provide all labels "
                f"in [initialize]. More info: https://spacy.io/api/cli#init_labels"
            )
-            nlp.initialize(lambda: islice(train_corpus(nlp), sample_size), sgd=optimizer)
+            nlp.initialize(
+                lambda: islice(train_corpus(nlp), sample_size), sgd=optimizer
+            )
        else:
            nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
        logger.info(f"Initialized pipeline components: {nlp.pipe_names}")
--- a/spacy/training/loggers.py
+++ b/spacy/training/loggers.py
@ -29,7 +29,7 @@ def console_logger(progress_bar: bool = False):
    def setup_printer(
        nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
    ) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
-        write = lambda text: stdout.write(f"{text}\n")
+        write = lambda text: print(text, file=stdout, flush=True)
        msg = Printer(no_print=True)
        # ensure that only trainable components are logged
        logged_pipes = [
--- a/spacy/util.py
+++ b/spacy/util.py
@ -20,8 +20,10 @@ import sys
 import warnings
 from packaging.specifiers import SpecifierSet, InvalidSpecifier
 from packaging.version import Version, InvalidVersion
+from packaging.requirements import Requirement
 import subprocess
 from contextlib import contextmanager
+from collections import defaultdict
 import tempfile
 import shutil
 import shlex
@ -33,11 +35,6 @@ try:
 except ImportError:
    cupy = None

-try:  # Python 3.8
-    import importlib.metadata as importlib_metadata
-except ImportError:
-    from catalogue import _importlib_metadata as importlib_metadata
-
 # These are functions that were previously (v2.x) available from spacy.util
 # and have since moved to Thinc. We're importing them here so people's code
 # doesn't break, but they should always be imported from Thinc from now on,
@ -46,7 +43,7 @@ from thinc.api import fix_random_seed, compounding, decaying  # noqa: F401


 from .symbols import ORTH
-from .compat import cupy, CudaStream, is_windows
+from .compat import cupy, CudaStream, is_windows, importlib_metadata
 from .errors import Errors, Warnings, OLD_MODEL_SHORTCUTS
 from . import about

@ -639,13 +636,18 @@ def is_unconstrained_version(
    return True


-def get_model_version_range(spacy_version: str) -> str:
-    """Generate a version range like >=1.2.3,<1.3.0 based on a given spaCy
-    version. Models are always compatible across patch versions but not
-    across minor or major versions.
+def split_requirement(requirement: str) -> Tuple[str, str]:
+    """Split a requirement like spacy>=1.2.3 into ("spacy", ">=1.2.3")."""
+    req = Requirement(requirement)
+    return (req.name, str(req.specifier))
+
+
+def get_minor_version_range(version: str) -> str:
+    """Generate a version range like >=1.2.3,<1.3.0 based on a given version
+    (e.g. of spaCy).
    """
-    release = Version(spacy_version).release
-    return f">={spacy_version},<{release[0]}.{release[1] + 1}.0"
+    release = Version(version).release
+    return f">={version},<{release[0]}.{release[1] + 1}.0"


 def get_model_lower_version(constraint: str) -> Optional[str]:
@ -733,7 +735,7 @@ def load_meta(path: Union[str, Path]) -> Dict[str, Any]:
                model=f"{meta['lang']}_{meta['name']}",
                model_version=meta["version"],
                version=meta["spacy_version"],
-                example=get_model_version_range(about.__version__),
+                example=get_minor_version_range(about.__version__),
            )
            warnings.warn(warn_msg)
    return meta
@ -1549,3 +1551,19 @@ def to_ternary_int(val) -> int:
        return 0
    else:
        return -1
+
+
+# The following implementation of packages_distributions() is adapted from
+# importlib_metadata, which is distributed under the Apache 2.0 License.
+# Copyright (c) 2017-2019 Jason R. Coombs, Barry Warsaw
+# See licenses/3rd_party_licenses.txt
+def packages_distributions() -> Dict[str, List[str]]:
+    """Return a mapping of top-level packages to their distributions. We're
+    inlining this helper from the importlib_metadata "backport" here, since
+    it's not available in the builtin importlib.metadata.
+    """
+    pkg_to_dist = defaultdict(list)
+    for dist in importlib_metadata.distributions():
+        for pkg in (dist.read_text("top_level.txt") or "").split():
+            pkg_to_dist[pkg].append(dist.metadata["Name"])
+    return dict(pkg_to_dist)
--- a/spacy/vocab.pyi
+++ b/spacy/vocab.pyi
@ -0,0 +1,78 @@
+from typing import (
+    Callable,
+    Iterator,
+    Optional,
+    Union,
+    Tuple,
+    List,
+    Dict,
+    Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from . import Language
+from .strings import StringStore
+from .lexeme import Lexeme
+from .lookups import Lookups
+from .tokens import Doc, Span
+from pathlib import Path
+
+def create_vocab(
+    lang: Language, defaults: Any, vectors_name: Optional[str] = ...
+) -> Vocab: ...
+
+class Vocab:
+    def __init__(
+        self,
+        lex_attr_getters: Optional[Dict[str, Callable[[str], Any]]] = ...,
+        strings: Optional[Union[List[str], StringStore]] = ...,
+        lookups: Optional[Lookups] = ...,
+        oov_prob: float = ...,
+        vectors_name: Optional[str] = ...,
+        writing_system: Dict[str, Any] = ...,
+        get_noun_chunks: Optional[Callable[[Union[Doc, Span]], Iterator[Span]]] = ...,
+    ) -> None: ...
+    @property
+    def lang(self) -> Language: ...
+    def __len__(self) -> int: ...
+    def add_flag(
+        self, flag_getter: Callable[[str], bool], flag_id: int = ...
+    ) -> int: ...
+    def __contains__(self, key: str) -> bool: ...
+    def __iter__(self) -> Iterator[Lexeme]: ...
+    def __getitem__(self, id_or_string: Union[str, int]) -> Lexeme: ...
+    @property
+    def vectors_length(self) -> int: ...
+    def reset_vectors(
+        self, *, width: Optional[int] = ..., shape: Optional[int] = ...
+    ) -> None: ...
+    def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
+    def get_vector(
+        self,
+        orth: Union[int, str],
+        minn: Optional[int] = ...,
+        maxn: Optional[int] = ...,
+    ) -> FloatsXd: ...
+    def set_vector(self, orth: Union[int, str], vector: Floats1d) -> None: ...
+    def has_vector(self, orth: Union[int, str]) -> bool: ...
+    lookups: Lookups
+    def to_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> None: ...
+    def from_disk(
+        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Vocab: ...
+    def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+    def from_bytes(
+        self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+    ) -> Vocab: ...
+
+def pickle_vocab(vocab: Vocab) -> Any: ...
+def unpickle_vocab(
+    sstore: StringStore,
+    vectors: Any,
+    morphology: Any,
+    data_dir: Any,
+    lex_attr_getters: Any,
+    lookups: Any,
+    get_noun_chunks: Any,
+) -> Vocab: ...
--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@ -303,6 +303,10 @@ not been implemeted for the given language, a `NotImplementedError` is raised.

 Create a new `Doc` object corresponding to the `Span`, with a copy of the data.

+When calling this on many spans from the same doc, passing in a precomputed
+array representation of the doc using the `array_head` and `array` args can save
+time.
+
 > #### Example
 >
 > ```python
@ -312,10 +316,12 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
 > assert doc2.text == "New York"
 > ```

-| Name             | Description                                                   |
-| ---------------- | ------------------------------------------------------------- |
-| `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~ |
-| **RETURNS**      | A `Doc` object of the `Span`'s content. ~~Doc~~               |
+| Name             | Description                                                                                                          |
+| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
+| `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~                                                        |
+| `array_head`     | Precomputed array attributes (headers) of the original doc, as generated by `Doc._get_array_attrs()`. ~~Tuple~~      |
+| `array`          | Precomputed array version of the original doc as generated by [`Doc.to_array`](/api/doc#to_array). ~~numpy.ndarray~~ |
+| **RETURNS**      | A `Doc` object of the `Span`'s content. ~~Doc~~                                                                      |

 ## Span.root {#root tag="property" model="parser"}

--- a/website/docs/api/spancategorizer.md
+++ b/website/docs/api/spancategorizer.md
@ -13,6 +13,22 @@ A span categorizer consists of two parts: a [suggester function](#suggesters)
 that proposes candidate spans, which may or may not overlap, and a labeler model
 that predicts zero or more labels for each candidate.

+Predicted spans will be saved in a [`SpanGroup`](/api/spangroup) on the doc.
+Individual span scores can be found in `spangroup.attrs["scores"]`.
+
+## Assigned Attributes {#assigned-attributes}
+
+Predictions will be saved to `Doc.spans[spans_key]` as a
+[`SpanGroup`](/api/spangroup). The scores for the spans in the `SpanGroup` will
+be saved in `SpanGroup.attrs["scores"]`.
+
+`spans_key` defaults to `"sc"`, but can be passed as a parameter.
+
+| Location                               | Value                                                    |
+| -------------------------------------- | -------------------------------------------------------- |
+| `Doc.spans[spans_key]`                 | The annotated spans. ~~SpanGroup~~                       |
+| `Doc.spans[spans_key].attrs["scores"]` | The score for each span in the `SpanGroup`. ~~Floats1d~~ |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/spangroup.md
+++ b/website/docs/api/spangroup.md
@ -46,6 +46,16 @@ Create a `SpanGroup`.

 The [`Doc`](/api/doc) object the span group is referring to.

+<Infobox title="SpanGroup and Doc lifecycle" variant="warning">
+
+When a `Doc` object is garbage collected, any related `SpanGroup` object won't
+be functional anymore, as these objects use a `weakref` to refer to the
+document. An error will be raised as the internal `doc` object will be `None`.
+To avoid this, make sure that the original `Doc` objects are still available in
+the scope of your function.
+
+</Infobox>
+
 > #### Example
 >
 > ```python