Added Haitian Creole (ht) Language Support to spaCy (#13807 )

This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.
Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436 )
2025-08-04 12:20:20 +03:00 · 2025-05-28 17:23:38 +02:00 · 2025-05-28 17:22:50 +02:00 · 2025-05-28 17:21:46 +02:00 · 2025-05-28 17:06:11 +02:00 · 2025-05-28 17:04:23 +02:00
1119 changed files with 97723 additions and 51810 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@ -0,0 +1 @@
+custom: [https://explosion.ai/merch, https://explosion.ai/tailored-solutions]
--- a/.github/ISSUE_TEMPLATE.md
+++ b/.github/ISSUE_TEMPLATE.md
@ -1,18 +0,0 @@
-<!--- Please provide a summary in the title and describe your issue here.
-Is this a bug or feature request? If a bug, include all the steps that led to the issue.
-
-If you're looking for help with your code, consider posting a question here:
-
- GitHub Discussions: https://github.com/explosion/spaCy/discussions
- Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
-->
-
-## Your Environment
-
-<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type
-`python -m spacy info --markdown` and copy-paste the result here.-->
-
- Operating System:
- Python Version Used:
- spaCy Version Used:
- Environment Information:
--- a/.github/ISSUE_TEMPLATE/01_bugs.md
+++ b/.github/ISSUE_TEMPLATE/01_bugs.md
@ -1,14 +1,16 @@
 ---
-name: "\U0001F6A8 Bug Report"
-about: Did you come across a bug or unexpected behaviour differing from the docs?
+name: "\U0001F6A8 Submit a Bug Report"
+about: Use this template if you came across a bug or unexpected behaviour differing from the docs.

 ---

+<!-- NOTE: For questions or install related issues, please open a Discussion instead. -->
+
 ## How to reproduce the behaviour
 <!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->

 ## Your Environment
-<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
+<!-- Include details of your environment. You can also type `python -m spacy info --markdown` and copy-paste the result here.-->
 * Operating System:
 * Python Version Used:
 * spaCy Version Used:
--- a/.github/ISSUE_TEMPLATE/02_docs.md
+++ b/.github/ISSUE_TEMPLATE/02_docs.md
@ -1,5 +1,5 @@
 ---
-name: "\U0001F4DA Documentation"
+name: "\U0001F4DA Submit a Documentation Report"
 about: Did you spot a mistake in the docs, is anything unclear or do you have a
  suggestion?

--- a/.github/ISSUE_TEMPLATE/03_other.md
+++ b/.github/ISSUE_TEMPLATE/03_other.md
@ -1,19 +0,0 @@
---
-name: "\U0001F4AC Anything else?"
-about: For feature and project ideas, general usage questions or help with your code, please post on the GitHub Discussions board instead.
---
-
-<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and enhancements. If you're looking for help with your code, consider posting a question here:
-
- GitHub Discussions: https://github.com/explosion/spaCy/discussions
- Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
-->
-
-## Your Environment
-
-<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
-
- Operating System:
- Python Version Used:
- spaCy Version Used:
- Environment Information:
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1,14 @@
+blank_issues_enabled: false
+contact_links:
+  - name: 🗯 Discussions Forum
+    url: https://github.com/explosion/spaCy/discussions
+    about: Install issues, usage questions, general discussion and anything else that isn't a bug report.
+  - name: 📖 spaCy FAQ & Troubleshooting
+    url: https://github.com/explosion/spaCy/discussions/8226
+    about: Before you post, check out the FAQ for answers to common community questions!
+  - name: 💫 spaCy Usage Guides & API reference
+    url: https://spacy.io/usage
+    about: Everything you need to know about spaCy and how to use it.
+  - name: 🛠 Submit a Pull Request
+    url: https://github.com/explosion/spaCy/pulls
+    about: Did you spot a mistake and know how to fix it? Feel free to submit a PR straight away!
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -14,6 +14,6 @@ or new feature, or a change to the documentation? -->
 ## Checklist
 <!--- Before you submit the PR, go over this checklist and make sure you can
 tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
+- [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
 - [ ] I ran the tests, and all new and existing tests passed.
 - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
--- a/.github/azure-steps.yml
+++ b/.github/azure-steps.yml
@ -1,88 +0,0 @@
-parameters:
-  python_version: ''
-  architecture: 'x64'
-
-steps:
-  - task: UsePythonVersion@0
-    inputs:
-      versionSpec: ${{ parameters.python_version }}
-      architecture: ${{ parameters.architecture }}
-
-  - bash: |
-      echo "##vso[task.setvariable variable=python_version]${{ parameters.python_version }}"
-    displayName: 'Set variables'
-
-  - script: |
-      python -m pip install -U build pip setuptools
-      python -m pip install -U -r requirements.txt
-    displayName: "Install dependencies"
-
-  - script: |
-      python -m build --sdist
-    displayName: "Build sdist"
-
-  - task: DeleteFiles@1
-    inputs:
-      contents: "spacy"
-    displayName: "Delete source directory"
-
-  - task: DeleteFiles@1
-    inputs:
-      contents: "*.egg-info"
-    displayName: "Delete egg-info directory"
-
-  - script: |
-      python -m pip freeze > installed.txt
-      python -m pip uninstall -y -r installed.txt
-    displayName: "Uninstall all packages"
-
-  - bash: |
-      SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
-      python -m pip install dist/$SDIST
-    displayName: "Install from sdist"
-
-  - script: |
-      python -W error -c "import spacy"
-    displayName: "Test import"
-
-  - script: |
-      python -m spacy download es_core_news_sm
-      python -c "import spacy; nlp=spacy.load('es_core_news_sm'); doc=nlp('test')"
-    displayName: 'Test download CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      python -m spacy convert extra/example_data/ner_example_data/ner-token-per-line-conll2003.json .
-    displayName: 'Test convert CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      python -m spacy init config -p ner -l es ner.cfg
-      python -m spacy debug config ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy
-    displayName: 'Test debug config CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      # will have errors due to sparse data, check for summary in output
-      python -m spacy debug data ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy | grep -q Summary
-    displayName: 'Test debug data CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      python -m spacy train ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy --training.max_steps 10 --gpu-id -1
-    displayName: 'Test train CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'es_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
-      PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir
-    displayName: 'Test assemble CLI'
-    condition: eq(variables['python_version'], '3.8')
-
-  - script: |
-      python -m pip install -U -r requirements.txt
-    displayName: "Install test requirements"
-
-  - script: |
-      python -m pytest --pyargs spacy -W error
-    displayName: "Run CPU tests"
--- a/.github/contributors/Jette16.md
+++ b/.github/contributors/Jette16.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Henriette Behr      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  23.09.2021          |
+| GitHub username                |  Jette16             |
+| Website (optional)             |                      |
--- a/.github/contributors/KennethEnevoldsen.md
+++ b/.github/contributors/KennethEnevoldsen.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                      |
+|------------------------------- | -------------------------- |
+| Name                           | Kenneth Enevoldsen         |
+| Company name (if applicable)   |                            |
+| Title or role (if applicable)  |                            |
+| Date                           | 2021-07-13                 |
+| GitHub username                | KennethEnevoldsen          |
+| Website (optional)             | www.kennethenevoldsen.com  |
--- a/.github/contributors/Lucaterre.md
+++ b/.github/contributors/Lucaterre.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry         |
+|------------------------------- |---------------|
+| Name                           | Lucas Terriel |
+| Company name (if applicable)   |               |
+| Title or role (if applicable)  |               |
+| Date                           | 2022-06-20    |
+| GitHub username                | Lucaterre     |
+| Website (optional)             |               |
--- a/.github/contributors/Pantalaymon.md
+++ b/.github/contributors/Pantalaymon.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |Valentin-Gabriel Soumah|
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |    2021-11-23        |
+| GitHub username                |     Pantalaymon      |
+| Website (optional)             |                      |
--- a/.github/contributors/avi197.md
+++ b/.github/contributors/avi197.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Son Pham             |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 09/10/2021           |
+| GitHub username                | Avi197               |
+| Website (optional)             |                      |
--- a/.github/contributors/bbieniek.md
+++ b/.github/contributors/bbieniek.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Baltazar Bieniek     |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2021.08.19           |
+| GitHub username                | bbieniek             |
+| Website (optional)             | https://baltazar.bieniek.org.pl/                     |
--- a/.github/contributors/connorbrinton.md
+++ b/.github/contributors/connorbrinton.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Connor Brinton       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | July 20th, 2021      |
+| GitHub username                | connorbrinton        |
+| Website (optional)             |                      |
--- a/.github/contributors/ezorita.md
+++ b/.github/contributors/ezorita.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Eduard Zorita        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06/17/2021           |
+| GitHub username                | ezorita              |
+| Website (optional)             |                      |
--- a/.github/contributors/fgaim.md
+++ b/.github/contributors/fgaim.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Fitsum Gaim          |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2021-08-07           |
+| GitHub username                | fgaim                |
+| Website (optional)             |                      |
--- a/.github/contributors/fonfonx.md
+++ b/.github/contributors/fonfonx.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Xavier Fontaine      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2022-04-13           |
+| GitHub username                | fonfonx              |
+| Website (optional)             |                      |
--- a/.github/contributors/gtoffoli.md
+++ b/.github/contributors/gtoffoli.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Giovanni Toffoli         |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  |                          |
+| Date                           | 2021-05-12               |
+| GitHub username                | gtoffoli                 |
+| Website (optional)             |                          |
--- a/.github/contributors/hlasse.md
+++ b/.github/contributors/hlasse.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                      |
+|------------------------------- | -------------------------- |
+| Name                           | Lasse Hansen               |
+| Company name (if applicable)   |                            |
+| Title or role (if applicable)  |                            |
+| Date                           | 2021-08-11                 |
+| GitHub username                | HLasse                     |
+| Website (optional)             | www.lassehansen.me         |
--- a/.github/contributors/jmyerston.md
+++ b/.github/contributors/jmyerston.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+      assignment is or becomes invalid, ineffective or unenforceable, you hereby
+      grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+      royalty-free, unrestricted license to exercise all rights under those
+      copyrights. This includes, at our option, the right to sublicense these same
+      rights to third parties through multiple levels of sublicensees or other
+      licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+      contribution as if each of us were the sole owners, and if one of us makes
+      a derivative work of your contribution, the one who makes the derivative
+      work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+      against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+      exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+      consent of, pay or render an accounting to the other for any use or
+      distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+      your contribution in whole or in part, alone or in combination with or
+      included in any product, work or materials arising out of the project to
+      which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+      multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+      or entity, including my employer, has or will have rights with respect to my
+      contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+      actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                         | Entry                               |
+| ----------------------------- | ----------------------------------- |
+| Name                          | Jacobo Myerston                     |
+| Company name (if applicable)  | University of California, San Diego |
+| Title or role (if applicable) | Academic                            |
+| Date                          | 07/05/2021                          |
+| GitHub username               | jmyerston                           |
+| Website (optional)            | diogenet.ucsd.edu                                    |
--- a/.github/contributors/julien-talkair.md
+++ b/.github/contributors/julien-talkair.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Julien Rossi                    |
+| Company name (if applicable)   |  TalkAir BV                    |
+| Title or role (if applicable)  |  CTO, Partner                    |
+| Date                           |  June 28 2021                    |
+| GitHub username                |  julien-talkair                    |
+| Website (optional)             |                      |
--- a/.github/contributors/mariosasko.md
+++ b/.github/contributors/mariosasko.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Mario Šaško          |
+| Company name (if applicable)   | TakeLab FER          |
+| Title or role (if applicable)  | R&D Intern           |
+| Date                           | 2021-07-12           |
+| GitHub username                | mariosasko           |
+| Website (optional)             |                      |
--- a/.github/contributors/nsorros.md
+++ b/.github/contributors/nsorros.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Nick Sorros          |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2/8/2021             |
+| GitHub username                | nsorros              |
+| Website (optional)             |                      |
--- a/.github/contributors/philipvollet.md
+++ b/.github/contributors/philipvollet.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Philip Vollet       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  22.09.2021          |
+| GitHub username                |  philipvollet        |
+| Website (optional)             |                      |
--- a/.github/contributors/shigapov.md
+++ b/.github/contributors/shigapov.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Renat Shigapov           |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  |                          |
+| Date                           | 2021-09-09               |
+| GitHub username                | shigapov                 |
+| Website (optional)             |                          |
--- a/.github/contributors/swfarnsworth.md
+++ b/.github/contributors/swfarnsworth.md
@ -0,0 +1,88 @@
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Steele Farnsworth                    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  13 August, 2021                    |
+| GitHub username                |   swfarnsworth                   |
+| Website (optional)             |                      |
+
--- a/.github/contributors/syrull.md
+++ b/.github/contributors/syrull.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Dimitar Ganev |
+| Company name (if applicable)   |  |
+| Title or role (if applicable)  |  |
+| Date                           | 2021/8/2 |
+| GitHub username                | syrull |
+| Website (optional)             |                      |
--- a/.github/contributors/thomashacker.md
+++ b/.github/contributors/thomashacker.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Edward Schmuhl                    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                    |
+| Date                           |  09.07.2021                    |
+| GitHub username                |  thomashacker                    |
+| Website (optional)             |                      |
--- a/.github/contributors/xadrianzetx.md
+++ b/.github/contributors/xadrianzetx.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |Adrian Zuber          |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |20-06-2021            |
+| GitHub username                |xadrianzetx           |
+| Website (optional)             |                      |
--- a/.github/contributors/yohasebe.md
+++ b/.github/contributors/yohasebe.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Yoichiro Hasebe      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | July 4th, 2021       |
+| GitHub username                | yohasebe             |
+| Website (optional)             | https://yohasebe.com |
--- a/.github/lock.yml
+++ b/.github/lock.yml
@ -1,19 +0,0 @@
-# Configuration for lock-threads - https://github.com/dessant/lock-threads
-
-# Number of days of inactivity before a closed issue or pull request is locked
-daysUntilLock: 30
-
-# Issues and pull requests with these labels will not be locked. Set to `[]` to disable
-exemptLabels: []
-
-# Label to add before locking, such as `outdated`. Set to `false` to disable
-lockLabel: false
-
-# Comment to post before locking. Set to `false` to disable
-lockComment: >
-  This thread has been automatically locked since there has not been
-  any recent activity after it was closed. Please open a new issue for
-  related bugs.
-
-# Limit to only `issues` or `pulls`
-only: issues
--- a/.github/no-response.yml
+++ b/.github/no-response.yml
@ -1,13 +0,0 @@
-# Configuration for probot-no-response - https://github.com/probot/no-response
-
-# Number of days of inactivity before an Issue is closed for lack of response
-daysUntilClose: 14
-# Label requiring a response
-responseRequiredLabel: more-info-needed
-# Comment to post when closing an Issue for lack of response. Set to `false` to disable
-closeComment: >
-  This issue has been automatically closed because there has been no response
-  to a request for more information from the original author. With only the
-  information that is currently in the issue, there's not enough information
-  to take action. If you're the original author, feel free to reopen the issue
-  if you have or find the answers needed to investigate further.
--- a/.github/spacy_universe_alert.py
+++ b/.github/spacy_universe_alert.py
@ -0,0 +1,67 @@
+import os
+import sys
+import json
+from datetime import datetime
+
+from slack_sdk.web.client import WebClient
+
+CHANNEL = "#alerts-universe"
+SLACK_TOKEN = os.environ.get("SLACK_BOT_TOKEN", "ENV VAR not available!")
+DATETIME_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
+
+client = WebClient(SLACK_TOKEN)
+github_context = json.loads(sys.argv[1])
+
+event = github_context['event']
+pr_title = event['pull_request']["title"]
+pr_link = event['pull_request']["patch_url"].replace(".patch", "")
+pr_author_url = event['sender']["html_url"]
+pr_author_name = pr_author_url.rsplit('/')[-1]
+pr_created_at_dt = datetime.strptime(
+    event['pull_request']["created_at"],
+    DATETIME_FORMAT
+)
+pr_created_at = pr_created_at_dt.strftime("%c")
+pr_updated_at_dt = datetime.strptime(
+    event['pull_request']["updated_at"],
+    DATETIME_FORMAT
+)
+pr_updated_at = pr_updated_at_dt.strftime("%c")
+
+blocks = [
+    {
+      "type": "section",
+      "text": {
+        "type": "mrkdwn",
+        "text": "📣 New spaCy Universe Project Alert ✨"
+      }
+    },
+    {
+      "type": "section",
+      "fields": [
+        {
+          "type": "mrkdwn",
+          "text": f"*Pull Request:*\n<{pr_link}|{pr_title}>"
+        },
+        {
+          "type": "mrkdwn",
+          "text": f"*Author:*\n<{pr_author_url}|{pr_author_name}>"
+        },
+        {
+          "type": "mrkdwn",
+          "text": f"*Created at:*\n {pr_created_at}"
+        },
+        {
+          "type": "mrkdwn",
+          "text": f"*Last Updated:*\n {pr_updated_at}"
+        }
+      ]
+    }
+  ]
+
+
+client.chat_postMessage(
+    channel=CHANNEL,
+    text="spaCy universe project PR alert",
+    blocks=blocks
+)
--- a/.github/validate_universe_json.py
+++ b/.github/validate_universe_json.py
@ -0,0 +1,19 @@
+import json
+import re
+import sys
+from pathlib import Path
+
+
+def validate_json(document):
+    universe_file = Path(document)
+    with universe_file.open() as f:
+        universe_data = json.load(f)
+        for entry in universe_data["resources"]:
+            if "github" in entry:
+                assert not re.match(
+                    r"^(http:)|^(https:)", entry["github"]
+                ), "Github field should be user/repo, not a url"
+
+
+if __name__ == "__main__":
+    validate_json(str(sys.argv[1]))
--- a/.github/workflows/cibuildwheel.yml
+++ b/.github/workflows/cibuildwheel.yml
@ -0,0 +1,99 @@
+name: Build
+
+on:
+  push:
+    tags:
+      # ytf did they invent their own syntax that's almost regex?
+      # ** matches 'zero or more of any character'
+      - 'release-v[0-9]+.[0-9]+.[0-9]+**'
+      - 'prerelease-v[0-9]+.[0-9]+.[0-9]+**'
+jobs:
+  build_wheels:
+    name: Build wheels on ${{ matrix.os }}
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        # macos-13 is an intel runner, macos-14 is apple silicon
+        os: [ubuntu-latest, windows-latest, macos-13, macos-14, ubuntu-24.04-arm]
+
+    steps:
+      - uses: actions/checkout@v4
+      # aarch64 (arm) is built via qemu emulation
+      # QEMU is sadly too slow. We need to wait for public ARM support
+      #- name: Set up QEMU
+      #  if: runner.os == 'Linux'
+      #  uses: docker/setup-qemu-action@v3
+      #  with:
+      #    platforms: all
+      - name: Build wheels
+        uses: pypa/cibuildwheel@v2.21.3
+        env:
+          CIBW_ARCHS_LINUX: auto
+        with:
+          package-dir: .
+          output-dir: wheelhouse
+          config-file: "{package}/pyproject.toml"
+      - uses: actions/upload-artifact@v4
+        with:
+          name: cibw-wheels-${{ matrix.os }}-${{ strategy.job-index }}
+          path: ./wheelhouse/*.whl
+
+  build_sdist:
+    name: Build source distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Build sdist
+        run: pipx run build --sdist
+      - uses: actions/upload-artifact@v4
+        with:
+          name: cibw-sdist
+          path: dist/*.tar.gz
+  create_release:
+    needs: [build_wheels, build_sdist]
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+      checks: write
+      actions: read
+      issues: read
+      packages: write
+      pull-requests: read
+      repository-projects: read
+      statuses: read
+    steps:
+      - name: Get the tag name and determine if it's a prerelease
+        id: get_tag_info
+        run: |
+          FULL_TAG=${GITHUB_REF#refs/tags/}
+          if [[ $FULL_TAG == release-* ]]; then
+            TAG_NAME=${FULL_TAG#release-}
+            IS_PRERELEASE=false
+          elif [[ $FULL_TAG == prerelease-* ]]; then
+            TAG_NAME=${FULL_TAG#prerelease-}
+            IS_PRERELEASE=true
+          else
+            echo "Tag does not match expected patterns" >&2
+            exit 1
+          fi
+          echo "FULL_TAG=$TAG_NAME" >> $GITHUB_ENV
+          echo "TAG_NAME=$TAG_NAME" >> $GITHUB_ENV
+          echo "IS_PRERELEASE=$IS_PRERELEASE" >> $GITHUB_ENV
+      - uses: actions/download-artifact@v4
+        with:
+          # unpacks all CIBW artifacts into dist/
+          pattern: cibw-*
+          path: dist
+          merge-multiple: true
+      - name: Create Draft Release
+        id: create_release
+        uses: softprops/action-gh-release@v2
+        if: startsWith(github.ref, 'refs/tags/')
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        with:
+          name: ${{ env.TAG_NAME }}
+          draft: true
+          prerelease: ${{ env.IS_PRERELEASE }}
+          files: "./dist/*" 
--- a/.github/workflows/explosionbot.yml
+++ b/.github/workflows/explosionbot.yml
@ -0,0 +1,28 @@
+name: Explosion Bot
+
+on:
+  issue_comment:
+    types:
+      - created
+      - edited
+
+jobs:
+  explosion-bot:
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Dump GitHub context
+        env:
+          GITHUB_CONTEXT: ${{ toJson(github) }}
+        run: echo "$GITHUB_CONTEXT"
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v4
+      - name: Install and run explosion-bot
+        run: |
+          pip install git+https://${{ secrets.EXPLOSIONBOT_TOKEN }}@github.com/explosion/explosion-bot
+          python -m explosionbot
+        env:
+          INPUT_TOKEN: ${{ secrets.EXPLOSIONBOT_TOKEN }}
+          INPUT_BK_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
+          ENABLED_COMMANDS: "test_gpu,test_slow,test_slow_gpu"
+          ALLOWED_TEAMS: "spaCy"
--- a/.github/workflows/gputests.yml.disabled
+++ b/.github/workflows/gputests.yml.disabled
@ -0,0 +1,22 @@
+name: Weekly GPU tests
+
+on:
+  schedule:
+    - cron: '0 1 * * MON'
+
+jobs:
+  weekly-gputests:
+    strategy:
+      fail-fast: false
+      matrix:
+        branch: [master, v4]
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger buildkite build
+        uses: buildkite/trigger-pipeline-action@v1.2.0
+        env:
+          PIPELINE: explosion-ai/spacy-slow-gpu-tests
+          BRANCH: ${{ matrix.branch }}
+          MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
+          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
--- a/.github/workflows/issue-manager.yml
+++ b/.github/workflows/issue-manager.yml
@ -13,9 +13,10 @@ on:

 jobs:
  issue-manager:
+    if: github.repository_owner == 'explosion'
    runs-on: ubuntu-latest
    steps:
-      - uses: tiangolo/issue-manager@0.2.1
+      - uses: tiangolo/issue-manager@0.4.0
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          config: >
@ -25,5 +26,11 @@ jobs:
                "message": "This issue has been automatically closed because it was answered and there was no follow-up discussion.",
                "remove_label_on_comment": true,
                "remove_label_on_close": true
+              },
+              "more-info-needed": {
+                "delay": "P7D",
+                "message": "This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.",
+                "remove_label_on_comment": true,
+                "remove_label_on_close": true
              }
            }
--- a/.github/workflows/lock.yml
+++ b/.github/workflows/lock.yml
@ -0,0 +1,26 @@
+name: 'Lock Threads'
+
+on:
+  schedule:
+    - cron: '0 0 * * *'  # check every day
+  workflow_dispatch:
+
+permissions:
+  issues: write
+
+concurrency:
+  group: lock
+
+jobs:
+  action:
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: dessant/lock-threads@v5
+        with:
+          process-only: 'issues'
+          issue-inactive-days: '30'
+          issue-comment: >
+            This thread has been automatically locked since there
+            has not been any recent activity after it was closed.
+            Please open a new issue for related bugs.
--- a/.github/workflows/publish_pypi.yml
+++ b/.github/workflows/publish_pypi.yml
@ -0,0 +1,29 @@
+# The cibuildwheel action triggers on creation of a release, this
+# triggers on publication.
+# The expected workflow is to create a draft release and let the wheels
+# upload, and then hit 'publish', which uploads to PyPi.
+
+on:
+  release:
+    types:
+      - published
+
+jobs:
+  upload_pypi:
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/spacy
+    permissions:
+      id-token: write
+      contents: read
+    if: github.event_name == 'release' && github.event.action == 'published'
+    # or, alternatively, upload to PyPI on every tag starting with 'v' (remove on: release above to use this)
+    # if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
+    steps:
+      - uses: robinraju/release-downloader@v1
+        with:
+          tag: ${{ github.event.release.tag_name }}
+          fileName: '*'
+          out-file-path: 'dist'
+      - uses: pypa/gh-action-pypi-publish@release/v1
--- a/.github/workflows/slowtests.yml.disabled
+++ b/.github/workflows/slowtests.yml.disabled
@ -0,0 +1,38 @@
+name: Daily slow tests
+
+on:
+  schedule:
+    - cron: '0 0 * * *'
+
+jobs:
+  daily-slowtests:
+    strategy:
+      fail-fast: false
+      matrix:
+        branch: [master, v4]
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ matrix.branch }}
+      - name: Get commits from past 24 hours
+        id: check_commits
+        run: |
+          today=$(date '+%Y-%m-%d %H:%M:%S')
+          yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
+          if git log --after="$yesterday" --before="$today" | grep commit ; then
+            echo run_tests=true >> $GITHUB_OUTPUT
+          else
+            echo run_tests=false >> $GITHUB_OUTPUT
+          fi
+
+      - name: Trigger buildkite build
+        if: steps.check_commits.outputs.run_tests == 'true'
+        uses: buildkite/trigger-pipeline-action@v1.2.0
+        env:
+          PIPELINE: explosion-ai/spacy-slow-tests
+          BRANCH: ${{ matrix.branch }}
+          MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
+          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
--- a/.github/workflows/spacy_universe_alert.yml
+++ b/.github/workflows/spacy_universe_alert.yml
@ -0,0 +1,33 @@
+name: spaCy universe project alert
+
+on:
+  pull_request_target:
+    paths:
+      - "website/meta/universe.json"
+
+jobs:
+  build:
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Dump GitHub context
+        env:
+          GITHUB_CONTEXT: ${{ toJson(github) }}
+          PR_NUMBER: ${{github.event.number}}
+        run: |
+          echo "$GITHUB_CONTEXT"
+
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+      - name: Install Bernadette app dependency and send an alert
+        env:
+          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
+          GITHUB_CONTEXT: ${{ toJson(github) }}
+          CHANNEL: "#alerts-universe"
+        run: |
+          pip install slack-sdk==3.17.2 aiohttp==3.8.1
+          echo "$CHANNEL"
+          python .github/spacy_universe_alert.py "$GITHUB_CONTEXT"
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -0,0 +1,175 @@
+name: tests
+
+on:
+  push:
+    tags-ignore:
+      - '**'
+    branches-ignore:
+      - "spacy.io"
+      - "nightly.spacy.io"
+      - "v2.spacy.io"
+    paths-ignore:
+      - "*.md"
+      - "*.mdx"
+      - "website/**"
+  pull_request:
+    types: [opened, synchronize, reopened, edited]
+    paths-ignore:
+      - "*.md"
+      - "*.mdx"
+      - "website/**"
+
+jobs:
+  validate:
+    name: Validate
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out repo
+        uses: actions/checkout@v4
+
+      - name: Configure Python version
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+
+      - name: black
+        run: |
+          python -m pip install black -c requirements.txt
+          python -m black spacy --check
+      - name: isort
+        run: |
+          python -m pip install isort -c requirements.txt
+          python -m isort spacy --check
+      - name: flake8
+        run: |
+          python -m pip install flake8==5.0.4
+          python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics
+          # Unfortunately cython-lint isn't working after the shift to Cython 3.
+          #- name: cython-lint
+          #  run: |
+          #    python -m pip install cython-lint -c requirements.txt
+          #    # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
+          #    cython-lint spacy --ignore E501,W291,E266
+
+  tests:
+    name: Test
+    needs: Validate
+    strategy:
+      fail-fast: true
+      matrix:
+        os: [ubuntu-latest, windows-latest, macos-latest]
+        python_version: ["3.9", "3.12", "3.13"]
+
+    runs-on: ${{ matrix.os }}
+
+    steps:
+      - name: Check out repo
+        uses: actions/checkout@v4
+
+      - name: Configure Python version
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python_version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install -U build pip setuptools
+          python -m pip install -U -r requirements.txt
+
+      - name: Build sdist
+        run: |
+          python -m build --sdist
+
+      - name: Run mypy
+        run: |
+          python -m mypy spacy
+        if: matrix.python_version != '3.7'
+
+      - name: Delete source directory and .egg-info
+        run: |
+          rm -rf spacy *.egg-info
+        shell: bash
+
+      - name: Uninstall all packages
+        run: |
+          python -m pip freeze
+          python -m pip freeze --exclude pywin32 > installed.txt
+          python -m pip uninstall -y -r installed.txt
+
+      - name: Install from sdist
+        run: |
+          SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
+          SPACY_NUM_BUILD_JOBS=2 python -m pip install dist/$SDIST
+        shell: bash
+
+      - name: Test import
+        run: python -W error -c "import spacy"
+
+      - name: "Test download CLI"
+        run: |
+          python -m spacy download ca_core_news_sm
+          python -m spacy download ca_core_news_md
+          python -c "import spacy; nlp=spacy.load('ca_core_news_sm'); doc=nlp('test')"
+        if: matrix.python_version == '3.9'
+
+      - name: "Test download_url in info CLI"
+        run: |
+          python -W error -m spacy info ca_core_news_sm | grep -q download_url
+        if: matrix.python_version == '3.9'
+
+      - name: "Test no warnings on load (#11713)"
+        run: |
+          python -W error -c "import ca_core_news_sm; nlp = ca_core_news_sm.load(); doc=nlp('test')"
+        if: matrix.python_version == '3.9'
+
+      - name: "Test convert CLI"
+        run: |
+          python -m spacy convert extra/example_data/ner_example_data/ner-token-per-line-conll2003.json .
+        if: matrix.python_version == '3.9'
+
+      - name: "Test debug config CLI"
+        run: |
+          python -m spacy init config -p ner -l ca ner.cfg
+          python -m spacy debug config ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy
+        if: matrix.python_version == '3.9'
+
+      - name: "Test debug data CLI"
+        run: |
+          # will have errors due to sparse data, check for summary in output
+          python -m spacy debug data ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy | grep -q Summary
+        if: matrix.python_version == '3.9'
+
+      - name: "Test train CLI"
+        run: |
+          python -m spacy train ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy --training.max_steps 10 --gpu-id -1
+        if: matrix.python_version == '3.9'
+
+      - name: "Test assemble CLI"
+        run: |
+          python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
+          python -m spacy assemble ner_source_sm.cfg output_dir
+        env:
+          PYTHONWARNINGS: "error,ignore::DeprecationWarning" 
+        if: matrix.python_version == '3.9'
+
+      - name: "Test assemble CLI vectors warning"
+        run: |
+          python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_md'}; config.to_disk('ner_source_md.cfg')"
+          python -m spacy assemble ner_source_md.cfg output_dir 2>&1 | grep -q W113
+        if: matrix.python_version == '3.9'
+
+      - name: "Install test requirements"
+        run: |
+          python -m pip install -U -r requirements.txt
+
+      - name: "Run CPU tests"
+        run: |
+          python -m pytest --pyargs spacy -W error
+        if: "!(startsWith(matrix.os, 'macos') && matrix.python_version == '3.11')"
+
+      - name: "Run CPU tests with thinc-apple-ops"
+        run: |
+          python -m pip install 'spacy[apple]'
+          python -m pytest --pyargs spacy
+        if: startsWith(matrix.os, 'macos') && matrix.python_version == '3.11'
--- a/.github/workflows/universe_validation.yml
+++ b/.github/workflows/universe_validation.yml
@ -0,0 +1,32 @@
+name: universe validation
+
+on:
+  push:
+    branches-ignore:
+      - "spacy.io"
+      - "nightly.spacy.io"
+      - "v2.spacy.io"
+    paths:
+      - "website/meta/universe.json"
+  pull_request:
+    types: [opened, synchronize, reopened, edited]
+    paths:
+      - "website/meta/universe.json"
+
+jobs:
+  validate:
+    name: Validate
+    if: github.repository_owner == 'explosion'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out repo
+        uses: actions/checkout@v4
+
+      - name: Configure Python version
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.7"
+
+      - name: Validate website/meta/universe.json
+        run: |
+          python .github/validate_universe_json.py website/meta/universe.json
--- a/.gitignore
+++ b/.gitignore
@ -10,20 +10,11 @@ spacy/tests/package/setup.cfg
 spacy/tests/package/pyproject.toml
 spacy/tests/package/requirements.txt

-# Website
-website/.cache/
-website/public/
-website/node_modules
-website/.npm
-website/logs
-*.log
-npm-debug.log*
-quickstart-training-generator.js
-
 # Cython / C extensions
 cythonize.json
 spacy/*.html
 *.cpp
+*.c
 *.so

 # Vim / VSCode / editors
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,13 @@
+repos:
+-   repo: https://github.com/ambv/black
+    rev: 22.3.0
+    hooks:
+    - id: black
+      language_version: python3.7
+      additional_dependencies: ['click==8.0.4']
+-   repo: https://github.com/pycqa/flake8
+    rev: 5.0.4
+    hooks:
+    - id: flake8
+      args:
+        - "--config=setup.cfg"
--- a/8
+++ b/8
@ -1,8 +0,0 @@
-@software{spacy,
-  author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane},
-  title = {{spaCy: Industrial-strength Natural Language Processing in Python}},
-  year = 2020,
-  publisher = {Zenodo},
-  doi = {10.5281/zenodo.1212303},
-  url = {https://doi.org/10.5281/zenodo.1212303}
-}
--- a/CITATION.cff
+++ b/CITATION.cff
@ -0,0 +1,16 @@
+cff-version: 1.2.0
+preferred-citation:
+  type: article
+  message: "If you use spaCy, please cite it as below."
+  authors:
+  - family-names: "Honnibal"
+    given-names: "Matthew"
+  - family-names: "Montani"
+    given-names: "Ines"
+  - family-names: "Van Landeghem"
+    given-names: "Sofie"
+  - family-names: "Boyd"
+    given-names: "Adriane"
+  title: "spaCy: Industrial-strength Natural Language Processing in Python"
+  doi: "10.5281/zenodo.1212303"
+  year: 2020
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -2,11 +2,7 @@

 # Contribute to spaCy

-Thanks for your interest in contributing to spaCy 🎉 The project is maintained
-by **[@honnibal](https://github.com/honnibal)**,
-**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)** and
-**[@adrianeboyd](https://github.com/adrianeboyd)**,
-and we'll do our best to help you get started. This page will give you a quick
+Thanks for your interest in contributing to spaCy 🎉 This page will give you a quick
 overview of how things are organized and most importantly, how to get involved.

 ## Table of contents
@ -39,7 +35,7 @@ so that more people can benefit from it.

 When opening an issue, use a **descriptive title** and include your
 **environment** (operating system, Python version, spaCy version). Our
-[issue template](https://github.com/explosion/spaCy/issues/new) helps you
+[issue templates](https://github.com/explosion/spaCy/issues/new/choose) help you
 remember the most important details to include. If you've discovered a bug, you
 can also submit a [regression test](#fixing-bugs) straight away. When you're
 opening an issue to report the bug, simply refer to your pull request in the
@ -144,29 +140,28 @@ Changes to `.py` files will be effective immediately.

 📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**

-### Contributor agreement
-
-If you've made a contribution to spaCy, you should fill in the
-[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
-your contribution can be used across the project. If you agree to be bound by
-the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
-and include it with your pull request, or submit it separately to
-[`.github/contributors/`](/.github/contributors). The name of the file should be
-your GitHub username, with the extension `.md`. For example, the user
-example_user would create the file `.github/contributors/example_user.md`.
-
 ### Fixing bugs

 When fixing a bug, first create an
-[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
-The description text can be very short – we don't want to make this too
+[issue](https://github.com/explosion/spaCy/issues) if one does not already
+exist. The description text can be very short – we don't want to make this too
 bureaucratic.

-Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
-[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
-you're fixing, and make sure the test fails. Next, add and commit your test file
-referencing the issue number in the commit message. Finally, fix the bug, make
-sure your test passes and reference the issue in your commit message.
+Next, add a test to the relevant file in the
+[`spacy/tests`](spacy/tests)folder. Then add a [pytest
+mark](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers),
+`@pytest.mark.issue(NUMBER)`, to reference the issue number.
+
+```python
+# Assume you're fixing Issue #1234
+@pytest.mark.issue(1234)
+def test_issue1234():
+    ...
+```
+
+Test for the bug you're fixing, and make sure the test fails. Next, add and
+commit your test file. Finally, fix the bug, make sure your test passes and
+reference the issue number in your pull request description.

 📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**

@ -178,9 +173,22 @@ formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
 Python modules. If you've built spaCy from source, you'll already have both
 tools installed.

+As a general rule of thumb, we use f-strings for any formatting of strings.
+One exception are calls to Python's `logging` functionality.
+To avoid unnecessary string conversions in these cases, we use string formatting
+templates with `%s` and `%d` etc.
+
 **⚠️ Note that formatting and linting is currently only possible for Python
 modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**

+### Pre-Commit Hooks
+
+After cloning the repo, after installing the packages from `requirements.txt`, enter the repo folder and run `pre-commit install`.
+Each time a `git commit` is initiated, `black` and `flake8` will run automatically on the modified files only.
+
+In case of error, or when `black` modified a file, the modified file needs to be `git add` once again and a new
+`git commit` has to be issued.
+
 ### Code formatting

 [`black`](https://github.com/ambv/black) is an opinionated Python code
@ -230,7 +238,7 @@ also want to keep an eye on unused declared variables or repeated
 (i.e. overwritten) dictionary keys. If your code was formatted with `black`
 (see above), you shouldn't see any formatting-related warnings.

-The [`.flake8`](.flake8) config defines the configuration we use for this
+The `flake8` section in [`setup.cfg`](setup.cfg) defines the configuration we use for this
 codebase. For example, we're not super strict about the line length, and we're
 excluding very large files like lemmatization and tokenizer exception tables.

@ -268,7 +276,8 @@ except:  # noqa: E722

 ### Python conventions

-All Python code must be written **compatible with Python 3.6+**.
+All Python code must be written **compatible with Python 3.6+**. More detailed
+code conventions can be found in the [developer docs](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Code%20Conventions.md).

 #### I/O and handling paths

@ -409,14 +418,7 @@ all test files and test functions need to be prefixed with `test_`.
 When adding tests, make sure to use descriptive names, keep the code short and
 concise and only test for one behavior at a time. Try to `parametrize` test
 cases wherever possible, use our pre-defined fixtures for spaCy components and
-avoid unnecessary imports.
-
-Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
-Tests that require the model to be loaded should be marked with
-`@pytest.mark.models`. Loading the models is expensive and not necessary if
-you're not actually testing the model performance. If all you need is a `Doc`
-object with annotations like heads, POS tags or the dependency parse, you can
-use the `Doc` constructor to construct it manually.
+avoid unnecessary imports. Extensive tests that take a long time should be marked with `@pytest.mark.slow`.

 📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**

@ -433,7 +435,7 @@ simply click on the "Suggest edits" button at the bottom of a page.
 ## Publishing spaCy extensions and plugins

 We're very excited about all the new possibilities for **community extensions**
-and plugins in spaCy v2.0, and we can't wait to see what you build with it!
+and plugins in spaCy v3.0, and we can't wait to see what you build with it!

 - An extension or plugin should add substantial functionality, be
  **well-documented** and **open-source**. It should be available for users to download
@ -447,13 +449,12 @@ and plugins in spaCy v2.0, and we can't wait to see what you build with it!
  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
  [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
  to make it easier to find. Those are also the topics we're linking to from the
-  spaCy website. If you're sharing your project on Twitter, feel free to tag
-  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
+  spaCy website. If you're sharing your project on X, feel free to tag
+  [@spacy_io](https://x.com/spacy_io) so we can check it out.

- Once your extension is published, you can open an issue on the
-  [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
-  [resources directory](https://spacy.io/usage/resources#extensions) on the
-  website.
+- Once your extension is published, you can open a
+  [PR](https://github.com/explosion/spaCy/pulls) to suggest it for the
+  [Universe](https://spacy.io/universe) page.

 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**

--- a/2
+++ b/2
@ -1,6 +1,6 @@
 The MIT License (MIT)

-Copyright (C) 2016-2021 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
+Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,11 +1,9 @@
-recursive-include include *.h
-recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
+recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml *.hh
 include LICENSE
 include README.md
 include pyproject.toml
 include spacy/py.typed
-recursive-exclude spacy/lang *.json
-recursive-include spacy/lang *.json.gz
-recursive-include spacy/cli *.json *.yml
+recursive-include spacy/cli *.yml
+recursive-include spacy/tests *.json
 recursive-include licenses *
 recursive-exclude spacy *.cpp
--- a/4
+++ b/4
@ -1,11 +1,11 @@
 SHELL := /bin/bash

 ifndef SPACY_EXTRAS
-override SPACY_EXTRAS = spacy-lookups-data==1.0.0 jieba spacy-pkuseg==0.0.28 sudachipy sudachidict_core pymorphy2
+override SPACY_EXTRAS = spacy-lookups-data==1.0.3
 endif

 ifndef PYVER
-override PYVER = 3.6
+override PYVER = 3.8
 endif

 VENV := ./env$(PYVER)
--- a/README.md
+++ b/README.md
@ -6,20 +6,20 @@ spaCy is a library for **advanced Natural Language Processing** in Python and
 Cython. It's built on the very latest research, and was designed from day one to
 be used in real products.

-spaCy comes with
-[pretrained pipelines](https://spacy.io/models) and
-currently supports tokenization and training for **60+ languages**. It features
-state-of-the-art speed and **neural network models** for tagging,
-parsing, **named entity recognition**, **text classification** and more,
-multi-task learning with pretrained **transformers** like BERT, as well as a
+spaCy comes with [pretrained pipelines](https://spacy.io/models) and currently
+supports tokenization and training for **70+ languages**. It features
+state-of-the-art speed and **neural network models** for tagging, parsing,
+**named entity recognition**, **text classification** and more, multi-task
+learning with pretrained **transformers** like BERT, as well as a
 production-ready [**training system**](https://spacy.io/usage/training) and easy
 model packaging, deployment and workflow management. spaCy is commercial
-open-source software, released under the MIT license.
+open-source software, released under the
+[MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).

-💫 **Version 3.0 out now!**
+💫 **Version 3.8 out now!**
 [Check out the release notes here.](https://github.com/explosion/spaCy/releases)

-[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
+[![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml)
 [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
 [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
 [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
@ -28,66 +28,79 @@ open-source software, released under the MIT license.
 <br />
 [![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/)
 [![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy)
-[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)

 ## 📖 Documentation

-| Documentation              |                                                                |
-| -------------------------- | -------------------------------------------------------------- |
-| ⭐️ **[spaCy 101]**        | New to spaCy? Here's everything you need to know!              |
-| 📚 **[Usage Guides]**      | How to use spaCy and its features.                             |
-| 🚀 **[New in v3.0]**       | New features, backwards incompatibilities and migration guide. |
-| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run.            |
-| 🎛 **[API Reference]**      | The detailed reference for spaCy's API.                        |
-| 📦 **[Models]**            | Download trained pipelines for spaCy.                          |
-| 🌌 **[Universe]**          | Plugins, extensions, demos and books from the spaCy ecosystem. |
-| 👩‍🏫 **[Online Course]**     | Learn spaCy in this free and interactive online course.        |
-| 📺 **[Videos]**            | Our YouTube channel with video tutorials, talks and more.      |
-| 🛠 **[Changelog]**          | Changes and version history.                                   |
-| 💝 **[Contribute]**        | How to contribute to the spaCy project and code base.          |
+| Documentation                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                              |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ⭐️ **[spaCy 101]**                                                                                                                                                                                                       | New to spaCy? Here's everything you need to know!                                                                                                                                                                                                                                                                                            |
+| 📚 **[Usage Guides]**                                                                                                                                                                                                     | How to use spaCy and its features.                                                                                                                                                                                                                                                                                                           |
+| 🚀 **[New in v3.0]**                                                                                                                                                                                                      | New features, backwards incompatibilities and migration guide.                                                                                                                                                                                                                                                                               |
+| 🪐 **[Project Templates]**                                                                                                                                                                                                | End-to-end workflows you can clone, modify and run.                                                                                                                                                                                                                                                                                          |
+| 🎛 **[API Reference]**                                                                                                                                                                                                     | The detailed reference for spaCy's API.                                                                                                                                                                                                                                                                                                      |
+| ⏩ **[GPU Processing]**                                                                                                                                                                                                    | Use spaCy with CUDA-compatible GPU processing.                                                                                                                                                                                                                                                                                               |
+| 📦 **[Models]**                                                                                                                                                                                                           | Download trained pipelines for spaCy.                                                                                                                                                                                                                                                                                                        |
+| 🦙 **[Large Language Models]**                                                                                                                                                                                            | Integrate LLMs into spaCy pipelines.                                                                                                                                                                                                                                                                                                        |
+| 🌌 **[Universe]**                                                                                                                                                                                                         | Plugins, extensions, demos and books from the spaCy ecosystem.                                                                                                                                                                                                                                                                               |
+| ⚙️ **[spaCy VS Code Extension]**                                                                                                                                                                                          | Additional tooling and features for working with spaCy's config files.                                                                                                                                                                                                                                                                       |
+| 👩‍🏫 **[Online Course]**                                                                                                                                                                                                    | Learn spaCy in this free and interactive online course.                                                                                                                                                                                                                                                                                      |
+| 📰 **[Blog]**                                                                                                                                                                                                             | Read about current spaCy and Prodigy development, releases, talks and more from Explosion.                                                                                                                                                                                                                 |
+| 📺 **[Videos]**                                                                                                                                                                                                           | Our YouTube channel with video tutorials, talks and more.                                                                                                                                                                                                                                                                                    |
+| 🔴 **[Live Stream]**                                                                                                                                                                                                       | Join Matt as he works on spaCy and chat about NLP, live every week.                                                                                                                                                                                                                                                                         |
+| 🛠 **[Changelog]**                                                                                                                                                                                                         | Changes and version history.                                                                                                                                                                                                                                                                                                                 |
+| 💝 **[Contribute]**                                                                                                                                                                                                       | How to contribute to the spaCy project and code base.                                                                                                                                                                                                                                                                                        |
+| 👕 **[Swag]**                                                                                                                                                                                                             | Support us and our work with unique, custom-designed swag!                                                                                                                                                                                                                                                                                   |
+| <a href="https://explosion.ai/tailored-solutions"><img src="https://github.com/explosion/spaCy/assets/13643239/36d2a42e-98c0-4599-90e1-788ef75181be" width="150" alt="Tailored Solutions"/></a> | Custom NLP consulting, implementation and strategic advice by spaCy’s core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! **[Learn more &rarr;](https://explosion.ai/tailored-solutions)**                 |

 [spacy 101]: https://spacy.io/usage/spacy-101
 [new in v3.0]: https://spacy.io/usage/v3
 [usage guides]: https://spacy.io/usage/
 [api reference]: https://spacy.io/api/
+[gpu processing]: https://spacy.io/usage#gpu
 [models]: https://spacy.io/models
+[large language models]: https://spacy.io/usage/large-language-models
 [universe]: https://spacy.io/universe
+[spacy vs code extension]: https://github.com/explosion/spacy-vscode
 [videos]: https://www.youtube.com/c/ExplosionAI
+[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
 [online course]: https://course.spacy.io
+[blog]: https://explosion.ai
 [project templates]: https://github.com/explosion/projects
 [changelog]: https://spacy.io/usage#changelog
 [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
+[swag]: https://explosion.ai/merch

 ## 💬 Where to ask questions

-The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**,
-**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)** and
-**[@adrianeboyd](https://github.com/adrianeboyd)**. Please understand that we won't
-be able to provide individual support via email. We also believe that help is
-much more valuable if it's shared publicly, so that more people can benefit from
-it.
+The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
+Please understand that we won't be able to provide individual support via email.
+We also believe that help is much more valuable if it's shared publicly, so that
+more people can benefit from it.

 | Type                            | Platforms                               |
 | ------------------------------- | --------------------------------------- |
 | 🚨 **Bug Reports**              | [GitHub Issue Tracker]                  |
-| 🎁 **Feature Requests & Ideas** | [GitHub Discussions]                    |
+| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] · [Live Stream]    |
 | 👩‍💻 **Usage Questions**          | [GitHub Discussions] · [Stack Overflow] |
-| 🗯 **General Discussion**        | [GitHub Discussions]                    |
+| 🗯 **General Discussion**        | [GitHub Discussions] · [Live Stream]   |

 [github issue tracker]: https://github.com/explosion/spaCy/issues
 [github discussions]: https://github.com/explosion/spaCy/discussions
 [stack overflow]: https://stackoverflow.com/questions/tagged/spacy
+[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c

 ## Features

- Support for **60+ languages**
+- Support for **70+ languages**
 - **Trained pipelines** for different languages and tasks
 - Multi-task learning with pretrained **transformers** like BERT
 - Support for pretrained **word vectors** and embeddings
 - State-of-the-art speed
 - Production-ready **training system**
 - Linguistically-motivated **tokenization**
- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
+- Components for named **entity recognition**, part-of-speech-tagging,
+  dependency parsing, sentence segmentation, **text classification**,
+  lemmatization, morphological analysis, entity linking and more
 - Easily extensible with **custom components** and attributes
 - Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
 - Built in **visualizers** for syntax and NER
@ -104,7 +117,7 @@ For detailed installation instructions, see the

 - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
  Studio)
- **Python version**: Python 3.6+ (only 64 bit)
+- **Python version**: Python >=3.7, <3.13 (only 64 bit)
 - **Package managers**: [pip] · [conda] (via `conda-forge`)

 [pip]: https://pypi.org/project/spacy/
@ -113,8 +126,8 @@ For detailed installation instructions, see the
 ### pip

 Using pip, spaCy releases are available as source packages and binary wheels.
-Before you install spaCy and its dependencies, make sure that
-your `pip`, `setuptools` and `wheel` are up to date.
+Before you install spaCy and its dependencies, make sure that your `pip`,
+`setuptools` and `wheel` are up to date.

 ```bash
 pip install -U pip setuptools wheel
@ -169,9 +182,9 @@ with the new version.

 ## 📦 Download model packages

-Trained pipelines for spaCy can be installed as **Python packages**. This
-means that they're a component of your application, just like any other module.
-Models can be installed using spaCy's [`download`](https://spacy.io/api/cli#download)
+Trained pipelines for spaCy can be installed as **Python packages**. This means
+that they're a component of your application, just like any other module. Models
+can be installed using spaCy's [`download`](https://spacy.io/api/cli#download)
 command, or manually by pointing pip to a path or URL.

 | Documentation              |                                                                  |
@ -237,8 +250,7 @@ do that depends on your system.
 | **Mac**     | Install a recent version of [XCode](https://developer.apple.com/xcode/), including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled.                                                                                        |
 | **Windows** | Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that matches the version that was used to compile your Python interpreter. |

-For more details
-and instructions, see the documentation on
+For more details and instructions, see the documentation on
 [compiling spaCy from source](https://spacy.io/usage#source) and the
 [quickstart widget](https://spacy.io/usage#section-quickstart) to get the right
 commands for your platform and Python version.
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -1,91 +0,0 @@
-trigger:
-  batch: true
-  branches:
-    include:
-      - "*"
-    exclude:
-      - "spacy.io"
-      - "nightly.spacy.io"
-      - "v2.spacy.io"
-  paths:
-    exclude:
-      - "website/*"
-      - "*.md"
-pr:
-  paths:
-    exclude:
-      - "website/*"
-      - "*.md"
-
-jobs:
-  # Perform basic checks for most important errors (syntax etc.) Uses the config
-  # defined in .flake8 and overwrites the selected codes.
-  - job: "Validate"
-    pool:
-      vmImage: "ubuntu-latest"
-    steps:
-      - task: UsePythonVersion@0
-        inputs:
-          versionSpec: "3.7"
-      - script: |
-          pip install flake8==5.0.4
-          python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
-        displayName: "flake8"
-
-  - job: "Test"
-    dependsOn: "Validate"
-    strategy:
-      matrix:
-        # We're only running one platform per Python version to speed up builds
-        Python36Linux:
-          imageName: "ubuntu-20.04"
-          python.version: "3.6"
-        #        Python36Windows:
-        #          imageName: "windows-latest"
-        #          python.version: "3.6"
-        #        Python36Mac:
-        #          imageName: "macos-latest"
-        #          python.version: "3.6"
-        #        Python37Linux:
-        #          imageName: "ubuntu-20.04"
-        #          python.version: "3.7"
-        Python37Windows:
-          imageName: "windows-latest"
-          python.version: "3.7"
-        #        Python37Mac:
-        #          imageName: "macos-latest"
-        #          python.version: "3.7"
-        #        Python38Linux:
-        #          imageName: "ubuntu-latest"
-        #          python.version: "3.8"
-        #        Python38Windows:
-        #          imageName: "windows-latest"
-        #          python.version: "3.8"
-        Python38Mac:
-          imageName: "macos-latest"
-          python.version: "3.8"
-        Python39Linux:
-          imageName: "ubuntu-latest"
-          python.version: "3.9"
-        #        Python39Windows:
-        #          imageName: "windows-latest"
-        #          python.version: "3.9"
-        #        Python39Mac:
-        #          imageName: "macos-latest"
-        #          python.version: "3.9"
-        Python310Linux:
-          imageName: "ubuntu-latest"
-          python.version: "3.10"
-        Python310Windows:
-          imageName: "windows-latest"
-          python.version: "3.10"
-        Python310Mac:
-          imageName: "macos-latest"
-          python.version: "3.10"
-      maxParallel: 4
-    pool:
-      vmImage: $(imageName)
-    steps:
-      - template: .github/azure-steps.yml
-        parameters:
-          python_version: '$(python.version)'
--- a/bin/release.sh
+++ b/bin/release.sh
@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+set -e
+
+# Insist repository is clean
+git diff-index --quiet HEAD
+
+version=$(grep "__version__ = " spacy/about.py)
+version=${version/__version__ = }
+version=${version/\'/}
+version=${version/\'/}
+version=${version/\"/}
+version=${version/\"/}
+
+echo "Pushing release-v"$version
+
+git tag -d release-v$version || true
+git push origin :release-v$version || true
+git tag release-v$version
+git push origin release-v$version
--- a/build-constraints.txt
+++ b/build-constraints.txt
@ -1,8 +1,2 @@
-# build version constraints for use with wheelwright + multibuild
-numpy==1.15.0; python_version<='3.7' and platform_machine!='aarch64'
-numpy==1.19.2; python_version<='3.7' and platform_machine=='aarch64'
-numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
-numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
-numpy==1.19.3; python_version=='3.9'
-numpy==1.21.3; python_version=='3.10'
-numpy; python_version>='3.11'
+# build version constraints for use with wheelwright
+numpy>=2.0.0,<3.0.0
--- a/extra/DEVELOPER_DOCS/Code
+++ b/extra/DEVELOPER_DOCS/Code
@ -0,0 +1,580 @@
+# Code Conventions
+
+For a general overview of code conventions for contributors, see the [section in the contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#code-conventions).
+
+1. [Code compatibility](#code-compatibility)
+2. [Auto-formatting](#auto-formatting)
+3. [Linting](#linting)
+4. [Documenting code](#documenting-code)
+5. [Type hints](#type-hints)
+6. [Structuring logic](#structuring-logic)
+7. [Naming](#naming)
+8. [Error handling](#error-handling)
+9. [Writing tests](#writing-tests)
+
+## Code compatibility
+
+spaCy supports **Python 3.6** and above, so all code should be written compatible with 3.6. This means that there are certain new syntax features that we won't be able to use until we drop support for older Python versions. Some newer features provide backports that we can conditionally install for older versions, although we only want to do this if it's absolutely necessary. If we need to use conditional imports based on the Python version or other custom compatibility-specific helpers, those should live in `compat.py`.
+
+## Auto-formatting
+
+spaCy uses `black` for auto-formatting (which is also available as a pre-commit hook). It's recommended to configure your editor to perform this automatically, either triggered manually or whenever you save a file. We also have a GitHub action that regularly formats the code base and submits a PR if changes are available. Note that auto-formatting is currently only available for `.py` (Python) files, not for `.pyx` (Cython).
+
+As a rule of thumb, if the auto-formatting produces output that looks messy, it can often indicate that there's a better way to structure the code to make it more concise.
+
+```diff
+- range_suggester = registry.misc.get("spacy.ngram_range_suggester.v1")(
+-     min_size=1, max_size=3
+- )
+ suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
+ range_suggester = suggester_factory(min_size=1, max_size=3)
+```
+
+In some specific cases, e.g. in the tests, it can make sense to disable auto-formatting for a specific block. You can do this by wrapping the code in `# fmt: off` and `# fmt: on`:
+
+```diff
+ # fmt: off
+text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
+deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
+        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
+        "poss", "nsubj", "ccomp", "punct"]
+ # fmt: on
+```
+
+## Linting
+
+[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code style. It scans one or more files and outputs errors and warnings. This feedback can help you stick to general standards and conventions, and can be very useful for spotting potential mistakes and inconsistencies in your code. Code you write should be compatible with our flake8 rules and not cause any warnings.
+
+```bash
+flake8 spacy
+```
+
+The most common problems surfaced by linting are:
+
+- **Trailing or missing whitespace.** This is related to formatting and should be fixed automatically by running `black`.
+- **Unused imports.** Those should be removed if the imports aren't actually used. If they're required, e.g. to expose them so they can be imported from the given module, you can add a comment and `# noqa: F401` exception (see details below).
+- **Unused variables.** This can often indicate bugs, e.g. a variable that's declared and not correctly passed on or returned. To prevent ambiguity here, your code shouldn't contain unused variables. If you're unpacking a list of tuples and end up with variables you don't need, you can call them `_` to indicate that they're unused.
+- **Redefinition of function.** This can also indicate bugs, e.g. a copy-pasted function that you forgot to rename and that now replaces the original function.
+- **Repeated dictionary keys.** This either indicates a bug or unnecessary duplication.
+- **Comparison with `True`, `False`, `None`**. This is mostly a stylistic thing: when checking whether a value is `True`, `False` or `None`, you should be using `is` instead of `==`. For example, `if value is None`.
+
+### Ignoring linter rules for special cases
+
+To ignore a given line, you can add a comment like `# noqa: F401`, specifying the code of the error or warning we want to ignore. It's also possible to ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. In general, you should always **specify the code(s)** you want to ignore – otherwise, you may end up missing actual problems.
+
+```python
+# The imported class isn't used in this file, but imported here, so it can be
+# imported *from* here by another module.
+from .submodule import SomeClass  # noqa: F401
+
+try:
+    do_something()
+except:  # noqa: E722
+    # This bare except is justified, for some specific reason
+    do_something_else()
+```
+
+## Documenting code
+
+All functions and methods you write should be documented with a docstring inline. The docstring can contain a simple summary, and an overview of the arguments and their (simplified) types. Modern editors will show this information to users when they call the function or method in their code.
+
+If it's part of the public API and there's a documentation section available, we usually add the link as `DOCS:` at the end. This allows us to keep the docstrings simple and concise, while also providing additional information and examples if necessary.
+
+```python
+def has_pipe(self, name: str) -> bool:
+    """Check if a component name is present in the pipeline. Equivalent to
+    `name in nlp.pipe_names`.
+
+    name (str): Name of the component.
+    RETURNS (bool): Whether a component of the name exists in the pipeline.
+
+    DOCS: https://spacy.io/api/language#has_pipe
+    """
+    ...
+```
+
+We specifically chose this approach of maintaining the docstrings and API reference separately, instead of auto-generating the API docs from the docstrings like other packages do. We want to be able to provide extensive explanations and examples in the documentation and use our own custom markup for it that would otherwise clog up the docstrings. We also want to be able to update the documentation independently of the code base. It's slightly more work, but it's absolutely worth it in terms of user and developer experience.
+
+### Inline code comments
+
+We don't expect you to add inline comments for everything you're doing – this should be obvious from reading the code. If it's not, the first thing to check is whether your code can be improved to make it more explicit. That said, if your code includes complex logic or aspects that may be unintuitive at first glance (or even included a subtle bug that you ended up fixing), you should leave a quick comment that provides more context.
+
+```diff
+token_index = indices[value]
+ # Index describes Token.i of last token but Span indices are inclusive
+span = doc[prev_token_index:token_index + 1]
+```
+
+```diff
+ # To create the components we need to use the final interpolated config
+ # so all values are available (if component configs use variables).
+ # Later we replace the component config with the raw config again.
+interpolated = filled.interpolate() if not filled.is_interpolated else filled
+```
+
+Don't be shy about including comments for tricky parts that _you_ found hard to implement or get right – those may come in handy for the next person working on this code, or even future you!
+
+If your change implements a fix to a specific issue, it can often be helpful to include the issue number in the comment, especially if it's a relatively straightforward adjustment:
+
+```diff
+ # Ensure object is a Span, not a Doc (#1234)
+if isinstance(obj, Doc):
+    obj = obj[obj.start:obj.end]
+```
+
+### Including TODOs
+
+It's fine to include code comments that indicate future TODOs, using the `TODO:` prefix. Modern editors typically format this in a different color, so it's easy to spot. TODOs don't necessarily have to be things that are absolutely critical to fix fight now – those should already be addressed in your pull request once it's ready for review. But they can include notes about potential future improvements.
+
+```diff
+ # TODO: this is currently pretty slow
+dir_checksum = hashlib.md5()
+for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
+    dir_checksum.update(sub_file.read_bytes())
+```
+
+If any of the TODOs you've added are important and should be fixed soon, you should add a task for this on Explosion's internal Ora board or an issue on the public issue tracker to make sure we don't forget to address it.
+
+## Type hints
+
+We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation. Ideally when developing, run `mypy spacy` on the code base to inspect any issues.
+
+If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` – although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
+
+```diff
+- def func(some_arg: dict) -> None:
+ def func(some_arg: Dict[str, Any]) -> None:
+    ...
+```
+
+```python
+def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
+    def callback(arg1: str, arg2: int) -> List[str]:
+        ...
+
+    return callback
+```
+
+For typing variables, we prefer the explicit format.
+
+```diff
+- var = value    # type: Type
+ var: Type = value
+```
+
+For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
+
+```python
+def build_tagger_model(
+    tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
+) -> Model[List[Doc], List[Floats2d]]:
+    ...
+```
+
+If you need to use a type hint that refers to something later declared in the same module, or the class that a method belongs to, you can use a string value instead:
+
+```python
+class SomeClass:
+    def from_bytes(self, data: bytes) -> "SomeClass":
+        ...
+```
+
+In some cases, you won't be able to import a class from a different module to use it as a type hint because it'd cause circular imports. For instance, `spacy/util.py` includes various helper functions that return an instance of `Language`, but we couldn't import it, because `spacy/language.py` imports `util` itself. In this case, we can provide `"Language"` as a string and make the import conditional on `typing.TYPE_CHECKING` so it only runs when the code is evaluated by a type checker:
+
+```python
+from typing TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from .language import Language
+
+def load_model(name: str) -> "Language":
+    ...
+```
+
+Note that we typically put the `from typing` import statements on the first line(s) of the Python module.
+
+## Structuring logic
+
+### Positional and keyword arguments
+
+We generally try to avoid writing functions and methods with too many arguments, and use keyword-only arguments wherever possible. Python lets you define arguments as keyword-only by separating them with a `, *`. If you're writing functions with additional arguments that customize the behavior, you typically want to make those arguments keyword-only, so their names have to be provided explicitly.
+
+```diff
+- def do_something(name: str, validate: bool = False):
+ def do_something(name: str, *, validate: bool = False):
+    ...
+
+- do_something("some_name", True)
+ do_something("some_name", validate=True)
+```
+
+This makes the function calls easier to read, because it's immediately clear what the additional values mean. It also makes it easier to extend arguments or change their order later on, because you don't end up with any function calls that depend on a specific positional order.
+
+### Avoid mutable default arguments
+
+A common Python gotcha are [mutable default arguments](https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments): if your argument defines a mutable default value like `[]` or `{}` and then goes and mutates it, the default value is created _once_ when the function is created and the same object is then mutated every time the function is called. This can be pretty unintuitive when you first encounter it. We therefore avoid writing logic that does this.
+
+If your arguments need to default to an empty list or dict, you can use the `SimpleFrozenList` and `SimpleFrozenDict` helpers provided by spaCy. They are simple frozen implementations that raise an error if they're being mutated to prevent bugs and logic that accidentally mutates default arguments.
+
+```diff
+- def to_bytes(self, *, exclude: List[str] = []):
+ def to_bytes(self, *, exclude: List[str] = SimpleFrozenList()):
+    ...
+```
+
+```diff
+def do_something(values: List[str] = SimpleFrozenList()):
+    if some_condition:
+-         values.append("foo")  # raises an error
+         values = [*values, "foo"]
+    return values
+```
+
+### Don't use `try`/`except` for control flow
+
+We strongly discourage using `try`/`except` blocks for anything that's not third-party error handling or error handling that we otherwise have little control over. There's typically always a way to anticipate the _actual_ problem and **check for it explicitly**, which makes the code easier to follow and understand, and prevents bugs:
+
+```diff
+- try:
+-     token = doc[i]
+- except IndexError:
+-     token = doc[-1]
+
+ if i < len(doc):
+     token = doc[i]
+ else:
+     token = doc[-1]
+```
+
+Even if you end up having to check for multiple conditions explicitly, this is still preferred over a catch-all `try`/`except`. It can be very helpful to think about the exact scenarios you need to cover, and what could go wrong at each step, which often leads to better code and fewer bugs. `try/except` blocks can also easily mask _other_ bugs and problems that raise the same errors you're catching, which is obviously bad.
+
+If you have to use `try`/`except`, make sure to only include what's **absolutely necessary** in the `try` block and define the exception(s) explicitly. Otherwise, you may end up masking very different exceptions caused by other bugs.
+
+```diff
+- try:
+-     value1 = get_some_value()
+-     value2 = get_some_other_value()
+-     score = external_library.compute_some_score(value1, value2)
+- except:
+-     score = 0.0
+
+ value1 = get_some_value()
+ value2 = get_some_other_value()
+ try:
+     score = external_library.compute_some_score(value1, value2)
+ except ValueError:
+     score = 0.0
+```
+
+### Avoid lambda functions
+
+`lambda` functions can be useful for defining simple anonymous functions in a single line, but they also introduce problems: for instance, they require [additional logic](https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions) in order to be pickled and are pretty ugly to type-annotate. So we typically avoid them in the code base and only use them in the serialization handlers and within tests for simplicity. Instead of `lambda`s, check if your code can be refactored to not need them, or use helper functions instead.
+
+```diff
+- split_string: Callable[[str], List[str]] = lambda value: [v.strip() for v in value.split(",")]
+
+ def split_string(value: str) -> List[str]:
+     return [v.strip() for v in value.split(",")]
+```
+
+### Numeric comparisons
+
+For numeric comparisons, as a general rule we always use `<` and `>=` and avoid the usage of `<=` and `>`. This is to ensure we consistently
+apply inclusive lower bounds and exclusive upper bounds, helping to prevent off-by-one errors.
+
+One exception to this rule is the ternary case. With a chain like
+
+```python
+if value >= 0 and value < max:
+    ...
+```
+
+it's fine to rewrite this to the shorter form
+
+```python
+if 0 <= value < max:
+    ...
+```
+
+even though this requires the usage of the `<=` operator.
+
+### Iteration and comprehensions
+
+We generally avoid using built-in functions like `filter` or `map` in favor of list or generator comprehensions.
+
+```diff
+- filtered = filter(lambda x: x in ["foo", "bar"], values)
+ filtered = (x for x in values if x in ["foo", "bar"])
+- filtered = list(filter(lambda x: x in ["foo", "bar"], values))
+ filtered = [x for x in values if x in ["foo", "bar"]]
+
+- result = map(lambda x: { x: x in ["foo", "bar"]}, values)
+ result = ({x: x in ["foo", "bar"]} for x in values)
+- result = list(map(lambda x: { x: x in ["foo", "bar"]}, values))
+ result = [{x: x in ["foo", "bar"]} for x in values]
+```
+
+If your logic is more complex, it's often better to write a loop instead, even if it adds more lines of code in total. The result will be much easier to follow and understand.
+
+```diff
+- result = [{"key": key, "scores": {f"{i}": score for i, score in enumerate(scores)}} for key, scores in values]
+
+ result = []
+ for key, scores in values:
+     scores_dict = {f"{i}": score for i, score in enumerate(scores)}
+     result.append({"key": key, "scores": scores_dict})
+```
+
+### Composition vs. inheritance
+
+Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion** — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing. Unless you're implementing a new data structure or pipeline component, you typically shouldn't have to use classes at all.
+
+### Don't use `print`
+
+The core library never `print`s anything. While we encourage using `print` statements for simple debugging (it's the most straightforward way of looking at what's happening), make sure to clean them up once you're ready to submit your pull request. If you want to output warnings or debugging information for users, use the respective dedicated mechanisms for this instead (see sections on warnings and logging for details).
+
+The only exceptions are the CLI functions, which pretty-print messages for the user, and methods that are explicitly intended for printing things, e.g. `Language.analyze_pipes` with `pretty=True` enabled. For this, we use our lightweight helper library [`wasabi`](https://github.com/ines/wasabi).
+
+## Naming
+
+Naming is hard and often a topic of long internal discussions. We don't expect you to come up with the perfect names for everything you write – finding the right names is often an iterative and collaborative process. That said, we do try to follow some basic conventions.
+
+Consistent with general Python conventions, we use `CamelCase` for class names including dataclasses, `snake_case` for methods, functions and variables, and `UPPER_SNAKE_CASE` for constants, typically defined at the top of a module. We also avoid using variable names that shadow the names of built-in functions, e.g. `input`, `help` or `list`.
+
+### Naming variables
+
+Variable names should always make it clear _what exactly_ the variable is and what it's used for. Instances of common classes should use the same consistent names. For example, you should avoid naming a text string (or anything else that's not a `Doc` object) `doc`. The most common class-to-variable mappings are:
+
+| Class      | Variable              | Example                                     |
+| ---------- | --------------------- | ------------------------------------------- |
+| `Language` | `nlp`                 | `nlp = spacy.blank("en")`                   |
+| `Doc`      | `doc`                 | `doc = nlp("Some text")`                    |
+| `Span`     | `span`, `ent`, `sent` | `span = doc[1:4]`, `ent = doc.ents[0]`      |
+| `Token`    | `token`               | `token = doc[0]`                            |
+| `Lexeme`   | `lexeme`, `lex`       | `lex = nlp.vocab["foo"]`                    |
+| `Vocab`    | `vocab`               | `vocab = Vocab()`                           |
+| `Example`  | `example`, `eg`       | `example = Example.from_dict(doc, gold)`    |
+| `Config`   | `config`, `cfg`       | `config = Config().from_disk("config.cfg")` |
+
+We try to avoid introducing too many temporary variables, as these clutter your namespace. It's okay to re-assign to an existing variable, but only if the value has the same type.
+
+```diff
+ents = get_a_list_of_entities()
+ents = [ent for ent in doc.ents if ent.label_ == "PERSON"]
+- ents = {(ent.start, ent.end): ent.label_ for ent in ents}
+ ent_mappings = {(ent.start, ent.end): ent.label_ for ent in ents}
+```
+
+### Naming methods and functions
+
+Try choosing short and descriptive names wherever possible and imperative verbs for methods that do something, e.g. `disable_pipes`, `add_patterns` or `get_vector`. Private methods and functions that are not intended to be part of the user-facing API should be prefixed with an underscore `_`. It's often helpful to look at the existing classes for inspiration.
+
+Objects that can be serialized, e.g. data structures and pipeline components, should implement the same consistent methods for serialization. Those usually include at least `to_disk`, `from_disk`, `to_bytes` and `from_bytes`. Some objects can also implement more specific methods like `{to/from}_dict` or `{to/from}_str`.
+
+## Error handling
+
+We always encourage writing helpful and detailed custom error messages for everything we can anticipate going wrong, and including as much detail as possible. spaCy provides a directory of error messages in `errors.py` with unique codes for each message. This allows us to keep the code base more concise and avoids long and nested blocks of texts throughout the code that disrupt the reading flow. The codes make it easy to find references to the same error in different places, and also helps identify problems reported by users (since we can just search for the error code).
+
+Errors can be referenced via their code, e.g. `Errors.E123`. Messages can also include placeholders for values, that can be populated by formatting the string with `.format()`.
+
+```python
+class Errors:
+    E123 = "Something went wrong"
+    E456 = "Unexpected value: {value}"
+```
+
+```diff
+if something_went_wrong:
+-    raise ValueError("Something went wrong!")
+    raise ValueError(Errors.E123)
+
+if not isinstance(value, int):
+-    raise ValueError(f"Unexpected value: {value}")
+    raise ValueError(Errors.E456.format(value=value))
+```
+
+As a general rule of thumb, all error messages raised within the **core library** should be added to `Errors`. The only place where we write errors and messages as strings is `spacy.cli`, since these functions typically pretty-print and generate a lot of output that'd otherwise be very difficult to separate from the actual logic.
+
+### Re-raising exceptions
+
+If we anticipate possible errors in third-party code that we don't control, or our own code in a very different context, we typically try to provide custom and more specific error messages if possible. If we need to re-raise an exception within a `try`/`except` block, we can re-raise a custom exception.
+
+[Re-raising `from`](https://docs.python.org/3/tutorial/errors.html#exception-chaining) the original caught exception lets us chain the exceptions, so the user sees both the original error, as well as the custom message with a note "The above exception was the direct cause of the following exception".
+
+```diff
+try:
+    run_third_party_code_that_might_fail()
+except ValueError as e:
+    raise ValueError(Errors.E123) from e
+```
+
+In some cases, it makes sense to suppress the original exception, e.g. if we know what it is and know that it's not particularly helpful. In that case, we can raise `from None`. This prevents clogging up the user's terminal with multiple and irrelevant chained exceptions.
+
+```diff
+try:
+    run_our_own_code_that_might_fail_confusingly()
+except ValueError:
+    raise ValueError(Errors.E123) from None
+```
+
+### Avoid using naked `assert`
+
+During development, it can sometimes be helpful to add `assert` statements throughout your code to make sure that the values you're working with are what you expect. However, as you clean up your code, those should either be removed or replaced by more explicit error handling:
+
+```diff
+- assert score >= 0.0
+ if score < 0.0:
+     raise ValueError(Errors.789.format(score=score))
+```
+
+Otherwise, the user will get to see a naked `AssertionError` with no further explanation, which is very unhelpful. Instead of adding an error message to `assert`, it's always better to `raise` more explicit errors for specific conditions. If you're checking for something that _has to be right_ and would otherwise be a bug in spaCy, you can express this in the error message:
+
+```python
+E161 = ("Found an internal inconsistency when predicting entity links. "
+        "This is likely a bug in spaCy, so feel free to open an issue: "
+        "https://github.com/explosion/spaCy/issues")
+```
+
+### Warnings
+
+Instead of raising an error, some parts of the code base can raise warnings to notify the user of a potential problem. This is done using Python's `warnings.warn` and the messages defined in `Warnings` in the `errors.py`. Whether or not warnings are shown can be controlled by the user, including custom filters for disabling specific warnings using a regular expression matching our internal codes, e.g. `W123`.
+
+```diff
+- print("Warning: No examples provided for validation")
+ warnings.warn(Warnings.W123)
+```
+
+When adding warnings, make sure you're not calling `warnings.warn` repeatedly, e.g. in a loop, which will clog up the terminal output. Instead, you can collect the potential problems first and then raise a single warning. If the problem is critical, consider raising an error instead.
+
+```diff
+ n_empty = 0
+for spans in lots_of_annotations:
+    if len(spans) == 0:
+-       warnings.warn(Warnings.456)
+       n_empty += 1
+ warnings.warn(Warnings.456.format(count=n_empty))
+```
+
+### Logging
+
+Log statements can be added via spaCy's `logger`, which uses Python's native `logging` module under the hood. We generally only use logging for debugging information that **the user may choose to see** in debugging mode or that's **relevant during training** but not at runtime.
+
+```diff
+ logger.info("Set up nlp object from config")
+config = nlp.config.interpolate()
+```
+
+`spacy train` and similar CLI commands will enable all log statements of level `INFO` by default (which is not the case at runtime). This allows outputting specific information within certain parts of the core library during training, without having it shown at runtime. `DEBUG`-level logs are only shown if the user enables `--verbose` logging during training. They can be used to provide more specific and potentially more verbose details, especially in areas that can indicate bugs or problems, or to surface more details about what spaCy does under the hood. You should only use logging statements if absolutely necessary and important.
+
+## Writing tests
+
+spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests for spaCy modules and classes live in their own directories of the same name and all test files should be prefixed with `test_`. Tests included in the core library only cover the code and do not depend on any trained pipelines. When implementing a new feature or fixing a bug, it's usually good to start by writing some tests that describe what _should_ happen. As you write your code, you can then keep running the relevant tests until all of them pass.
+
+### Test suite structure
+
+When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
+
+Regression tests are tests that refer to bugs reported in specific issues. They should live in the relevant module of the test suite, named according to the issue number (e.g., `test_issue1234.py`), and [marked](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers) appropriately (e.g. `@pytest.mark.issue(1234)`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first.
+
+The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
+
+### Testing Cython Code
+
+If you're developing Cython code (`.pyx` files), those extensions will need to be built before the test runner can test that code - otherwise it's going to run the tests with stale code from the last time the extension was built. You can build the extensions locally with `python setup.py build_ext -i`.
+
+### Constructing objects and state
+
+Test functions usually follow the same simple structure: they set up some state, perform the operation you want to test and `assert` conditions that you expect to be true, usually before and after the operation.
+
+Tests should focus on exactly what they're testing and avoid dependencies on other unrelated library functionality wherever possible. If all your test needs is a `Doc` object with certain annotations set, you should always construct it manually:
+
+```python
+def test_doc_creation_with_pos():
+    doc = Doc(Vocab(), words=["hello", "world"], pos=["NOUN", "VERB"])
+    assert doc[0].pos_ == "NOUN"
+    assert doc[1].pos_ == "VERB"
+```
+
+### Parametrizing tests
+
+If you need to run the same test function over different input examples, you usually want to parametrize the test cases instead of using a loop within your test. This lets you keep a better separation between test cases and test logic, and it'll result in more useful output because `pytest` will be able to tell you which exact test case failed.
+
+The `@pytest.mark.parametrize` decorator takes two arguments: a string defining one or more comma-separated arguments that should be passed to the test function and a list of corresponding test cases (or a list of tuples to provide multiple arguments).
+
+```python
+@pytest.mark.parametrize("words", [["hello", "world"], ["this", "is", "a", "test"]])
+def test_doc_length(words):
+    doc = Doc(Vocab(), words=words)
+    assert len(doc) == len(words)
+```
+
+```python
+@pytest.mark.parametrize("text,expected_len", [("hello world", 2), ("I can't!", 4)])
+def test_token_length(en_tokenizer, text, expected_len):  # en_tokenizer is a fixture
+    doc = en_tokenizer(text)
+    assert len(doc) == expected_len
+```
+
+You can also stack `@pytest.mark.parametrize` decorators, although this is not recommended unless it's absolutely needed or required for the test. When stacking decorators, keep in mind that this will run the test with all possible combinations of the respective parametrized values, which is often not what you want and can slow down the test suite.
+
+### Handling failing tests
+
+`xfail` means that a test **should pass but currently fails**, i.e. is expected to fail. You can mark a test as currently xfailing by adding the `@pytest.mark.xfail` decorator. This should only be used for tests that don't yet work, not for logic that cause errors we raise on purpose (see the section on testing errors for this). It's often very helpful to implement tests for edge cases that we don't yet cover and mark them as `xfail`. You can also provide a `reason` keyword argument to the decorator with an explanation of why the test currently fails.
+
+```diff
+ @pytest.mark.xfail(reason="Issue #225 - not yet implemented")
+def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
+    doc = en_tokenizer("Will this road take me to Puddleton?\u2014No.")
+    assert doc[8].text == "\u2014"
+```
+
+When you run the test suite, you may come across tests that are reported as `xpass`. This means that they're marked as `xfail` but didn't actually fail. This is worth looking into: sometimes, it can mean that we have since fixed a bug that caused the test to previously fail, so we can remove the decorator. In other cases, especially when it comes to machine learning model implementations, it can also indicate that the **test is flaky**: it sometimes passes and sometimes fails. This can be caused by a bug, or by constraints being too narrowly defined. If a test shows different behavior depending on whether its run in isolation or not, this can indicate that it reacts to global state set in a previous test, which is unideal and should be avoided.
+
+### Writing slow tests
+
+If a test is useful but potentially quite slow, you can mark it with the `@pytest.mark.slow` decorator. This is a special marker we introduced and tests decorated with it only run if you run the test suite with `--slow`, but not as part of the main CI process. Before introducing a slow test, double-check that there isn't another and more efficient way to test for the behavior. You should also consider adding a simpler test with maybe only a subset of the test cases that can always run, so we at least have some coverage.
+
+### Skipping tests
+
+The `@pytest.mark.skip` decorator lets you skip tests entirely. You only want to do this for failing tests that may be slow to run or cause memory errors or segfaults, which would otherwise terminate the entire process and wouldn't be caught by `xfail`. We also sometimes use the `skip` decorator for old and outdated regression tests that we want to keep around but that don't apply anymore. When using the `skip` decorator, make sure to provide the `reason` keyword argument with a quick explanation of why you chose to skip this test.
+
+### Testing errors and warnings
+
+`pytest` lets you check whether a given error is raised by using the `pytest.raises` contextmanager. This is very useful when implementing custom error handling, so make sure you're not only testing for the correct behavior but also for errors resulting from incorrect inputs. If you're testing errors, you should always check for `pytest.raises` explicitly and not use `xfail`.
+
+```python
+words = ["a", "b", "c", "d", "e"]
+ents = ["Q-PERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"]
+with pytest.raises(ValueError):
+    Doc(Vocab(), words=words, ents=ents)
+```
+
+You can also use the `pytest.warns` contextmanager to check that a given warning type is raised. The first argument is the warning type or `None` (which will capture a list of warnings that you can `assert` is empty).
+
+```python
+def test_phrase_matcher_validation(en_vocab):
+    doc1 = Doc(en_vocab, words=["Test"], deps=["ROOT"])
+    doc2 = Doc(en_vocab, words=["Test"])
+    matcher = PhraseMatcher(en_vocab, validate=True)
+    with pytest.warns(UserWarning):
+        # Warn about unnecessarily parsed document
+        matcher.add("TEST1", [doc1])
+    with pytest.warns(None) as record:
+        matcher.add("TEST2", [docs])
+        assert not record.list
+```
+
+Keep in mind that your tests will fail if you're using the `pytest.warns` contextmanager with a given warning and the warning is _not_ shown. So you should only use it to check that spaCy handles and outputs warnings correctly. If your test outputs a warning that's expected but not relevant to what you're testing, you can use the `@pytest.mark.filterwarnings` decorator and ignore specific warnings starting with a given code:
+
+```python
+@pytest.mark.filterwarnings("ignore:\\[W036")
+def test_matcher_empty(en_vocab):
+    matcher = Matcher(en_vocab)
+    matcher(Doc(en_vocab, words=["test"]))
+```
+
+### Testing trained pipelines
+
+Our regular test suite does not depend on any of the trained pipelines, since their outputs can vary and aren't generally required to test the library functionality. We test pipelines separately using the tests included in the [`spacy-models`](https://github.com/explosion/spacy-models) repository, which run whenever we train a new suite of models. The tests here mostly focus on making sure that the packages can be loaded and that the predictions seam reasonable, and they include checks for common bugs we encountered previously. If your test does not primarily focus on verifying a model's predictions, it should be part of the core library tests and construct the required objects manually, instead of being added to the models tests.
+
+Keep in mind that specific predictions may change, and we can't test for all incorrect predictions reported by users. Different models make different mistakes, so even a model that's significantly more accurate overall may end up making wrong predictions that it previously didn't. However, some surprising incorrect predictions may indicate deeper bugs that we definitely want to investigate.
--- a/extra/DEVELOPER_DOCS/ExplosionBot.md
+++ b/extra/DEVELOPER_DOCS/ExplosionBot.md
@ -0,0 +1,56 @@
+# Explosion-bot
+
+Explosion-bot is a robot that can be invoked to help with running particular test commands.
+
+## Permissions
+
+Only maintainers have permissions to summon explosion-bot. Each of the open source repos that use explosion-bot has its own team(s) of maintainers, and only github users who are members of those teams can successfully run bot commands.
+
+## Running robot commands
+
+To summon the robot, write a github comment on the issue/PR you wish to test. The comment must be in the following format:
+
+```
+@explosion-bot please test_gpu
+```
+
+Some things to note:
+
+- The `@explosion-bot please` must be the beginning of the command - you cannot add anything in front of this or else the robot won't know how to parse it. Adding anything at the end aside from the test name will also confuse the robot, so keep it simple!
+- The command name (such as `test_gpu`) must be one of the tests that the bot knows how to run. The available commands are documented in the bot's [workflow config](https://github.com/explosion/spaCy/blob/master/.github/workflows/explosionbot.yml#L26) and must match exactly one of the commands listed there.
+- The robot can't do multiple things at once, so if you want it to run multiple tests, you'll have to summon it with one comment per test.
+
+### Examples
+
+- Execute spaCy slow GPU tests with a custom thinc branch from a spaCy PR:
+
+  ```
+  @explosion-bot please test_slow_gpu --thinc-branch <branch_name>
+  ```
+
+  `branch_name` can either be a named branch, e.g: `develop`, or an unmerged PR, e.g: `refs/pull/<pr_number>/head`.
+
+- Execute spaCy Transformers GPU tests from a spaCy PR:
+
+  ```
+  @explosion-bot please test_gpu --run-on spacy-transformers --run-on-branch master --spacy-branch current_pr
+  ```
+
+  This will launch the GPU pipeline for the `spacy-transformers` repo on its `master` branch, using the current spaCy PR's branch to build spaCy. The name of the repository passed to `--run-on` is case-sensitive, e.g: use `spaCy` instead of `spacy`.
+
+- General info about supported commands.
+
+  ```
+  @explosion-bot please info
+  ```
+
+- Help text for a specific command
+  ```
+  @explosion-bot please <command> --help
+  ```
+
+## Troubleshooting
+
+If the robot isn't responding to commands as expected, you can check its logs in the [Github Action](https://github.com/explosion/spaCy/actions/workflows/explosionbot.yml).
+
+For each command sent to the bot, there should be a run of the `explosion-bot` workflow. In the `Install and run explosion-bot` step, towards the ends of the logs you should see info about the configuration that the bot was run with, as well as any errors that the bot encountered.
--- a/extra/DEVELOPER_DOCS/Language.md
+++ b/extra/DEVELOPER_DOCS/Language.md
@ -0,0 +1,150 @@
+# Language
+
+> Reference: `spacy/language.py`
+
+1. [Constructing the `nlp` object from a config](#1-constructing-the-nlp-object-from-a-config)
+   - [A. Overview of `Language.from_config`](#1a-overview)
+   - [B. Component factories](#1b-how-pipeline-component-factories-work-in-the-config)
+   - [C. Sourcing a component](#1c-sourcing-a-pipeline-component)
+   - [D. Tracking components as they're modified](#1d-tracking-components-as-theyre-modified)
+   - [E. spaCy's config utility function](#1e-spacys-config-utility-functions)
+2. [Initialization](#initialization)
+   - [A. Initialization for training](#2a-initialization-for-training): `init_nlp`
+   - [B. Initializing the `nlp` object](#2b-initializing-the-nlp-object): `Language.initialize`
+   - [C. Initializing the vocab](#2c-initializing-the-vocab): `init_vocab`
+
+## 1. Constructing the `nlp` object from a config
+
+### 1A. Overview
+
+Most of the functions referenced in the config are regular functions with arbitrary arguments registered via the function registry. However, the pipeline components are a bit special: they don't only receive arguments passed in via the config file, but also the current `nlp` object and the string `name` of the individual component instance (so a user can have multiple components created with the same factory, e.g. `ner_one` and `ner_two`). This name can then be used by the components to add to the losses and scores. This special requirement means that pipeline components can't just be resolved via the config the "normal" way: we need to retrieve the component functions manually and pass them their arguments, plus the `nlp` and `name`.
+
+The `Language.from_config` classmethod takes care of constructing the `nlp` object from a config. It's the single place where this happens and what `spacy.load` delegates to under the hood. Its main responsibilities are:
+
+- **Load and validate the config**, and optionally **auto-fill** all missing values that we either have defaults for in the config template or that registered function arguments define defaults for. This helps ensure backwards-compatibility, because we're able to add a new argument `foo: str = "bar"` to an existing function, without breaking configs that don't specity it.
+- **Execute relevant callbacks** for pipeline creation, e.g. optional functions called before and after creation of the `nlp` object and pipeline.
+- **Initialize language subclass and create tokenizer**. The `from_config` classmethod will always be called on a language subclass, e.g. `English`, not on `Language` directly. Initializing the subclass takes a callback to create the tokenizer.
+- **Set up the pipeline components**. Components can either refer to a component factory or a `source`, i.e. an existing pipeline that's loaded and that the component is then copied from. We also need to ensure that we update the information about which components are disabled.
+- **Manage listeners.** If sourced components "listen" to other components (`tok2vec`, `transformer`), we need to ensure that the references are valid. If the config specifies that listeners should be replaced by copies (e.g. to give the `ner` component its own `tok2vec` model instead of listening to the shared `tok2vec` component in the pipeline), we also need to take care of that.
+
+Note that we only resolve and load **selected sections** in `Language.from_config`, i.e. only the parts that are relevant at runtime, which is `[nlp]` and `[components]`. We don't want to be resolving anything related to training or initialization, since this would mean loading and constructing unnecessary functions, including functions that require information that isn't necessarily available at runtime, like `paths.train`.
+
+### 1B. How pipeline component factories work in the config
+
+As opposed to regular registered functions that refer to a registry and function name (e.g. `"@misc": "foo.v1"`), pipeline components follow a different format and refer to their component `factory` name. This corresponds to the name defined via the `@Language.component` or `@Language.factory` decorator. We need this decorator to define additional meta information for the components, like their default config and score weights.
+
+```ini
+[components.my_component]
+factory = "foo"
+some_arg = "bar"
+other_arg = ${paths.some_path}
+```
+
+This means that we need to create and resolve the `config["components"]` separately from the rest of the config. There are some important considerations and things we need to manage explicitly to avoid unexpected behavior:
+
+#### Variable interpolation
+
+When a config is resolved, references to variables are replaced, so that the functions receive the correct value instead of just the variable name. To interpolate a config, we need it in its entirety: we couldn't just interpolate a subsection that refers to variables defined in a different subsection. So we first interpolate the entire config.
+
+However, the `nlp.config` should include the original config with variables intact – otherwise, loading a pipeline and saving it to disk will destroy all logic implemented via variables and hard-code the values all over the place. This means that when we create the components, we need to keep two versions of the config: the interpolated config with the "real" values and the `raw_config` including the variable references.
+
+#### Factory registry
+
+Component factories are special and use the `@Language.factory` or `@Language.component` decorator to register themselves and their meta. When the decorator runs, it performs some basic validation, stores the meta information for the factory on the `Language` class (default config, scores etc.) and then adds the factory function to `registry.factories`. The `component` decorator can be used for registering simple functions that just take a `Doc` object and return it so in that case, we create the factory for the user automatically.
+
+There's one important detail to note about how factories are registered via entry points: A package that wants to expose spaCy components still needs to register them via the `@Language` decorators so we have the component meta information and can perform required checks. All we care about here is that the decorated function is **loaded and imported**. When it is, the `@Language` decorator takes care of everything, including actually registering the component factory.
+
+Normally, adding to the registry via an entry point will just add the function to the registry under the given name. But for `spacy_factories`, we don't actually want that: all we care about is that the function decorated with `@Language` is imported so the decorator runs. So we only exploit Python's entry point system to automatically import the function, and the `spacy_factories` entry point group actually adds to a **separate registry**, `registry._factories`, under the hood. Its only purpose is that the functions are imported. The decorator then runs, creates the factory if needed and adds it to the `registry.factories` registry.
+
+#### Language-specific factories
+
+spaCy supports registering factories on the `Language` base class, as well as language-specific subclasses like `English` or `German`. This allows providing different factories depending on the language, e.g. a different default lemmatizer. The `Language.get_factory_name` classmethod constructs the factory name as `{lang}.{name}` if a language is available (i.e. if it's a subclass) and falls back to `{name}` otherwise. So `@German.factory("foo")` will add a factory `de.foo` under the hood. If you add `nlp.add_pipe("foo")`, we first check if there's a factory for `{nlp.lang}.foo` and if not, we fall back to checking for a factory `foo`.
+
+#### Creating a pipeline component from a factory
+
+`Language.add_pipe` takes care of adding a pipeline component, given its factory name, its config. If no source pipeline to copy the component from is provided, it delegates to `Language.create_pipe`, which sets up the actual component function.
+
+- Validate the config and make sure that the factory was registered via the decorator and that we have meta for it.
+- Update the component config with any defaults specified by the component's `default_config`, if available. This is done by merging the values we receive into the defaults. It ensures that you can still add a component without having to specify its _entire_ config including more complex settings like `model`. If no `model` is defined, we use the default.
+- Check if we have a language-specific factory for the given `nlp.lang` and if not, fall back to the global factory.
+- Construct the component config, consisting of whatever arguments were provided, plus the current `nlp` object and `name`, which are default expected arguments of all factories. We also add a reference to the `@factories` registry, so we can resolve the config via the registry, like any other config. With the added `nlp` and `name`, it should now include all expected arguments of the given function.
+- Fill the config to make sure all unspecified defaults from the function arguments are added and update the `raw_config` (uninterpolated with variables intact) with that information, so the component config we store in `nlp.config` is up to date. We do this by adding the `raw_config` _into_ the filled config – otherwise, the references to variables would be overwritten.
+- Resolve the config and create all functions it refers to (e.g. `model`). This gives us the actual component function that we can insert into the pipeline.
+
+### 1C. Sourcing a pipeline component
+
+```ini
+[components.ner]
+source = "en_core_web_sm"
+```
+
+spaCy also allows ["sourcing" a component](https://spacy.io/usage/processing-pipelines#sourced-components), which will copy it over from an existing pipeline. In this case, `Language.add_pipe` will delegate to `Language.create_pipe_from_source`. In order to copy a component effectively and validate it, the source pipeline first needs to be loaded. This is done in `Language.from_config`, so a source pipeline only has to be loaded once if multiple components source from it. Sourcing a component will perform the following checks and modifications:
+
+- For each sourced pipeline component loaded in `Language.from_config`, a hash of the vectors data from the source pipeline is stored in the pipeline meta so we're able to check whether the vectors match and warn if not (since different vectors that are used as features in components can lead to degraded performance). Because the vectors are not loaded at the point when components are sourced, the check is postponed to `init_vocab` as part of `Language.initialize`.
+- If the sourced pipeline component is loaded through `Language.add_pipe(source=)`, the vectors are already loaded and can be compared directly. The check compares the shape and keys first and finally falls back to comparing the actual byte representation of the vectors (which is slower).
+- Ensure that the component is available in the pipeline.
+- Interpolate the entire config of the source pipeline so all variables are replaced and the component's config that's copied over doesn't include references to variables that are not available in the destination config.
+- Add the source `vocab.strings` to the destination's `vocab.strings` so we don't end up with unavailable strings in the final pipeline (which would also include labels used by the sourced component).
+
+Note that there may be other incompatibilities that we're currently not checking for and that could cause a sourced component to not work in the destination pipeline. We're interested in adding more checks here but there'll always be a small number of edge cases we'll never be able to catch, including a sourced component depending on other pipeline state that's not available in the destination pipeline.
+
+### 1D. Tracking components as they're modified
+
+The `Language` class implements methods for removing, replacing or renaming pipeline components. Whenever we make these changes, we need to update the information stored on the `Language` object to ensure that it matches the current state of the pipeline. If a user just writes to `nlp.config` manually, we obviously can't ensure that the config matches the reality – but since we offer modification via the pipe methods, it's expected that spaCy keeps the config in sync under the hood. Otherwise, saving a modified pipeline to disk and loading it back wouldn't work. The internal attributes we need to keep in sync here are:
+
+| Attribute                | Type                         | Description                                                                                                                                                     |
+| ------------------------ | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `Language._components`   | `List[Tuple[str, Callable]]` | All pipeline components as `(name, func)` tuples. This is used as the source of truth for `Language.pipeline`, `Language.pipe_names` and `Language.components`. |
+| `Language._pipe_meta`    | `Dict[str, FactoryMeta]`     | The meta information of a component's factory, keyed by component name. This can include multiple components referring to the same factory meta.                |
+| `Language._pipe_configs` | `Dict[str, Config]`          | The component's config, keyed by component name.                                                                                                                |
+| `Language._disabled`     | `Set[str]`                   | Names of components that are currently disabled.                                                                                                                |
+| `Language._config`       | `Config`                     | The underlying config. This is only internals and will be used as the basis for constructing the config in the `Language.config` property.                      |
+
+In addition to the actual component settings in `[components]`, the config also allows specifying component-specific arguments via the `[initialize.components]` block, which are passed to the component's `initialize` method during initialization if it's available. So we also need to keep this in sync in the underlying config.
+
+### 1E. spaCy's config utility functions
+
+When working with configs in spaCy, make sure to use the utility functions provided by spaCy if available, instead of calling the respective `Config` methods. The utilities take care of providing spaCy-specific error messages and ensure a consistent order of config sections by setting the `section_order` argument. This ensures that exported configs always have the same consistent format.
+
+- `util.load_config`: load a config from a file
+- `util.load_config_from_str`: load a confirm from a string representation
+- `util.copy_config`: deepcopy a config
+
+## 2. Initialization
+
+Initialization is a separate step of the [config lifecycle](https://spacy.io/usage/training#config-lifecycle) that's not performed at runtime. It's implemented via the `training.initialize.init_nlp` helper and calls into `Language.initialize` method, which sets up the pipeline and component models before training. The `initialize` method takes a callback that returns a sample of examples, which is used to initialize the component models, add all required labels and perform shape inference if applicable.
+
+Components can also define custom initialization setting via the `[initialize.components]` block, e.g. if they require external data like lookup tables to be loaded in. All config settings defined here will be passed to the component's `initialize` method, if it implements one. Components are expected to handle their own serialization after they're initialized so that any data or settings they require are saved with the pipeline and will be available from disk when the pipeline is loaded back at runtime.
+
+### 2A. Initialization for training
+
+The `init_nlp` function is called before training and returns an initialized `nlp` object that can be updated with the examples. It only needs the config and does the following:
+
+- Load and validate the config. In order to validate certain settings like the `seed`, we also interpolate the config to get the final value (because in theory, a user could provide this via a variable).
+- Set up the GPU allocation, if required.
+- Create the `nlp` object from the raw, uninterpolated config, which delegates to `Language.from_config`. Since this method may modify and auto-fill the config and pipeline component settings, we then use the interpolated version of `nlp.config` going forward, to ensure that what we're training with is up to date.
+- Resolve the `[training]` block of the config and perform validation, e.g. to check that the corpora are available.
+- Determine the components that should be frozen (not updated during training) or resumed (sourced components from a different pipeline that should be updated from the examples and not reset and re-initialized). To resume training, we can call the `nlp.resume_training` method.
+- Initialize the `nlp` object via `nlp.initialize` and pass it a `get_examples` callback that returns the training corpus (used for shape inference, setting up labels etc.). If the training corpus is streamed, we only provide a small sample of the data, which can potentially be infinite. `nlp.initialize` will delegate to the components as well and pass the data sample forward.
+- Check the listeners and warn about components dependencies, e.g. if a frozen component listens to a component that is retrained, or vice versa (which can degrade results).
+
+### 2B. Initializing the `nlp` object
+
+The `Language.initialize` method does the following:
+
+- **Resolve the config** defined in the `[initialize]` block separately (since everything else is already available in the loaded `nlp` object), based on the fully interpolated config.
+- **Execute callbacks**, i.e. `before_init` and `after_init`, if they're defined.
+- **Initialize the vocab**, including vocab data, lookup tables and vectors.
+- **Initialize the tokenizer** if it implements an `initialize` method. This is not the case for the default tokenizers, but it allows custom tokenizers to depend on external data resources that are loaded in on initialization.
+- **Initialize all pipeline components** if they implement an `initialize` method and pass them the `get_examples` callback, the current `nlp` object as well as well additional initialization config settings provided in the component-specific block.
+- **Initialize pretraining** if a `[pretraining]` block is available in the config. This allows loading pretrained tok2vec weights in `spacy pretrain`.
+- **Register listeners** if token-to-vector embedding layers of a component model "listen" to a previous component (`tok2vec`, `transformer`) in the pipeline.
+- **Create an optimizer** on the `Language` class, either by adding the optimizer passed as `sgd` to `initialize`, or by creating the optimizer defined in the config's training settings.
+
+### 2C. Initializing the vocab
+
+Vocab initialization is handled in the `training.initialize.init_vocab` helper. It takes the relevant loaded functions and values from the config and takes care of the following:
+
+- Add lookup tables defined in the config initialization, e.g. custom lemmatization tables. Those will be added to `nlp.vocab.lookups` from where they can be accessed by components.
+- Add JSONL-formatted [vocabulary data](https://spacy.io/api/data-formats#vocab-jsonl) to pre-populate the lexical attributes.
+- Load vectors into the pipeline. Vectors are defined as a name or path to a saved `nlp` object containing the vectors, e.g. `en_vectors_web_lg`. It's loaded and the vectors are ported over, while ensuring that all source strings are available in the destination strings. We also warn if there's a mismatch between sourced vectors, since this can lead to problems.
--- a/extra/DEVELOPER_DOCS/Listeners.md
+++ b/extra/DEVELOPER_DOCS/Listeners.md
@ -0,0 +1,235 @@
+# Listeners
+
+- [1. Overview](#1-overview)
+- [2. Initialization](#2-initialization)
+  - [2A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
+  - [2B. Shape inference](#2b-shape-inference)
+- [3. Internal communication](#3-internal-communication)
+  - [3A. During prediction](#3a-during-prediction)
+  - [3B. During training](#3b-during-training)
+    - [Training with multiple listeners](#training-with-multiple-listeners)
+  - [3C. Frozen components](#3c-frozen-components)
+    - [The Tok2Vec or Transformer is frozen](#the-tok2vec-or-transformer-is-frozen)
+    - [The upstream component is frozen](#the-upstream-component-is-frozen)
+- [4. Replacing listener with standalone](#4-replacing-listener-with-standalone)
+
+## 1. Overview
+
+Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
+This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
+Both versions can be used either inline/standalone, which means that they are defined and used
+by only one specific component (e.g. NER), or
+[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
+in which case the embedding functionality becomes a separate component that can
+feed embeddings to multiple components downstream, using a listener-pattern.
+
+| Type          | Usage      | Model Architecture                                                                                 |
+| ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
+| `Tok2Vec`     | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec)                                      |
+| `Tok2Vec`     | listener   | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener)                      |
+| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer)   |
+| `Transformer` | listener   | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
+
+Here we discuss the listener pattern and its implementation in code in more detail.
+
+## 2. Initialization
+
+### 2A. Linking listeners to the embedding component
+
+To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
+
+```
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v2"
+```
+
+A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
+name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
+
+```
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+upstream = "tok2vec"
+```
+
+When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
+initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
+implemented in the method `nlp._link_components()` which loops over each
+component in the pipeline and calls `find_listeners()` on a component if it's defined.
+The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
+of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
+`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
+
+If it's a Transformer-based pipeline, a
+[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
+has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
+sublayers of downstream components.
+
+### 2B. Shape inference
+
+Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
+For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
+
+```
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+```
+
+A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
+is set with the function `model.attrs["set_transformer"]`,
+[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
+by `set_pytorch_transformer`.
+This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
+the listener's output dimension correctly.
+
+This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
+commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
+object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
+linking twice is not harmful.
+
+## 3. Internal communication
+
+The internal communication between a listener and its downstream components is organized by sending and
+receiving information across the components - either directly or implicitly.
+The details are different depending on whether the pipeline is currently training, or predicting.
+Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
+
+### 3A. During prediction
+
+When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
+which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
+Similarly, the `Transformer` component stores the produced embeddings
+in `doc._.trf_data`. Next, the `forward` pass of a
+[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
+or a
+[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
+accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
+properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
+We need this fallback mechanism to enable shape inference methods in Thinc, but the code
+is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
+
+### 3B. During training
+
+During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
+annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
+Instead, we rely on a caching mechanism between the original embedding component and its listener.
+Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
+identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
+call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
+network.
+
+We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
+which means that only one such batch needs to be kept in memory for each listener.
+When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
+runs the backpropagation, and returns the results and the gradients.
+
+There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
+
+- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
+  training pipeline.
+- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
+
+#### Training with multiple listeners
+
+One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
+a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
+the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
+send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
+of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
+will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
+
+### 3C. Frozen components
+
+The listener pattern can get particularly tricky in combination with frozen components. To detect components
+with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
+the listeners and their upstream components and warns in two scenarios.
+
+#### The Tok2Vec or Transformer is frozen
+
+If the `Tok2Vec` or `Transformer` was already trained,
+e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
+it could be a valid use-case to freeze the embedding architecture and only train downstream components such
+as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
+embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
+list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
+
+However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
+listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
+
+#### The upstream component is frozen
+
+If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
+the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
+how to use the `replace_listeners` functionality to prevent this problem.
+
+## 4. Replacing listener with standalone
+
+The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
+of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
+effectively making the downstream component independent of any other components in the pipeline.
+It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
+First, it fetches the original `Model` of the original component that creates the embeddings:
+
+```
+tok2vec = self.get_pipe(tok2vec_name)
+tok2vec_model = tok2vec.model
+```
+
+Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
+[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
+
+In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
+downstream component. However, for the `transformer`, this doesn't work.
+The reason is that the `TransformerListener` architecture chains the listener with
+[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
+
+```
+model = chain(
+    TransformerListener(upstream_name=upstream)
+    trfs2arrays(pooling, grad_factor),
+)
+```
+
+but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
+and `trfs2arrays`:
+
+```
+model = chain(
+    TransformerModel(name, get_spans, tokenizer_config),
+    split_trf_batch(),
+    trfs2arrays(pooling, grad_factor),
+)
+```
+
+So you can't just take the model from the listener, and drop that into the component internally. You need to
+adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
+[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
+[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
+in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
+
+```
+replace_func = tok2vec_model.attrs["replace_listener_cfg"]
+new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
+...
+new_model = tok2vec_model.attrs["replace_listener"](new_model)
+```
+
+The new config and model are then properly stored on the `nlp` object.
+Note that this functionality (running the replacement for a transformer listener) was broken prior to
+`spacy-transformers` 1.0.5.
+
+In spaCy 3.7, `Language.replace_listeners` was updated to pass the following additional arguments to the `replace_listener` callback:
+the listener to be replaced and the `tok2vec`/`transformer` pipe from which the new model was copied. To maintain backwards-compatiblity,
+the method only passes these extra arguments for callbacks that support them:
+
+```
+def replace_listener_pre_37(copied_tok2vec_model):
+  ...
+
+def replace_listener_post_37(copied_tok2vec_model, replaced_listener, tok2vec_pipe):
+  ...
+```
--- a/extra/DEVELOPER_DOCS/README.md
+++ b/extra/DEVELOPER_DOCS/README.md
@ -0,0 +1,7 @@
+<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
+
+# Developer Documentation
+
+This directory includes additional documentation and explanations of spaCy's internals. It's mostly intended for the spaCy core development team and contributors interested in the more complex parts of the library. The documents generally focus on more abstract implementation details and how specific methods and algorithms work, and they assume knowledge of what's already available in the [usage documentation](https://spacy.io/usage) and [API reference](https://spacy.io/api).
+
+If you're looking to contribute to spaCy, make sure to check out the documentation and [contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md) first.
--- a/extra/DEVELOPER_DOCS/Satellite
+++ b/extra/DEVELOPER_DOCS/Satellite
@ -0,0 +1,82 @@
+# spaCy Satellite Packages
+
+This is a list of all the active repos relevant to spaCy besides the main one, with short descriptions, history, and current status. Archived repos will not be covered.
+
+## Always Included in spaCy
+
+These packages are always pulled in when you install spaCy. Most of them are direct dependencies, but some are transitive dependencies through other packages.
+
+- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatability.
+- [thinc](https://github.com/explosion/thinc): Thinc is the machine learning library that powers trainable components in spaCy. It wraps backends like Numpy, PyTorch, and Tensorflow to provide a functional interface for specifying architectures.
+- [catalogue](https://github.com/explosion/catalogue): Small library for adding function registries, like those used for model architectures in spaCy.
+- [confection](https://github.com/explosion/confection): This library contains the functionality for config parsing that was formerly contained directly in Thinc.
+- [spacy-loggers](https://github.com/explosion/spacy-loggers): Contains loggers beyond the default logger available in spaCy&#39;s core code base. This includes loggers integrated with third-party services, which may differ in release cadence from spaCy itself.
+- [wasabi](https://github.com/explosion/wasabi): A command line formatting library, used for terminal output in spaCy.
+- [srsly](https://github.com/explosion/srsly): A wrapper that vendors several serialization libraries for spaCy. Includes parsers for JSON, JSONL, MessagePack, (extended) Pickle, and YAML.
+- [preshed](https://github.com/explosion/preshed): A Cython library for low-level data structures like hash maps, used for memory efficient data storage.
+- [cython-blis](https://github.com/explosion/cython-blis): Fast matrix multiplication using BLIS without depending on system libraries. Required by Thinc, rather than spaCy directly.
+- [murmurhash](https://github.com/explosion/murmurhash): A wrapper library for a C++ murmurhash implementation, used for string IDs in spaCy and preshed.
+- [cymem](https://github.com/explosion/cymem): A small library for RAII-style memory management in Cython. 
+
+## Optional Extensions for spaCy
+
+These are repos that can be used by spaCy but aren&#39;t part of a default installation. Many of these are wrappers to integrate various kinds of third-party libraries.
+
+- [spacy-transformers](https://github.com/explosion/spacy-transformers): A wrapper for the [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) library, this handles the extensive conversion necessary to coordinate spaCy&#39;s powerful `Doc` representation, training pipeline, and the Transformer embeddings. When released, this was known as `spacy-pytorch-transformers`, but it changed to the current name when HuggingFace update the name of their library as well.
+- [spacy-huggingface-hub](https://github.com/explosion/spacy-huggingface-hub): This package has a CLI script for uploading a packaged spaCy pipeline (created with `spacy package`) to the [Hugging Face Hub](https://huggingface.co/models).
+- [spacy-alignments](https://github.com/explosion/spacy-alignments): A wrapper for the tokenizations library (mentioned below) with a modified build system to simplify cross-platform wheel creation. Used in spacy-transformers for aligning spaCy and HuggingFace tokenizations.
+- [spacy-experimental](https://github.com/explosion/spacy-experimental): Experimental components that are not quite ready for inclusion in the main spaCy library. Usually there are unresolved questions around their APIs, so the experimental library allows us to expose them to the community for feedback before fully integrating them. 
+- [spacy-lookups-data](https://github.com/explosion/spacy-lookups-data): A repository of linguistic data, such as lemmas, that takes up a lot of disk space. Originally created to reduce the size of the spaCy core library. This is mainly useful if you want the data included but aren&#39;t using a pretrained pipeline; for the affected languages, the relevant data is included in pretrained pipelines directly.
+- [coreferee](https://github.com/explosion/coreferee): Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages. Used as a spaCy pipeline component.
+- [spacy-stanza](https://github.com/explosion/spacy-stanza): This is a wrapper that allows the use of Stanford&#39;s Stanza library in spaCy. 
+- [spacy-streamlit](https://github.com/explosion/spacy-streamlit): A wrapper for the Streamlit dashboard building library to help with integrating [displaCy](https://spacy.io/api/top-level/#displacy).
+- [spacymoji](https://github.com/explosion/spacymoji): A library to add extra support for emoji to spaCy, such as including character names.
+- [thinc-apple-ops](https://github.com/explosion/thinc-apple-ops): A special backend for OSX that uses Apple&#39;s native libraries for improved performance.
+- [os-signpost](https://github.com/explosion/os-signpost): A Python package that allows you to use the `OSSignposter` API in OSX for performance analysis.
+- [spacy-ray](https://github.com/explosion/spacy-ray): A wrapper to integrate spaCy with Ray, a distributed training framework. Currently a work in progress.
+
+## Prodigy
+
+[Prodigy](https://prodi.gy) is Explosion&#39;s easy to use and highly customizable tool for annotating data. Prodigy itself requires a license, but the repos below contain documentation, examples, and editor or notebook integrations.
+
+- [prodigy-recipes](https://github.com/explosion/prodigy-recipes): Sample recipes for Prodigy, along with notebooks and other examples of usage.
+- [vscode-prodigy](https://github.com/explosion/vscode-prodigy): A VS Code extension that lets you run Prodigy inside VS Code.
+- [jupyterlab-prodigy](https://github.com/explosion/jupyterlab-prodigy): An extension for JupyterLab that lets you run Prodigy inside JupyterLab.
+
+## Independent Tools or Projects
+
+These are tools that may be related to or use spaCy, but are functional independent projects in their own right as well.
+
+- [floret](https://github.com/explosion/floret): A modification of fastText to use Bloom Embeddings. Can be used to add vectors with subword features to spaCy, and also works independently in the same manner as fastText.
+- [sense2vec](https://github.com/explosion/sense2vec): A library to make embeddings of noun phrases or words coupled with their part of speech. This library uses spaCy.
+- [spacy-vectors-builder](https://github.com/explosion/spacy-vectors-builder): This is a spaCy project that builds vectors using floret and a lot of input text. It handles downloading the input data as well as the actual building of vectors.
+- [holmes-extractor](https://github.com/explosion/holmes-extractor): Information extraction from English and German texts based on predicate logic. Uses spaCy.
+- [healthsea](https://github.com/explosion/healthsea): Healthsea is a project to extract information from comments about health supplements. Structurally, it&#39;s a self-contained, large spaCy project.
+- [spacy-pkuseg](https://github.com/explosion/spacy-pkuseg): A fork of the pkuseg Chinese tokenizer. Used for Chinese support in spaCy, but also works independently.
+- [ml-datasets](https://github.com/explosion/ml-datasets): This repo includes loaders for several standard machine learning datasets, like MNIST or WikiNER, and has historically been used in spaCy example code and documentation.
+
+## Documentation and Informational Repos
+
+These repos are used to support the spaCy docs or otherwise present information about spaCy or other Explosion projects.
+
+- [projects](https://github.com/explosion/projects): The projects repo is used to show detailed examples of spaCy usage. Individual projects can be checked out using the spaCy command line tool, rather than checking out the projects repo directly.
+- [spacy-course](https://github.com/explosion/spacy-course): Home to the interactive spaCy course for learning about how to use the library and some basic NLP principles.
+- [spacy-io-binder](https://github.com/explosion/spacy-io-binder): Home to the notebooks used for interactive examples in the documentation.
+
+## Organizational / Meta
+
+These repos are used for organizing data around spaCy, but are not something an end user would need to install as part of using the library.
+
+- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatability, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
+- [wheelwright](https://github.com/explosion/wheelwright): A tool for automating our PyPI builds and releases.
+- [ec2buildwheel](https://github.com/explosion/ec2buildwheel): A small project that allows you to build Python packages in the manner of cibuildwheel, but on any EC2 image. Used by wheelwright.
+
+## Other
+
+Repos that don&#39;t fit in any of the above categories.
+
+- [blis](https://github.com/explosion/blis): A fork of the official BLIS library. The main branch is not updated, but work continues in various branches. This is used for cython-blis.
+- [tokenizations](https://github.com/explosion/tokenizations): A library originally by Yohei Tamura to align strings with tolerance to some variations in features like case and diacritics, used for aligning tokens and wordpieces. Adopted and maintained by Explosion, but usually spacy-alignments is used instead.
+- [conll-2012](https://github.com/explosion/conll-2012): A repo to hold some slightly cleaned up versions of the official scripts for the CoNLL 2012 shared task involving coreference resolution. Used in the coref project.
+- [fastapi-explosion-extras](https://github.com/explosion/fastapi-explosion-extras): Some small tweaks to FastAPI used at Explosion.
+
--- a/extra/DEVELOPER_DOCS/StringStore-Vocab.md
+++ b/extra/DEVELOPER_DOCS/StringStore-Vocab.md
@ -0,0 +1,216 @@
+# StringStore & Vocab
+
+> Reference: `spacy/strings.pyx`
+> Reference: `spacy/vocab.pyx`
+
+## Overview
+
+spaCy represents mosts strings internally using a `uint64` in Cython which
+corresponds to a hash. The magic required to make this largely transparent is
+handled by the `StringStore`, and is integrated into the pipelines using the
+`Vocab`, which also connects it to some other information. 
+
+These are mostly internal details that average library users should never have
+to think about. On the other hand, when developing a component it's normal to
+interact with the Vocab for lexeme data or word vectors, and it's not unusual
+to add labels to the `StringStore`.
+
+## StringStore
+
+### Overview
+
+The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
+though it is not a subclass of anything in particular.
+
+The main functionality of the `StringStore` is that `__getitem__` converts
+hashes into strings or strings into hashes.
+
+The full details of the conversion are complicated. Normally you shouldn't have
+to worry about them, but the first applicable case here is used to get the
+return value:
+
+1. 0 and the empty string are special cased to each other
+2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
+3. normal strings or bytes are hashed
+4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
+5. anything not yet handled is used as a hash to lookup a string
+
+For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
+
+Almost all strings in spaCy are stored in the `StringStore`. This naturally
+includes tokens, but also includes things like labels (not just NER/POS/dep,
+but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
+of the main results of this is that tokens can be represented by a compact C
+struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
+input for the models is straightforward, and there's not a token mapping step
+like in many machine learning frameworks. Additionally, because the token IDs
+in spaCy are based on hashes, they are consistent across environments or
+models.
+
+One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
+`int` and `something.value_` returns a string. That's implemented using the
+`StringStore`. Typically the `int` is stored in a C struct and the string is
+generated via a property that calls into the `StringStore` with the `int`.
+
+Besides `__getitem__`, the `StringStore` has functions to return specifically a
+string or specifically a hash, regardless of whether the input was a string or
+hash to begin with, though these are only used occasionally.
+
+### Implementation Details: Hashes and Allocations
+
+Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
+mechanism for detecting and avoiding collisions. To date there has never been a
+reproducible collision or user report about any related issues.
+
+[murmurhash]: https://github.com/explosion/murmurhash
+
+The empty string is not hashed, it's just converted to/from 0. 
+
+A small number of strings use indices into a lookup table (so low integers)
+rather than hashes. This is mostly Universal Dependencies labels or other
+strings considered "core" in spaCy. This was critical in v1, which hadn't
+introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
+especially lexeme flags, but is otherwise only maintained for backwards
+compatibility.
+
+You can call `strings["mystring"]` with a string the `StringStore` has never seen
+before and it will return a hash. But in order to do the reverse operation, you
+need to call `strings.add("mystring")` first. Without a call to `add` the
+string will not be interned.
+
+Example:
+
+```
+from spacy.strings import StringStore
+
+ss = StringStore()
+hashval = ss["spacy"] # 10639093010105930009
+try:
+    # this won't work
+    ss[hashval]
+except KeyError:
+    print(f"key {hashval} unknown in the StringStore.")
+
+ss.add("spacy")
+assert ss[hashval] == "spacy" # it works now
+
+# There is no `.keys` property, but you can iterate over keys
+# The empty string will never be in the list of keys
+for key in ss:
+    print(key)
+```
+
+In normal use nothing is ever removed from the `StringStore`. In theory this
+means that if you do something like iterate through all hex values of a certain
+length you can have explosive memory usage. In practice this has never been an
+issue. (Note that this is also different from using `sys.intern` to intern
+Python strings, which does not guarantee they won't be garbage collected later.)
+
+Strings are stored in the `StringStore` in a peculiar way: each string uses a
+union that is either an eight-byte `char[]` or a `char*`. Short strings are
+stored directly in the `char[]`, while longer strings are stored in allocated
+memory and prefixed with their length. This is a strategy to reduce indirection
+and memory fragmentation. See  `decode_Utf8Str` and `_allocate` in
+`strings.pyx` for the implementation.
+
+### When to Use the StringStore?
+
+While you can ignore the `StringStore` in many cases, there are situations where
+you should make use of it to avoid errors. 
+
+Any time you introduce a string that may be set on a `Doc` field that has a hash,
+you should add the string to the `StringStore`. This mainly happens when adding
+labels in components, but there are some other cases:
+
+- syntax iterators, mainly `get_noun_chunks`
+- external data used in components, like the `KnowledgeBase` in the `entity_linker`
+- labels used in tests
+
+## Vocab
+
+The `Vocab` is a core component of a `Language` pipeline. Its main function is
+to manage `Lexeme`s, which are structs that contain information about a token
+that depends only on its surface form, without context. `Lexeme`s store much of
+the data associated with `Token`s. As a side effect of this the `Vocab` also
+manages the `StringStore` for a pipeline and a grab-bag of other data.
+
+These are things stored in the vocab:
+
+- `Lexeme`s
+- `StringStore`
+- `Morphology`: manages info used in `MorphAnalysis` objects
+- `vectors`: basically a dict for word vectors
+- `lookups`: language specific data like lemmas
+- `writing_system`: language specific metadata
+- `get_noun_chunks`: a syntax iterator
+- lex attribute getters: functions like `is_punct`, set in language defaults
+- `cfg`: **not** the pipeline config, this is mostly unused
+- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
+
+Some of these, like the Morphology and Vectors, are complex enough that they
+need their own explanations. Here we'll just look at Vocab-specific items.
+
+### Lexemes
+
+A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
+that identify various context-free token attributes. Lexemes are the core data
+of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
+for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
+
+Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
+that it accepts a hash or id, with one important difference: if you do a lookup
+using a string, that value is added to the `StringStore` automatically. 
+
+The attributes stored in a `LexemeC` are:
+
+- orth (the raw text)
+- lower
+- norm
+- shape
+- prefix
+- suffix
+
+Most of these are straightforward. All of them can be customized, and (except
+`orth`) probably should be since the defaults are based on English, but in
+practice this is rarely done at present.
+
+### Lookups
+
+This is basically a dict of dicts, implemented using a `Table` for each
+sub-dict, that stores lemmas and other language-specific lookup data. 
+
+A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
+Bloom filters to speed up misses and has some extra serialization features.
+Tables are not used outside of the lookups.
+
+### Lex Attribute Getters
+
+Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
+much like lookups, but take the form of functions rather than string-to-string
+dicts, so they're stored separately.
+
+### Writing System
+
+This is a dict with three attributes:
+
+- `direction`: ltr or rtl (default ltr)
+- `has_case`: bool (default `True`)
+- `has_letters`: bool (default `True`, `False` only for CJK for now)
+
+Currently these are not used much - the main use is that `direction` is used in
+visualizers, though `rtl` doesn't quite work (see
+[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
+could be used when choosing hyperparameters for subwords, controlling word
+shape generation, and similar tasks.
+
+### Other Vocab Members
+
+The Vocab is kind of the default place to store things from `Language.defaults`
+that don't belong to the Tokenizer. The following properties are in the Vocab
+just because they don't have anywhere else to go.
+
+- `get_noun_chunks`
+- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
+- `_unused_object`: Leftover C member, should be removed in next major version
+
+
--- a/licenses/3rd_party_licenses.txt
+++ b/licenses/3rd_party_licenses.txt
@ -43,8 +43,8 @@ scikit-learn

 * Files: scorer.py

-The following implementation of roc_auc_score() is adapted from
-scikit-learn, which is distributed under the following license:
+The implementation of roc_auc_score() is adapted from scikit-learn, which is
+distributed under the following license:

 New BSD License

@ -77,3 +77,126 @@ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
 DAMAGE.
+
+
+pyvi
+----
+
+* Files: lang/vi/__init__.py
+
+The MIT License (MIT)
+Copyright (c) 2016 Viet-Trung Tran
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+
+importlib_metadata
+------------------
+
+* Files: util.py
+
+The implementation of packages_distributions() is adapted from
+importlib_metadata, which is distributed under the following license:
+
+Copyright 2017-2019 Jason R. Coombs, Barry Warsaw
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+polyleven
+---------
+
+* Files: spacy/matcher/polyleven.c
+
+MIT License
+
+Copyright (c) 2021 Fujimoto Seiji <fujimoto@ceptord.net>
+Copyright (c) 2021 Max Bachmann <kontakt@maxbachmann.de>
+Copyright (c) 2022 Nick Mazuk
+Copyright (c) 2022 Michael Weiss <code@mweiss.ch>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+
+SciPy
+-----
+
+* Files: scorer.py
+
+The implementation of trapezoid() is adapted from SciPy, which is distributed
+under the following license:
+
+New BSD License
+
+Copyright (c) 2001-2002 Enthought, Inc. 2003-2023, SciPy Developers.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above
+   copyright notice, this list of conditions and the following
+   disclaimer in the documentation and/or other materials provided
+   with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,13 +1,67 @@
 [build-system]
 requires = [
    "setuptools",
-    "cython>=0.25,<3.0",
+    "cython>=3.0,<4.0",
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc>=8.0.3,<8.1.0",
-    "blis>=0.4.0,<0.8.0",
-    "pathy",
-    "numpy>=1.15.0",
+    "thinc>=8.3.4,<8.4.0",
+    "numpy>=2.0.0,<3.0.0"
 ]
 build-backend = "setuptools.build_meta"
+
+[tool.cibuildwheel]
+build = "*"
+skip = "pp* cp36* cp37* cp38* *-win32 *i686*"
+test-skip = ""
+free-threaded-support = false
+
+archs = ["native"]
+
+build-frontend = "default"
+config-settings = {}
+dependency-versions = "pinned"
+environment = { PIP_CONSTRAINT = "build-constraints.txt" }
+
+environment-pass = []
+build-verbosity = 0
+
+before-all = "curl https://sh.rustup.rs -sSf | sh -s -- -y --profile minimal --default-toolchain stable"
+before-build = "pip install -r requirements.txt && python setup.py clean"
+repair-wheel-command = ""
+
+test-command = ""
+before-test = ""
+test-requires = []
+test-extras = []
+
+container-engine = "docker"
+
+manylinux-x86_64-image = "manylinux2014"
+manylinux-i686-image = "manylinux2014"
+manylinux-aarch64-image = "manylinux2014"
+manylinux-ppc64le-image = "manylinux2014"
+manylinux-s390x-image = "manylinux2014"
+manylinux-pypy_x86_64-image = "manylinux2014"
+manylinux-pypy_i686-image = "manylinux2014"
+manylinux-pypy_aarch64-image = "manylinux2014"
+
+musllinux-x86_64-image = "musllinux_1_2"
+musllinux-i686-image = "musllinux_1_2"
+musllinux-aarch64-image = "musllinux_1_2"
+musllinux-ppc64le-image = "musllinux_1_2"
+musllinux-s390x-image = "musllinux_1_2"
+
+[tool.cibuildwheel.linux]
+repair-wheel-command = "auditwheel repair -w {dest_dir} {wheel}"
+
+[tool.cibuildwheel.macos]
+repair-wheel-command = "delocate-wheel --require-archs {delocate_archs} -w {dest_dir} -v {wheel}"
+
+[tool.cibuildwheel.windows]
+
+[tool.cibuildwheel.pyodide]
+
+
+[tool.isort]
+profile = "black"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,32 +1,38 @@
 # Our libraries
-spacy-legacy>=3.0.5,<3.1.0
+spacy-legacy>=3.0.11,<3.1.0
+spacy-loggers>=1.0.0,<2.0.0
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc>=8.0.3,<8.1.0
-blis>=0.4.0,<0.8.0
+thinc>=8.3.4,<8.4.0
 ml_datasets>=0.2.0,<0.3.0
 murmurhash>=0.28.0,<1.1.0
-wasabi>=0.8.1,<1.1.0
-srsly>=2.4.1,<3.0.0
-catalogue>=2.0.4,<2.1.0
-typer>=0.3.0,<0.4.0
-pathy>=0.3.5
-smart-open>=5.2.1,<7.0.0
+wasabi>=0.9.1,<1.2.0
+srsly>=2.4.3,<3.0.0
+catalogue>=2.0.6,<2.1.0
+typer-slim>=0.3.0,<1.0.0
+weasel>=0.1.0,<0.5.0
 # Third party dependencies
-numpy>=1.15.0
+numpy>=2.0.0,<3.0.0
 requests>=2.13.0,<3.0.0
 tqdm>=4.38.0,<5.0.0
-pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
+pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
 jinja2
 # Official Python utilities
 setuptools
 packaging>=20.0
-typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
 # Development dependencies
-cython>=0.25,<3.0
-pytest>=5.2.0
+pre-commit>=2.13.0
+cython>=3.0,<4.0
+pytest>=5.2.0,!=7.1.0
 pytest-timeout>=1.3.0,<2.0.0
 mock>=2.0.0,<3.0.0
 flake8>=3.8.0,<6.0.0
 hypothesis>=3.27.0,<7.0.0
-mypy==0.910
+mypy>=1.5.0,<1.6.0; platform_machine != "aarch64" and python_version >= "3.8"
+types-mock>=0.1.1
+types-setuptools>=57.0.0
+types-requests
+types-setuptools>=57.0.0
+black==22.3.0
+cython-lint>=0.15.0
+isort>=5.0,<6.0
--- a/setup.cfg
+++ b/setup.cfg
@ -17,10 +17,11 @@ classifiers =
    Operating System :: Microsoft :: Windows
    Programming Language :: Cython
    Programming Language :: Python :: 3
-    Programming Language :: Python :: 3.6
-    Programming Language :: Python :: 3.7
-    Programming Language :: Python :: 3.8
    Programming Language :: Python :: 3.9
+    Programming Language :: Python :: 3.10
+    Programming Language :: Python :: 3.11
+    Programming Language :: Python :: 3.12
+    Programming Language :: Python :: 3.13
    Topic :: Scientific/Engineering
 project_urls =
    Release notes = https://github.com/explosion/spaCy/releases
@ -29,39 +30,41 @@ project_urls =
 [options]
 zip_safe = false
 include_package_data = true
-python_requires = >=3.6
+python_requires = >=3.9,<3.14
+# NOTE: This section is superseded by pyproject.toml and will be removed in
+# spaCy v4
 setup_requires =
-    cython>=0.25,<3.0
-    numpy>=1.15.0
+    cython>=3.0,<4.0
+    numpy>=2.0.0,<3.0.0; python_version < "3.9"
+    numpy>=2.0.0,<3.0.0; python_version >= "3.9"
    # We also need our Cython packages here to compile against
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc>=8.0.3,<8.1.0
+    thinc>=8.3.4,<8.4.0
 install_requires =
    # Our libraries
-    spacy-legacy>=3.0.5,<3.1.0
+    spacy-legacy>=3.0.11,<3.1.0
+    spacy-loggers>=1.0.0,<2.0.0
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc>=8.0.3,<8.1.0
-    blis>=0.4.0,<0.8.0
-    wasabi>=0.8.1,<1.1.0
-    srsly>=2.4.1,<3.0.0
-    catalogue>=2.0.4,<2.1.0
+    thinc>=8.3.4,<8.4.0
+    wasabi>=0.9.1,<1.2.0
+    srsly>=2.4.3,<3.0.0
+    catalogue>=2.0.6,<2.1.0
+    weasel>=0.1.0,<0.5.0
    # Third-party dependencies
-    typer>=0.3.0,<0.4.0
-    pathy>=0.3.5
-    smart-open>=5.2.1,<7.0.0
+    typer-slim>=0.3.0,<1.0.0
    tqdm>=4.38.0,<5.0.0
-    numpy>=1.15.0
+    numpy>=1.15.0; python_version < "3.9"
+    numpy>=1.19.0; python_version >= "3.9"
    requests>=2.13.0,<3.0.0
-    pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
+    pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
    jinja2
    # Official Python utilities
    setuptools
    packaging>=20.0
-    typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"

 [options.entry_points]
 console_scripts =
@ -69,39 +72,55 @@ console_scripts =

 [options.extras_require]
 lookups =
-    spacy_lookups_data>=1.0.0,<1.1.0
+    spacy_lookups_data>=1.0.3,<1.1.0
 transformers =
-    spacy_transformers>=1.0.1,<1.1.0
-ray =
-    spacy_ray>=0.1.0,<1.0.0
+    spacy_transformers>=1.1.2,<1.4.0
 cuda =
-    cupy>=5.0.0b4,<10.0.0
+    cupy>=5.0.0b4,<13.0.0
 cuda80 =
-    cupy-cuda80>=5.0.0b4,<10.0.0
+    cupy-cuda80>=5.0.0b4,<13.0.0
 cuda90 =
-    cupy-cuda90>=5.0.0b4,<10.0.0
+    cupy-cuda90>=5.0.0b4,<13.0.0
 cuda91 =
-    cupy-cuda91>=5.0.0b4,<10.0.0
+    cupy-cuda91>=5.0.0b4,<13.0.0
 cuda92 =
-    cupy-cuda92>=5.0.0b4,<10.0.0
+    cupy-cuda92>=5.0.0b4,<13.0.0
 cuda100 =
-    cupy-cuda100>=5.0.0b4,<10.0.0
+    cupy-cuda100>=5.0.0b4,<13.0.0
 cuda101 =
-    cupy-cuda101>=5.0.0b4,<10.0.0
+    cupy-cuda101>=5.0.0b4,<13.0.0
 cuda102 =
-    cupy-cuda102>=5.0.0b4,<10.0.0
+    cupy-cuda102>=5.0.0b4,<13.0.0
 cuda110 =
-    cupy-cuda110>=5.0.0b4,<10.0.0
+    cupy-cuda110>=5.0.0b4,<13.0.0
 cuda111 =
-    cupy-cuda111>=5.0.0b4,<10.0.0
+    cupy-cuda111>=5.0.0b4,<13.0.0
 cuda112 =
-    cupy-cuda112>=5.0.0b4,<10.0.0
+    cupy-cuda112>=5.0.0b4,<13.0.0
+cuda113 =
+    cupy-cuda113>=5.0.0b4,<13.0.0
+cuda114 =
+    cupy-cuda114>=5.0.0b4,<13.0.0
+cuda115 =
+    cupy-cuda115>=5.0.0b4,<13.0.0
+cuda116 =
+    cupy-cuda116>=5.0.0b4,<13.0.0
+cuda117 =
+    cupy-cuda117>=5.0.0b4,<13.0.0
+cuda11x =
+    cupy-cuda11x>=11.0.0,<13.0.0
+cuda12x =
+    cupy-cuda12x>=11.5.0,<13.0.0
+cuda-autodetect =
+    cupy-wheel>=11.0.0,<13.0.0
+apple =
+    thinc-apple-ops>=1.0.0,<2.0.0
 # Language tokenizers with external dependencies
 ja =
-    sudachipy>=0.4.9
-    sudachidict_core>=20200330
+    sudachipy>=0.5.2,!=0.6.1
+    sudachidict_core>=20211220
 ko =
-    natto-py==0.9.0
+    natto-py>=0.9.0
 th =
    pythainlp>=2.0

@ -112,7 +131,7 @@ universal = false
 formats = gztar

 [flake8]
-ignore = E203, E266, E501, E731, W503, E741
+ignore = E203, E266, E501, E731, W503, E741, F541
 max-line-length = 80
 select = B,C,E,F,W,T4,B9
 exclude =
@ -123,9 +142,11 @@ exclude =

 [tool:pytest]
 markers =
-    slow
+    slow: mark a test as slow
+    issue: reference specific issue

 [mypy]
 ignore_missing_imports = True
 no_implicit_optional = True
 plugins = pydantic.mypy, thinc.mypy
+allow_redefinition = True
--- a/setup.py
+++ b/setup.py
@ -1,10 +1,9 @@
 #!/usr/bin/env python
 from setuptools import Extension, setup, find_packages
 import sys
-import platform
 import numpy
-from distutils.command.build_ext import build_ext
-from distutils.sysconfig import get_python_inc
+from setuptools.command.build_ext import build_ext
+from sysconfig import get_path
 from pathlib import Path
 import shutil
 from Cython.Build import cythonize
@ -23,16 +22,20 @@ Options.docstrings = True

 PACKAGES = find_packages()
 MOD_NAMES = [
+    "spacy.training.alignment_array",
    "spacy.training.example",
    "spacy.parts_of_speech",
    "spacy.strings",
    "spacy.lexeme",
    "spacy.vocab",
    "spacy.attrs",
-    "spacy.kb",
+    "spacy.kb.candidate",
+    "spacy.kb.kb",
+    "spacy.kb.kb_in_memory",
    "spacy.ml.parser_model",
    "spacy.morphology",
    "spacy.pipeline.dep_parser",
+    "spacy.pipeline._edit_tree_internals.edit_trees",
    "spacy.pipeline.morphologizer",
    "spacy.pipeline.multitask",
    "spacy.pipeline.ner",
@ -75,6 +78,7 @@ COMPILER_DIRECTIVES = {
    "language_level": -3,
    "embedsignature": True,
    "annotation_typing": False,
+    "profile": sys.version_info < (3, 12),
 }
 # Files to copy into the package that are otherwise not included
 COPY_FILES = {
@ -84,30 +88,6 @@ COPY_FILES = {
 }


-def is_new_osx():
-    """Check whether we're on OSX >= 10.7"""
-    if sys.platform != "darwin":
-        return False
-    mac_ver = platform.mac_ver()[0]
-    if mac_ver.startswith("10"):
-        minor_version = int(mac_ver.split(".")[1])
-        if minor_version >= 7:
-            return True
-        else:
-            return False
-    return False
-
-
-if is_new_osx():
-    # On Mac, use libc++ because Apple deprecated use of
-    # libstdc
-    COMPILE_OPTIONS["other"].append("-stdlib=libc++")
-    LINK_OPTIONS["other"].append("-lc++")
-    # g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
-    # See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
-    LINK_OPTIONS["other"].append("-nodefaultlibs")
-
-
 # By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
 # http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
 class build_ext_options:
@ -124,6 +104,8 @@ class build_ext_options:

 class build_ext_subclass(build_ext, build_ext_options):
    def build_extensions(self):
+        if self.parallel is None and os.environ.get("SPACY_NUM_BUILD_JOBS") is not None:
+            self.parallel = int(os.environ.get("SPACY_NUM_BUILD_JOBS"))
        build_ext_options.build_options(self)
        build_ext.build_extensions(self)

@ -198,13 +180,28 @@ def setup_package():

    include_dirs = [
        numpy.get_include(),
-        get_python_inc(plat_specific=True),
+        get_path("include"),
    ]
    ext_modules = []
+    ext_modules.append(
+        Extension(
+            "spacy.matcher.levenshtein",
+            [
+                "spacy/matcher/levenshtein.pyx",
+                "spacy/matcher/polyleven.c",
+            ],
+            language="c",
+            include_dirs=include_dirs,
+        )
+    )
    for name in MOD_NAMES:
        mod_path = name.replace(".", "/") + ".pyx"
        ext = Extension(
-            name, [mod_path], language="c++", include_dirs=include_dirs, extra_compile_args=["-std=c++11"]
+            name,
+            [mod_path],
+            language="c++",
+            include_dirs=include_dirs,
+            extra_compile_args=["-std=c++11"],
        )
        ext_modules.append(ext)
    print("Cythonizing sources")
--- a/spacy/init.py
+++ b/spacy/init.py
@ -1,26 +1,25 @@
-from typing import Union, Iterable, Dict, Any
-from pathlib import Path
 import sys
+from pathlib import Path
+from typing import Any, Dict, Iterable, Union

 # set library-specific custom warning handling before doing anything else
 from .errors import setup_default_warnings
-setup_default_warnings()
+
+setup_default_warnings()  # noqa: E402

 # These are imported as part of the API
-from thinc.api import prefer_gpu, require_gpu, require_cpu  # noqa: F401
-from thinc.api import Config
+from thinc.api import Config, prefer_gpu, require_cpu, require_gpu  # noqa: F401

 from . import pipeline  # noqa: F401
-from .cli.info import info  # noqa: F401
-from .glossary import explain  # noqa: F401
-from .about import __version__  # noqa: F401
-from .util import registry, logger  # noqa: F401
-
-from .errors import Errors
-from .language import Language
-from .vocab import Vocab
 from . import util
-
+from .about import __version__  # noqa: F401
+from .cli.info import info  # noqa: F401
+from .errors import Errors
+from .glossary import explain  # noqa: F401
+from .language import Language
+from .registrations import REGISTRY_POPULATED, populate_registry
+from .util import logger, registry  # noqa: F401
+from .vocab import Vocab

 if sys.maxunicode == 65535:
    raise SystemError(Errors.E130)
@ -30,25 +29,33 @@ def load(
    name: Union[str, Path],
    *,
    vocab: Union[Vocab, bool] = True,
-    disable: Iterable[str] = util.SimpleFrozenList(),
-    exclude: Iterable[str] = util.SimpleFrozenList(),
+    disable: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
+    enable: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
+    exclude: Union[str, Iterable[str]] = util._DEFAULT_EMPTY_PIPES,
    config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
 ) -> Language:
    """Load a spaCy model from an installed package or a local path.

    name (str): Package name or model path.
    vocab (Vocab): A Vocab object. If True, a vocab is created.
-    disable (Iterable[str]): Names of pipeline components to disable. Disabled
+    disable (Union[str, Iterable[str]]): Name(s) of pipeline component(s) to disable. Disabled
        pipes will be loaded but they won't be run unless you explicitly
        enable them by calling nlp.enable_pipe.
-    exclude (Iterable[str]): Names of pipeline components to exclude. Excluded
+    enable (Union[str, Iterable[str]]): Name(s) of pipeline component(s) to enable. All other
+        pipes will be disabled (but can be enabled later using nlp.enable_pipe).
+    exclude (Union[str, Iterable[str]]): Name(s) of pipeline component(s) to exclude. Excluded
        components won't be loaded.
    config (Dict[str, Any] / Config): Config overrides as nested dict or dict
        keyed by section values in dot notation.
    RETURNS (Language): The loaded nlp object.
    """
    return util.load_model(
-        name, vocab=vocab, disable=disable, exclude=exclude, config=config
+        name,
+        vocab=vocab,
+        disable=disable,
+        enable=enable,
+        exclude=exclude,
+        config=config,
    )


--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,7 +1,5 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.9"
+__version__ = "3.8.7"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
-__projects__ = "https://github.com/explosion/projects"
-__projects_branch__ = "v3"
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -1,6 +1,7 @@
 # Reserve 64 values for flag features
 from . cimport symbols

+
 cdef enum attr_id_t:
    NULL_ATTR
    IS_ALPHA
@ -95,4 +96,4 @@ cdef enum attr_id_t:
    ENT_ID = symbols.ENT_ID

    IDX
-    SENT_END
+    SENT_END
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -1,3 +1,7 @@
+# cython: profile=False
+from .errors import Errors
+
+IOB_STRINGS = ("", "I", "O", "B")

 IDS = {
    "": NULL_ATTR,
@ -64,7 +68,6 @@ IDS = {
    "FLAG61": FLAG61,
    "FLAG62": FLAG62,
    "FLAG63": FLAG63,
-
    "ID": ID,
    "ORTH": ORTH,
    "LOWER": LOWER,
@ -72,7 +75,6 @@ IDS = {
    "SHAPE": SHAPE,
    "PREFIX": PREFIX,
    "SUFFIX": SUFFIX,
-
    "LENGTH": LENGTH,
    "LEMMA": LEMMA,
    "POS": POS,
@ -87,7 +89,7 @@ IDS = {
    "SPACY": SPACY,
    "LANG": LANG,
    "MORPH": MORPH,
-    "IDX": IDX
+    "IDX": IDX,
 }


@ -109,28 +111,66 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
    """
    inty_attrs = {}
    if _do_deprecated:
-        if 'F' in stringy_attrs:
+        if "F" in stringy_attrs:
            stringy_attrs["ORTH"] = stringy_attrs.pop("F")
-        if 'L' in stringy_attrs:
+        if "L" in stringy_attrs:
            stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
-        if 'pos' in stringy_attrs:
+        if "pos" in stringy_attrs:
            stringy_attrs["TAG"] = stringy_attrs.pop("pos")
-        if 'morph' in stringy_attrs:
-            morphs = stringy_attrs.pop('morph')
-        if 'number' in stringy_attrs:
-            stringy_attrs.pop('number')
-        if 'tenspect' in stringy_attrs:
-            stringy_attrs.pop('tenspect')
+        if "morph" in stringy_attrs:
+            morphs = stringy_attrs.pop("morph")  # no-cython-lint
+        if "number" in stringy_attrs:
+            stringy_attrs.pop("number")
+        if "tenspect" in stringy_attrs:
+            stringy_attrs.pop("tenspect")
        morph_keys = [
-            'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number',
-            'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss',
-            'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType',
-            'Gender', 'Mood', 'Negative', 'Tense', 'Voice', 'Abbr',
-            'Derivation', 'Echo', 'Foreign', 'NameType', 'NounType', 'NumForm',
-            'NumValue', 'PartType', 'Polite', 'StyleVariant',
-            'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
-            'Reflex', 'Negative', 'Mood', 'Aspect', 'Case',
-            'Polarity', 'PrepCase', 'Animacy' # U20
+            "PunctType",
+            "PunctSide",
+            "Other",
+            "Degree",
+            "AdvType",
+            "Number",
+            "VerbForm",
+            "PronType",
+            "Aspect",
+            "Tense",
+            "PartType",
+            "Poss",
+            "Hyph",
+            "ConjType",
+            "NumType",
+            "Foreign",
+            "VerbType",
+            "NounType",
+            "Gender",
+            "Mood",
+            "Negative",
+            "Tense",
+            "Voice",
+            "Abbr",
+            "Derivation",
+            "Echo",
+            "Foreign",
+            "NameType",
+            "NounType",
+            "NumForm",
+            "NumValue",
+            "PartType",
+            "Polite",
+            "StyleVariant",
+            "PronType",
+            "AdjType",
+            "Person",
+            "Variant",
+            "AdpType",
+            "Reflex",
+            "Negative",
+            "Mood",
+            "Aspect",
+            "Case",
+            "Polarity",
+            "PrepCase",
+            "Animacy",  # U20
        ]
        for key in morph_keys:
            if key in stringy_attrs:
@ -142,8 +182,13 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
    for name, value in stringy_attrs.items():
        int_key = intify_attr(name)
        if int_key is not None:
-            if strings_map is not None and isinstance(value, basestring):
-                if hasattr(strings_map, 'add'):
+            if int_key == ENT_IOB:
+                if value in IOB_STRINGS:
+                    value = IOB_STRINGS.index(value)
+                elif isinstance(value, str):
+                    raise ValueError(Errors.E1025.format(value=value))
+            if strings_map is not None and isinstance(value, str):
+                if hasattr(strings_map, "add"):
                    value = strings_map.add(value)
                else:
                    value = strings_map[value]
--- a/spacy/cli/init.py
+++ b/spacy/cli/init.py
@ -1,31 +1,40 @@
 from wasabi import msg

+# Needed for testing
+from . import download as download_module  # noqa: F401
 from ._util import app, setup_cli  # noqa: F401
+from .apply import apply  # noqa: F401
+from .assemble import assemble_cli  # noqa: F401

 # These are the actual functions, NOT the wrapped CLI commands. The CLI commands
 # are registered automatically and won't have to be imported here.
-from .download import download  # noqa: F401
-from .info import info  # noqa: F401
-from .package import package  # noqa: F401
-from .profile import profile  # noqa: F401
-from .train import train_cli  # noqa: F401
-from .assemble import assemble_cli  # noqa: F401
-from .pretrain import pretrain  # noqa: F401
-from .debug_data import debug_data  # noqa: F401
-from .debug_config import debug_config  # noqa: F401
-from .debug_model import debug_model  # noqa: F401
-from .evaluate import evaluate  # noqa: F401
+from .benchmark_speed import benchmark_speed_cli  # noqa: F401
 from .convert import convert  # noqa: F401
+from .debug_config import debug_config  # noqa: F401
+from .debug_data import debug_data  # noqa: F401
+from .debug_diff import debug_diff  # noqa: F401
+from .debug_model import debug_model  # noqa: F401
+from .download import download  # noqa: F401
+from .evaluate import evaluate  # noqa: F401
+from .find_function import find_function  # noqa: F401
+from .find_threshold import find_threshold  # noqa: F401
+from .info import info  # noqa: F401
+from .init_config import fill_config, init_config  # noqa: F401
 from .init_pipeline import init_pipeline_cli  # noqa: F401
-from .init_config import init_config, fill_config  # noqa: F401
-from .validate import validate  # noqa: F401
-from .project.clone import project_clone  # noqa: F401
-from .project.assets import project_assets  # noqa: F401
-from .project.run import project_run  # noqa: F401
-from .project.dvc import project_update_dvc  # noqa: F401
-from .project.push import project_push  # noqa: F401
-from .project.pull import project_pull  # noqa: F401
-from .project.document import project_document  # noqa: F401
+from .package import package  # noqa: F401
+from .pretrain import pretrain  # noqa: F401
+from .profile import profile  # noqa: F401
+from .project.assets import project_assets  # type: ignore[attr-defined]  # noqa: F401
+from .project.clone import project_clone  # type: ignore[attr-defined]  # noqa: F401
+from .project.document import (  # type: ignore[attr-defined]  # noqa: F401
+    project_document,
+)
+from .project.dvc import project_update_dvc  # type: ignore[attr-defined]  # noqa: F401
+from .project.pull import project_pull  # type: ignore[attr-defined]  # noqa: F401
+from .project.push import project_push  # type: ignore[attr-defined]  # noqa: F401
+from .project.run import project_run  # type: ignore[attr-defined]  # noqa: F401
+from .train import train_cli  # type: ignore[attr-defined]  # noqa: F401
+from .validate import validate  # type: ignore[attr-defined]  # noqa: F401


@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@ -1,34 +1,50 @@
-from typing import Dict, Any, Union, List, Optional, Tuple, Iterable, TYPE_CHECKING
-import sys
-import shutil
-from pathlib import Path
-from wasabi import msg
-import srsly
 import hashlib
+import os
+import shutil
+import sys
+from configparser import InterpolationError
+from contextlib import contextmanager
+from pathlib import Path
+from typing import (
+    TYPE_CHECKING,
+    Any,
+    Dict,
+    Iterable,
+    List,
+    Optional,
+    Tuple,
+    Union,
+    overload,
+)
+
+import srsly
 import typer
 from click import NoSuchOption
 from click.parser import split_arg_string
-from typer.main import get_command
-from contextlib import contextmanager
 from thinc.api import Config, ConfigValidationError, require_gpu
-from thinc.util import has_cupy, gpu_is_available
-from configparser import InterpolationError
-import os
+from thinc.util import gpu_is_available
+from typer.main import get_command
+from wasabi import Printer, msg
+from weasel import app as project_cli

-from ..schemas import ProjectConfigSchema, validate
-from ..util import import_file, run_command, make_tempdir, registry, logger
-from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
 from .. import about
-
-if TYPE_CHECKING:
-    from pathy import Pathy  # noqa: F401
-
+from ..compat import Literal
+from ..schemas import validate
+from ..util import (
+    ENV_VARS,
+    SimpleFrozenDict,
+    import_file,
+    is_compatible_version,
+    logger,
+    make_tempdir,
+    registry,
+    run_command,
+)

 SDIST_SUFFIX = ".tar.gz"
 WHEEL_SUFFIX = "-py3-none-any.whl"

 PROJECT_FILE = "project.yml"
-PROJECT_LOCK = "project.lock"
 COMMAND = "python -m spacy"
 NAME = "spacy"
 HELP = """spaCy Command-line Interface
@ -44,6 +60,7 @@ DEBUG_HELP = """Suite of helpful commands for debugging and profiling. Includes
 commands to check and validate your config files, training and evaluation data,
 and custom model implementations.
 """
+BENCHMARK_HELP = """Commands for benchmarking pipelines."""
 INIT_HELP = """Commands for initializing configs and pipeline packages."""

 # Wrappers for Typer's annotations. Initially created to set defaults and to
@ -52,12 +69,13 @@ Arg = typer.Argument
 Opt = typer.Option

 app = typer.Typer(name=NAME, help=HELP)
-project_cli = typer.Typer(name="project", help=PROJECT_HELP, no_args_is_help=True)
+benchmark_cli = typer.Typer(name="benchmark", help=BENCHMARK_HELP, no_args_is_help=True)
 debug_cli = typer.Typer(name="debug", help=DEBUG_HELP, no_args_is_help=True)
 init_cli = typer.Typer(name="init", help=INIT_HELP, no_args_is_help=True)

-app.add_typer(project_cli)
+app.add_typer(project_cli, name="project", help=PROJECT_HELP, no_args_is_help=True)
 app.add_typer(debug_cli)
+app.add_typer(benchmark_cli)
 app.add_typer(init_cli)


@ -85,9 +103,9 @@ def parse_config_overrides(
    cli_overrides = _parse_overrides(args, is_cli=True)
    if cli_overrides:
        keys = [k for k in cli_overrides if k not in env_overrides]
-        logger.debug(f"Config overrides from CLI: {keys}")
+        logger.debug("Config overrides from CLI: %s", keys)
    if env_overrides:
-        logger.debug(f"Config overrides from env variables: {list(env_overrides)}")
+        logger.debug("Config overrides from env variables: %s", list(env_overrides))
    return {**cli_overrides, **env_overrides}


@ -130,147 +148,6 @@ def _parse_override(value: Any) -> Any:
        return str(value)


-def load_project_config(
-    path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
-) -> Dict[str, Any]:
-    """Load the project.yml file from a directory and validate it. Also make
-    sure that all directories defined in the config exist.
-
-    path (Path): The path to the project directory.
-    interpolate (bool): Whether to substitute project variables.
-    overrides (Dict[str, Any]): Optional config overrides.
-    RETURNS (Dict[str, Any]): The loaded project.yml.
-    """
-    config_path = path / PROJECT_FILE
-    if not config_path.exists():
-        msg.fail(f"Can't find {PROJECT_FILE}", config_path, exits=1)
-    invalid_err = f"Invalid {PROJECT_FILE}. Double-check that the YAML is correct."
-    try:
-        config = srsly.read_yaml(config_path)
-    except ValueError as e:
-        msg.fail(invalid_err, e, exits=1)
-    errors = validate(ProjectConfigSchema, config)
-    if errors:
-        msg.fail(invalid_err)
-        print("\n".join(errors))
-        sys.exit(1)
-    validate_project_version(config)
-    validate_project_commands(config)
-    # Make sure directories defined in config exist
-    for subdir in config.get("directories", []):
-        dir_path = path / subdir
-        if not dir_path.exists():
-            dir_path.mkdir(parents=True)
-    if interpolate:
-        err = f"{PROJECT_FILE} validation error"
-        with show_validation_error(title=err, hint_fill=False):
-            config = substitute_project_variables(config, overrides)
-    return config
-
-
-def substitute_project_variables(
-    config: Dict[str, Any],
-    overrides: Dict[str, Any] = SimpleFrozenDict(),
-    key: str = "vars",
-    env_key: str = "env",
-) -> Dict[str, Any]:
-    """Interpolate variables in the project file using the config system.
-
-    config (Dict[str, Any]): The project config.
-    overrides (Dict[str, Any]): Optional config overrides.
-    key (str): Key containing variables in project config.
-    env_key (str): Key containing environment variable mapping in project config.
-    RETURNS (Dict[str, Any]): The interpolated project config.
-    """
-    config.setdefault(key, {})
-    config.setdefault(env_key, {})
-    # Substitute references to env vars with their values
-    for config_var, env_var in config[env_key].items():
-        config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
-    # Need to put variables in the top scope again so we can have a top-level
-    # section "project" (otherwise, a list of commands in the top scope wouldn't)
-    # be allowed by Thinc's config system
-    cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
-    cfg = Config().from_str(cfg.to_str(), overrides=overrides)
-    interpolated = cfg.interpolate()
-    return dict(interpolated["project"])
-
-
-def validate_project_version(config: Dict[str, Any]) -> None:
-    """If the project defines a compatible spaCy version range, chec that it's
-    compatible with the current version of spaCy.
-
-    config (Dict[str, Any]): The loaded config.
-    """
-    spacy_version = config.get("spacy_version", None)
-    if spacy_version and not is_compatible_version(about.__version__, spacy_version):
-        err = (
-            f"The {PROJECT_FILE} specifies a spaCy version range ({spacy_version}) "
-            f"that's not compatible with the version of spaCy you're running "
-            f"({about.__version__}). You can edit version requirement in the "
-            f"{PROJECT_FILE} to load it, but the project may not run as expected."
-        )
-        msg.fail(err, exits=1)
-
-
-def validate_project_commands(config: Dict[str, Any]) -> None:
-    """Check that project commands and workflows are valid, don't contain
-    duplicates, don't clash  and only refer to commands that exist.
-
-    config (Dict[str, Any]): The loaded config.
-    """
-    command_names = [cmd["name"] for cmd in config.get("commands", [])]
-    workflows = config.get("workflows", {})
-    duplicates = set([cmd for cmd in command_names if command_names.count(cmd) > 1])
-    if duplicates:
-        err = f"Duplicate commands defined in {PROJECT_FILE}: {', '.join(duplicates)}"
-        msg.fail(err, exits=1)
-    for workflow_name, workflow_steps in workflows.items():
-        if workflow_name in command_names:
-            err = f"Can't use workflow name '{workflow_name}': name already exists as a command"
-            msg.fail(err, exits=1)
-        for step in workflow_steps:
-            if step not in command_names:
-                msg.fail(
-                    f"Unknown command specified in workflow '{workflow_name}': {step}",
-                    f"Workflows can only refer to commands defined in the 'commands' "
-                    f"section of the {PROJECT_FILE}.",
-                    exits=1,
-                )
-
-
-def get_hash(data, exclude: Iterable[str] = tuple()) -> str:
-    """Get the hash for a JSON-serializable object.
-
-    data: The data to hash.
-    exclude (Iterable[str]): Top-level keys to exclude if data is a dict.
-    RETURNS (str): The hash.
-    """
-    if isinstance(data, dict):
-        data = {k: v for k, v in data.items() if k not in exclude}
-    data_str = srsly.json_dumps(data, sort_keys=True).encode("utf8")
-    return hashlib.md5(data_str).hexdigest()
-
-
-def get_checksum(path: Union[Path, str]) -> str:
-    """Get the checksum for a file or directory given its file path. If a
-    directory path is provided, this uses all files in that directory.
-
-    path (Union[Path, str]): The file or directory path.
-    RETURNS (str): The checksum.
-    """
-    path = Path(path)
-    if path.is_file():
-        return hashlib.md5(Path(path).read_bytes()).hexdigest()
-    if path.is_dir():
-        # TODO: this is currently pretty slow
-        dir_checksum = hashlib.md5()
-        for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
-            dir_checksum.update(sub_file.read_bytes())
-        return dir_checksum.hexdigest()
-    msg.fail(f"Can't get checksum for {path}: not a file or directory", exits=1)
-
-
@contextmanager
 def show_validation_error(
    file_path: Optional[Union[str, Path]] = None,
@ -328,153 +205,33 @@ def import_code(code_path: Optional[Union[Path, str]]) -> None:
            msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)


-def upload_file(src: Path, dest: Union[str, "Pathy"]) -> None:
-    """Upload a file.
-
-    src (Path): The source path.
-    url (str): The destination URL to upload to.
-    """
-    import smart_open
-
-    dest = str(dest)
-    with smart_open.open(dest, mode="wb") as output_file:
-        with src.open(mode="rb") as input_file:
-            output_file.write(input_file.read())
-
-
-def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False) -> None:
-    """Download a file using smart_open.
-
-    url (str): The URL of the file.
-    dest (Path): The destination path.
-    force (bool): Whether to force download even if file exists.
-        If False, the download will be skipped.
-    """
-    import smart_open
-
-    if dest.exists() and not force:
-        return None
-    src = str(src)
-    with smart_open.open(src, mode="rb", compression="disable") as input_file:
-        with dest.open(mode="wb") as output_file:
-            output_file.write(input_file.read())
-
-
-def ensure_pathy(path):
-    """Temporary helper to prevent importing Pathy globally (which can cause
-    slow and annoying Google Cloud warning)."""
-    from pathy import Pathy  # noqa: F811
-
-    return Pathy(path)
-
-
-def git_checkout(
-    repo: str, subpath: str, dest: Path, *, branch: str = "master", sparse: bool = False
-):
-    git_version = get_git_version()
-    if dest.exists():
-        msg.fail("Destination of checkout must not exist", exits=1)
-    if not dest.parent.exists():
-        msg.fail("Parent of destination of checkout must exist", exits=1)
-    if sparse and git_version >= (2, 22):
-        return git_sparse_checkout(repo, subpath, dest, branch)
-    elif sparse:
-        # Only show warnings if the user explicitly wants sparse checkout but
-        # the Git version doesn't support it
-        err_old = (
-            f"You're running an old version of Git (v{git_version[0]}.{git_version[1]}) "
-            f"that doesn't fully support sparse checkout yet."
-        )
-        err_unk = "You're running an unknown version of Git, so sparse checkout has been disabled."
-        msg.warn(
-            f"{err_unk if git_version == (0, 0) else err_old} "
-            f"This means that more files than necessary may be downloaded "
-            f"temporarily. To only download the files needed, make sure "
-            f"you're using Git v2.22 or above."
-        )
-    with make_tempdir() as tmp_dir:
-        cmd = f"git -C {tmp_dir} clone {repo} . -b {branch}"
-        run_command(cmd, capture=True)
-        # We need Path(name) to make sure we also support subdirectories
-        try:
-            shutil.copytree(str(tmp_dir / Path(subpath)), str(dest))
-        except FileNotFoundError:
-            err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
-            msg.fail(err, repo, exits=1)
-
-
-def git_sparse_checkout(repo, subpath, dest, branch):
-    # We're using Git, partial clone and sparse checkout to
-    # only clone the files we need
-    # This ends up being RIDICULOUS. omg.
-    # So, every tutorial and SO post talks about 'sparse checkout'...But they
-    # go and *clone* the whole repo. Worthless. And cloning part of a repo
-    # turns out to be completely broken. The only way to specify a "path" is..
-    # a path *on the server*? The contents of which, specifies the paths. Wat.
-    # Obviously this is hopelessly broken and insecure, because you can query
-    # arbitrary paths on the server! So nobody enables this.
-    # What we have to do is disable *all* files. We could then just checkout
-    # the path, and it'd "work", but be hopelessly slow...Because it goes and
-    # transfers every missing object one-by-one. So the final piece is that we
-    # need to use some weird git internals to fetch the missings in bulk, and
-    # *that* we can do by path.
-    # We're using Git and sparse checkout to only clone the files we need
-    with make_tempdir() as tmp_dir:
-        # This is the "clone, but don't download anything" part.
-        cmd = (
-            f"git clone {repo} {tmp_dir} --no-checkout --depth 1 "
-            f"-b {branch} --filter=blob:none"
-        )
-        run_command(cmd)
-        # Now we need to find the missing filenames for the subpath we want.
-        # Looking for this 'rev-list' command in the git --help? Hah.
-        cmd = f"git -C {tmp_dir} rev-list --objects --all --missing=print -- {subpath}"
-        ret = run_command(cmd, capture=True)
-        git_repo = _http_to_git(repo)
-        # Now pass those missings into another bit of git internals
-        missings = " ".join([x[1:] for x in ret.stdout.split() if x.startswith("?")])
-        if not missings:
-            err = (
-                f"Could not find any relevant files for '{subpath}'. "
-                f"Did you specify a correct and complete path within repo '{repo}' "
-                f"and branch {branch}?"
-            )
-            msg.fail(err, exits=1)
-        cmd = f"git -C {tmp_dir} fetch-pack {git_repo} {missings}"
-        run_command(cmd, capture=True)
-        # And finally, we can checkout our subpath
-        cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
-        run_command(cmd, capture=True)
-        # We need Path(name) to make sure we also support subdirectories
-        shutil.move(str(tmp_dir / Path(subpath)), str(dest))
-
-
 def get_git_version(
    error: str = "Could not run 'git'. Make sure it's installed and the executable is available.",
 ) -> Tuple[int, int]:
    """Get the version of git and raise an error if calling 'git --version' fails.
-
    error (str): The error message to show.
    RETURNS (Tuple[int, int]): The version as a (major, minor) tuple. Returns
        (0, 0) if the version couldn't be determined.
    """
-    ret = run_command("git --version", capture=True)
+    try:
+        ret = run_command("git --version", capture=True)
+    except:
+        raise RuntimeError(error)
    stdout = ret.stdout.strip()
    if not stdout or not stdout.startswith("git version"):
-        return (0, 0)
+        return 0, 0
    version = stdout[11:].strip().split(".")
-    return (int(version[0]), int(version[1]))
+    return int(version[0]), int(version[1])


-def _http_to_git(repo: str) -> str:
-    if repo.startswith("http://"):
-        repo = repo.replace(r"http://", r"https://")
-    if repo.startswith(r"https://"):
-        repo = repo.replace("https://", "git@").replace("/", ":", 1)
-        if repo.endswith("/"):
-            repo = repo[:-1]
-        repo = f"{repo}.git"
-    return repo
+@overload
+def string_to_list(value: str, intify: Literal[False] = ...) -> List[str]:
+    ...
+
+
+@overload
+def string_to_list(value: str, intify: Literal[True]) -> List[int]:
+    ...


 def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
@ -487,7 +244,7 @@ def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[in
    RETURNS (Union[List[str], List[int]]): A list of strings or ints.
    """
    if not value:
-        return []
+        return []  # type: ignore[return-value]
    if value.startswith("[") and value.endswith("]"):
        value = value[1:-1]
    result = []
@ -499,17 +256,57 @@ def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[in
            p = p[1:-1]
        p = p.strip()
        if intify:
-            p = int(p)
+            p = int(p)  # type: ignore[assignment]
        result.append(p)
    return result


-def setup_gpu(use_gpu: int) -> None:
+def setup_gpu(use_gpu: int, silent=None) -> None:
    """Configure the GPU and log info."""
+    if silent is None:
+        local_msg = Printer()
+    else:
+        local_msg = Printer(no_print=silent, pretty=not silent)
    if use_gpu >= 0:
-        msg.info(f"Using GPU: {use_gpu}")
+        local_msg.info(f"Using GPU: {use_gpu}")
        require_gpu(use_gpu)
    else:
-        msg.info("Using CPU")
-        if has_cupy and gpu_is_available():
-            msg.info("To switch to GPU 0, use the option: --gpu-id 0")
+        local_msg.info("Using CPU")
+        if gpu_is_available():
+            local_msg.info("To switch to GPU 0, use the option: --gpu-id 0")
+
+
+def walk_directory(path: Path, suffix: Optional[str] = None) -> List[Path]:
+    """Given a directory and a suffix, recursively find all files matching the suffix.
+    Directories or files with names beginning with a . are ignored, but hidden flags on
+    filesystems are not checked.
+    When provided with a suffix `None`, there is no suffix-based filtering."""
+    if not path.is_dir():
+        return [path]
+    paths = [path]
+    locs = []
+    seen = set()
+    for path in paths:
+        if str(path) in seen:
+            continue
+        seen.add(str(path))
+        if path.parts[-1].startswith("."):
+            continue
+        elif path.is_dir():
+            paths.extend(path.iterdir())
+        elif suffix is not None and not path.parts[-1].endswith(suffix):
+            continue
+        else:
+            locs.append(path)
+    # It's good to sort these, in case the ordering messes up cache.
+    locs.sort()
+    return locs
+
+
+def _format_number(number: Union[int, float], ndigits: int = 2) -> str:
+    """Formats a number (float or int) rounding to `ndigits`, without truncating trailing 0s,
+    as happens with `round(number, ndigits)`"""
+    if isinstance(number, float):
+        return f"{number:.{ndigits}f}"
+    else:
+        return str(number)
--- a/spacy/cli/apply.py
+++ b/spacy/cli/apply.py
@ -0,0 +1,142 @@
+from itertools import chain
+from pathlib import Path
+from typing import Iterable, List, Optional, Union, cast
+
+import srsly
+import tqdm
+from wasabi import msg
+
+from ..tokens import Doc, DocBin
+from ..util import ensure_path, load_model
+from ..vocab import Vocab
+from ._util import Arg, Opt, app, import_code, setup_gpu, walk_directory
+
+path_help = """Location of the documents to predict on.
+Can be a single file in .spacy format or a .jsonl file.
+Files with other extensions are treated as single plain text documents.
+If a directory is provided it is traversed recursively to grab
+all files to be processed.
+The files can be a mixture of .spacy, .jsonl and text files.
+If .jsonl is provided the specified field is going
+to be grabbed ("text" by default)."""
+
+out_help = "Path to save the resulting .spacy file"
+code_help = (
+    "Path to Python file with additional " "code (registered functions) to be imported"
+)
+gold_help = "Use gold preprocessing provided in the .spacy files"
+force_msg = (
+    "The provided output file already exists. "
+    "To force overwriting the output file, set the --force or -F flag."
+)
+
+
+DocOrStrStream = Union[Iterable[str], Iterable[Doc]]
+
+
+def _stream_docbin(path: Path, vocab: Vocab) -> Iterable[Doc]:
+    """
+    Stream Doc objects from DocBin.
+    """
+    docbin = DocBin().from_disk(path)
+    for doc in docbin.get_docs(vocab):
+        yield doc
+
+
+def _stream_jsonl(path: Path, field: str) -> Iterable[str]:
+    """
+    Stream "text" field from JSONL. If the field "text" is
+    not found it raises error.
+    """
+    for entry in srsly.read_jsonl(path):
+        if field not in entry:
+            msg.fail(f"{path} does not contain the required '{field}' field.", exits=1)
+        else:
+            yield entry[field]
+
+
+def _stream_texts(paths: Iterable[Path]) -> Iterable[str]:
+    """
+    Yields strings from text files in paths.
+    """
+    for path in paths:
+        with open(path, "r") as fin:
+            text = fin.read()
+            yield text
+
+
+@app.command("apply")
+def apply_cli(
+    # fmt: off
+    model: str = Arg(..., help="Model name or path"),
+    data_path: Path = Arg(..., help=path_help, exists=True),
+    output_file: Path = Arg(..., help=out_help, dir_okay=False),
+    code_path: Optional[Path] = Opt(None, "--code", "-c", help=code_help),
+    text_key: str = Opt("text", "--text-key", "-tk", help="Key containing text string for JSONL"),
+    force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
+    use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU."),
+    batch_size: int = Opt(1, "--batch-size", "-b", help="Batch size."),
+    n_process: int = Opt(1, "--n-process", "-n", help="number of processors to use.")
+):
+    """
+    Apply a trained pipeline to documents to get predictions.
+    Expects a loadable spaCy pipeline and path to the data, which
+    can be a directory or a file.
+    The data files can be provided in multiple formats:
+        1. .spacy files
+        2. .jsonl files with a specified "field" to read the text from.
+        3. Files with any other extension are assumed to be containing
+           a single document.
+    DOCS: https://spacy.io/api/cli#apply
+    """
+    data_path = ensure_path(data_path)
+    output_file = ensure_path(output_file)
+    code_path = ensure_path(code_path)
+    if output_file.exists() and not force_overwrite:
+        msg.fail(force_msg, exits=1)
+    if not data_path.exists():
+        msg.fail(f"Couldn't find data path: {data_path}", exits=1)
+    import_code(code_path)
+    setup_gpu(use_gpu)
+    apply(data_path, output_file, model, text_key, batch_size, n_process)
+
+
+def apply(
+    data_path: Path,
+    output_file: Path,
+    model: str,
+    json_field: str,
+    batch_size: int,
+    n_process: int,
+):
+    docbin = DocBin(store_user_data=True)
+    paths = walk_directory(data_path)
+    if len(paths) == 0:
+        docbin.to_disk(output_file)
+        msg.warn(
+            "Did not find data to process,"
+            f" {data_path} seems to be an empty directory."
+        )
+        return
+    nlp = load_model(model)
+    msg.good(f"Loaded model {model}")
+    vocab = nlp.vocab
+    streams: List[DocOrStrStream] = []
+    text_files = []
+    for path in paths:
+        if path.suffix == ".spacy":
+            streams.append(_stream_docbin(path, vocab))
+        elif path.suffix == ".jsonl":
+            streams.append(_stream_jsonl(path, json_field))
+        else:
+            text_files.append(path)
+    if len(text_files) > 0:
+        streams.append(_stream_texts(text_files))
+    datagen = cast(DocOrStrStream, chain(*streams))
+    for doc in tqdm.tqdm(
+        nlp.pipe(datagen, batch_size=batch_size, n_process=n_process), disable=None
+    ):
+        docbin.add(doc)
+    if output_file.suffix == "":
+        output_file = output_file.with_suffix(".spacy")
+    docbin.to_disk(output_file)
--- a/spacy/cli/assemble.py
+++ b/spacy/cli/assemble.py
@ -1,14 +1,20 @@
-from typing import Optional
-from pathlib import Path
-from wasabi import msg
-import typer
 import logging
+from pathlib import Path
+from typing import Optional
+
+import typer
+from wasabi import msg

-from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
-from ._util import import_code
-from ..training.initialize import init_nlp
 from .. import util
 from ..util import get_sourced_components, load_model_from_config
+from ._util import (
+    Arg,
+    Opt,
+    app,
+    import_code,
+    parse_config_overrides,
+    show_validation_error,
+)


@app.command(
@ -34,7 +40,8 @@ def assemble_cli(

    DOCS: https://spacy.io/api/cli#assemble
    """
-    util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
+    if verbose:
+        util.logger.setLevel(logging.DEBUG)
    # Make sure all files and paths exists if they are needed
    if not config_path or (str(config_path) != "-" and not config_path.exists()):
        msg.fail("Config file not found", config_path, exits=1)
--- a/spacy/cli/benchmark_speed.py
+++ b/spacy/cli/benchmark_speed.py
@ -0,0 +1,177 @@
+import random
+import time
+from itertools import islice
+from pathlib import Path
+from typing import Iterable, List, Optional
+
+import numpy
+import typer
+from tqdm import tqdm
+from wasabi import msg
+
+from .. import util
+from ..language import Language
+from ..tokens import Doc
+from ..training import Corpus
+from ._util import Arg, Opt, benchmark_cli, import_code, setup_gpu
+
+
+@benchmark_cli.command(
+    "speed",
+    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def benchmark_speed_cli(
+    # fmt: off
+    ctx: typer.Context,
+    model: str = Arg(..., help="Model name or path"),
+    data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True),
+    batch_size: Optional[int] = Opt(None, "--batch-size", "-b", min=1, help="Override the pipeline batch size"),
+    no_shuffle: bool = Opt(False, "--no-shuffle", help="Do not shuffle benchmark data"),
+    use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
+    n_batches: int = Opt(50, "--batches", help="Minimum number of batches to benchmark", min=30,),
+    warmup_epochs: int = Opt(3, "--warmup", "-w", min=0, help="Number of iterations over the data for warmup"),
+    code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
+    # fmt: on
+):
+    """
+    Benchmark a pipeline. Expects a loadable spaCy pipeline and benchmark
+    data in the binary .spacy format.
+    """
+    import_code(code_path)
+    setup_gpu(use_gpu=use_gpu, silent=False)
+
+    nlp = util.load_model(model)
+    batch_size = batch_size if batch_size is not None else nlp.batch_size
+    corpus = Corpus(data_path)
+    docs = [eg.predicted for eg in corpus(nlp)]
+
+    if len(docs) == 0:
+        msg.fail("Cannot benchmark speed using an empty corpus.", exits=1)
+
+    print(f"Warming up for {warmup_epochs} epochs...")
+    warmup(nlp, docs, warmup_epochs, batch_size)
+
+    print()
+    print(f"Benchmarking {n_batches} batches...")
+    wps = benchmark(nlp, docs, n_batches, batch_size, not no_shuffle)
+
+    print()
+    print_outliers(wps)
+    print_mean_with_ci(wps)
+
+
+# Lowercased, behaves as a context manager function.
+class time_context:
+    """Register the running time of a context."""
+
+    def __enter__(self):
+        self.start = time.perf_counter()
+        return self
+
+    def __exit__(self, type, value, traceback):
+        self.elapsed = time.perf_counter() - self.start
+
+
+class Quartiles:
+    """Calculate the q1, q2, q3 quartiles and the inter-quartile range (iqr)
+    of a sample."""
+
+    q1: float
+    q2: float
+    q3: float
+    iqr: float
+
+    def __init__(self, sample: numpy.ndarray) -> None:
+        self.q1 = numpy.quantile(sample, 0.25)
+        self.q2 = numpy.quantile(sample, 0.5)
+        self.q3 = numpy.quantile(sample, 0.75)
+        self.iqr = self.q3 - self.q1
+
+
+def annotate(
+    nlp: Language, docs: List[Doc], batch_size: Optional[int]
+) -> numpy.ndarray:
+    docs = nlp.pipe(tqdm(docs, unit="doc", disable=None), batch_size=batch_size)
+    wps = []
+    while True:
+        with time_context() as elapsed:
+            batch_docs = list(
+                islice(docs, batch_size if batch_size else nlp.batch_size)
+            )
+        if len(batch_docs) == 0:
+            break
+        n_tokens = count_tokens(batch_docs)
+        wps.append(n_tokens / elapsed.elapsed)
+
+    return numpy.array(wps)
+
+
+def benchmark(
+    nlp: Language,
+    docs: List[Doc],
+    n_batches: int,
+    batch_size: int,
+    shuffle: bool,
+) -> numpy.ndarray:
+    if shuffle:
+        bench_docs = [
+            nlp.make_doc(random.choice(docs).text)
+            for _ in range(n_batches * batch_size)
+        ]
+    else:
+        bench_docs = [
+            nlp.make_doc(docs[i % len(docs)].text)
+            for i in range(n_batches * batch_size)
+        ]
+
+    return annotate(nlp, bench_docs, batch_size)
+
+
+def bootstrap(x, statistic=numpy.mean, iterations=10000) -> numpy.ndarray:
+    """Apply a statistic to repeated random samples of an array."""
+    return numpy.fromiter(
+        (
+            statistic(numpy.random.choice(x, len(x), replace=True))
+            for _ in range(iterations)
+        ),
+        numpy.float64,
+    )
+
+
+def count_tokens(docs: Iterable[Doc]) -> int:
+    return sum(len(doc) for doc in docs)
+
+
+def print_mean_with_ci(sample: numpy.ndarray):
+    mean = numpy.mean(sample)
+    bootstrap_means = bootstrap(sample)
+    bootstrap_means.sort()
+
+    # 95% confidence interval
+    low = bootstrap_means[int(len(bootstrap_means) * 0.025)]
+    high = bootstrap_means[int(len(bootstrap_means) * 0.975)]
+
+    print(f"Mean: {mean:.1f} words/s (95% CI: {low-mean:.1f} +{high-mean:.1f})")
+
+
+def print_outliers(sample: numpy.ndarray):
+    quartiles = Quartiles(sample)
+
+    n_outliers = numpy.sum(
+        (sample < (quartiles.q1 - 1.5 * quartiles.iqr))
+        | (sample > (quartiles.q3 + 1.5 * quartiles.iqr))
+    )
+    n_extreme_outliers = numpy.sum(
+        (sample < (quartiles.q1 - 3.0 * quartiles.iqr))
+        | (sample > (quartiles.q3 + 3.0 * quartiles.iqr))
+    )
+    print(
+        f"Outliers: {(100 * n_outliers) / len(sample):.1f}%, extreme outliers: {(100 * n_extreme_outliers) / len(sample)}%"
+    )
+
+
+def warmup(
+    nlp: Language, docs: List[Doc], warmup_epochs: int, batch_size: Optional[int]
+) -> numpy.ndarray:
+    docs = [doc.copy() for doc in docs * warmup_epochs]
+    return annotate(nlp, docs, batch_size)
--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -1,25 +1,29 @@
-from typing import Optional, Any, List, Union
-from enum import Enum
-from pathlib import Path
-from wasabi import Printer
-import srsly
+import itertools
 import re
 import sys
-import itertools
+from enum import Enum
+from pathlib import Path
+from typing import Any, Callable, Iterable, Mapping, Optional, Union

-from ._util import app, Arg, Opt
+import srsly
+from wasabi import Printer
+
+from ..tokens import Doc, DocBin
 from ..training import docs_to_json
-from ..tokens import DocBin
-from ..training.converters import iob_to_docs, conll_ner_to_docs, json_to_docs
-from ..training.converters import conllu_to_docs
-
+from ..training.converters import (
+    conll_ner_to_docs,
+    conllu_to_docs,
+    iob_to_docs,
+    json_to_docs,
+)
+from ._util import Arg, Opt, app, walk_directory

 # Converters are matched by file extension except for ner/iob, which are
 # matched by file extension and content. To add a converter, add a new
 # entry to this dict with the file extension mapped to the converter function
 # imported from /converters.

-CONVERTERS = {
+CONVERTERS: Mapping[str, Callable[..., Iterable[Doc]]] = {
    "conllubio": conllu_to_docs,
    "conllu": conllu_to_docs,
    "conll": conll_ner_to_docs,
@ -28,6 +32,8 @@ CONVERTERS = {
    "json": json_to_docs,
 }

+AUTO = "auto"
+

 # File types that can be written to stdout
 FILE_TYPES_STDOUT = ("json",)
@ -49,7 +55,7 @@ def convert_cli(
    model: Optional[str] = Opt(None, "--model", "--base", "-b", help="Trained spaCy pipeline for sentence segmentation to use as base (for --seg-sents)"),
    morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
    merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
-    converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
+    converter: str = Opt(AUTO, "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
    ner_map: Optional[Path] = Opt(None, "--ner-map", "-nm", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
    lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
    concatenate: bool = Opt(None, "--concatenate", "-C", help="Concatenate output to a single file"),
@ -66,19 +72,16 @@ def convert_cli(

    DOCS: https://spacy.io/api/cli#convert
    """
-    if isinstance(file_type, FileTypes):
-        # We get an instance of the FileTypes from the CLI so we need its string value
-        file_type = file_type.value
    input_path = Path(input_path)
-    output_dir = "-" if output_dir == Path("-") else output_dir
+    output_dir: Union[str, Path] = "-" if output_dir == Path("-") else output_dir
    silent = output_dir == "-"
    msg = Printer(no_print=silent)
-    verify_cli_args(msg, input_path, output_dir, file_type, converter, ner_map)
    converter = _get_converter(msg, converter, input_path)
+    verify_cli_args(msg, input_path, output_dir, file_type.value, converter, ner_map)
    convert(
        input_path,
        output_dir,
-        file_type=file_type,
+        file_type=file_type.value,
        n_sents=n_sents,
        seg_sents=seg_sents,
        model=model,
@ -94,7 +97,7 @@ def convert_cli(


 def convert(
-    input_path: Union[str, Path],
+    input_path: Path,
    output_dir: Union[str, Path],
    *,
    file_type: str = "json",
@ -103,18 +106,19 @@ def convert(
    model: Optional[str] = None,
    morphology: bool = False,
    merge_subtokens: bool = False,
-    converter: str = "auto",
+    converter: str,
    ner_map: Optional[Path] = None,
    lang: Optional[str] = None,
    concatenate: bool = False,
    silent: bool = True,
-    msg: Optional[Printer],
+    msg: Optional[Printer] = None,
 ) -> None:
+    input_path = Path(input_path)
    if not msg:
        msg = Printer(no_print=silent)
    ner_map = srsly.read_json(ner_map) if ner_map is not None else None
    doc_files = []
-    for input_loc in walk_directory(Path(input_path), converter):
+    for input_loc in walk_directory(input_path, converter):
        with input_loc.open("r", encoding="utf-8") as infile:
            input_data = infile.read()
        # Use converter function to convert data
@ -141,7 +145,7 @@ def convert(
        else:
            db = DocBin(docs=docs, store_user_data=True)
            len_docs = len(db)
-            data = db.to_bytes()
+            data = db.to_bytes()  # type: ignore[assignment]
        if output_dir == "-":
            _print_docs_to_stdout(data, file_type)
        else:
@ -191,42 +195,14 @@ def autodetect_ner_format(input_data: str) -> Optional[str]:
    return None


-def walk_directory(path: Path, converter: str) -> List[Path]:
-    if not path.is_dir():
-        return [path]
-    paths = [path]
-    locs = []
-    seen = set()
-    for path in paths:
-        if str(path) in seen:
-            continue
-        seen.add(str(path))
-        if path.parts[-1].startswith("."):
-            continue
-        elif path.is_dir():
-            paths.extend(path.iterdir())
-        elif converter == "json" and not path.parts[-1].endswith("json"):
-            continue
-        elif converter == "conll" and not path.parts[-1].endswith("conll"):
-            continue
-        elif converter == "iob" and not path.parts[-1].endswith("iob"):
-            continue
-        else:
-            locs.append(path)
-    # It's good to sort these, in case the ordering messes up cache.
-    locs.sort()
-    return locs
-
-
 def verify_cli_args(
    msg: Printer,
-    input_path: Union[str, Path],
+    input_path: Path,
    output_dir: Union[str, Path],
-    file_type: FileTypes,
+    file_type: str,
    converter: str,
    ner_map: Optional[Path],
 ):
-    input_path = Path(input_path)
    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
        msg.fail(
            f"Can't write .{file_type} data to stdout. Please specify an output directory.",
@ -242,18 +218,22 @@ def verify_cli_args(
        input_locs = walk_directory(input_path, converter)
        if len(input_locs) == 0:
            msg.fail("No input files in directory", input_path, exits=1)
-        file_types = list(set([loc.suffix[1:] for loc in input_locs]))
-        if converter == "auto" and len(file_types) >= 2:
-            file_types = ",".join(file_types)
-            msg.fail("All input files must be same type", file_types, exits=1)
-    if converter != "auto" and converter not in CONVERTERS:
+    if converter not in CONVERTERS:
        msg.fail(f"Can't find converter for {converter}", exits=1)


-def _get_converter(msg, converter, input_path):
+def _get_converter(msg, converter, input_path: Path):
    if input_path.is_dir():
-        input_path = walk_directory(input_path, converter)[0]
-    if converter == "auto":
+        if converter == AUTO:
+            input_locs = walk_directory(input_path, suffix=None)
+            file_types = list(set([loc.suffix[1:] for loc in input_locs]))
+            if len(file_types) >= 2:
+                file_types_str = ",".join(file_types)
+                msg.fail("All input files must be same type", file_types_str, exits=1)
+            input_path = input_locs[0]
+        else:
+            input_path = walk_directory(input_path, suffix=converter)[0]
+    if converter == AUTO:
        converter = input_path.suffix[1:]
    if converter == "ner" or converter == "iob":
        with input_path.open(encoding="utf8") as file_:
--- a/spacy/cli/debug_config.py
+++ b/spacy/cli/debug_config.py
@ -1,15 +1,22 @@
-from typing import Optional, Dict, Any, Union, List
 from pathlib import Path
-from wasabi import msg, table
+from typing import Any, Dict, List, Optional, Union
+
+import typer
 from thinc.api import Config
 from thinc.config import VARIABLE_RE
-import typer
+from wasabi import msg, table

-from ._util import Arg, Opt, show_validation_error, parse_config_overrides
-from ._util import import_code, debug_cli
+from .. import util
 from ..schemas import ConfigSchemaInit, ConfigSchemaTraining
 from ..util import registry
-from .. import util
+from ._util import (
+    Arg,
+    Opt,
+    debug_cli,
+    import_code,
+    parse_config_overrides,
+    show_validation_error,
+)


@debug_cli.command(
@ -25,7 +32,7 @@ def debug_config_cli(
    show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
    # fmt: on
 ):
-    """Debug a config.cfg file and show validation errors. The command will
+    """Debug a config file and show validation errors. The command will
    create all objects in the tree and validate them. Note that some config
    validation errors are blocking and will prevent the rest of the config from
    being resolved. This means that you may not see all validation errors at
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -1,24 +1,49 @@
-from typing import List, Sequence, Dict, Any, Tuple, Optional, Set
-from pathlib import Path
-from collections import Counter
+import math
 import sys
-import srsly
-from wasabi import Printer, MESSAGES, msg
-import typer
+from collections import Counter
+from pathlib import Path
+from typing import (
+    Any,
+    Dict,
+    Iterable,
+    List,
+    Optional,
+    Sequence,
+    Set,
+    Tuple,
+    Union,
+    cast,
+    overload,
+)

-from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides
-from ._util import import_code, debug_cli
-from ..training import Example
-from ..training.initialize import get_sourced_components
-from ..schemas import ConfigSchemaTraining
+import numpy
+import srsly
+import typer
+from wasabi import MESSAGES, Printer, msg
+
+from .. import util
+from ..compat import Literal
+from ..language import Language
+from ..morphology import Morphology
+from ..pipeline import Morphologizer, SpanCategorizer, TrainablePipe
+from ..pipeline._edit_tree_internals.edit_trees import EditTrees
 from ..pipeline._parser_internals import nonproj
 from ..pipeline._parser_internals.nonproj import DELIMITER
-from ..pipeline import Morphologizer
-from ..morphology import Morphology
-from ..language import Language
+from ..schemas import ConfigSchemaTraining
+from ..training import Example, remove_bilu_prefix
+from ..training.initialize import get_sourced_components
 from ..util import registry, resolve_dot_names
-from .. import util
-
+from ..vectors import Mode as VectorsMode
+from ._util import (
+    Arg,
+    Opt,
+    _format_number,
+    app,
+    debug_cli,
+    import_code,
+    parse_config_overrides,
+    show_validation_error,
+)

 # Minimum number of expected occurrences of NER label in data to train new label
 NEW_LABEL_THRESHOLD = 50
@ -27,6 +52,12 @@ DEP_LABEL_THRESHOLD = 20
 # Minimum number of expected examples to train a new pipeline
 BLANK_MODEL_MIN_THRESHOLD = 100
 BLANK_MODEL_THRESHOLD = 2000
+# Arbitrary threshold where SpanCat performs well
+SPAN_DISTINCT_THRESHOLD = 1
+# Arbitrary threshold where SpanCat performs well
+BOUNDARY_DISTINCT_THRESHOLD = 1
+# Arbitrary threshold for filtering span lengths during reporting (percentage)
+SPAN_LENGTH_THRESHOLD_PERCENTAGE = 90


@debug_cli.command(
@ -101,13 +132,14 @@ def debug_data(
    # Create the gold corpus to be able to better analyze data
    dot_names = [T["train_corpus"], T["dev_corpus"]]
    train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
+
+    nlp.initialize(lambda: train_corpus(nlp))
+    msg.good("Pipeline can be initialized with data")
+
    train_dataset = list(train_corpus(nlp))
    dev_dataset = list(dev_corpus(nlp))
    msg.good("Corpus is loadable")

-    nlp.initialize(lambda: train_dataset)
-    msg.good("Pipeline can be initialized with data")
-
    # Create all gold data here to avoid iterating over the train_dataset constantly
    gold_train_data = _compile_gold(train_dataset, factory_names, nlp, make_proj=True)
    gold_train_unpreprocessed_data = _compile_gold(
@ -167,29 +199,164 @@ def debug_data(
        show=verbose,
    )
    if len(nlp.vocab.vectors):
-        msg.info(
-            f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
-            f"unique keys, {nlp.vocab.vectors_length} dimensions)"
-        )
-        n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
-        msg.warn(
-            "{} words in training data without vectors ({:.0f}%)".format(
-                n_missing_vectors,
-                100 * (n_missing_vectors / gold_train_data["n_words"]),
-            ),
-        )
-        msg.text(
-            "10 most common words without vectors: {}".format(
-                _format_labels(
-                    gold_train_data["words_missing_vectors"].most_common(10),
-                    counts=True,
-                )
-            ),
-            show=verbose,
-        )
+        if nlp.vocab.vectors.mode == VectorsMode.floret:
+            msg.info(
+                f"floret vectors with {len(nlp.vocab.vectors)} vectors, "
+                f"{nlp.vocab.vectors_length} dimensions, "
+                f"{nlp.vocab.vectors.minn}-{nlp.vocab.vectors.maxn} char "
+                f"n-gram subwords"
+            )
+        else:
+            msg.info(
+                f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
+                f"unique keys, {nlp.vocab.vectors_length} dimensions)"
+            )
+            n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
+            msg.warn(
+                "{} words in training data without vectors ({:.0f}%)".format(
+                    n_missing_vectors,
+                    100 * (n_missing_vectors / gold_train_data["n_words"]),
+                ),
+            )
+            msg.text(
+                "10 most common words without vectors: {}".format(
+                    _format_labels(
+                        gold_train_data["words_missing_vectors"].most_common(10),
+                        counts=True,
+                    )
+                ),
+                show=verbose,
+            )
    else:
        msg.info("No word vectors present in the package")

+    if "spancat" in factory_names or "spancat_singlelabel" in factory_names:
+        model_labels_spancat = _get_labels_from_spancat(nlp)
+        has_low_data_warning = False
+        has_no_neg_warning = False
+
+        msg.divider("Span Categorization")
+        msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
+
+        msg.text("Label counts in train data: ", show=verbose)
+        for spans_key, data_labels in gold_train_data["spancat"].items():
+            msg.text(
+                f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
+                show=verbose,
+            )
+        # Data checks: only take the spans keys in the actual spancat components
+        data_labels_in_component = {
+            spans_key: gold_train_data["spancat"][spans_key]
+            for spans_key in model_labels_spancat.keys()
+        }
+        for spans_key, data_labels in data_labels_in_component.items():
+            for label, count in data_labels.items():
+                # Check for missing labels
+                spans_key_in_model = spans_key in model_labels_spancat.keys()
+                if (spans_key_in_model) and (
+                    label not in model_labels_spancat[spans_key]
+                ):
+                    msg.warn(
+                        f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
+                        "Performance may degrade after training."
+                    )
+                # Check for low number of examples per label
+                if count <= NEW_LABEL_THRESHOLD:
+                    msg.warn(
+                        f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
+                    )
+                    has_low_data_warning = True
+                # Check for negative examples
+                with msg.loading("Analyzing label distribution..."):
+                    neg_docs = _get_examples_without_label(
+                        train_dataset, label, "spancat", spans_key
+                    )
+                if neg_docs == 0:
+                    msg.warn(f"No examples for texts WITHOUT new label '{label}'")
+                    has_no_neg_warning = True
+
+            with msg.loading("Obtaining span characteristics..."):
+                span_characteristics = _get_span_characteristics(
+                    train_dataset, gold_train_data, spans_key
+                )
+
+            msg.info(f"Span characteristics for spans_key '{spans_key}'")
+            msg.info("SD = Span Distinctiveness, BD = Boundary Distinctiveness")
+            _print_span_characteristics(span_characteristics)
+
+            _span_freqs = _get_spans_length_freq_dist(
+                gold_train_data["spans_length"][spans_key]
+            )
+            _filtered_span_freqs = _filter_spans_length_freq_dist(
+                _span_freqs, threshold=SPAN_LENGTH_THRESHOLD_PERCENTAGE
+            )
+
+            msg.info(
+                f"Over {SPAN_LENGTH_THRESHOLD_PERCENTAGE}% of spans have lengths of 1 -- "
+                f"{max(_filtered_span_freqs.keys())} "
+                f"(min={span_characteristics['min_length']}, max={span_characteristics['max_length']}). "
+                f"The most common span lengths are: {_format_freqs(_filtered_span_freqs)}. "
+                "If you are using the n-gram suggester, note that omitting "
+                "infrequent n-gram lengths can greatly improve speed and "
+                "memory usage."
+            )
+
+            msg.text(
+                f"Full distribution of span lengths: {_format_freqs(_span_freqs)}",
+                show=verbose,
+            )
+
+            # Add report regarding span characteristics
+            if span_characteristics["avg_sd"] < SPAN_DISTINCT_THRESHOLD:
+                msg.warn("Spans may not be distinct from the rest of the corpus")
+            else:
+                msg.good("Spans are distinct from the rest of the corpus")
+
+            p_spans = span_characteristics["p_spans"].values()
+            all_span_tokens: Counter = sum(p_spans, Counter())
+            most_common_spans = [w for w, _ in all_span_tokens.most_common(10)]
+            msg.text(
+                "10 most common span tokens: {}".format(
+                    _format_labels(most_common_spans)
+                ),
+                show=verbose,
+            )
+
+            # Add report regarding span boundary characteristics
+            if span_characteristics["avg_bd"] < BOUNDARY_DISTINCT_THRESHOLD:
+                msg.warn("Boundary tokens are not distinct from the rest of the corpus")
+            else:
+                msg.good("Boundary tokens are distinct from the rest of the corpus")
+
+            p_bounds = span_characteristics["p_bounds"].values()
+            all_span_bound_tokens: Counter = sum(p_bounds, Counter())
+            most_common_bounds = [w for w, _ in all_span_bound_tokens.most_common(10)]
+            msg.text(
+                "10 most common span boundary tokens: {}".format(
+                    _format_labels(most_common_bounds)
+                ),
+                show=verbose,
+            )
+
+        if has_low_data_warning:
+            msg.text(
+                f"To train a new span type, your data should include at "
+                f"least {NEW_LABEL_THRESHOLD} instances of the new label",
+                show=verbose,
+            )
+        else:
+            msg.good("Good amount of examples for all labels")
+
+        if has_no_neg_warning:
+            msg.text(
+                "Training data should always include examples of spans "
+                "in context, as well as examples without a given span "
+                "type.",
+                show=verbose,
+            )
+        else:
+            msg.good("Examples without occurrences available for all labels")
+
    if "ner" in factory_names:
        # Get all unique NER labels present in the data
        labels = set(
@ -200,7 +367,7 @@ def debug_data(
        has_low_data_warning = False
        has_no_neg_warning = False
        has_ws_ents_error = False
-        has_punct_ents_warning = False
+        has_boundary_cross_ents_warning = False

        msg.divider("Named Entity Recognition")
        msg.info(f"{len(model_labels)} label(s)")
@ -215,7 +382,7 @@ def debug_data(
            if label != "-"
        ]
        labels_with_counts = _format_labels(labels_with_counts, counts=True)
-        msg.text(f"Labels in train data: {_format_labels(labels)}", show=verbose)
+        msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
        missing_labels = model_labels - labels
        if missing_labels:
            msg.warn(
@ -227,10 +394,6 @@ def debug_data(
            msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
            has_ws_ents_error = True

-        if gold_train_data["punct_ents"]:
-            msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
-            has_punct_ents_warning = True
-
        for label in labels:
            if label_counts[label] <= NEW_LABEL_THRESHOLD:
                msg.warn(
@ -239,19 +402,25 @@ def debug_data(
                has_low_data_warning = True

                with msg.loading("Analyzing label distribution..."):
-                    neg_docs = _get_examples_without_label(train_dataset, label)
+                    neg_docs = _get_examples_without_label(train_dataset, label, "ner")
                if neg_docs == 0:
                    msg.warn(f"No examples for texts WITHOUT new label '{label}'")
                    has_no_neg_warning = True

+        if gold_train_data["boundary_cross_ents"]:
+            msg.warn(
+                f"{gold_train_data['boundary_cross_ents']} entity span(s) crossing sentence boundaries"
+            )
+            has_boundary_cross_ents_warning = True
+
        if not has_low_data_warning:
            msg.good("Good amount of examples for all labels")
        if not has_no_neg_warning:
            msg.good("Examples without occurrences available for all labels")
        if not has_ws_ents_error:
            msg.good("No entities consisting of or starting/ending with whitespace")
-        if not has_punct_ents_warning:
-            msg.good("No entities consisting of or starting/ending with punctuation")
+        if not has_boundary_cross_ents_warning:
+            msg.good("No entities crossing sentence boundaries")

        if has_low_data_warning:
            msg.text(
@ -267,15 +436,9 @@ def debug_data(
                show=verbose,
            )
        if has_ws_ents_error:
-            msg.text(
-                "As of spaCy v2.1.0, entity spans consisting of or starting/ending "
-                "with whitespace characters are considered invalid."
-            )
-
-        if has_punct_ents_warning:
            msg.text(
                "Entity spans consisting of or starting/ending "
-                "with punctuation can not be trained with a noise level > 0."
+                "with whitespace characters are considered invalid."
            )

    if "textcat" in factory_names:
@ -377,10 +540,15 @@ def debug_data(

    if "tagger" in factory_names:
        msg.divider("Part-of-speech Tagging")
-        labels = [label for label in gold_train_data["tags"]]
+        label_list, counts = zip(*gold_train_data["tags"].items())
+        msg.info(f"{len(label_list)} label(s) in train data")
+        p = numpy.array(counts)
+        p = p / p.sum()
+        norm_entropy = (-p * numpy.log2(p)).sum() / numpy.log2(len(label_list))
+        msg.info(f"{norm_entropy} is the normalised label entropy")
        model_labels = _get_labels_from_model(nlp, "tagger")
-        msg.info(f"{len(labels)} label(s) in train data")
-        missing_labels = model_labels - set(labels)
+        labels = set(label_list)
+        missing_labels = model_labels - labels
        if missing_labels:
            msg.warn(
                "Some model labels are not present in the train data. The "
@ -394,10 +562,11 @@ def debug_data(

    if "morphologizer" in factory_names:
        msg.divider("Morphologizer (POS+Morph)")
-        labels = [label for label in gold_train_data["morphs"]]
+        label_list = [label for label in gold_train_data["morphs"]]
        model_labels = _get_labels_from_model(nlp, "morphologizer")
-        msg.info(f"{len(labels)} label(s) in train data")
-        missing_labels = model_labels - set(labels)
+        msg.info(f"{len(label_list)} label(s) in train data")
+        labels = set(label_list)
+        missing_labels = model_labels - labels
        if missing_labels:
            msg.warn(
                "Some model labels are not present in the train data. The "
@ -526,6 +695,59 @@ def debug_data(
                f"Found {gold_train_data['n_cycles']} projectivized train sentence(s) with cycles"
            )

+    if "trainable_lemmatizer" in factory_names:
+        msg.divider("Trainable Lemmatizer")
+        trees_train: Set[str] = gold_train_data["lemmatizer_trees"]
+        trees_dev: Set[str] = gold_dev_data["lemmatizer_trees"]
+        # This is necessary context when someone is attempting to interpret whether the
+        # number of trees exclusively in the dev set is meaningful.
+        msg.info(f"{len(trees_train)} lemmatizer trees generated from training data")
+        msg.info(f"{len(trees_dev)} lemmatizer trees generated from dev data")
+        dev_not_train = trees_dev - trees_train
+
+        if len(dev_not_train) != 0:
+            pct = len(dev_not_train) / len(trees_dev)
+            msg.info(
+                f"{len(dev_not_train)} lemmatizer trees ({pct*100:.1f}% of dev trees)"
+                " were found exclusively in the dev data."
+            )
+        else:
+            # Would we ever expect this case? It seems like it would be pretty rare,
+            # and we might actually want a warning?
+            msg.info("All trees in dev data present in training data.")
+
+        if gold_train_data["n_low_cardinality_lemmas"] > 0:
+            n = gold_train_data["n_low_cardinality_lemmas"]
+            msg.warn(f"{n} training docs with 0 or 1 unique lemmas.")
+
+        if gold_dev_data["n_low_cardinality_lemmas"] > 0:
+            n = gold_dev_data["n_low_cardinality_lemmas"]
+            msg.warn(f"{n} dev docs with 0 or 1 unique lemmas.")
+
+        if gold_train_data["no_lemma_annotations"] > 0:
+            n = gold_train_data["no_lemma_annotations"]
+            msg.warn(f"{n} training docs with no lemma annotations.")
+        else:
+            msg.good("All training docs have lemma annotations.")
+
+        if gold_dev_data["no_lemma_annotations"] > 0:
+            n = gold_dev_data["no_lemma_annotations"]
+            msg.warn(f"{n} dev docs with no lemma annotations.")
+        else:
+            msg.good("All dev docs have lemma annotations.")
+
+        if gold_train_data["partial_lemma_annotations"] > 0:
+            n = gold_train_data["partial_lemma_annotations"]
+            msg.info(f"{n} training docs with partial lemma annotations.")
+        else:
+            msg.good("All training docs have complete lemma annotations.")
+
+        if gold_dev_data["partial_lemma_annotations"] > 0:
+            n = gold_dev_data["partial_lemma_annotations"]
+            msg.info(f"{n} dev docs with partial lemma annotations.")
+        else:
+            msg.good("All dev docs have complete lemma annotations.")
+
    msg.divider("Summary")
    good_counts = msg.counts[MESSAGES.GOOD]
    warn_counts = msg.counts[MESSAGES.WARN]
@ -564,7 +786,7 @@ def _compile_gold(
    nlp: Language,
    make_proj: bool,
 ) -> Dict[str, Any]:
-    data = {
+    data: Dict[str, Any] = {
        "ner": Counter(),
        "cats": Counter(),
        "tags": Counter(),
@ -572,8 +794,12 @@ def _compile_gold(
        "deps": Counter(),
        "words": Counter(),
        "roots": Counter(),
+        "spancat": dict(),
+        "spans_length": dict(),
+        "spans_per_type": dict(),
+        "sb_per_type": dict(),
        "ws_ents": 0,
-        "punct_ents": 0,
+        "boundary_cross_ents": 0,
        "n_words": 0,
        "n_misaligned_words": 0,
        "words_missing_vectors": Counter(),
@ -583,7 +809,13 @@ def _compile_gold(
        "n_cats_multilabel": 0,
        "n_cats_bad_values": 0,
        "texts": set(),
+        "lemmatizer_trees": set(),
+        "no_lemma_annotations": 0,
+        "partial_lemma_annotations": 0,
+        "n_low_cardinality_lemmas": 0,
    }
+    if "trainable_lemmatizer" in factory_names:
+        trees = EditTrees(nlp.vocab.strings)
    for eg in examples:
        gold = eg.reference
        doc = eg.predicted
@ -602,27 +834,74 @@ def _compile_gold(
                if nlp.vocab.strings[word] not in nlp.vocab.vectors:
                    data["words_missing_vectors"].update([word])
        if "ner" in factory_names:
+            sent_starts = eg.get_aligned_sent_starts()
            for i, label in enumerate(eg.get_aligned_ner()):
                if label is None:
                    continue
                if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
                    # "Illegal" whitespace entity
                    data["ws_ents"] += 1
-                if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
-                    ".",
-                    "'",
-                    "!",
-                    "?",
-                    ",",
-                ]:
-                    # punctuation entity: could be replaced by whitespace when training with noise,
-                    # so add a warning to alert the user to this unexpected side effect.
-                    data["punct_ents"] += 1
                if label.startswith(("B-", "U-")):
-                    combined_label = label.split("-")[1]
+                    combined_label = remove_bilu_prefix(label)
                    data["ner"][combined_label] += 1
+                if sent_starts[i] and label.startswith(("I-", "L-")):
+                    data["boundary_cross_ents"] += 1
                elif label == "-":
                    data["ner"]["-"] += 1
+        if "spancat" in factory_names or "spancat_singlelabel" in factory_names:
+            for spans_key in list(eg.reference.spans.keys()):
+                # Obtain the span frequency
+                if spans_key not in data["spancat"]:
+                    data["spancat"][spans_key] = Counter()
+                for i, span in enumerate(eg.reference.spans[spans_key]):
+                    if span.label_ is None:
+                        continue
+                    else:
+                        data["spancat"][spans_key][span.label_] += 1
+
+                # Obtain the span length
+                if spans_key not in data["spans_length"]:
+                    data["spans_length"][spans_key] = dict()
+                for span in gold.spans[spans_key]:
+                    if span.label_ is None:
+                        continue
+                    if span.label_ not in data["spans_length"][spans_key]:
+                        data["spans_length"][spans_key][span.label_] = []
+                    data["spans_length"][spans_key][span.label_].append(len(span))
+
+                # Obtain spans per span type
+                if spans_key not in data["spans_per_type"]:
+                    data["spans_per_type"][spans_key] = dict()
+                for span in gold.spans[spans_key]:
+                    if span.label_ not in data["spans_per_type"][spans_key]:
+                        data["spans_per_type"][spans_key][span.label_] = []
+                    data["spans_per_type"][spans_key][span.label_].append(span)
+
+                # Obtain boundary tokens per span type
+                window_size = 1
+                if spans_key not in data["sb_per_type"]:
+                    data["sb_per_type"][spans_key] = dict()
+                for span in gold.spans[spans_key]:
+                    if span.label_ not in data["sb_per_type"][spans_key]:
+                        # Creating a data structure that holds the start and
+                        # end tokens for each span type
+                        data["sb_per_type"][spans_key][span.label_] = {
+                            "start": [],
+                            "end": [],
+                        }
+                    for offset in range(window_size):
+                        sb_start_idx = span.start - (offset + 1)
+                        if sb_start_idx >= 0:
+                            data["sb_per_type"][spans_key][span.label_]["start"].append(
+                                gold[sb_start_idx : sb_start_idx + 1]
+                            )
+
+                        sb_end_idx = span.end + (offset + 1)
+                        if sb_end_idx <= len(gold):
+                            data["sb_per_type"][spans_key][span.label_]["end"].append(
+                                gold[sb_end_idx - 1 : sb_end_idx]
+                            )
+
        if "textcat" in factory_names or "textcat_multilabel" in factory_names:
            data["cats"].update(gold.cats)
            if any(val not in (0, 1) for val in gold.cats.values()):
@ -666,30 +945,288 @@ def _compile_gold(
                data["n_nonproj"] += 1
            if nonproj.contains_cycle(aligned_heads):
                data["n_cycles"] += 1
+        if "trainable_lemmatizer" in factory_names:
+            # from EditTreeLemmatizer._labels_from_data
+            if all(token.lemma == 0 for token in gold):
+                data["no_lemma_annotations"] += 1
+                continue
+            if any(token.lemma == 0 for token in gold):
+                data["partial_lemma_annotations"] += 1
+            lemma_set = set()
+            for token in gold:
+                if token.lemma != 0:
+                    lemma_set.add(token.lemma)
+                    tree_id = trees.add(token.text, token.lemma_)
+                    tree_str = trees.tree_to_str(tree_id)
+                    data["lemmatizer_trees"].add(tree_str)
+            # We want to identify cases where lemmas aren't assigned
+            # or are all assigned the same value, as this would indicate
+            # an issue since we're expecting a large set of lemmas
+            if len(lemma_set) < 2 and len(gold) > 1:
+                data["n_low_cardinality_lemmas"] += 1
    return data


-def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
+@overload
+def _format_labels(labels: Iterable[str], counts: Literal[False] = False) -> str:
+    ...
+
+
+@overload
+def _format_labels(
+    labels: Iterable[Tuple[str, int]],
+    counts: Literal[True],
+) -> str:
+    ...
+
+
+def _format_labels(
+    labels: Union[Iterable[str], Iterable[Tuple[str, int]]],
+    counts: bool = False,
+) -> str:
    if counts:
-        return ", ".join([f"'{l}' ({c})" for l, c in labels])
-    return ", ".join([f"'{l}'" for l in labels])
+        return ", ".join(
+            [f"'{l}' ({c})" for l, c in cast(Iterable[Tuple[str, int]], labels)]
+        )
+    return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])


-def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
+def _format_freqs(freqs: Dict[int, float], sort: bool = True) -> str:
+    if sort:
+        freqs = dict(sorted(freqs.items()))
+
+    _freqs = [(str(k), v) for k, v in freqs.items()]
+    return ", ".join(
+        [f"{l} ({c}%)" for l, c in cast(Iterable[Tuple[str, float]], _freqs)]
+    )
+
+
+def _get_examples_without_label(
+    data: Sequence[Example],
+    label: str,
+    component: Literal["ner", "spancat"] = "ner",
+    spans_key: Optional[str] = "sc",
+) -> int:
    count = 0
    for eg in data:
-        labels = [
-            label.split("-")[1]
-            for label in eg.get_aligned_ner()
-            if label not in ("O", "-", None)
-        ]
+        if component == "ner":
+            labels = [
+                remove_bilu_prefix(label)
+                for label in eg.get_aligned_ner()
+                if label not in ("O", "-", None)
+            ]
+
+        if component == "spancat":
+            labels = (
+                [span.label_ for span in eg.reference.spans[spans_key]]
+                if spans_key in eg.reference.spans
+                else []
+            )
+
        if label not in labels:
            count += 1
    return count


-def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
-    if pipe_name not in nlp.pipe_names:
-        return set()
-    pipe = nlp.get_pipe(pipe_name)
-    return set(pipe.labels)
+def _get_labels_from_model(nlp: Language, factory_name: str) -> Set[str]:
+    pipe_names = [
+        pipe_name
+        for pipe_name in nlp.pipe_names
+        if nlp.get_pipe_meta(pipe_name).factory == factory_name
+    ]
+    labels: Set[str] = set()
+    for pipe_name in pipe_names:
+        pipe = nlp.get_pipe(pipe_name)
+        assert isinstance(pipe, TrainablePipe)
+        labels.update(pipe.labels)
+    return labels
+
+
+def _get_labels_from_spancat(nlp: Language) -> Dict[str, Set[str]]:
+    pipe_names = [
+        pipe_name
+        for pipe_name in nlp.pipe_names
+        if nlp.get_pipe_meta(pipe_name).factory in ("spancat", "spancat_singlelabel")
+    ]
+    labels: Dict[str, Set[str]] = {}
+    for pipe_name in pipe_names:
+        pipe = nlp.get_pipe(pipe_name)
+        assert isinstance(pipe, SpanCategorizer)
+        if pipe.key not in labels:
+            labels[pipe.key] = set()
+        labels[pipe.key].update(pipe.labels)
+    return labels
+
+
+def _gmean(l: List) -> float:
+    """Compute geometric mean of a list"""
+    return math.exp(math.fsum(math.log(i) for i in l) / len(l))
+
+
+def _wgt_average(metric: Dict[str, float], frequencies: Counter) -> float:
+    total = sum(value * frequencies[span_type] for span_type, value in metric.items())
+    return total / sum(frequencies.values())
+
+
+def _get_distribution(docs, normalize: bool = True) -> Counter:
+    """Get the frequency distribution given a set of Docs"""
+    word_counts: Counter = Counter()
+    for doc in docs:
+        for token in doc:
+            # Normalize the text
+            t = token.text.lower().replace("``", '"').replace("''", '"')
+            word_counts[t] += 1
+    if normalize:
+        total = sum(word_counts.values(), 0.0)
+        word_counts = Counter({k: v / total for k, v in word_counts.items()})
+    return word_counts
+
+
+def _get_kl_divergence(p: Counter, q: Counter) -> float:
+    """Compute the Kullback-Leibler divergence from two frequency distributions"""
+    total = 0.0
+    for word, p_word in p.items():
+        total += p_word * math.log(p_word / q[word])
+    return total
+
+
+def _format_span_row(span_data: List[Dict], labels: List[str]) -> List[Any]:
+    """Compile into one list for easier reporting"""
+    d = {
+        label: [label] + list(_format_number(d[label]) for d in span_data)
+        for label in labels
+    }
+    return list(d.values())
+
+
+def _get_span_characteristics(
+    examples: List[Example], compiled_gold: Dict[str, Any], spans_key: str
+) -> Dict[str, Any]:
+    """Obtain all span characteristics"""
+    data_labels = compiled_gold["spancat"][spans_key]
+    # Get lengths
+    span_length = {
+        label: _gmean(l)
+        for label, l in compiled_gold["spans_length"][spans_key].items()
+    }
+    spans_per_type = {
+        label: len(spans)
+        for label, spans in compiled_gold["spans_per_type"][spans_key].items()
+    }
+    min_lengths = [min(l) for l in compiled_gold["spans_length"][spans_key].values()]
+    max_lengths = [max(l) for l in compiled_gold["spans_length"][spans_key].values()]
+
+    # Get relevant distributions: corpus, spans, span boundaries
+    p_corpus = _get_distribution([eg.reference for eg in examples], normalize=True)
+    p_spans = {
+        label: _get_distribution(spans, normalize=True)
+        for label, spans in compiled_gold["spans_per_type"][spans_key].items()
+    }
+    p_bounds = {
+        label: _get_distribution(sb["start"] + sb["end"], normalize=True)
+        for label, sb in compiled_gold["sb_per_type"][spans_key].items()
+    }
+
+    # Compute for actual span characteristics
+    span_distinctiveness = {
+        label: _get_kl_divergence(freq_dist, p_corpus)
+        for label, freq_dist in p_spans.items()
+    }
+    sb_distinctiveness = {
+        label: _get_kl_divergence(freq_dist, p_corpus)
+        for label, freq_dist in p_bounds.items()
+    }
+
+    return {
+        "sd": span_distinctiveness,
+        "bd": sb_distinctiveness,
+        "spans_per_type": spans_per_type,
+        "lengths": span_length,
+        "min_length": min(min_lengths),
+        "max_length": max(max_lengths),
+        "avg_sd": _wgt_average(span_distinctiveness, data_labels),
+        "avg_bd": _wgt_average(sb_distinctiveness, data_labels),
+        "avg_length": _wgt_average(span_length, data_labels),
+        "labels": list(data_labels.keys()),
+        "p_spans": p_spans,
+        "p_bounds": p_bounds,
+    }
+
+
+def _print_span_characteristics(span_characteristics: Dict[str, Any]):
+    """Print all span characteristics into a table"""
+    headers = ("Span Type", "Length", "SD", "BD", "N")
+    # Wasabi has this at 30 by default, but we might have some long labels
+    max_col = max(30, max(len(label) for label in span_characteristics["labels"]))
+    # Prepare table data with all span characteristics
+    table_data = [
+        span_characteristics["lengths"],
+        span_characteristics["sd"],
+        span_characteristics["bd"],
+        span_characteristics["spans_per_type"],
+    ]
+    table = _format_span_row(
+        span_data=table_data, labels=span_characteristics["labels"]
+    )
+    # Prepare table footer with weighted averages
+    footer_data = [
+        span_characteristics["avg_length"],
+        span_characteristics["avg_sd"],
+        span_characteristics["avg_bd"],
+    ]
+
+    footer = (
+        ["Wgt. Average"] + ["{:.2f}".format(round(f, 2)) for f in footer_data] + ["-"]
+    )
+    msg.table(
+        table,
+        footer=footer,
+        header=headers,
+        divider=True,
+        aligns=["l"] + ["r"] * (len(footer_data) + 1),
+        max_col=max_col,
+    )
+
+
+def _get_spans_length_freq_dist(
+    length_dict: Dict, threshold=SPAN_LENGTH_THRESHOLD_PERCENTAGE
+) -> Dict[int, float]:
+    """Get frequency distribution of spans length under a certain threshold"""
+    all_span_lengths = []
+    for _, lengths in length_dict.items():
+        all_span_lengths.extend(lengths)
+
+    freq_dist: Counter = Counter()
+    for i in all_span_lengths:
+        if freq_dist.get(i):
+            freq_dist[i] += 1
+        else:
+            freq_dist[i] = 1
+
+    # We will be working with percentages instead of raw counts
+    freq_dist_percentage = {}
+    for span_length, count in freq_dist.most_common():
+        percentage = (count / len(all_span_lengths)) * 100.0
+        percentage = round(percentage, 2)
+        freq_dist_percentage[span_length] = percentage
+
+    return freq_dist_percentage
+
+
+def _filter_spans_length_freq_dist(
+    freq_dist: Dict[int, float], threshold: int
+) -> Dict[int, float]:
+    """Filter frequency distribution with respect to a threshold
+
+    We're going to filter all the span lengths that fall
+    around a percentage threshold when summed.
+    """
+    total = 0.0
+    filtered_freq_dist = {}
+    for span_length, dist in freq_dist.items():
+        if total >= threshold:
+            break
+        else:
+            filtered_freq_dist[span_length] = dist
+        total += dist
+    return filtered_freq_dist
--- a/spacy/cli/debug_diff.py
+++ b/spacy/cli/debug_diff.py
@ -0,0 +1,89 @@
+from pathlib import Path
+from typing import Optional
+
+import typer
+from thinc.api import Config
+from wasabi import MarkdownRenderer, Printer, diff_strings
+
+from ..util import load_config
+from ._util import Arg, Opt, debug_cli, parse_config_overrides, show_validation_error
+from .init_config import Optimizations, init_config
+
+
+@debug_cli.command(
+    "diff-config",
+    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def debug_diff_cli(
+    # fmt: off
+    ctx: typer.Context,
+    config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
+    compare_to: Optional[Path] = Opt(None, help="Path to a config file to diff against, or `None` to compare against default settings", exists=True, allow_dash=True),
+    optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether the user config was optimized for efficiency or accuracy. Only relevant when comparing against the default config."),
+    gpu: bool = Opt(False, "--gpu", "-G", help="Whether the original config can run on a GPU. Only relevant when comparing against the default config."),
+    pretraining: bool = Opt(False, "--pretraining", "--pt", help="Whether to compare on a config with pretraining involved. Only relevant when comparing against the default config."),
+    markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues")
+    # fmt: on
+):
+    """Show a diff of a config file with respect to spaCy's defaults or another config file. If
+    additional settings were used in the creation of the config file, then you
+    must supply these as extra parameters to the command when comparing to the default settings. The generated diff
+    can also be used when posting to the discussion forum to provide more
+    information for the maintainers.
+
+    The `optimize`, `gpu`, and `pretraining` options are only relevant when
+    comparing against the default configuration (or specifically when `compare_to` is None).
+
+    DOCS: https://spacy.io/api/cli#debug-diff
+    """
+    debug_diff(
+        config_path=config_path,
+        compare_to=compare_to,
+        gpu=gpu,
+        optimize=optimize,
+        pretraining=pretraining,
+        markdown=markdown,
+    )
+
+
+def debug_diff(
+    config_path: Path,
+    compare_to: Optional[Path],
+    gpu: bool,
+    optimize: Optimizations,
+    pretraining: bool,
+    markdown: bool,
+):
+    msg = Printer()
+    with show_validation_error(hint_fill=False):
+        user_config = load_config(config_path)
+        if compare_to:
+            other_config = load_config(compare_to)
+        else:
+            # Recreate a default config based from user's config
+            lang = user_config["nlp"]["lang"]
+            pipeline = list(user_config["nlp"]["pipeline"])
+            msg.info(f"Found user-defined language: '{lang}'")
+            msg.info(f"Found user-defined pipelines: {pipeline}")
+            other_config = init_config(
+                lang=lang,
+                pipeline=pipeline,
+                optimize=optimize.value,
+                gpu=gpu,
+                pretraining=pretraining,
+                silent=True,
+            )
+
+    user = user_config.to_str()
+    other = other_config.to_str()
+
+    if user == other:
+        msg.warn("No diff to show: configs are identical")
+    else:
+        diff_text = diff_strings(other, user, add_symbols=markdown)
+        if markdown:
+            md = MarkdownRenderer()
+            md.add(md.code_block(diff_text, "diff"))
+            print(md.text)
+        else:
+            print(diff_text)
--- a/spacy/cli/debug_model.py
+++ b/spacy/cli/debug_model.py
@ -1,19 +1,32 @@
-from typing import Dict, Any, Optional, Iterable
-from pathlib import Path
 import itertools
+from pathlib import Path
+from typing import Any, Dict, Optional
+
+import typer
+from thinc.api import (
+    Model,
+    data_validation,
+    fix_random_seed,
+    set_dropout_rate,
+    set_gpu_allocator,
+)
+from wasabi import msg

 from spacy.training import Example
 from spacy.util import resolve_dot_names
-from wasabi import msg
-from thinc.api import fix_random_seed, set_dropout_rate, Adam
-from thinc.api import Model, data_validation, set_gpu_allocator
-import typer

-from ._util import Arg, Opt, debug_cli, show_validation_error
-from ._util import parse_config_overrides, string_to_list, setup_gpu
+from .. import util
 from ..schemas import ConfigSchemaTraining
 from ..util import registry
-from .. import util
+from ._util import (
+    Arg,
+    Opt,
+    debug_cli,
+    parse_config_overrides,
+    setup_gpu,
+    show_validation_error,
+    string_to_list,
+)


@debug_cli.command(
@ -133,15 +146,16 @@ def debug_model(
        _print_model(model, print_settings)

    # STEP 2: Updating the model and printing again
-    optimizer = Adam(0.001)
    set_dropout_rate(model, 0.2)
    # ugly hack to deal with Tok2Vec/Transformer listeners
    upstream_component = None
    if model.has_ref("tok2vec") and "tok2vec-listener" in model.get_ref("tok2vec").name:
        upstream_component = nlp.get_pipe("tok2vec")
-    if model.has_ref("tok2vec") and "transformer-listener" in model.get_ref("tok2vec").name:
+    if (
+        model.has_ref("tok2vec")
+        and "transformer-listener" in model.get_ref("tok2vec").name
+    ):
        upstream_component = nlp.get_pipe("transformer")
-    goldY = None
    for e in range(3):
        if upstream_component:
            upstream_component.update(examples)
@ -156,7 +170,7 @@ def debug_model(
        msg.divider(f"STEP 3 - prediction")
        msg.info(str(prediction))

-    msg.good(f"Succesfully ended analysis - model looks good.")
+    msg.good(f"Successfully ended analysis - model looks good.")


 def _sentences():
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -1,13 +1,22 @@
-from typing import Optional, Sequence
-import requests
 import sys
-from wasabi import msg
-import typer
+from typing import Optional, Sequence
+from urllib.parse import urljoin
+
+import requests
+import typer
+from wasabi import msg

-from ._util import app, Arg, Opt, WHEEL_SUFFIX, SDIST_SUFFIX
 from .. import about
-from ..util import is_package, get_base_version, run_command
 from ..errors import OLD_MODEL_SHORTCUTS
+from ..util import (
+    get_minor_version,
+    is_in_interactive,
+    is_in_jupyter,
+    is_package,
+    is_prerelease_version,
+    run_command,
+)
+from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app


@app.command(
@ -19,7 +28,7 @@ def download_cli(
    ctx: typer.Context,
    model: str = Arg(..., help="Name of pipeline package to download"),
    direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
-    sdist: bool = Opt(False, "--sdist", "-S", help="Download sdist (.tar.gz) archive instead of pre-built binary wheel")
+    sdist: bool = Opt(False, "--sdist", "-S", help="Download sdist (.tar.gz) archive instead of pre-built binary wheel"),
    # fmt: on
 ):
    """
@ -35,7 +44,12 @@ def download_cli(
    download(model, direct, sdist, *ctx.args)


-def download(model: str, direct: bool = False, sdist: bool = False, *pip_args) -> None:
+def download(
+    model: str,
+    direct: bool = False,
+    sdist: bool = False,
+    *pip_args,
+) -> None:
    if (
        not (is_package("spacy") or is_package("spacy-nightly"))
        and "--no-deps" not in pip_args
@ -49,13 +63,17 @@ def download(model: str, direct: bool = False, sdist: bool = False, *pip_args) -
            "dependencies, you'll have to install them manually."
        )
        pip_args = pip_args + ("--no-deps",)
-    suffix = SDIST_SUFFIX if sdist else WHEEL_SUFFIX
-    dl_tpl = "{m}-{v}/{m}-{v}{s}#egg={m}=={v}"
    if direct:
+        # Reject model names with '/', in order to prevent shenanigans.
+        if "/" in model:
+            msg.fail(
+                title="Model download rejected",
+                text=f"Cannot download model '{model}'. Models are expected to be file names, not URLs or fragments",
+                exits=True,
+            )
        components = model.split("-")
        model_name = "".join(components[:-1])
        version = components[-1]
-        download_model(dl_tpl.format(m=model_name, v=version, s=suffix), pip_args)
    else:
        model_name = model
        if model in OLD_MODEL_SHORTCUTS:
@ -66,15 +84,49 @@ def download(model: str, direct: bool = False, sdist: bool = False, *pip_args) -
            model_name = OLD_MODEL_SHORTCUTS[model]
        compatibility = get_compatibility()
        version = get_version(model_name, compatibility)
-        download_model(dl_tpl.format(m=model_name, v=version, s=suffix), pip_args)
+
+    filename = get_model_filename(model_name, version, sdist)
+
+    download_model(filename, pip_args)
    msg.good(
        "Download and installation successful",
        f"You can now load the package via spacy.load('{model_name}')",
    )
+    if is_in_jupyter():
+        reload_deps_msg = (
+            "If you are in a Jupyter or Colab notebook, you may need to "
+            "restart Python in order to load all the package's dependencies. "
+            "You can do this by selecting the 'Restart kernel' or 'Restart "
+            "runtime' option."
+        )
+        msg.warn(
+            "Restart to reload dependencies",
+            reload_deps_msg,
+        )
+    elif is_in_interactive():
+        reload_deps_msg = (
+            "If you are in an interactive Python session, you may need to "
+            "exit and restart Python to load all the package's dependencies. "
+            "You can exit with Ctrl-D (or Ctrl-Z and Enter on Windows)."
+        )
+        msg.warn(
+            "Restart to reload dependencies",
+            reload_deps_msg,
+        )
+
+
+def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str:
+    dl_tpl = "{m}-{v}/{m}-{v}{s}"
+    suffix = SDIST_SUFFIX if sdist else WHEEL_SUFFIX
+    filename = dl_tpl.format(m=model_name, v=version, s=suffix)
+    return filename


 def get_compatibility() -> dict:
-    version = get_base_version(about.__version__)
+    if is_prerelease_version(about.__version__):
+        version: Optional[str] = about.__version__
+    else:
+        version = get_minor_version(about.__version__)
    r = requests.get(about.__compatibility__)
    if r.status_code != 200:
        msg.fail(
@ -101,10 +153,24 @@ def get_version(model: str, comp: dict) -> str:
    return comp[model][0]


+def get_latest_version(model: str) -> str:
+    comp = get_compatibility()
+    return get_version(model, comp)
+
+
 def download_model(
    filename: str, user_pip_args: Optional[Sequence[str]] = None
 ) -> None:
-    download_url = about.__download_url__ + "/" + filename
+    # Construct the download URL carefully. We need to make sure we don't
+    # allow relative paths or other shenanigans to trick us into download
+    # from outside our own repo.
+    base_url = about.__download_url__
+    # urljoin requires that the path ends with /, or the last path part will be dropped
+    if not base_url.endswith("/"):
+        base_url = about.__download_url__ + "/"
+    download_url = urljoin(base_url, filename)
+    if not download_url.startswith(about.__download_url__):
+        raise ValueError(f"Download from {filename} rejected. Was it a relative path?")
    pip_args = list(user_pip_args) if user_pip_args is not None else []
    cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
    run_command(cmd)
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -1,18 +1,21 @@
-from typing import Optional, List, Dict
-from wasabi import Printer
-from pathlib import Path
 import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
 import srsly
 from thinc.api import fix_random_seed
+from wasabi import Printer

-from ..training import Corpus
-from ..tokens import Doc
-from ._util import app, Arg, Opt, setup_gpu, import_code
+from .. import displacy, util
 from ..scorer import Scorer
-from .. import util
-from .. import displacy
+from ..tokens import Doc
+from ..training import Corpus
+from ._util import Arg, Opt, app, benchmark_cli, import_code, setup_gpu


+@benchmark_cli.command(
+    "accuracy",
+)
@app.command("evaluate")
 def evaluate_cli(
    # fmt: off
@ -24,6 +27,8 @@ def evaluate_cli(
    gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
    displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
    displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
+    per_component: bool = Opt(False, "--per-component", "-P", help="Return scores per component, only applicable when an output JSON file is specified."),
+    spans_key: str = Opt("sc", "--spans-key", "-sk", help="Spans key to use when evaluating Doc.spans"),
    # fmt: on
 ):
    """
@ -36,7 +41,7 @@ def evaluate_cli(
    dependency parses in a HTML file, set as output directory as the
    displacy_path argument.

-    DOCS: https://spacy.io/api/cli#evaluate
+    DOCS: https://spacy.io/api/cli#benchmark-accuracy
    """
    import_code(code_path)
    evaluate(
@ -47,7 +52,9 @@ def evaluate_cli(
        gold_preproc=gold_preproc,
        displacy_path=displacy_path,
        displacy_limit=displacy_limit,
+        per_component=per_component,
        silent=False,
+        spans_key=spans_key,
    )


@ -60,10 +67,12 @@ def evaluate(
    displacy_path: Optional[Path] = None,
    displacy_limit: int = 25,
    silent: bool = True,
-) -> Scorer:
+    spans_key: str = "sc",
+    per_component: bool = False,
+) -> Dict[str, Any]:
    msg = Printer(no_print=silent, pretty=not silent)
    fix_random_seed()
-    setup_gpu(use_gpu)
+    setup_gpu(use_gpu, silent=silent)
    data_path = util.ensure_path(data_path)
    output_path = util.ensure_path(output)
    displacy_path = util.ensure_path(displacy_path)
@ -74,41 +83,86 @@ def evaluate(
    corpus = Corpus(data_path, gold_preproc=gold_preproc)
    nlp = util.load_model(model)
    dev_dataset = list(corpus(nlp))
-    scores = nlp.evaluate(dev_dataset)
-    metrics = {
-        "TOK": "token_acc",
-        "TAG": "tag_acc",
-        "POS": "pos_acc",
-        "MORPH": "morph_acc",
-        "LEMMA": "lemma_acc",
-        "UAS": "dep_uas",
-        "LAS": "dep_las",
-        "NER P": "ents_p",
-        "NER R": "ents_r",
-        "NER F": "ents_f",
-        "TEXTCAT": "cats_score",
-        "SENT P": "sents_p",
-        "SENT R": "sents_r",
-        "SENT F": "sents_f",
-        "SPEED": "speed",
-    }
-    results = {}
-    data = {}
-    for metric, key in metrics.items():
-        if key in scores:
-            if key == "cats_score":
-                metric = metric + " (" + scores.get("cats_score_desc", "unk") + ")"
-            if isinstance(scores[key], (int, float)):
-                if key == "speed":
-                    results[metric] = f"{scores[key]:.0f}"
+    scores = nlp.evaluate(dev_dataset, per_component=per_component)
+    if per_component:
+        data = scores
+        if output is None:
+            msg.warn(
+                "The per-component option is enabled but there is no output JSON file provided to save the scores to."
+            )
+        else:
+            msg.info("Per-component scores will be saved to output JSON file.")
+    else:
+        metrics = {
+            "TOK": "token_acc",
+            "TAG": "tag_acc",
+            "POS": "pos_acc",
+            "MORPH": "morph_acc",
+            "LEMMA": "lemma_acc",
+            "UAS": "dep_uas",
+            "LAS": "dep_las",
+            "NER P": "ents_p",
+            "NER R": "ents_r",
+            "NER F": "ents_f",
+            "TEXTCAT": "cats_score",
+            "SENT P": "sents_p",
+            "SENT R": "sents_r",
+            "SENT F": "sents_f",
+            "SPAN P": f"spans_{spans_key}_p",
+            "SPAN R": f"spans_{spans_key}_r",
+            "SPAN F": f"spans_{spans_key}_f",
+            "SPEED": "speed",
+        }
+        results = {}
+        data = {}
+        for metric, key in metrics.items():
+            if key in scores:
+                if key == "cats_score":
+                    metric = metric + " (" + scores.get("cats_score_desc", "unk") + ")"
+                if isinstance(scores[key], (int, float)):
+                    if key == "speed":
+                        results[metric] = f"{scores[key]:.0f}"
+                    else:
+                        results[metric] = f"{scores[key]*100:.2f}"
                else:
-                    results[metric] = f"{scores[key]*100:.2f}"
-            else:
-                results[metric] = "-"
-            data[re.sub(r"[\s/]", "_", key.lower())] = scores[key]
+                    results[metric] = "-"
+                data[re.sub(r"[\s/]", "_", key.lower())] = scores[key]

-    msg.table(results, title="Results")
+        msg.table(results, title="Results")
+        data = handle_scores_per_type(scores, data, spans_key=spans_key, silent=silent)

+    if displacy_path:
+        factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
+        docs = list(nlp.pipe(ex.reference.text for ex in dev_dataset[:displacy_limit]))
+        render_deps = "parser" in factory_names
+        render_ents = "ner" in factory_names
+        render_spans = "spancat" in factory_names
+
+        render_parses(
+            docs,
+            displacy_path,
+            model_name=model,
+            limit=displacy_limit,
+            deps=render_deps,
+            ents=render_ents,
+            spans=render_spans,
+        )
+        msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
+
+    if output_path is not None:
+        srsly.write_json(output_path, data)
+        msg.good(f"Saved results to {output_path}")
+    return data
+
+
+def handle_scores_per_type(
+    scores: Dict[str, Any],
+    data: Dict[str, Any] = {},
+    *,
+    spans_key: str = "sc",
+    silent: bool = False,
+) -> Dict[str, Any]:
+    msg = Printer(no_print=silent, pretty=not silent)
    if "morph_per_feat" in scores:
        if scores["morph_per_feat"]:
            print_prf_per_type(msg, scores["morph_per_feat"], "MORPH", "feat")
@ -121,6 +175,12 @@ def evaluate(
        if scores["ents_per_type"]:
            print_prf_per_type(msg, scores["ents_per_type"], "NER", "type")
            data["ents_per_type"] = scores["ents_per_type"]
+    if f"spans_{spans_key}_per_type" in scores:
+        if scores[f"spans_{spans_key}_per_type"]:
+            print_prf_per_type(
+                msg, scores[f"spans_{spans_key}_per_type"], "SPANS", "type"
+            )
+            data[f"spans_{spans_key}_per_type"] = scores[f"spans_{spans_key}_per_type"]
    if "cats_f_per_type" in scores:
        if scores["cats_f_per_type"]:
            print_prf_per_type(msg, scores["cats_f_per_type"], "Textcat F", "label")
@ -129,26 +189,7 @@ def evaluate(
        if scores["cats_auc_per_type"]:
            print_textcats_auc_per_cat(msg, scores["cats_auc_per_type"])
            data["cats_auc_per_type"] = scores["cats_auc_per_type"]
-
-    if displacy_path:
-        factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
-        docs = list(nlp.pipe(ex.reference.text for ex in dev_dataset[:displacy_limit]))
-        render_deps = "parser" in factory_names
-        render_ents = "ner" in factory_names
-        render_parses(
-            docs,
-            displacy_path,
-            model_name=model,
-            limit=displacy_limit,
-            deps=render_deps,
-            ents=render_ents,
-        )
-        msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
-
-    if output_path is not None:
-        srsly.write_json(output_path, data)
-        msg.good(f"Saved results to {output_path}")
-    return data
+    return scores


 def render_parses(
@ -158,6 +199,7 @@ def render_parses(
    limit: int = 250,
    deps: bool = True,
    ents: bool = True,
+    spans: bool = True,
 ):
    docs[0].user_data["title"] = model_name
    if ents:
@ -171,6 +213,11 @@ def render_parses(
        with (output_path / "parses.html").open("w", encoding="utf8") as file_:
            file_.write(html)

+    if spans:
+        html = displacy.render(docs[:limit], style="span", page=True)
+        with (output_path / "spans.html").open("w", encoding="utf8") as file_:
+            file_.write(html)
+

 def print_prf_per_type(
    msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
--- a/spacy/cli/find_function.py
+++ b/spacy/cli/find_function.py
@ -0,0 +1,69 @@
+from typing import Optional, Tuple
+
+from catalogue import RegistryError
+from wasabi import msg
+
+from ..util import registry
+from ._util import Arg, Opt, app
+
+
+@app.command("find-function")
+def find_function_cli(
+    # fmt: off
+    func_name: str = Arg(..., help="Name of the registered function."),
+    registry_name: Optional[str] = Opt(None, "--registry", "-r", help="Name of the catalogue registry."),
+    # fmt: on
+):
+    """
+    Find the module, path and line number to the file the registered
+    function is defined in, if available.
+
+    func_name (str): Name of the registered function.
+    registry_name (Optional[str]): Name of the catalogue registry.
+
+    DOCS: https://spacy.io/api/cli#find-function
+    """
+    if not registry_name:
+        registry_names = registry.get_registry_names()
+        for name in registry_names:
+            if registry.has(name, func_name):
+                registry_name = name
+                break
+
+    if not registry_name:
+        msg.fail(
+            f"Couldn't find registered function: '{func_name}'",
+            exits=1,
+        )
+
+    assert registry_name is not None
+    find_function(func_name, registry_name)
+
+
+def find_function(func_name: str, registry_name: str) -> Tuple[str, int]:
+    registry_desc = None
+    try:
+        registry_desc = registry.find(registry_name, func_name)
+    except RegistryError as e:
+        msg.fail(
+            f"Couldn't find registered function: '{func_name}' in registry '{registry_name}'",
+        )
+        msg.fail(f"{e}", exits=1)
+    assert registry_desc is not None
+
+    registry_path = None
+    line_no = None
+    if registry_desc["file"]:
+        registry_path = registry_desc["file"]
+        line_no = registry_desc["line_no"]
+
+    if not registry_path or not line_no:
+        msg.fail(
+            f"Couldn't find path to registered function: '{func_name}' in registry '{registry_name}'",
+            exits=1,
+        )
+    assert registry_path is not None
+    assert line_no is not None
+
+    msg.good(f"Found registered function '{func_name}' at {registry_path}:{line_no}")
+    return str(registry_path), int(line_no)
--- a/spacy/cli/find_threshold.py
+++ b/spacy/cli/find_threshold.py
@ -0,0 +1,233 @@
+import functools
+import logging
+import operator
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+import numpy
+import wasabi.tables
+
+from .. import util
+from ..errors import Errors
+from ..pipeline import MultiLabel_TextCategorizer, TextCategorizer
+from ..training import Corpus
+from ._util import Arg, Opt, app, import_code, setup_gpu
+
+_DEFAULTS = {
+    "n_trials": 11,
+    "use_gpu": -1,
+    "gold_preproc": False,
+}
+
+
+@app.command(
+    "find-threshold",
+    context_settings={"allow_extra_args": False, "ignore_unknown_options": True},
+)
+def find_threshold_cli(
+    # fmt: off
+    model: str = Arg(..., help="Model name or path"),
+    data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True),
+    pipe_name: str = Arg(..., help="Name of pipe to examine thresholds for"),
+    threshold_key: str = Arg(..., help="Key of threshold attribute in component's configuration"),
+    scores_key: str = Arg(..., help="Metric to optimize"),
+    n_trials: int = Opt(_DEFAULTS["n_trials"], "--n_trials", "-n", help="Number of trials to determine optimal thresholds"),
+    code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
+    use_gpu: int = Opt(_DEFAULTS["use_gpu"], "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
+    gold_preproc: bool = Opt(_DEFAULTS["gold_preproc"], "--gold-preproc", "-G", help="Use gold preprocessing"),
+    verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
+    # fmt: on
+):
+    """
+    Runs prediction trials for a trained model with varying thresholds to maximize
+    the specified metric. The search space for the threshold is traversed linearly
+    from 0 to 1 in `n_trials` steps. Results are displayed in a table on `stdout`
+    (the corresponding API call to `spacy.cli.find_threshold.find_threshold()`
+    returns all results).
+
+    This is applicable only for components whose predictions are influenced by
+    thresholds - e.g. `textcat_multilabel` and `spancat`, but not `textcat`. Note
+    that the full path to the corresponding threshold attribute in the config has to
+    be provided.
+
+    DOCS: https://spacy.io/api/cli#find-threshold
+    """
+    if verbose:
+        util.logger.setLevel(logging.DEBUG)
+    import_code(code_path)
+    find_threshold(
+        model=model,
+        data_path=data_path,
+        pipe_name=pipe_name,
+        threshold_key=threshold_key,
+        scores_key=scores_key,
+        n_trials=n_trials,
+        use_gpu=use_gpu,
+        gold_preproc=gold_preproc,
+        silent=False,
+    )
+
+
+def find_threshold(
+    model: str,
+    data_path: Path,
+    pipe_name: str,
+    threshold_key: str,
+    scores_key: str,
+    *,
+    n_trials: int = _DEFAULTS["n_trials"],  # type: ignore
+    use_gpu: int = _DEFAULTS["use_gpu"],  # type: ignore
+    gold_preproc: bool = _DEFAULTS["gold_preproc"],  # type: ignore
+    silent: bool = True,
+) -> Tuple[float, float, Dict[float, float]]:
+    """
+    Runs prediction trials for models with varying thresholds to maximize the specified metric.
+    model (Union[str, Path]): Pipeline to evaluate. Can be a package or a path to a data directory.
+    data_path (Path): Path to file with DocBin with docs to use for threshold search.
+    pipe_name (str): Name of pipe to examine thresholds for.
+    threshold_key (str): Key of threshold attribute in component's configuration.
+    scores_key (str): Name of score to metric to optimize.
+    n_trials (int): Number of trials to determine optimal thresholds.
+    use_gpu (int): GPU ID or -1 for CPU.
+    gold_preproc (bool): Whether to use gold preprocessing. Gold preprocessing helps the annotations align to the
+        tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due
+        to train/test skew.
+    silent (bool): Whether to print non-error-related output to stdout.
+    RETURNS (Tuple[float, float, Dict[float, float]]): Best found threshold, the corresponding score, scores for all
+        evaluated thresholds.
+    """
+
+    setup_gpu(use_gpu, silent=silent)
+    data_path = util.ensure_path(data_path)
+    if not data_path.exists():
+        wasabi.msg.fail("Evaluation data not found", data_path, exits=1)
+    nlp = util.load_model(model)
+
+    if pipe_name not in nlp.component_names:
+        raise AttributeError(
+            Errors.E001.format(name=pipe_name, opts=nlp.component_names)
+        )
+    pipe = nlp.get_pipe(pipe_name)
+    if not hasattr(pipe, "scorer"):
+        raise AttributeError(Errors.E1045)
+
+    if type(pipe) == TextCategorizer:
+        wasabi.msg.warn(
+            "The `textcat` component doesn't use a threshold as it's not applicable to the concept of "
+            "exclusive classes. All thresholds will yield the same results."
+        )
+
+    if not silent:
+        wasabi.msg.info(
+            title=f"Optimizing for {scores_key} for component '{pipe_name}' with {n_trials} "
+            f"trials."
+        )
+
+    # Load evaluation corpus.
+    corpus = Corpus(data_path, gold_preproc=gold_preproc)
+    dev_dataset = list(corpus(nlp))
+    config_keys = threshold_key.split(".")
+
+    def set_nested_item(
+        config: Dict[str, Any], keys: List[str], value: float
+    ) -> Dict[str, Any]:
+        """Set item in nested dictionary. Adapted from https://stackoverflow.com/a/54138200.
+        config (Dict[str, Any]): Configuration dictionary.
+        keys (List[Any]): Path to value to set.
+        value (float): Value to set.
+        RETURNS (Dict[str, Any]): Updated dictionary.
+        """
+        functools.reduce(operator.getitem, keys[:-1], config)[keys[-1]] = value
+        return config
+
+    def filter_config(
+        config: Dict[str, Any], keys: List[str], full_key: str
+    ) -> Dict[str, Any]:
+        """Filters provided config dictionary so that only the specified keys path remains.
+        config (Dict[str, Any]): Configuration dictionary.
+        keys (List[Any]): Path to value to set.
+        full_key (str): Full user-specified key.
+        RETURNS (Dict[str, Any]): Filtered dictionary.
+        """
+        if keys[0] not in config:
+            wasabi.msg.fail(
+                title=f"Failed to look up `{full_key}` in config: sub-key {[keys[0]]} not found.",
+                text=f"Make sure you specified {[keys[0]]} correctly. The following sub-keys are available instead: "
+                f"{list(config.keys())}",
+                exits=1,
+            )
+        return {
+            keys[0]: filter_config(config[keys[0]], keys[1:], full_key)
+            if len(keys) > 1
+            else config[keys[0]]
+        }
+
+    # Evaluate with varying threshold values.
+    scores: Dict[float, float] = {}
+    config_keys_full = ["components", pipe_name, *config_keys]
+    table_col_widths = (10, 10)
+    thresholds = numpy.linspace(0, 1, n_trials)
+    print(wasabi.tables.row(["Threshold", f"{scores_key}"], widths=table_col_widths))
+    for threshold in thresholds:
+        # Reload pipeline with overrides specifying the new threshold.
+        nlp = util.load_model(
+            model,
+            config=set_nested_item(
+                filter_config(
+                    nlp.config, config_keys_full, ".".join(config_keys_full)
+                ).copy(),
+                config_keys_full,
+                threshold,
+            ),
+        )
+        if hasattr(pipe, "cfg"):
+            setattr(
+                nlp.get_pipe(pipe_name),
+                "cfg",
+                set_nested_item(getattr(pipe, "cfg"), config_keys, threshold),
+            )
+
+        eval_scores = nlp.evaluate(dev_dataset)
+        if scores_key not in eval_scores:
+            wasabi.msg.fail(
+                title=f"Failed to look up score `{scores_key}` in evaluation results.",
+                text=f"Make sure you specified the correct value for `scores_key`. The following scores are "
+                f"available: {list(eval_scores.keys())}",
+                exits=1,
+            )
+        scores[threshold] = eval_scores[scores_key]
+
+        if not isinstance(scores[threshold], (float, int)):
+            wasabi.msg.fail(
+                f"Returned score for key '{scores_key}' is not numeric. Threshold optimization only works for numeric "
+                f"scores.",
+                exits=1,
+            )
+        print(
+            wasabi.row(
+                [round(threshold, 3), round(scores[threshold], 3)],
+                widths=table_col_widths,
+            )
+        )
+
+    best_threshold = max(scores.keys(), key=(lambda key: scores[key]))
+
+    # If all scores are identical, emit warning.
+    if len(set(scores.values())) == 1:
+        wasabi.msg.warn(
+            title="All scores are identical. Verify that all settings are correct.",
+            text=""
+            if (
+                not isinstance(pipe, MultiLabel_TextCategorizer)
+                or scores_key in ("cats_macro_f", "cats_micro_f")
+            )
+            else "Use `cats_macro_f` or `cats_micro_f` when optimizing the threshold for `textcat_multilabel`.",
+        )
+
+    else:
+        if not silent:
+            print(
+                f"\nBest threshold: {round(best_threshold, ndigits=4)} with {scores_key} value of {scores[best_threshold]}."
+            )
+
+    return best_threshold, scores[best_threshold], scores
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -1,12 +1,15 @@
-from typing import Optional, Dict, Any, Union, List
+import json
 import platform
 from pathlib import Path
-from wasabi import Printer, MarkdownRenderer
-import srsly
+from typing import Any, Dict, List, Optional, Union

-from ._util import app, Arg, Opt, string_to_list
-from .. import util
-from .. import about
+import srsly
+from wasabi import MarkdownRenderer, Printer
+
+from .. import about, util
+from ..compat import importlib_metadata
+from ._util import Arg, Opt, app, string_to_list
+from .download import get_latest_version, get_model_filename


@app.command("info")
@ -15,7 +18,8 @@ def info_cli(
    model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
    markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
    silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
-    exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
+    exclude: str = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
+    url: bool = Opt(False, "--url", "-u", help="Print the URL to download the most recent compatible version of the pipeline"),
    # fmt: on
 ):
    """
@ -23,10 +27,19 @@ def info_cli(
    print its meta information. Flag --markdown prints details in Markdown for easy
    copy-pasting to GitHub issues.

+    Flag --url prints only the download URL of the most recent compatible
+    version of the pipeline.
+
    DOCS: https://spacy.io/api/cli#info
    """
    exclude = string_to_list(exclude)
-    info(model, markdown=markdown, silent=silent, exclude=exclude)
+    info(
+        model,
+        markdown=markdown,
+        silent=silent,
+        exclude=exclude,
+        url=url,
+    )


 def info(
@ -35,11 +48,20 @@ def info(
    markdown: bool = False,
    silent: bool = True,
    exclude: Optional[List[str]] = None,
+    url: bool = False,
 ) -> Union[str, dict]:
    msg = Printer(no_print=silent, pretty=not silent)
    if not exclude:
        exclude = []
-    if model:
+    if url:
+        if model is not None:
+            title = f"Download info for pipeline '{model}'"
+            data = info_model_url(model)
+            print(data["download_url"])
+            return data
+        else:
+            msg.fail("--url option requires a pipeline name", exits=1)
+    elif model:
        title = f"Info about pipeline '{model}'"
        data = info_model(model, silent=silent)
    else:
@ -61,7 +83,7 @@ def info(
    return raw_data


-def info_spacy() -> Dict[str, any]:
+def info_spacy() -> Dict[str, Any]:
    """Generate info about the current spaCy intallation.

    RETURNS (dict): The spaCy info.
@ -99,11 +121,43 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
        meta["source"] = str(model_path.resolve())
    else:
        meta["source"] = str(model_path)
+    download_url = info_installed_model_url(model)
+    if download_url:
+        meta["download_url"] = download_url
    return {
        k: v for k, v in meta.items() if k not in ("accuracy", "performance", "speed")
    }


+def info_installed_model_url(model: str) -> Optional[str]:
+    """Given a pipeline name, get the download URL if available, otherwise
+    return None.
+
+    This is only available for pipelines installed as modules that have
+    dist-info available.
+    """
+    try:
+        dist = importlib_metadata.distribution(model)
+        text = dist.read_text("direct_url.json")
+        if isinstance(text, str):
+            data = json.loads(text)
+            return data["url"]
+    except Exception:
+        pass
+    return None
+
+
+def info_model_url(model: str) -> Dict[str, Any]:
+    """Return the download URL for the latest version of a pipeline."""
+    version = get_latest_version(model)
+
+    filename = get_model_filename(model, version)
+    download_url = about.__download_url__ + "/" + filename
+    release_tpl = "https://github.com/explosion/spacy-models/releases/tag/{m}-{v}"
+    release_url = release_tpl.format(m=model, v=version)
+    return {"download_url": download_url, "release_url": release_url}
+
+
 def get_markdown(
    data: Dict[str, Any],
    title: Optional[str] = None,
--- a/spacy/cli/init_config.py
+++ b/spacy/cli/init_config.py
@ -1,18 +1,26 @@
-from typing import Optional, List, Tuple
+import re
 from enum import Enum
 from pathlib import Path
-from wasabi import Printer, diff_strings
-from thinc.api import Config
+from typing import List, Optional, Tuple
+
 import srsly
-import re
 from jinja2 import Template
+from thinc.api import Config
+from wasabi import Printer, diff_strings

 from .. import util
 from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
 from ..schemas import RecommendationSchema
-from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
-from ._util import string_to_list, import_code
-
+from ..util import SimpleFrozenList
+from ._util import (
+    COMMAND,
+    Arg,
+    Opt,
+    import_code,
+    init_cli,
+    show_validation_error,
+    string_to_list,
+)

 ROOT = Path(__file__).parent / "templates"
 TEMPLATE_PATH = ROOT / "quickstart_training.jinja"
@ -24,28 +32,40 @@ class Optimizations(str, Enum):
    accuracy = "accuracy"


+class InitValues:
+    """
+    Default values for initialization. Dedicated class to allow synchronized default values for init_config_cli() and
+    init_config(), i.e. initialization calls via CLI respectively Python.
+    """
+
+    lang = "en"
+    pipeline = SimpleFrozenList(["tagger", "parser", "ner"])
+    optimize = Optimizations.efficiency
+    gpu = False
+    pretraining = False
+    force_overwrite = False
+
+
@init_cli.command("config")
 def init_config_cli(
    # fmt: off
-    output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
-    lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
-    pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
-    optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
-    gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
-    pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
-    force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
+    output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
+    lang: str = Opt(InitValues.lang, "--lang", "-l", help="Two-letter code of the language to use"),
+    pipeline: str = Opt(",".join(InitValues.pipeline), "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
+    optimize: Optimizations = Opt(InitValues.optimize, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
+    gpu: bool = Opt(InitValues.gpu, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
+    pretraining: bool = Opt(InitValues.pretraining, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
+    force_overwrite: bool = Opt(InitValues.force_overwrite, "--force", "-F", help="Force overwriting the output file"),
    # fmt: on
 ):
    """
-    Generate a starter config.cfg for training. Based on your requirements
+    Generate a starter config file for training. Based on your requirements
    specified via the CLI arguments, this command generates a config with the
    optimal settings for your use case. This includes the choice of architecture,
    pretrained weights and related hyperparameters.

    DOCS: https://spacy.io/api/cli#init-config
    """
-    if isinstance(optimize, Optimizations):  # instance of enum from the CLI
-        optimize = optimize.value
    pipeline = string_to_list(pipeline)
    is_stdout = str(output_file) == "-"
    if not is_stdout and output_file.exists() and not force_overwrite:
@ -57,7 +77,7 @@ def init_config_cli(
    config = init_config(
        lang=lang,
        pipeline=pipeline,
-        optimize=optimize,
+        optimize=optimize.value,
        gpu=gpu,
        pretraining=pretraining,
        silent=is_stdout,
@ -68,15 +88,15 @@ def init_config_cli(
@init_cli.command("fill-config")
 def init_fill_config_cli(
    # fmt: off
-    base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False),
-    output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
+    base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
+    output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
    pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
    diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
    code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
    # fmt: on
 ):
    """
-    Fill partial config.cfg with default values. Will add all missing settings
+    Fill partial config file with default values. Will add all missing settings
    from the default config and will create all objects, check the registered
    functions for their default values and update the base config. This command
    can be used with a config generated via the training quickstart widget:
@ -135,11 +155,11 @@ def fill_config(

 def init_config(
    *,
-    lang: str,
-    pipeline: List[str],
-    optimize: str,
-    gpu: bool,
-    pretraining: bool = False,
+    lang: str = InitValues.lang,
+    pipeline: List[str] = InitValues.pipeline,
+    optimize: str = InitValues.optimize,
+    gpu: bool = InitValues.gpu,
+    pretraining: bool = InitValues.pretraining,
    silent: bool = True,
 ) -> Config:
    msg = Printer(no_print=silent)
@ -175,8 +195,8 @@ def init_config(
        "Pipeline": ", ".join(pipeline),
        "Optimize for": optimize,
        "Hardware": variables["hardware"].upper(),
-        "Transformer": template_vars.transformer.get("name")
-        if template_vars.use_transformer
+        "Transformer": template_vars.transformer.get("name")  # type: ignore[attr-defined]
+        if template_vars.use_transformer  # type: ignore[attr-defined]
        else None,
    }
    msg.info("Generated config template specific for your use case")
--- a/spacy/cli/init_pipeline.py
+++ b/spacy/cli/init_pipeline.py
@ -1,15 +1,23 @@
-from typing import Optional
 import logging
 from pathlib import Path
-from wasabi import msg
-import typer
+from typing import Optional
+
 import srsly
+import typer
+from wasabi import msg

 from .. import util
-from ..training.initialize import init_nlp, convert_vectors
 from ..language import Language
-from ._util import init_cli, Arg, Opt, parse_config_overrides, show_validation_error
-from ._util import import_code, setup_gpu
+from ..training.initialize import convert_vectors, init_nlp
+from ._util import (
+    Arg,
+    Opt,
+    import_code,
+    init_cli,
+    parse_config_overrides,
+    setup_gpu,
+    show_validation_error,
+)


@init_cli.command("vectors")
@ -20,21 +28,32 @@ def init_vectors_cli(
    output_dir: Path = Arg(..., help="Pipeline output directory"),
    prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
    truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
+    mode: str = Opt("default", "--mode", "-m", help="Vectors mode: default or floret"),
    name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
    verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
    jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
+    attr: str = Opt("ORTH", "--attr", "-a", help="Optional token attribute to use for vectors, e.g. LOWER or NORM"),
    # fmt: on
 ):
    """Convert word vectors for use with spaCy. Will export an nlp object that
    you can use in the [initialize] block of your config to initialize
    a model with vectors.
    """
-    util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
+    if verbose:
+        util.logger.setLevel(logging.DEBUG)
    msg.info(f"Creating blank nlp object for language '{lang}'")
    nlp = util.get_lang_class(lang)()
    if jsonl_loc is not None:
        update_lexemes(nlp, jsonl_loc)
-    convert_vectors(nlp, vectors_loc, truncate=truncate, prune=prune, name=name)
+    convert_vectors(
+        nlp,
+        vectors_loc,
+        truncate=truncate,
+        prune=prune,
+        name=name,
+        mode=mode,
+        attr=attr,
+    )
    msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors")
    nlp.to_disk(output_dir)
    msg.good(
@ -69,7 +88,8 @@ def init_pipeline_cli(
    use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
    # fmt: on
 ):
-    util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
+    if verbose:
+        util.logger.setLevel(logging.DEBUG)
    overrides = parse_config_overrides(ctx.args)
    import_code(code_path)
    setup_gpu(use_gpu)
@ -98,7 +118,8 @@ def init_labels_cli(
    """Generate JSON files for the labels in the data. This helps speed up the
    training process, since spaCy won't have to preprocess the data to
    extract the labels."""
-    util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
+    if verbose:
+        util.logger.setLevel(logging.DEBUG)
    if not output_path.exists():
        output_path.mkdir(parents=True)
    overrides = parse_config_overrides(ctx.args)
@ -108,6 +129,10 @@ def init_labels_cli(
        config = util.load_config(config_path, overrides=overrides)
    with show_validation_error(hint_fill=False):
        nlp = init_nlp(config, use_gpu=use_gpu)
+    _init_labels(nlp, output_path)
+
+
+def _init_labels(nlp, output_path):
    for name, component in nlp.pipeline:
        if getattr(component, "label_data", None) is not None:
            output_file = output_path / f"{name}.json"
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -1,14 +1,21 @@
-from typing import Optional, Union, Any, Dict, List, Tuple
+import os
+import re
 import shutil
-from pathlib import Path
-from wasabi import Printer, get_raw_input
-import srsly
+import subprocess
 import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union, cast

-from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
-from ..schemas import validate, ModelMetaSchema
-from .. import util
-from .. import about
+import srsly
+from catalogue import RegistryError
+from thinc.api import Config
+from wasabi import MarkdownRenderer, Printer, get_raw_input
+
+from .. import about, util
+from ..compat import importlib_metadata
+from ..schemas import ModelMetaSchema, validate
+from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app, string_to_list


@app.command("package")
@ -23,6 +30,7 @@ def package_cli(
    version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
    build: str = Opt("sdist", "--build", "-b", help="Comma-separated formats to build: sdist and/or wheel, or none."),
    force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"),
+    require_parent: bool = Opt(True, "--require-parent/--no-require-parent", "-R", "-R", help="Include the parent package (e.g. spacy) in the requirements"),
    # fmt: on
 ):
    """
@ -31,7 +39,7 @@ def package_cli(
    specified output directory, and the data will be copied over. If
    --create-meta is set and a meta.json already exists in the output directory,
    the existing values will be used as the defaults in the command-line prompt.
-    After packaging, "python setup.py sdist" is run in the package directory,
+    After packaging, "python -m build --sdist" is run in the package directory,
    which will create a .tar.gz archive that can be installed via "pip install".

    If additional code files are provided (e.g. Python files containing custom
@ -53,6 +61,7 @@ def package_cli(
        create_sdist=create_sdist,
        create_wheel=create_wheel,
        force=force,
+        require_parent=require_parent,
        silent=False,
    )

@ -67,6 +76,7 @@ def package(
    create_meta: bool = False,
    create_sdist: bool = True,
    create_wheel: bool = False,
+    require_parent: bool = False,
    force: bool = False,
    silent: bool = True,
 ) -> None:
@ -74,9 +84,17 @@ def package(
    input_path = util.ensure_path(input_dir)
    output_path = util.ensure_path(output_dir)
    meta_path = util.ensure_path(meta_path)
-    if create_wheel and not has_wheel():
-        err = "Generating a binary .whl file requires wheel to be installed"
-        msg.fail(err, "pip install wheel", exits=1)
+    if create_wheel and not has_wheel() and not has_build():
+        err = (
+            "Generating wheels requires 'build' or 'wheel' (deprecated) to be installed"
+        )
+        msg.fail(err, "pip install build", exits=1)
+    if not has_build():
+        msg.warn(
+            "Generating packages without the 'build' package is deprecated and "
+            "will not be supported in the future. To install 'build': pip "
+            "install build"
+        )
    if not input_path or not input_path.exists():
        msg.fail("Can't locate pipeline data", input_path, exits=1)
    if not output_path or not output_path.exists():
@ -98,8 +116,32 @@ def package(
    if not meta_path.exists() or not meta_path.is_file():
        msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
    meta = srsly.read_json(meta_path)
-    meta = get_meta(input_dir, meta)
+    meta = get_meta(input_dir, meta, require_parent=require_parent)
+    if meta["requirements"]:
+        msg.good(
+            f"Including {len(meta['requirements'])} package requirement(s) from "
+            f"meta and config",
+            ", ".join(meta["requirements"]),
+        )
    if name is not None:
+        if not name.isidentifier():
+            msg.fail(
+                f"Model name ('{name}') is not a valid module name. "
+                "This is required so it can be imported as a module.",
+                "We recommend names that use ASCII A-Z, a-z, _ (underscore), "
+                "and 0-9. "
+                "For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
+                exits=1,
+            )
+        if not _is_permitted_package_name(name):
+            msg.fail(
+                f"Model name ('{name}') is not a permitted package name. "
+                "This is required to correctly load the model with spacy.load.",
+                "We recommend names that use ASCII A-Z, a-z, _ (underscore), "
+                "and 0-9. "
+                "For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
+                exits=1,
+            )
        meta["name"] = name
    if version is not None:
        meta["version"] = version
@ -112,7 +154,9 @@ def package(
        msg.fail("Invalid pipeline meta.json")
        print("\n".join(errors))
        sys.exit(1)
-    model_name = meta["lang"] + "_" + meta["name"]
+    model_name = meta["name"]
+    if not model_name.startswith(meta["lang"] + "_"):
+        model_name = f"{meta['lang']}_{model_name}"
    model_name_v = model_name + "-" + meta["version"]
    main_path = output_dir / model_name_v
    package_path = main_path / model_name
@ -128,31 +172,72 @@ def package(
            )
    Path.mkdir(package_path, parents=True)
    shutil.copytree(str(input_dir), str(package_path / model_name_v))
-    license_path = package_path / model_name_v / "LICENSE"
-    if license_path.exists():
-        shutil.move(str(license_path), str(main_path))
+    for file_name in FILENAMES_DOCS:
+        file_path = package_path / model_name_v / file_name
+        if file_path.exists():
+            shutil.copy(str(file_path), str(main_path))
+    readme_path = main_path / "README.md"
+    if not readme_path.exists():
+        readme = generate_readme(meta)
+        create_file(readme_path, readme)
+        create_file(package_path / model_name_v / "README.md", readme)
+        msg.good("Generated README.md from meta.json")
+    else:
+        msg.info("Using existing README.md from pipeline directory")
    imports = []
    for code_path in code_paths:
        imports.append(code_path.stem)
        shutil.copy(str(code_path), str(package_path))
    create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
+
    create_file(main_path / "setup.py", TEMPLATE_SETUP)
    create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
    init_py = TEMPLATE_INIT.format(
        imports="\n".join(f"from . import {m}" for m in imports)
    )
    create_file(package_path / "__init__.py", init_py)
-    msg.good(f"Successfully created package '{model_name_v}'", main_path)
+    msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
    if create_sdist:
        with util.working_dir(main_path):
-            util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
+            # run directly, since util.run_command is not designed to continue
+            # after a command fails
+            ret = subprocess.run(
+                [sys.executable, "-m", "build", ".", "--sdist"],
+                env=os.environ.copy(),
+            )
+            if ret.returncode != 0:
+                msg.warn(
+                    "Creating sdist with 'python -m build' failed. Falling "
+                    "back to deprecated use of 'python setup.py sdist'"
+                )
+                util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
        zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}"
        msg.good(f"Successfully created zipped Python package", zip_file)
    if create_wheel:
        with util.working_dir(main_path):
-            util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
-        wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}"
+            # run directly, since util.run_command is not designed to continue
+            # after a command fails
+            ret = subprocess.run(
+                [sys.executable, "-m", "build", ".", "--wheel"],
+                env=os.environ.copy(),
+            )
+            if ret.returncode != 0:
+                msg.warn(
+                    "Creating wheel with 'python -m build' failed. Falling "
+                    "back to deprecated use of 'wheel' with "
+                    "'python setup.py bdist_wheel'"
+                )
+                util.run_command(
+                    [sys.executable, "setup.py", "bdist_wheel"], capture=False
+                )
+        wheel_name_squashed = re.sub("_+", "_", model_name_v)
+        wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
        msg.good(f"Successfully created binary wheel", wheel)
+    if "__" in model_name:
+        msg.warn(
+            f"Model name ('{model_name}') contains a run of underscores. "
+            "Runs of underscores are not significant in installed package names.",
+        )


 def has_wheel() -> bool:
@ -164,6 +249,77 @@ def has_wheel() -> bool:
        return False


+def has_build() -> bool:
+    # it's very likely that there is a local directory named build/ (especially
+    # in an editable install), so an import check is not sufficient; instead
+    # check that there is a package version
+    try:
+        importlib_metadata.version("build")
+        return True
+    except importlib_metadata.PackageNotFoundError:  # type: ignore[attr-defined]
+        return False
+
+
+def get_third_party_dependencies(
+    config: Config, exclude: List[str] = util.SimpleFrozenList()
+) -> List[str]:
+    """If the config includes references to registered functions that are
+    provided by third-party packages (spacy-transformers, other libraries), we
+    want to include them in meta["requirements"] so that the package specifies
+    them as dependencies and the user won't have to do it manually.
+
+    We do this by:
+    - traversing the config to check for registered function (@ keys)
+    - looking up the functions and getting their module
+    - looking up the module version and generating an appropriate version range
+
+    config (Config): The pipeline config.
+    exclude (list): List of packages to exclude (e.g. that already exist in meta).
+    RETURNS (list): The versioned requirements.
+    """
+    own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly")
+    distributions = util.packages_distributions()
+    funcs = defaultdict(set)
+    # We only want to look at runtime-relevant sections, not [training] or [initialize]
+    for section in ("nlp", "components"):
+        for path, value in util.walk_dict(config[section]):
+            if path[-1].startswith("@"):  # collect all function references by registry
+                funcs[path[-1][1:]].add(value)
+    for component in config.get("components", {}).values():
+        if "factory" in component:
+            funcs["factories"].add(component["factory"])
+    modules = set()
+    lang = config["nlp"]["lang"]
+    for reg_name, func_names in funcs.items():
+        for func_name in func_names:
+            # Try the lang-specific version and fall back
+            try:
+                func_info = util.registry.find(reg_name, lang + "." + func_name)
+            except RegistryError:
+                try:
+                    func_info = util.registry.find(reg_name, func_name)
+                except RegistryError as regerr:
+                    # lang-specific version being absent is not actually an issue
+                    raise regerr from None
+            module_name = func_info.get("module")  # type: ignore[attr-defined]
+            if module_name:  # the code is part of a module, not a --code file
+                modules.add(func_info["module"].split(".")[0])  # type: ignore[union-attr]
+    dependencies = []
+    for module_name in modules:
+        if module_name == about.__title__:
+            continue
+        if module_name in distributions:
+            dist = distributions.get(module_name)
+            if dist:
+                pkg = dist[0]
+                if pkg in own_packages or pkg in exclude:
+                    continue
+                version = util.get_package_version(pkg)
+                version_range = util.get_minor_version_range(version)  # type: ignore[arg-type]
+                dependencies.append(f"{pkg}{version_range}")
+    return dependencies
+
+
 def get_build_formats(formats: List[str]) -> Tuple[bool, bool]:
    supported = ["sdist", "wheel", "none"]
    for form in formats:
@ -182,9 +338,11 @@ def create_file(file_path: Path, contents: str) -> None:


 def get_meta(
-    model_path: Union[str, Path], existing_meta: Dict[str, Any]
+    model_path: Union[str, Path],
+    existing_meta: Dict[str, Any],
+    require_parent: bool = False,
 ) -> Dict[str, Any]:
-    meta = {
+    meta: Dict[str, Any] = {
        "lang": "en",
        "name": "pipeline",
        "version": "0.0.0",
@ -194,9 +352,10 @@ def get_meta(
        "url": "",
        "license": "MIT",
    }
-    meta.update(existing_meta)
    nlp = util.load_model_from_path(Path(model_path))
-    meta["spacy_version"] = util.get_model_version_range(about.__version__)
+    meta.update(nlp.meta)
+    meta["spacy_version"] = util.get_minor_version_range(about.__version__)
+    meta.update(existing_meta)
    meta["vectors"] = {
        "width": nlp.vocab.vectors_length,
        "vectors": len(nlp.vocab.vectors),
@ -205,6 +364,13 @@ def get_meta(
    }
    if about.__title__ != "spacy":
        meta["parent_package"] = about.__title__
+    meta.setdefault("requirements", [])
+    # Update the requirements with all third-party packages in the config
+    existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
+    reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
+    meta["requirements"].extend(reqs)
+    if require_parent and about.__title__ not in meta["requirements"]:
+        meta["requirements"].append(about.__title__ + meta["spacy_version"])
    return meta


@ -231,6 +397,121 @@ def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]
    return meta


+def generate_readme(meta: Dict[str, Any]) -> str:
+    """
+    Generate a Markdown-formatted README text from a model meta.json. Used
+    within the GitHub release notes and as content for README.md file added
+    to model packages.
+    """
+    md = MarkdownRenderer()
+    lang = meta["lang"]
+    name = f"{lang}_{meta['name']}"
+    version = meta["version"]
+    pipeline = ", ".join([md.code(p) for p in meta.get("pipeline", [])])
+    components = ", ".join([md.code(p) for p in meta.get("components", [])])
+    vecs = meta.get("vectors", {})
+    vectors = f"{vecs.get('keys', 0)} keys, {vecs.get('vectors', 0)} unique vectors ({ vecs.get('width', 0)} dimensions)"
+    author = meta.get("author") or "n/a"
+    notes = meta.get("notes", "")
+    license_name = meta.get("license")
+    sources = _format_sources(meta.get("sources"))
+    description = meta.get("description")
+    label_scheme = _format_label_scheme(cast(Dict[str, Any], meta.get("labels")))
+    accuracy = _format_accuracy(cast(Dict[str, Any], meta.get("performance")))
+    table_data = [
+        (md.bold("Name"), md.code(name)),
+        (md.bold("Version"), md.code(version)),
+        (md.bold("spaCy"), md.code(meta["spacy_version"])),
+        (md.bold("Default Pipeline"), pipeline),
+        (md.bold("Components"), components),
+        (md.bold("Vectors"), vectors),
+        (md.bold("Sources"), sources or "n/a"),
+        (md.bold("License"), md.code(license_name) if license_name else "n/a"),
+        (md.bold("Author"), md.link(author, meta["url"]) if "url" in meta else author),
+    ]
+    # Put together Markdown body
+    if description:
+        md.add(description)
+    md.add(md.table(table_data, ["Feature", "Description"]))
+    if label_scheme:
+        md.add(md.title(3, "Label Scheme"))
+        md.add(label_scheme)
+    if accuracy:
+        md.add(md.title(3, "Accuracy"))
+        md.add(accuracy)
+    if notes:
+        md.add(notes)
+    return md.text
+
+
+def _format_sources(data: Any) -> str:
+    if not data or not isinstance(data, list):
+        return "n/a"
+    sources = []
+    for source in data:
+        if not isinstance(source, dict):
+            source = {"name": source}
+        name = source.get("name")
+        if not name:
+            continue
+        url = source.get("url")
+        author = source.get("author")
+        result = name if not url else "[{}]({})".format(name, url)
+        if author:
+            result += " ({})".format(author)
+        sources.append(result)
+    return "<br>".join(sources)
+
+
+def _format_accuracy(data: Dict[str, Any], exclude: List[str] = ["speed"]) -> str:
+    if not data:
+        return ""
+    md = MarkdownRenderer()
+    scalars = [(k, v) for k, v in data.items() if isinstance(v, (int, float))]
+    scores = [
+        (md.code(acc.upper()), f"{score*100:.2f}")
+        for acc, score in scalars
+        if acc not in exclude
+    ]
+    md.add(md.table(scores, ["Type", "Score"]))
+    return md.text
+
+
+def _format_label_scheme(data: Dict[str, Any]) -> str:
+    if not data:
+        return ""
+    md = MarkdownRenderer()
+    n_labels = 0
+    n_pipes = 0
+    label_data = []
+    for pipe, labels in data.items():
+        if not labels:
+            continue
+        col1 = md.bold(md.code(pipe))
+        col2 = ", ".join(
+            [md.code(str(label).replace("|", "\\|")) for label in labels]
+        )  # noqa: W605
+        label_data.append((col1, col2))
+        n_labels += len(labels)
+        n_pipes += 1
+    if not label_data:
+        return ""
+    label_info = f"View label scheme ({n_labels} labels for {n_pipes} components)"
+    md.add("<details>")
+    md.add(f"<summary>{label_info}</summary>")
+    md.add(md.table(label_data, ["Component", "Labels"]))
+    md.add("</details>")
+    return md.text
+
+
+def _is_permitted_package_name(package_name: str) -> bool:
+    # regex from: https://www.python.org/dev/peps/pep-0426/#name
+    permitted_match = re.search(
+        r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
+    )
+    return permitted_match is not None
+
+
 TEMPLATE_SETUP = """
 #!/usr/bin/env python
 import io
@ -245,6 +526,13 @@ def load_meta(fp):
        return json.load(f)


+def load_readme(fp):
+    if path.exists(fp):
+        with io.open(fp, encoding='utf8') as f:
+            return f.read()
+    return ""
+
+
 def list_files(data_dir):
    output = []
    for root, _, filenames in walk(data_dir):
@ -257,8 +545,11 @@ def list_files(data_dir):


 def list_requirements(meta):
-    parent_package = meta.get('parent_package', 'spacy')
-    requirements = [parent_package + meta['spacy_version']]
+    # Up to version 3.7, we included the parent package
+    # in requirements by default. This behaviour is removed
+    # in 3.8, with a setting to include the parent package in
+    # the requirements list in the meta if desired.
+    requirements = []
    if 'setup_requires' in meta:
        requirements += meta['setup_requires']
    if 'requirements' in meta:
@ -270,6 +561,8 @@ def setup_package():
    root = path.abspath(path.dirname(__file__))
    meta_path = path.join(root, 'meta.json')
    meta = load_meta(meta_path)
+    readme_path = path.join(root, 'README.md')
+    readme = load_readme(readme_path)
    model_name = str(meta['lang'] + '_' + meta['name'])
    model_dir = path.join(model_name, model_name + '-' + meta['version'])

@ -279,6 +572,7 @@ def setup_package():
    setup(
        name=model_name,
        description=meta.get('description'),
+        long_description=readme,
        author=meta.get('author'),
        author_email=meta.get('email'),
        url=meta.get('url'),
@ -294,12 +588,14 @@ def setup_package():

 if __name__ == '__main__':
    setup_package()
-""".strip()
+""".lstrip()


 TEMPLATE_MANIFEST = """
 include meta.json
 include LICENSE
+include LICENSES_SOURCES
+include README.md
 """.strip()


@ -314,4 +610,7 @@ __version__ = get_model_meta(Path(__file__).parent)['version']

 def load(**overrides):
    return load_model_from_init_py(__file__, **overrides)
-""".strip()
+""".lstrip()
+
+
+FILENAMES_DOCS = ["LICENSE", "LICENSES_SOURCES", "README.md"]
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -1,13 +1,21 @@
-from typing import Optional
-from pathlib import Path
-from wasabi import msg
-import typer
 import re
+from pathlib import Path
+from typing import Optional
+
+import typer
+from wasabi import msg

-from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
-from ._util import import_code, setup_gpu
 from ..training.pretrain import pretrain
 from ..util import load_config
+from ._util import (
+    Arg,
+    Opt,
+    app,
+    import_code,
+    parse_config_overrides,
+    setup_gpu,
+    show_validation_error,
+)


@app.command(
@ -23,6 +31,7 @@ def pretrain_cli(
    resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
    epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using --resume-path. Prevents unintended overwriting of existing weight files."),
    use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
+    skip_last: bool = Opt(False, "--skip-last", "-L", help="Skip saving model-last.bin"),
    # fmt: on
 ):
    """
@ -61,7 +70,7 @@ def pretrain_cli(
        # TODO: What's the solution here? How do we handle optional blocks?
        msg.fail("The [pretraining] block in your config is empty", exits=1)
    if not output_dir.exists():
-        output_dir.mkdir()
+        output_dir.mkdir(parents=True)
        msg.good(f"Created output directory: {output_dir}")
    # Save non-interpolated config
    raw_config.to_disk(output_dir / "config.cfg")
@ -74,6 +83,7 @@ def pretrain_cli(
        epoch_resume=epoch_resume,
        use_gpu=use_gpu,
        silent=False,
+        skip_last=skip_last,
    )
    msg.good("Successfully finished pretrain")

--- a/spacy/cli/profile.py
+++ b/spacy/cli/profile.py
@ -1,17 +1,18 @@
-from typing import Optional, Sequence, Union, Iterator
-import tqdm
-from pathlib import Path
-import srsly
 import cProfile
+import itertools
 import pstats
 import sys
-import itertools
-from wasabi import msg, Printer
-import typer
+from pathlib import Path
+from typing import Iterator, Optional, Sequence, Union
+
+import srsly
+import tqdm
+import typer
+from wasabi import Printer, msg

-from ._util import app, debug_cli, Arg, Opt, NAME
 from ..language import Language
 from ..util import load_model
+from ._util import NAME, Arg, Opt, app, debug_cli


@debug_cli.command("profile")
@ -32,7 +33,7 @@ def profile_cli(

    DOCS: https://spacy.io/api/cli#debug-profile
    """
-    if ctx.parent.command.name == NAME:  # called as top-level command
+    if ctx.parent.command.name == NAME:  # type: ignore[union-attr]    # called as top-level command
        msg.warn(
            "The profile command is now available via the 'debug profile' "
            "subcommand. You can run python -m spacy debug --help for an "
@ -42,9 +43,9 @@ def profile_cli(


 def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
-
    if inputs is not None:
-        inputs = _read_inputs(inputs, msg)
+        texts = _read_inputs(inputs, msg)
+        texts = list(itertools.islice(texts, n_texts))
    if inputs is None:
        try:
            import ml_datasets
@ -56,16 +57,13 @@ def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) ->
                exits=1,
            )

-        n_inputs = 25000
-        with msg.loading("Loading IMDB dataset via Thinc..."):
-            imdb_train, _ = ml_datasets.imdb()
-            inputs, _ = zip(*imdb_train)
-        msg.info(f"Loaded IMDB dataset and using {n_inputs} examples")
-        inputs = inputs[:n_inputs]
+        with msg.loading("Loading IMDB dataset via ml_datasets..."):
+            imdb_train, _ = ml_datasets.imdb(train_limit=n_texts, dev_limit=0)
+            texts, _ = zip(*imdb_train)
+        msg.info(f"Loaded IMDB dataset and using {n_texts} examples")
    with msg.loading(f"Loading pipeline '{model}'..."):
        nlp = load_model(model)
    msg.good(f"Loaded pipeline '{model}'")
-    texts = list(itertools.islice(inputs, n_texts))
    cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof")
    s = pstats.Stats("Profile.prof")
    msg.divider("Profile stats")
@ -73,7 +71,7 @@ def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) ->


 def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
-    for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
+    for doc in nlp.pipe(tqdm.tqdm(texts, disable=None), batch_size=16):
        pass


@ -87,7 +85,7 @@ def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
        if not input_path.exists() or not input_path.is_file():
            msg.fail("Not a valid input data file", loc, exits=1)
        msg.info(f"Using data from {input_path.parts[-1]}")
-        file_ = input_path.open()
+        file_ = input_path.open()  # type: ignore[assignment]
    for line in file_:
        data = srsly.json_loads(line)
        text = data["text"]
--- a/spacy/cli/project/assets.py
+++ b/spacy/cli/project/assets.py
@ -1,154 +1 @@
-from typing import Optional
-from pathlib import Path
-from wasabi import msg
-import re
-import shutil
-import requests
-
-from ...util import ensure_path, working_dir
-from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config
-from .._util import get_checksum, download_file, git_checkout, get_git_version
-
-
-@project_cli.command("assets")
-def project_assets_cli(
-    # fmt: off
-    project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
-    sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+.")
-    # fmt: on
-):
-    """Fetch project assets like datasets and pretrained weights. Assets are
-    defined in the "assets" section of the project.yml. If a checksum is
-    provided in the project.yml, the file is only downloaded if no local file
-    with the same checksum exists.
-
-    DOCS: https://spacy.io/api/cli#project-assets
-    """
-    project_assets(project_dir, sparse_checkout=sparse_checkout)
-
-
-def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None:
-    """Fetch assets for a project using DVC if possible.
-
-    project_dir (Path): Path to project directory.
-    """
-    project_path = ensure_path(project_dir)
-    config = load_project_config(project_path)
-    assets = config.get("assets", {})
-    if not assets:
-        msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0)
-    msg.info(f"Fetching {len(assets)} asset(s)")
-    for asset in assets:
-        dest = (project_dir / asset["dest"]).resolve()
-        checksum = asset.get("checksum")
-        if "git" in asset:
-            git_err = (
-                f"Cloning spaCy project templates requires Git and the 'git' command. "
-                f"Make sure it's installed and that the executable is available."
-            )
-            get_git_version(error=git_err)
-            if dest.exists():
-                # If there's already a file, check for checksum
-                if checksum and checksum == get_checksum(dest):
-                    msg.good(
-                        f"Skipping download with matching checksum: {asset['dest']}"
-                    )
-                    continue
-                else:
-                    if dest.is_dir():
-                        shutil.rmtree(dest)
-                    else:
-                        dest.unlink()
-            git_checkout(
-                asset["git"]["repo"],
-                asset["git"]["path"],
-                dest,
-                branch=asset["git"].get("branch"),
-                sparse=sparse_checkout,
-            )
-            msg.good(f"Downloaded asset {dest}")
-        else:
-            url = asset.get("url")
-            if not url:
-                # project.yml defines asset without URL that the user has to place
-                check_private_asset(dest, checksum)
-                continue
-            fetch_asset(project_path, url, dest, checksum)
-
-
-def check_private_asset(dest: Path, checksum: Optional[str] = None) -> None:
-    """Check and validate assets without a URL (private assets that the user
-    has to provide themselves) and give feedback about the checksum.
-
-    dest (Path): Destination path of the asset.
-    checksum (Optional[str]): Optional checksum of the expected file.
-    """
-    if not Path(dest).exists():
-        err = f"No URL provided for asset. You need to add this file yourself: {dest}"
-        msg.warn(err)
-    else:
-        if not checksum:
-            msg.good(f"Asset already exists: {dest}")
-        elif checksum == get_checksum(dest):
-            msg.good(f"Asset exists with matching checksum: {dest}")
-        else:
-            msg.fail(f"Asset available but with incorrect checksum: {dest}")
-
-
-def fetch_asset(
-    project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
-) -> None:
-    """Fetch an asset from a given URL or path. If a checksum is provided and a
-    local file exists, it's only re-downloaded if the checksum doesn't match.
-
-    project_path (Path): Path to project directory.
-    url (str): URL or path to asset.
-    checksum (Optional[str]): Optional expected checksum of local file.
-    RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
-        the asset failed.
-    """
-    dest_path = (project_path / dest).resolve()
-    if dest_path.exists() and checksum:
-        # If there's already a file, check for checksum
-        if checksum == get_checksum(dest_path):
-            msg.good(f"Skipping download with matching checksum: {dest}")
-            return dest_path
-    # We might as well support the user here and create parent directories in
-    # case the asset dir isn't listed as a dir to create in the project.yml
-    if not dest_path.parent.exists():
-        dest_path.parent.mkdir(parents=True)
-    with working_dir(project_path):
-        url = convert_asset_url(url)
-        try:
-            download_file(url, dest_path)
-            msg.good(f"Downloaded asset {dest}")
-        except requests.exceptions.RequestException as e:
-            if Path(url).exists() and Path(url).is_file():
-                # If it's a local file, copy to destination
-                shutil.copy(url, str(dest_path))
-                msg.good(f"Copied local asset {dest}")
-            else:
-                msg.fail(f"Download failed: {dest}", e)
-                return
-    if checksum and checksum != get_checksum(dest_path):
-        msg.fail(f"Checksum doesn't match value defined in {PROJECT_FILE}: {dest}")
-
-
-def convert_asset_url(url: str) -> str:
-    """Check and convert the asset URL if needed.
-
-    url (str): The asset URL.
-    RETURNS (str): The converted URL.
-    """
-    # If the asset URL is a regular GitHub URL it's likely a mistake
-    if re.match(r"(http(s?)):\/\/github.com", url) and "releases/download" not in url:
-        converted = url.replace("github.com", "raw.githubusercontent.com")
-        converted = re.sub(r"/(tree|blob)/", "/", converted)
-        msg.warn(
-            "Downloading from a regular GitHub URL. This will only download "
-            "the source of the page, not the actual file. Converting the URL "
-            "to a raw URL.",
-            converted,
-        )
-        return converted
-    return url
+from weasel.cli.assets import *
--- a/spacy/cli/project/clone.py
+++ b/spacy/cli/project/clone.py
@ -1,99 +1 @@
-from typing import Optional
-from pathlib import Path
-from wasabi import msg
-import subprocess
-import re
-
-from ... import about
-from ...util import ensure_path
-from .._util import project_cli, Arg, Opt, COMMAND, PROJECT_FILE
-from .._util import git_checkout, get_git_version
-
-DEFAULT_REPO = about.__projects__
-DEFAULT_PROJECTS_BRANCH = about.__projects_branch__
-DEFAULT_BRANCH = "master"
-
-
-@project_cli.command("clone")
-def project_clone_cli(
-    # fmt: off
-    name: str = Arg(..., help="The name of the template to clone"),
-    dest: Optional[Path] = Arg(None, help="Where to clone the project. Defaults to current working directory", exists=False),
-    repo: str = Opt(DEFAULT_REPO, "--repo", "-r", help="The repository to clone from"),
-    branch: Optional[str] = Opt(None, "--branch", "-b", help="The branch to clone from"),
-    sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse Git checkout to only check out and clone the files needed. Requires Git v22.2+.")
-    # fmt: on
-):
-    """Clone a project template from a repository. Calls into "git" and will
-    only download the files from the given subdirectory. The GitHub repo
-    defaults to the official spaCy template repo, but can be customized
-    (including using a private repo).
-
-    DOCS: https://spacy.io/api/cli#project-clone
-    """
-    if dest is None:
-        dest = Path.cwd() / Path(name).parts[-1]
-    if branch is None:
-        # If it's a user repo, we want to default to other branch
-        branch = DEFAULT_PROJECTS_BRANCH if repo == DEFAULT_REPO else DEFAULT_BRANCH
-    project_clone(name, dest, repo=repo, branch=branch, sparse_checkout=sparse_checkout)
-
-
-def project_clone(
-    name: str,
-    dest: Path,
-    *,
-    repo: str = about.__projects__,
-    branch: str = about.__projects_branch__,
-    sparse_checkout: bool = False,
-) -> None:
-    """Clone a project template from a repository.
-
-    name (str): Name of subdirectory to clone.
-    dest (Path): Destination path of cloned project.
-    repo (str): URL of Git repo containing project templates.
-    branch (str): The branch to clone from
-    """
-    dest = ensure_path(dest)
-    check_clone(name, dest, repo)
-    project_dir = dest.resolve()
-    repo_name = re.sub(r"(http(s?)):\/\/github.com/", "", repo)
-    try:
-        git_checkout(repo, name, dest, branch=branch, sparse=sparse_checkout)
-    except subprocess.CalledProcessError:
-        err = f"Could not clone '{name}' from repo '{repo_name}'"
-        msg.fail(err, exits=1)
-    msg.good(f"Cloned '{name}' from {repo_name}", project_dir)
-    if not (project_dir / PROJECT_FILE).exists():
-        msg.warn(f"No {PROJECT_FILE} found in directory")
-    else:
-        msg.good(f"Your project is now ready!")
-        print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
-
-
-def check_clone(name: str, dest: Path, repo: str) -> None:
-    """Check and validate that the destination path can be used to clone. Will
-    check that Git is available and that the destination path is suitable.
-
-    name (str): Name of the directory to clone from the repo.
-    dest (Path): Local destination of cloned directory.
-    repo (str): URL of the repo to clone from.
-    """
-    git_err = (
-        f"Cloning spaCy project templates requires Git and the 'git' command. ",
-        f"To clone a project without Git, copy the files from the '{name}' "
-        f"directory in the {repo} to {dest} manually.",
-    )
-    get_git_version(error=git_err)
-    if not dest:
-        msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
-    if dest.exists():
-        # Directory already exists (not allowed, clone needs to create it)
-        msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
-    if not dest.parent.exists():
-        # We're not creating parents, parent dir should exist
-        msg.fail(
-            f"Can't clone project, parent directory doesn't exist: {dest.parent}. "
-            f"Create the necessary folder(s) first before continuing.",
-            exits=1,
-        )
+from weasel.cli.clone import *
--- a/spacy/cli/project/document.py
+++ b/spacy/cli/project/document.py
@ -1,115 +1 @@
-from pathlib import Path
-from wasabi import msg, MarkdownRenderer
-
-from ...util import working_dir
-from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config
-
-
-DOCS_URL = "https://spacy.io"
-INTRO_PROJECT = f"""The [`{PROJECT_FILE}`]({PROJECT_FILE}) defines the data assets required by the
-project, as well as the available commands and workflows. For details, see the
-[spaCy projects documentation]({DOCS_URL}/usage/projects)."""
-INTRO_COMMANDS = f"""The following commands are defined by the project. They
-can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run).
-Commands are only re-run if their inputs have changed."""
-INTRO_WORKFLOWS = f"""The following workflows are defined by the project. They
-can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run)
-and will run the specified commands in order. Commands are only re-run if their
-inputs have changed."""
-INTRO_ASSETS = f"""The following assets are defined by the project. They can
-be fetched by running [`spacy project assets`]({DOCS_URL}/api/cli#project-assets)
-in the project directory."""
-# These markers are added to the Markdown and can be used to update the file in
-# place if it already exists. Only the auto-generated part will be replaced.
-MARKER_START = "<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->"
-MARKER_END = "<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->"
-# If this marker is used in an existing README, it's ignored and not replaced
-MARKER_IGNORE = "<!-- SPACY PROJECT: IGNORE -->"
-
-
-@project_cli.command("document")
-def project_document_cli(
-    # fmt: off
-    project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
-    output_file: Path = Opt("-", "--output", "-o", help="Path to output Markdown file for output. Defaults to - for standard output"),
-    no_emoji: bool = Opt(False, "--no-emoji", "-NE", help="Don't use emoji")
-    # fmt: on
-):
-    """
-    Auto-generate a README.md for a project. If the content is saved to a file,
-    hidden markers are added so you can add custom content before or after the
-    auto-generated section and only the auto-generated docs will be replaced
-    when you re-run the command.
-
-    DOCS: https://spacy.io/api/cli#project-document
-    """
-    project_document(project_dir, output_file, no_emoji=no_emoji)
-
-
-def project_document(
-    project_dir: Path, output_file: Path, *, no_emoji: bool = False
-) -> None:
-    is_stdout = str(output_file) == "-"
-    config = load_project_config(project_dir)
-    md = MarkdownRenderer(no_emoji=no_emoji)
-    md.add(MARKER_START)
-    title = config.get("title")
-    description = config.get("description")
-    md.add(md.title(1, f"spaCy Project{f': {title}' if title else ''}", "🪐"))
-    if description:
-        md.add(description)
-    md.add(md.title(2, PROJECT_FILE, "📋"))
-    md.add(INTRO_PROJECT)
-    # Commands
-    cmds = config.get("commands", [])
-    data = [(md.code(cmd["name"]), cmd.get("help", "")) for cmd in cmds]
-    if data:
-        md.add(md.title(3, "Commands", "⏯"))
-        md.add(INTRO_COMMANDS)
-        md.add(md.table(data, ["Command", "Description"]))
-    # Workflows
-    wfs = config.get("workflows", {}).items()
-    data = [(md.code(n), " &rarr; ".join(md.code(w) for w in stp)) for n, stp in wfs]
-    if data:
-        md.add(md.title(3, "Workflows", "⏭"))
-        md.add(INTRO_WORKFLOWS)
-        md.add(md.table(data, ["Workflow", "Steps"]))
-    # Assets
-    assets = config.get("assets", [])
-    data = []
-    for a in assets:
-        source = "Git" if a.get("git") else "URL" if a.get("url") else "Local"
-        dest_path = a["dest"]
-        dest = md.code(dest_path)
-        if source == "Local":
-            # Only link assets if they're in the repo
-            with working_dir(project_dir) as p:
-                if (p / dest_path).exists():
-                    dest = md.link(dest, dest_path)
-        data.append((dest, source, a.get("description", "")))
-    if data:
-        md.add(md.title(3, "Assets", "🗂"))
-        md.add(INTRO_ASSETS)
-        md.add(md.table(data, ["File", "Source", "Description"]))
-    md.add(MARKER_END)
-    # Output result
-    if is_stdout:
-        print(md.text)
-    else:
-        content = md.text
-        if output_file.exists():
-            with output_file.open("r", encoding="utf8") as f:
-                existing = f.read()
-            if MARKER_IGNORE in existing:
-                msg.warn("Found ignore marker in existing file: skipping", output_file)
-                return
-            if MARKER_START in existing and MARKER_END in existing:
-                msg.info("Found existing file: only replacing auto-generated docs")
-                before = existing.split(MARKER_START)[0]
-                after = existing.split(MARKER_END)[1]
-                content = f"{before}{content}{after}"
-            else:
-                msg.warn("Replacing existing file")
-        with output_file.open("w", encoding="utf8") as f:
-            f.write(content)
-        msg.good("Saved project documentation", output_file)
+from weasel.cli.document import *
--- a/spacy/cli/project/dvc.py
+++ b/spacy/cli/project/dvc.py
@ -1,204 +1 @@
-"""This module contains helpers and subcommands for integrating spaCy projects
-with Data Version Controk (DVC). https://dvc.org"""
-from typing import Dict, Any, List, Optional, Iterable
-import subprocess
-from pathlib import Path
-from wasabi import msg
-
-from .._util import PROJECT_FILE, load_project_config, get_hash, project_cli
-from .._util import Arg, Opt, NAME, COMMAND
-from ...util import working_dir, split_command, join_command, run_command
-from ...util import SimpleFrozenList
-
-
-DVC_CONFIG = "dvc.yaml"
-DVC_DIR = ".dvc"
-UPDATE_COMMAND = "dvc"
-DVC_CONFIG_COMMENT = f"""# This file is auto-generated by spaCy based on your {PROJECT_FILE}. If you've
-# edited your {PROJECT_FILE}, you can regenerate this file by running:
-# {COMMAND} project {UPDATE_COMMAND}"""
-
-
-@project_cli.command(UPDATE_COMMAND)
-def project_update_dvc_cli(
-    # fmt: off
-    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
-    workflow: Optional[str] = Arg(None, help=f"Name of workflow defined in {PROJECT_FILE}. Defaults to first workflow if not set."),
-    verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
-    force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
-    # fmt: on
-):
-    """Auto-generate Data Version Control (DVC) config. A DVC
-    project can only define one pipeline, so you need to specify one workflow
-    defined in the project.yml. If no workflow is specified, the first defined
-    workflow is used. The DVC config will only be updated if the project.yml
-    changed.
-
-    DOCS: https://spacy.io/api/cli#project-dvc
-    """
-    project_update_dvc(project_dir, workflow, verbose=verbose, force=force)
-
-
-def project_update_dvc(
-    project_dir: Path,
-    workflow: Optional[str] = None,
-    *,
-    verbose: bool = False,
-    force: bool = False,
-) -> None:
-    """Update the auto-generated Data Version Control (DVC) config file. A DVC
-    project can only define one pipeline, so you need to specify one workflow
-    defined in the project.yml. Will only update the file if the checksum changed.
-
-    project_dir (Path): The project directory.
-    workflow (Optional[str]): Optional name of workflow defined in project.yml.
-        If not set, the first workflow will be used.
-    verbose (bool): Print more info.
-    force (bool): Force update DVC config.
-    """
-    config = load_project_config(project_dir)
-    updated = update_dvc_config(
-        project_dir, config, workflow, verbose=verbose, force=force
-    )
-    help_msg = "To execute the workflow with DVC, run: dvc repro"
-    if updated:
-        msg.good(f"Updated DVC config from {PROJECT_FILE}", help_msg)
-    else:
-        msg.info(f"No changes found in {PROJECT_FILE}, no update needed", help_msg)
-
-
-def update_dvc_config(
-    path: Path,
-    config: Dict[str, Any],
-    workflow: Optional[str] = None,
-    verbose: bool = False,
-    silent: bool = False,
-    force: bool = False,
-) -> bool:
-    """Re-run the DVC commands in dry mode and update dvc.yaml file in the
-    project directory. The file is auto-generated based on the config. The
-    first line of the auto-generated file specifies the hash of the config
-    dict, so if any of the config values change, the DVC config is regenerated.
-
-    path (Path): The path to the project directory.
-    config (Dict[str, Any]): The loaded project.yml.
-    verbose (bool): Whether to print additional info (via DVC).
-    silent (bool): Don't output anything (via DVC).
-    force (bool): Force update, even if hashes match.
-    RETURNS (bool): Whether the DVC config file was updated.
-    """
-    ensure_dvc(path)
-    workflows = config.get("workflows", {})
-    workflow_names = list(workflows.keys())
-    check_workflows(workflow_names, workflow)
-    if not workflow:
-        workflow = workflow_names[0]
-    config_hash = get_hash(config)
-    path = path.resolve()
-    dvc_config_path = path / DVC_CONFIG
-    if dvc_config_path.exists():
-        # Check if the file was generated using the current config, if not, redo
-        with dvc_config_path.open("r", encoding="utf8") as f:
-            ref_hash = f.readline().strip().replace("# ", "")
-        if ref_hash == config_hash and not force:
-            return False  # Nothing has changed in project.yml, don't need to update
-        dvc_config_path.unlink()
-    dvc_commands = []
-    config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
-    for name in workflows[workflow]:
-        command = config_commands[name]
-        deps = command.get("deps", [])
-        outputs = command.get("outputs", [])
-        outputs_no_cache = command.get("outputs_no_cache", [])
-        if not deps and not outputs and not outputs_no_cache:
-            continue
-        # Default to the working dir as the project path since dvc.yaml is auto-generated
-        # and we don't want arbitrary paths in there
-        project_cmd = ["python", "-m", NAME, "project", "run", name]
-        deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
-        outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
-        outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
-        dvc_cmd = ["run", "-n", name, "-w", str(path), "--no-exec"]
-        if command.get("no_skip"):
-            dvc_cmd.append("--always-changed")
-        full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
-        dvc_commands.append(join_command(full_cmd))
-    with working_dir(path):
-        dvc_flags = {"--verbose": verbose, "--quiet": silent}
-        run_dvc_commands(dvc_commands, flags=dvc_flags)
-    with dvc_config_path.open("r+", encoding="utf8") as f:
-        content = f.read()
-        f.seek(0, 0)
-        f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
-    return True
-
-
-def run_dvc_commands(
-    commands: Iterable[str] = SimpleFrozenList(), flags: Dict[str, bool] = {}
-) -> None:
-    """Run a sequence of DVC commands in a subprocess, in order.
-
-    commands (List[str]): The string commands without the leading "dvc".
-    flags (Dict[str, bool]): Conditional flags to be added to command. Makes it
-        easier to pass flags like --quiet that depend on a variable or
-        command-line setting while avoiding lots of nested conditionals.
-    """
-    for command in commands:
-        command = split_command(command)
-        dvc_command = ["dvc", *command]
-        # Add the flags if they are set to True
-        for flag, is_active in flags.items():
-            if is_active:
-                dvc_command.append(flag)
-        run_command(dvc_command)
-
-
-def check_workflows(workflows: List[str], workflow: Optional[str] = None) -> None:
-    """Validate workflows provided in project.yml and check that a given
-    workflow can be used to generate a DVC config.
-
-    workflows (List[str]): Names of the available workflows.
-    workflow (Optional[str]): The name of the workflow to convert.
-    """
-    if not workflows:
-        msg.fail(
-            f"No workflows defined in {PROJECT_FILE}. To generate a DVC config, "
-            f"define at least one list of commands.",
-            exits=1,
-        )
-    if workflow is not None and workflow not in workflows:
-        msg.fail(
-            f"Workflow '{workflow}' not defined in {PROJECT_FILE}. "
-            f"Available workflows: {', '.join(workflows)}",
-            exits=1,
-        )
-    if not workflow:
-        msg.warn(
-            f"No workflow specified for DVC pipeline. Using the first workflow "
-            f"defined in {PROJECT_FILE}: '{workflows[0]}'"
-        )
-
-
-def ensure_dvc(project_dir: Path) -> None:
-    """Ensure that the "dvc" command is available and that the current project
-    directory is an initialized DVC project.
-    """
-    try:
-        subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
-    except Exception:
-        msg.fail(
-            "To use spaCy projects with DVC (Data Version Control), DVC needs "
-            "to be installed and the 'dvc' command needs to be available",
-            "You can install the Python package from pip (pip install dvc) or "
-            "conda (conda install -c conda-forge dvc). For more details, see the "
-            "documentation: https://dvc.org/doc/install",
-            exits=1,
-        )
-    if not (project_dir / ".dvc").exists():
-        msg.fail(
-            "Project not initialized as a DVC project",
-            "To initialize a DVC project, you can run 'dvc init' in the project "
-            "directory. For more details, see the documentation: "
-            "https://dvc.org/doc/command-reference/init",
-            exits=1,
-        )
+from weasel.cli.dvc import *
--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@ -1,58 +1 @@
-from pathlib import Path
-from wasabi import msg
-from .remote_storage import RemoteStorage
-from .remote_storage import get_command_hash
-from .._util import project_cli, Arg
-from .._util import load_project_config
-from .run import update_lockfile
-
-
-@project_cli.command("pull")
-def project_pull_cli(
-    # fmt: off
-    remote: str = Arg("default", help="Name or path of remote storage"),
-    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
-    # fmt: on
-):
-    """Retrieve available precomputed outputs from a remote storage.
-    You can alias remotes in your project.yml by mapping them to storage paths.
-    A storage can be anything that the smart-open library can upload to, e.g.
-    AWS, Google Cloud Storage, SSH, local directories etc.
-
-    DOCS: https://spacy.io/api/cli#project-pull
-    """
-    for url, output_path in project_pull(project_dir, remote):
-        if url is not None:
-            msg.good(f"Pulled {output_path} from {url}")
-
-
-def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
-    # TODO: We don't have tests for this :(. It would take a bit of mockery to
-    # set up. I guess see if it breaks first?
-    config = load_project_config(project_dir)
-    if remote in config.get("remotes", {}):
-        remote = config["remotes"][remote]
-    storage = RemoteStorage(project_dir, remote)
-    commands = list(config.get("commands", []))
-    # We use a while loop here because we don't know how the commands
-    # will be ordered. A command might need dependencies from one that's later
-    # in the list.
-    while commands:
-        for i, cmd in enumerate(list(commands)):
-            deps = [project_dir / dep for dep in cmd.get("deps", [])]
-            if all(dep.exists() for dep in deps):
-                cmd_hash = get_command_hash("", "", deps, cmd["script"])
-                for output_path in cmd.get("outputs", []):
-                    url = storage.pull(output_path, command_hash=cmd_hash)
-                    yield url, output_path
-
-                out_locs = [project_dir / out for out in cmd.get("outputs", [])]
-                if all(loc.exists() for loc in out_locs):
-                    update_lockfile(project_dir, cmd)
-                # We remove the command from the list here, and break, so that
-                # we iterate over the loop again.
-                commands.pop(i)
-                break
-        else:
-            # If we didn't break the for loop, break the while loop.
-            break
+from weasel.cli.pull import *
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@ -1,63 +1 @@
-from pathlib import Path
-from wasabi import msg
-from .remote_storage import RemoteStorage
-from .remote_storage import get_content_hash, get_command_hash
-from .._util import load_project_config
-from .._util import project_cli, Arg
-
-
-@project_cli.command("push")
-def project_push_cli(
-    # fmt: off
-    remote: str = Arg("default", help="Name or path of remote storage"),
-    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
-    # fmt: on
-):
-    """Persist outputs to a remote storage. You can alias remotes in your
-    project.yml by mapping them to storage paths. A storage can be anything that
-    the smart-open library can upload to, e.g. AWS, Google Cloud Storage, SSH,
-    local directories etc.
-
-    DOCS: https://spacy.io/api/cli#project-push
-    """
-    for output_path, url in project_push(project_dir, remote):
-        if url is None:
-            msg.info(f"Skipping {output_path}")
-        else:
-            msg.good(f"Pushed {output_path} to {url}")
-
-
-def project_push(project_dir: Path, remote: str):
-    """Persist outputs to a remote storage. You can alias remotes in your project.yml
-    by mapping them to storage paths. A storage can be anything that the smart-open
-    library can upload to, e.g. gcs, aws, ssh, local directories etc
-    """
-    config = load_project_config(project_dir)
-    if remote in config.get("remotes", {}):
-        remote = config["remotes"][remote]
-    storage = RemoteStorage(project_dir, remote)
-    for cmd in config.get("commands", []):
-        deps = [project_dir / dep for dep in cmd.get("deps", [])]
-        if any(not dep.exists() for dep in deps):
-            continue
-        cmd_hash = get_command_hash(
-            "", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
-        )
-        for output_path in cmd.get("outputs", []):
-            output_loc = project_dir / output_path
-            if output_loc.exists() and _is_not_empty_dir(output_loc):
-                url = storage.push(
-                    output_path,
-                    command_hash=cmd_hash,
-                    content_hash=get_content_hash(output_loc),
-                )
-                yield output_path, url
-
-
-def _is_not_empty_dir(loc: Path):
-    if not loc.is_dir():
-        return True
-    elif any(_is_not_empty_dir(child) for child in loc.iterdir()):
-        return True
-    else:
-        return False
+from weasel.cli.push import *
--- a/spacy/cli/project/remote_storage.py
+++ b/spacy/cli/project/remote_storage.py
@ -1,174 +1 @@
-from typing import Optional, List, Dict, TYPE_CHECKING
-import os
-import site
-import hashlib
-import urllib.parse
-import tarfile
-from pathlib import Path
-
-from .._util import get_hash, get_checksum, download_file, ensure_pathy
-from ...util import make_tempdir, get_minor_version, ENV_VARS, check_bool_env_var
-from ...git_info import GIT_VERSION
-from ... import about
-
-if TYPE_CHECKING:
-    from pathy import Pathy  # noqa: F401
-
-
-class RemoteStorage:
-    """Push and pull outputs to and from a remote file storage.
-
-    Remotes can be anything that `smart-open` can support: AWS, GCS, file system,
-    ssh, etc.
-    """
-
-    def __init__(self, project_root: Path, url: str, *, compression="gz"):
-        self.root = project_root
-        self.url = ensure_pathy(url)
-        self.compression = compression
-
-    def push(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
-        """Compress a file or directory within a project and upload it to a remote
-        storage. If an object exists at the full URL, nothing is done.
-
-        Within the remote storage, files are addressed by their project path
-        (url encoded) and two user-supplied hashes, representing their creation
-        context and their file contents. If the URL already exists, the data is
-        not uploaded. Paths are archived and compressed prior to upload.
-        """
-        loc = self.root / path
-        if not loc.exists():
-            raise IOError(f"Cannot push {loc}: does not exist.")
-        url = self.make_url(path, command_hash, content_hash)
-        if url.exists():
-            return None
-        tmp: Path
-        with make_tempdir() as tmp:
-            tar_loc = tmp / self.encode_name(str(path))
-            mode_string = f"w:{self.compression}" if self.compression else "w"
-            with tarfile.open(tar_loc, mode=mode_string) as tar_file:
-                tar_file.add(str(loc), arcname=str(path))
-            with tar_loc.open(mode="rb") as input_file:
-                with url.open(mode="wb") as output_file:
-                    output_file.write(input_file.read())
-        return url
-
-    def pull(
-        self,
-        path: Path,
-        *,
-        command_hash: Optional[str] = None,
-        content_hash: Optional[str] = None,
-    ) -> Optional["Pathy"]:
-        """Retrieve a file from the remote cache. If the file already exists,
-        nothing is done.
-
-        If the command_hash and/or content_hash are specified, only matching
-        results are returned. If no results are available, an error is raised.
-        """
-        dest = self.root / path
-        if dest.exists():
-            return None
-        url = self.find(path, command_hash=command_hash, content_hash=content_hash)
-        if url is None:
-            return url
-        else:
-            # Make sure the destination exists
-            if not dest.parent.exists():
-                dest.parent.mkdir(parents=True)
-            tmp: Path
-            with make_tempdir() as tmp:
-                tar_loc = tmp / url.parts[-1]
-                download_file(url, tar_loc)
-                mode_string = f"r:{self.compression}" if self.compression else "r"
-                with tarfile.open(tar_loc, mode=mode_string) as tar_file:
-                    # This requires that the path is added correctly, relative
-                    # to root. This is how we set things up in push()
-                    tar_file.extractall(self.root)
-        return url
-
-    def find(
-        self,
-        path: Path,
-        *,
-        command_hash: Optional[str] = None,
-        content_hash: Optional[str] = None,
-    ) -> Optional["Pathy"]:
-        """Find the best matching version of a file within the storage,
-        or `None` if no match can be found. If both the creation and content hash
-        are specified, only exact matches will be returned. Otherwise, the most
-        recent matching file is preferred.
-        """
-        name = self.encode_name(str(path))
-        if command_hash is not None and content_hash is not None:
-            url = self.make_url(path, command_hash, content_hash)
-            urls = [url] if url.exists() else []
-        elif command_hash is not None:
-            urls = list((self.url / name / command_hash).iterdir())
-        else:
-            urls = list((self.url / name).iterdir())
-            if content_hash is not None:
-                urls = [url for url in urls if url.parts[-1] == content_hash]
-        return urls[-1] if urls else None
-
-    def make_url(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
-        """Construct a URL from a subpath, a creation hash and a content hash."""
-        return self.url / self.encode_name(str(path)) / command_hash / content_hash
-
-    def encode_name(self, name: str) -> str:
-        """Encode a subpath into a URL-safe name."""
-        return urllib.parse.quote_plus(name)
-
-
-def get_content_hash(loc: Path) -> str:
-    return get_checksum(loc)
-
-
-def get_command_hash(
-    site_hash: str, env_hash: str, deps: List[Path], cmd: List[str]
-) -> str:
-    """Create a hash representing the execution of a command. This includes the
-    currently installed packages, whatever environment variables have been marked
-    as relevant, and the command.
-    """
-    check_commit = check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION)
-    spacy_v = GIT_VERSION if check_commit else get_minor_version(about.__version__)
-    dep_checksums = [get_checksum(dep) for dep in sorted(deps)]
-    hashes = [spacy_v, site_hash, env_hash] + dep_checksums
-    hashes.extend(cmd)
-    creation_bytes = "".join(hashes).encode("utf8")
-    return hashlib.md5(creation_bytes).hexdigest()
-
-
-def get_site_hash():
-    """Hash the current Python environment's site-packages contents, including
-    the name and version of the libraries. The list we're hashing is what
-    `pip freeze` would output.
-    """
-    site_dirs = site.getsitepackages()
-    if site.ENABLE_USER_SITE:
-        site_dirs.extend(site.getusersitepackages())
-    packages = set()
-    for site_dir in site_dirs:
-        site_dir = Path(site_dir)
-        for subpath in site_dir.iterdir():
-            if subpath.parts[-1].endswith("dist-info"):
-                packages.add(subpath.parts[-1].replace(".dist-info", ""))
-    package_bytes = "".join(sorted(packages)).encode("utf8")
-    return hashlib.md5sum(package_bytes).hexdigest()
-
-
-def get_env_hash(env: Dict[str, str]) -> str:
-    """Construct a hash of the environment variables that will be passed into
-    the commands.
-
-    Values in the env dict may be references to the current os.environ, using
-    the syntax $ENV_VAR to mean os.environ[ENV_VAR]
-    """
-    env_vars = {}
-    for key, value in env.items():
-        if value.startswith("$"):
-            env_vars[key] = os.environ.get(value[1:], "")
-        else:
-            env_vars[key] = value
-    return get_hash(env_vars)
+from weasel.cli.remote_storage import *
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`custom: [https://explosion.ai/merch, https://explosion.ai/tailored-solutions]`