Merge remote-tracking branch 'origin/develop' into rliaw-develop

2025-10-30 15:37:29 +03:00 · 2020-06-30 13:50:03 -07:00 · 2020-06-30 13:50:03 -07:00 · 610dfd85c2
commit 610dfd85c2
parent 9b3ba4071a b032943c34
235 changed files with 8908 additions and 5314 deletions
--- a/.github/contributors/Arvindcheenu.md
+++ b/.github/contributors/Arvindcheenu.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Arvind Srinivasan    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-06-13           |
+| GitHub username                | arvindcheenu         |
+| Website (optional)             |                      |
--- a/.github/contributors/JannisTriesToCode.md
+++ b/.github/contributors/JannisTriesToCode.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                         |
+|------------------------------- | ----------------------------- |
+| Name                           | Jannis Rauschke               |
+| Company name (if applicable)   |                               |
+| Title or role (if applicable)  |                               |
+| Date                           | 22.05.2020                    |
+| GitHub username                | JannisTriesToCode             |
+| Website (optional)             | https://twitter.com/JRauschke |
--- a/.github/contributors/MartinoMensio.md
+++ b/.github/contributors/MartinoMensio.md
@ -99,8 +99,8 @@ mark both statements:
 | Field                          | Entry                              |
 |------------------------------- | --------------------               |
 | Name                           | Martino Mensio                     |
-| Company name (if applicable)   | Polytechnic University of Turin    |
-| Title or role (if applicable)  | Student                            |
+| Company name (if applicable)   | The Open University                |
+| Title or role (if applicable)  | PhD Student                        |
 | Date                           | 17 November 2017                   |
 | GitHub username                | MartinoMensio                      |
 | Website (optional)             | https://martinomensio.github.io/   |
--- a/.github/contributors/R1j1t.md
+++ b/.github/contributors/R1j1t.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Rajat                     |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  24 May 2020                    |
+| GitHub username                |  R1j1t                    |
+| Website (optional)             |                      |
--- a/.github/contributors/hiroshi-matsuda-rit.md
+++ b/.github/contributors/hiroshi-matsuda-rit.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Hiroshi Matsuda      |
+| Company name (if applicable)   | Megagon Labs, Tokyo  |
+| Title or role (if applicable)  | Research Scientist   |
+| Date                           | June 6, 2020         |
+| GitHub username                | hiroshi-matsuda-rit  |
+| Website (optional)             |                      |
--- a/.github/contributors/jonesmartins.md
+++ b/.github/contributors/jonesmartins.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Jones Martins        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-06-10           |
+| GitHub username                | jonesmartins         |
+| Website (optional)             |                      |
--- a/.github/contributors/leomrocha.md
+++ b/.github/contributors/leomrocha.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Leonardo M. Rocha    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |  Eng.                |
+| Date                           |  31/05/2020          |
+| GitHub username                |  leomrocha           |
+| Website (optional)             |                      |
--- a/.github/contributors/lfiedler.md
+++ b/.github/contributors/lfiedler.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Leander Fiedler      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06 April 2020        |
+| GitHub username                | lfiedler             |
+| Website (optional)             |                      |
--- a/.github/contributors/mahnerak.md
+++ b/.github/contributors/mahnerak.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Karen Hambardzumyan  |
+| Company name (if applicable)   | YerevaNN             |
+| Title or role (if applicable)  | Researcher           |
+| Date                           | 2020-06-19           |
+| GitHub username                | mahnerak             |
+| Website (optional)             | https://mahnerak.com/|
--- a/.github/contributors/myavrum.md
+++ b/.github/contributors/myavrum.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Marat M. Yavrumyan   |
+| Company name (if applicable)   | YSU, UD_Armenian Project |
+| Title or role (if applicable)  | Dr., Principal Investigator |
+| Date                           | 2020-06-19           |
+| GitHub username                | myavrum              |
+| Website (optional)             | http://armtreebank.yerevann.com/ |
--- a/.github/contributors/theudas.md
+++ b/.github/contributors/theudas.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Philipp Sodmann          |
+| Company name (if applicable)   | Empolis                  |
+| Title or role (if applicable)  |                          |
+| Date                           | 2017-05-06               |
+| GitHub username                | theudas                  |
+| Website (optional)             |                          |
--- a/.github/workflows/issue-manager.yml
+++ b/.github/workflows/issue-manager.yml
@ -0,0 +1,29 @@
+name: Issue Manager
+
+on:
+  schedule:
+    - cron: "0 0 * * *"
+  issue_comment:
+    types:
+      - created
+      - edited
+  issues:
+    types:
+      - labeled
+
+jobs:
+  issue-manager:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: tiangolo/issue-manager@0.2.1
+        with:
+          token: ${{ secrets.GITHUB_TOKEN }}
+          config: >
+            {
+              "resolved": {
+                "delay": "P7D",
+                "message": "This issue has been automatically closed because it was answered and there was no follow-up discussion.",
+                "remove_label_on_comment": true,
+                "remove_label_on_close": true
+              }
+            }
--- a/5
+++ b/5
@ -5,8 +5,9 @@ VENV := ./env$(PYVER)
 version := $(shell "bin/get-version.sh")

 dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy_lookups_data
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
 	chmod a+rx $@
+	cp $@ dist/spacy.pex

 dist/pytest.pex : wheelhouse/pytest-*.whl
 	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
@ -14,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl

 wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
 	$(VENV)/bin/pip wheel . -w ./wheelhouse
-	$(VENV)/bin/pip wheel spacy_lookups_data -w ./wheelhouse
+	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
 	touch $@

 wheelhouse/pytest-%.whl : $(VENV)/bin/pex
--- a/README.md
+++ b/README.md
@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
 Cython. It's built on the very latest research, and was designed from day one to
 be used in real products. spaCy comes with
 [pretrained statistical models](https://spacy.io/models) and word vectors, and
-currently supports tokenization for **50+ languages**. It features
+currently supports tokenization for **60+ languages**. It features
 state-of-the-art speed, convolutional **neural network models** for tagging,
 parsing and **named entity recognition** and easy **deep learning** integration.
 It's commercial open-source software, released under the MIT license.

-💫 **Version 2.2 out now!**
+💫 **Version 2.3 out now!**
 [Check out the release notes here.](https://github.com/explosion/spaCy/releases)

 [![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
@ -31,7 +31,7 @@ It's commercial open-source software, released under the MIT license.
 | --------------- | -------------------------------------------------------------- |
 | [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
 | [Usage Guides]  | How to use spaCy and its features.                             |
-| [New in v2.2]   | New features, backwards incompatibilities and migration guide. |
+| [New in v2.3]   | New features, backwards incompatibilities and migration guide. |
 | [API Reference] | The detailed reference for spaCy's API.                        |
 | [Models]        | Download statistical language models for spaCy.                |
 | [Universe]      | Libraries, extensions, demos, books and courses.               |
@ -39,7 +39,7 @@ It's commercial open-source software, released under the MIT license.
 | [Contribute]    | How to contribute to the spaCy project and code base.          |

 [spacy 101]: https://spacy.io/usage/spacy-101
-[new in v2.2]: https://spacy.io/usage/v2-2
+[new in v2.3]: https://spacy.io/usage/v2-3
 [usage guides]: https://spacy.io/usage/
 [api reference]: https://spacy.io/api/
 [models]: https://spacy.io/models
@ -119,12 +119,13 @@ of `v2.0.13`).
 pip install spacy
 ```

-To install additional data tables for lemmatization in **spaCy v2.2+** you can
-run `pip install spacy[lookups]` or install
+To install additional data tables for lemmatization and normalization in
+**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 separately. The lookups package is needed to create blank models with
-lemmatization data, and to lemmatize in languages that don't yet come with
-pretrained models and aren't powered by third-party libraries.
+lemmatization data for v2.2+ plus normalization data for v2.3+, and to
+lemmatize in languages that don't yet come with pretrained models and aren't
+powered by third-party libraries.

 When using pip it is generally recommended to install packages in a virtual
 environment to avoid modifying system state:
--- a/bin/ud/ud_train.py
+++ b/bin/ud/ud_train.py
@ -14,7 +14,7 @@ import spacy
 import spacy.util
 from bin.ud import conll17_ud_eval
 from spacy.tokens import Token, Doc
-from spacy.gold import GoldParse, Example
+from spacy.gold import Example
 from spacy.util import compounding, minibatch, minibatch_by_words
 from spacy.syntax.nonproj import projectivize
 from spacy.matcher import Matcher
@ -78,22 +78,21 @@ def read_data(
                head = int(head) - 1 if head != "0" else id_
                sent["words"].append(word)
                sent["tags"].append(tag)
-                sent["morphology"].append(_parse_morph_string(morph))
-                sent["morphology"][-1].add("POS_%s" % pos)
+                sent["morphs"].append(_compile_morph_string(morph, pos))
                sent["heads"].append(head)
                sent["deps"].append("ROOT" if dep == "root" else dep)
                sent["spaces"].append(space_after == "_")
-            sent["entities"] = ["-"] * len(sent["words"])
+            sent["entities"] = ["-"] * len(sent["words"])    # TODO: doc-level format
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
-                golds.append(GoldParse(docs[-1], **sent))
-                assert golds[-1].morphology is not None
+                golds.append(sent)
+                assert golds[-1]["morphs"] is not None

            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
                doc, gold = _make_gold(nlp, None, sent_annots)
-                assert gold.morphology is not None
+                assert gold["morphs"] is not None
                sent_annots = []
                docs.append(doc)
                golds.append(gold)
@ -109,17 +108,10 @@ def read_data(
    return golds_to_gold_data(docs, golds)


-def _parse_morph_string(morph_string):
+def _compile_morph_string(morph_string, pos):
    if morph_string == '_':
-        return set()
-    output = []
-    replacements = {'1': 'one', '2': 'two', '3': 'three'}
-    for feature in morph_string.split('|'):
-        key, value = feature.split('=')
-        value = replacements.get(value, value)
-        value = value.split(',')[0]
-        output.append('%s_%s' % (key, value.lower()))
-    return set(output)
+        return f"POS={pos}"
+    return morph_string + f"|POS={pos}"


 def read_conllu(file_):
@ -151,28 +143,27 @@ def read_conllu(file_):

 def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
    # Flatten the conll annotations, and adjust the head indices
-    flat = defaultdict(list)
+    gold = defaultdict(list)
    sent_starts = []
    for sent in sent_annots:
-        flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
-        for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
-            flat[field].extend(sent[field])
+        gold["heads"].extend(len(gold["words"])+head for head in sent["heads"])
+        for field in ["words", "tags", "deps", "morphs", "entities", "spaces"]:
+            gold[field].extend(sent[field])
        sent_starts.append(True)
        sent_starts.extend([False] * (len(sent["words"]) - 1))
    # Construct text if necessary
-    assert len(flat["words"]) == len(flat["spaces"])
+    assert len(gold["words"]) == len(gold["spaces"])
    if text is None:
        text = "".join(
-            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
+            word + " " * space for word, space in zip(gold["words"], gold["spaces"])
        )
    doc = nlp.make_doc(text)
-    flat.pop("spaces")
-    gold = GoldParse(doc, **flat)
-    gold.sent_starts = sent_starts
-    for i in range(len(gold.heads)):
+    gold.pop("spaces")
+    gold["sent_starts"] = sent_starts
+    for i in range(len(gold["heads"])):
        if random.random() < drop_deps:
-            gold.heads[i] = None
-            gold.labels[i] = None
+            gold["heads"][i] = None
+            gold["labels"][i] = None

    return doc, gold

@ -183,15 +174,10 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0):


 def golds_to_gold_data(docs, golds):
-    """Get out the training data format used by begin_training, given the
-    GoldParse objects."""
+    """Get out the training data format used by begin_training"""
    data = []
    for doc, gold in zip(docs, golds):
-        example = Example(doc=doc)
-        example.add_doc_annotation(cats=gold.cats)
-        token_annotation_dict = gold.orig.to_dict()
-        example.add_token_annotation(**token_annotation_dict)
-        example.goldparse = gold
+        example = Example.from_dict(doc, dict(gold))
        data.append(example)
    return data

@ -359,9 +345,8 @@ def initialize_pipeline(nlp, examples, config, device):
        nlp.parser.add_multitask_objective("tag")
    if config.multitask_sent:
        nlp.parser.add_multitask_objective("sent_start")
-    for ex in examples:
-        gold = ex.gold
-        for tag in gold.tags:
+    for eg in examples:
+        for tag in eg.get_aligned("TAG", as_string=True):
            if tag is not None:
                nlp.tagger.add_label(tag)
    if torch is not None and device != -1:
@ -495,10 +480,6 @@ def main(
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)

-    Token.set_extension("get_conllu_lines", method=get_token_conllu)
-    Token.set_extension("begins_fused", default=False)
-    Token.set_extension("inside_fused", default=False)
-
    spacy.util.fix_random_seed()
    lang.zh.Chinese.Defaults.use_jieba = False
    lang.ja.Japanese.Defaults.use_janome = False
@ -541,10 +522,10 @@ def main(
        else:
            batches = minibatch(examples, size=batch_sizes)
        losses = {}
-        n_train_words = sum(len(ex.doc) for ex in examples)
+        n_train_words = sum(len(eg.predicted) for eg in examples)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
-                pbar.update(sum(len(ex.doc) for ex in batch))
+                pbar.update(sum(len(ex.predicted) for ex in batch))
                nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
                nlp.update(
                    batch,
--- a/examples/experiments/onto-joint/defaults.cfg
+++ b/examples/experiments/onto-joint/defaults.cfg
@ -5,17 +5,16 @@
 # data is passed in sentence-by-sentence via some prior preprocessing.
 gold_preproc = false
 # Limitations on training document length or number of examples.
-max_length = 0
+max_length = 5000
 limit = 0
 # Data augmentation
 orth_variant_level = 0.0
-noise_level = 0.0
 dropout = 0.1
 # Controls early-stopping. 0 or -1 mean unlimited.
 patience = 1600
 max_epochs = 0
 max_steps = 20000
-eval_frequency = 400
+eval_frequency = 200
 # Other settings
 seed = 0
 accumulate_gradient = 1
@ -41,15 +40,15 @@ beta2 = 0.999
 L2_is_weight_decay = true
 L2 = 0.01
 grad_clip = 1.0
-use_averages = true
+use_averages = false
 eps = 1e-8
-learn_rate = 0.001
+#learn_rate = 0.001

-#[optimizer.learn_rate]
-#@schedules = "warmup_linear.v1"
-#warmup_steps = 250
-#total_steps = 20000
-#initial_rate = 0.001
+[optimizer.learn_rate]
+@schedules = "warmup_linear.v1"
+warmup_steps = 250
+total_steps = 20000
+initial_rate = 0.001

 [nlp]
 lang = "en"
@ -58,15 +57,11 @@ vectors = null
 [nlp.pipeline.tok2vec]
 factory = "tok2vec"

-[nlp.pipeline.senter]
-factory = "senter"

 [nlp.pipeline.ner]
 factory = "ner"
 learn_tokens = false
 min_action_freq = 1
-beam_width = 1
-beam_update_prob = 1.0

 [nlp.pipeline.tagger]
 factory = "tagger"
@ -74,16 +69,7 @@ factory = "tagger"
 [nlp.pipeline.parser]
 factory = "parser"
 learn_tokens = false
-min_action_freq = 1
-beam_width = 1
-beam_update_prob = 1.0
-
-[nlp.pipeline.senter.model]
-@architectures = "spacy.Tagger.v1"
-
-[nlp.pipeline.senter.model.tok2vec]
-@architectures = "spacy.Tok2VecTensors.v1"
-width = ${nlp.pipeline.tok2vec.model:width}
+min_action_freq = 30

 [nlp.pipeline.tagger.model]
@architectures = "spacy.Tagger.v1"
@ -96,8 +82,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1"
 nr_feature_tokens = 8
 hidden_width = 128
-maxout_pieces = 3
-use_upper = false
+maxout_pieces = 2
+use_upper = true

 [nlp.pipeline.parser.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
@ -107,8 +93,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1"
 nr_feature_tokens = 3
 hidden_width = 128
-maxout_pieces = 3
-use_upper = false
+maxout_pieces = 2
+use_upper = true

 [nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
@ -117,10 +103,10 @@ width = ${nlp.pipeline.tok2vec.model:width}
 [nlp.pipeline.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
 pretrained_vectors = ${nlp:vectors}
-width = 256
-depth = 6
+width = 128
+depth = 4
 window_size = 1
-embed_size = 10000
+embed_size = 7000
 maxout_pieces = 3
 subword_features = true
-dropout = null
+dropout = ${training:dropout}
--- a/examples/experiments/onto-joint/pretrain.cfg
+++ b/examples/experiments/onto-joint/pretrain.cfg
@ -9,7 +9,6 @@ max_length = 0
 limit = 0
 # Data augmentation
 orth_variant_level = 0.0
-noise_level = 0.0
 dropout = 0.1
 # Controls early-stopping. 0 or -1 mean unlimited.
 patience = 1600
--- a/examples/experiments/onto-ner.cfg
+++ b/examples/experiments/onto-ner.cfg
@ -0,0 +1,80 @@
+# Training hyper-parameters and additional features.
+[training]
+# Whether to train on sequences with 'gold standard' sentence boundaries
+# and tokens. If you set this to true, take care to ensure your run-time
+# data is passed in sentence-by-sentence via some prior preprocessing.
+gold_preproc = false
+# Limitations on training document length or number of examples.
+max_length = 5000
+limit = 0
+# Data augmentation
+orth_variant_level = 0.0
+dropout = 0.2
+# Controls early-stopping. 0 or -1 mean unlimited.
+patience = 1600
+max_epochs = 0
+max_steps = 20000
+eval_frequency = 500
+# Other settings
+seed = 0
+accumulate_gradient = 1
+use_pytorch_for_gpu_memory = false
+# Control how scores are printed and checkpoints are evaluated.
+scores = ["speed", "ents_p", "ents_r", "ents_f"]
+score_weights = {"ents_f": 1.0}
+# These settings are invalid for the transformer models.
+init_tok2vec = null
+discard_oversize = false
+omit_extra_lookups = false
+
+[training.batch_size]
+@schedules = "compounding.v1"
+start = 100
+stop = 1000
+compound = 1.001
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = false
+L2 = 1e-6
+grad_clip = 1.0
+use_averages = true
+eps = 1e-8
+learn_rate = 0.001
+
+#[optimizer.learn_rate]
+#@schedules = "warmup_linear.v1"
+#warmup_steps = 250
+#total_steps = 20000
+#initial_rate = 0.001
+
+[nlp]
+lang = "en"
+vectors = null
+
+[nlp.pipeline.ner]
+factory = "ner"
+learn_tokens = false
+min_action_freq = 1
+beam_width = 1
+beam_update_prob = 1.0
+
+[nlp.pipeline.ner.model]
+@architectures = "spacy.TransitionBasedParser.v1"
+nr_feature_tokens = 3
+hidden_width = 64
+maxout_pieces = 2
+use_upper = true
+
+[nlp.pipeline.ner.model.tok2vec]
+@architectures = "spacy.HashEmbedCNN.v1"
+pretrained_vectors = ${nlp:vectors}
+width = 96
+depth = 4
+window_size = 1
+embed_size = 2000
+maxout_pieces = 3
+subword_features = true
+dropout = ${training:dropout}
--- a/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
+++ b/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
@ -6,7 +6,6 @@ init_tok2vec = null
 vectors = null
 max_epochs = 100
 orth_variant_level = 0.0
-noise_level = 0.0
 gold_preproc = true
 max_length = 0
 use_gpu = 0
--- a/examples/experiments/ptb-joint-pos-dep/defaults.cfg
+++ b/examples/experiments/ptb-joint-pos-dep/defaults.cfg
@ -6,7 +6,6 @@ init_tok2vec = null
 vectors = null
 max_epochs = 100
 orth_variant_level = 0.0
-noise_level = 0.0
 gold_preproc = true
 max_length = 0
 use_gpu = -1
--- a/examples/training/conllu.py
+++ b/examples/training/conllu.py
@ -12,7 +12,7 @@ import tqdm
 import spacy
 import spacy.util
 from spacy.tokens import Token, Doc
-from spacy.gold import GoldParse, Example
+from spacy.gold import Example
 from spacy.syntax.nonproj import projectivize
 from collections import defaultdict
 from spacy.matcher import Matcher
@ -33,31 +33,6 @@ random.seed(0)
 numpy.random.seed(0)


-def minibatch_by_words(examples, size=5000):
-    random.shuffle(examples)
-    if isinstance(size, int):
-        size_ = itertools.repeat(size)
-    else:
-        size_ = size
-    examples = iter(examples)
-    while True:
-        batch_size = next(size_)
-        batch = []
-        while batch_size >= 0:
-            try:
-                example = next(examples)
-            except StopIteration:
-                if batch:
-                    yield batch
-                return
-            batch_size -= len(example.doc)
-            batch.append(example)
-        if batch:
-            yield batch
-        else:
-            break
-
-
 ################
 # Data reading #
 ################
@ -110,7 +85,7 @@ def read_data(
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
-                golds.append(GoldParse(docs[-1], **sent))
+                golds.append(sent)

            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
@ -159,20 +134,19 @@ def read_conllu(file_):

 def _make_gold(nlp, text, sent_annots):
    # Flatten the conll annotations, and adjust the head indices
-    flat = defaultdict(list)
+    gold = defaultdict(list)
    for sent in sent_annots:
-        flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
+        gold["heads"].extend(len(gold["words"]) + head for head in sent["heads"])
        for field in ["words", "tags", "deps", "entities", "spaces"]:
-            flat[field].extend(sent[field])
+            gold[field].extend(sent[field])
    # Construct text if necessary
-    assert len(flat["words"]) == len(flat["spaces"])
+    assert len(gold["words"]) == len(gold["spaces"])
    if text is None:
        text = "".join(
-            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
+            word + " " * space for word, space in zip(gold["words"], gold["spaces"])
        )
    doc = nlp.make_doc(text)
-    flat.pop("spaces")
-    gold = GoldParse(doc, **flat)
+    gold.pop("spaces")
    return doc, gold


@ -182,15 +156,10 @@ def _make_gold(nlp, text, sent_annots):


 def golds_to_gold_data(docs, golds):
-    """Get out the training data format used by begin_training, given the
-    GoldParse objects."""
+    """Get out the training data format used by begin_training."""
    data = []
    for doc, gold in zip(docs, golds):
-        example = Example(doc=doc)
-        example.add_doc_annotation(cats=gold.cats)
-        token_annotation_dict = gold.orig.to_dict()
-        example.add_token_annotation(**token_annotation_dict)
-        example.goldparse = gold
+        example = Example.from_dict(doc, gold)
        data.append(example)
    return data

@ -313,15 +282,15 @@ def initialize_pipeline(nlp, examples, config):
        nlp.parser.add_multitask_objective("sent_start")
    nlp.parser.moves.add_action(2, "subtok")
    nlp.add_pipe(nlp.create_pipe("tagger"))
-    for ex in examples:
-        for tag in ex.gold.tags:
+    for eg in examples:
+        for tag in eg.get_aligned("TAG", as_string=True):
            if tag is not None:
                nlp.tagger.add_label(tag)
    # Replace labels that didn't make the frequency cutoff
    actions = set(nlp.parser.labels)
    label_set = set([act.split("-")[1] for act in actions if "-" in act])
-    for ex in examples:
-        gold = ex.gold
+    for eg in examples:
+        gold = eg.gold
        for i, label in enumerate(gold.labels):
            if label is not None and label not in label_set:
                gold.labels[i] = label.split("||")[0]
@ -415,13 +384,12 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
    optimizer = initialize_pipeline(nlp, examples, config)

    for i in range(config.nr_epoch):
-        docs = [nlp.make_doc(example.doc.text) for example in examples]
-        batches = minibatch_by_words(examples, size=config.batch_size)
+        batches = spacy.minibatch_by_words(examples, size=config.batch_size)
        losses = {}
-        n_train_words = sum(len(doc) for doc in docs)
+        n_train_words = sum(len(eg.reference.doc) for eg in examples)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
-                pbar.update(sum(len(ex.doc) for ex in batch))
+                pbar.update(sum(len(eg.reference.doc) for eg in batch))
                nlp.update(
                    examples=batch, sgd=optimizer, drop=config.dropout, losses=losses,
                )
--- a/examples/training/create_kb.py
+++ b/examples/training/create_kb.py
@ -30,7 +30,7 @@ ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
    model=("Model name, should have pretrained word embeddings", "positional", None, str),
    output_dir=("Optional output directory", "option", "o", Path),
 )
-def main(model=None, output_dir=None):
+def main(model, output_dir=None):
    """Load the model and create the KB with pre-defined entity encodings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
    The updated vocab will also be written to a directory in the output_dir."""
--- a/examples/training/ner_multitask_objective.py
+++ b/examples/training/ner_multitask_objective.py
@ -24,8 +24,10 @@ import random
 import plac
 import spacy
 import os.path
+
+from spacy.gold.example import Example
 from spacy.tokens import Doc
-from spacy.gold import read_json_file, GoldParse
+from spacy.gold import read_json_file

 random.seed(0)

@ -59,27 +61,25 @@ def main(n_iter=10):
    print(nlp.pipeline)

    print("Create data", len(TRAIN_DATA))
-    optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA)
+    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
-        for example in TRAIN_DATA:
-            for token_annotation in example.token_annotations:
-                doc = Doc(nlp.vocab, words=token_annotation.words)
-                gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation)
-
-                nlp.update(
-                    examples=[(doc, gold)],  # 1 example
-                    drop=0.2,  # dropout - make it harder to memorise data
-                    sgd=optimizer,  # callable to update weights
-                    losses=losses,
-                )
+        for example_dict in TRAIN_DATA:
+            doc = Doc(nlp.vocab, words=example_dict["words"])
+            example = Example.from_dict(doc, example_dict)
+            nlp.update(
+                examples=[example],  # 1 example
+                drop=0.2,  # dropout - make it harder to memorise data
+                sgd=optimizer,  # callable to update weights
+                losses=losses,
+            )
        print(losses.get("nn_labeller", 0.0), losses["ner"])

    # test the trained model
-    for example in TRAIN_DATA:
-        if example.text is not None:
-            doc = nlp(example.text)
+    for example_dict in TRAIN_DATA:
+        if "text" in example_dict:
+            doc = nlp(example_dict["text"])
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -4,9 +4,10 @@ import random
 import warnings
 import srsly
 import spacy
-from spacy.gold import GoldParse
+from spacy.gold import Example
 from spacy.util import minibatch, compounding

+# TODO: further fix & test this script for v.3 ? (read_gold_data is never called)

 LABEL = "ANIMAL"
 TRAIN_DATA = [
@ -36,15 +37,13 @@ def read_raw_data(nlp, jsonl_loc):


 def read_gold_data(nlp, gold_loc):
-    docs = []
-    golds = []
+    examples = []
    for json_obj in srsly.read_jsonl(gold_loc):
        doc = nlp.make_doc(json_obj["text"])
        ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
-        gold = GoldParse(doc, entities=ents)
-        docs.append(doc)
-        golds.append(gold)
-    return list(zip(docs, golds))
+        example = Example.from_dict(doc, {"entities": ents})
+        examples.append(example)
+    return examples


 def main(model_name, unlabelled_loc):
--- a/examples/training/train_intent_parser.py
+++ b/examples/training/train_intent_parser.py
@ -2,7 +2,7 @@
 # coding: utf-8
 """Using the parser to recognise your own semantics

-spaCy's parser component can be used to trained to predict any type of tree
+spaCy's parser component can be trained to predict any type of tree
 structure over your input text. You can also predict trees over whole documents
 or chat logs, with connections between the sentence-roots used to annotate
 discourse structure. In this example, we'll build a message parser for a common
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -56,7 +56,7 @@ def main(model=None, output_dir=None, n_iter=100):
            print("Add label", ent[2])
            ner.add_label(ent[2])

-    with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
+    with nlp.select_pipes(enable="simple_ner") and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module="spacy")

--- a/examples/training/train_textcat.py
+++ b/examples/training/train_textcat.py
@ -19,7 +19,7 @@ from ml_datasets import loaders
 import spacy
 from spacy import util
 from spacy.util import minibatch, compounding
-from spacy.gold import Example, GoldParse
+from spacy.gold import Example


@plac.annotations(
@ -62,11 +62,10 @@ def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=Non
    train_examples = []
    for text, cats in zip(train_texts, train_cats):
        doc = nlp.make_doc(text)
-        gold = GoldParse(doc, cats=cats)
+        example = Example.from_dict(doc, {"cats": cats})
        for cat in cats:
            textcat.add_label(cat)
-        ex = Example.from_gold(gold, doc=doc)
-        train_examples.append(ex)
+        train_examples.append(example)

    with nlp.select_pipes(enable="textcat"):  # only train textcat
        optimizer = nlp.begin_training()
--- a/pyproject.toml
+++ b/pyproject.toml
@ -6,7 +6,7 @@ requires = [
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc==8.0.0a9",
+    "thinc==8.0.0a11",
    "blis>=0.4.0,<0.5.0"
 ]
 build-backend = "setuptools.build_meta"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,17 +1,17 @@
 # Our libraries
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc==8.0.0a9
+thinc==8.0.0a11
 blis>=0.4.0,<0.5.0
 ml_datasets>=0.1.1
 murmurhash>=0.28.0,<1.1.0
-wasabi>=0.4.0,<1.1.0
-srsly>=2.0.0,<3.0.0
+wasabi>=0.7.0,<1.1.0
+srsly>=2.1.0,<3.0.0
 catalogue>=0.0.7,<1.1.0
+typer>=0.3.0,<1.0.0
 # Third party dependencies
 numpy>=1.15.0
 requests>=2.13.0,<3.0.0
-plac>=0.9.6,<1.2.0
 tqdm>=4.38.0,<5.0.0
 pydantic>=1.3.0,<2.0.0
 # Official Python utilities
--- a/setup.cfg
+++ b/setup.cfg
@ -36,22 +36,21 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc==8.0.0a9
+    thinc==8.0.0a11
 install_requires =
    # Our libraries
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc==8.0.0a9
+    thinc==8.0.0a11
    blis>=0.4.0,<0.5.0
-    wasabi>=0.4.0,<1.1.0
-    srsly>=2.0.0,<3.0.0
+    wasabi>=0.7.0,<1.1.0
+    srsly>=2.1.0,<3.0.0
    catalogue>=0.0.7,<1.1.0
-    ml_datasets>=0.1.1
+    typer>=0.3.0,<1.0.0
    # Third-party dependencies
    tqdm>=4.38.0,<5.0.0
    numpy>=1.15.0
-    plac>=0.9.6,<1.2.0
    requests>=2.13.0,<3.0.0
    pydantic>=1.3.0,<2.0.0
    # Official Python utilities
@ -61,7 +60,7 @@ install_requires =

 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.3.1,<0.4.0
+    spacy_lookups_data>=0.3.2,<0.4.0
 cuda =
    cupy>=5.0.0b4,<9.0.0
 cuda80 =
@ -80,7 +79,8 @@ cuda102 =
    cupy-cuda102>=5.0.0b4,<9.0.0
 # Language tokenizers with external dependencies
 ja =
-    fugashi>=0.1.3
+    sudachipy>=0.4.5
+    sudachidict_core>=20200330
 ko =
    natto-py==0.9.0
 th =
--- a/setup.py
+++ b/setup.py
@ -23,6 +23,8 @@ Options.docstrings = True

 PACKAGES = find_packages()
 MOD_NAMES = [
+    "spacy.gold.align",
+    "spacy.gold.example",
    "spacy.parts_of_speech",
    "spacy.strings",
    "spacy.lexeme",
@ -37,11 +39,10 @@ MOD_NAMES = [
    "spacy.tokenizer",
    "spacy.syntax.nn_parser",
    "spacy.syntax._parser_model",
-    "spacy.syntax._beam_utils",
    "spacy.syntax.nonproj",
    "spacy.syntax.transition_system",
    "spacy.syntax.arc_eager",
-    "spacy.gold",
+    "spacy.gold.gold_io",
    "spacy.tokens.doc",
    "spacy.tokens.span",
    "spacy.tokens.token",
@ -120,7 +121,7 @@ class build_ext_subclass(build_ext, build_ext_options):

 def clean(path):
    for path in path.glob("**/*"):
-        if path.is_file() and path.suffix in (".so", ".cpp"):
+        if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
            print(f"Deleting {path.name}")
            path.unlink()

--- a/spacy/init.py
+++ b/spacy/init.py
@ -8,7 +8,7 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
 from thinc.api import prefer_gpu, require_gpu

 from . import pipeline
-from .cli.info import info as cli_info
+from .cli.info import info
 from .glossary import explain
 from .about import __version__
 from .errors import Errors, Warnings
@ -34,7 +34,3 @@ def load(name, **overrides):
 def blank(name, **kwargs):
    LangClass = util.get_lang_class(name)
    return LangClass(**kwargs)
-
-
-def info(model=None, markdown=False, silent=False):
-    return cli_info(model, markdown, silent)
--- a/spacy/main.py
+++ b/spacy/main.py
@ -1,31 +1,4 @@
 if __name__ == "__main__":
-    import plac
-    import sys
-    from wasabi import msg
-    from spacy.cli import download, link, info, package, pretrain, convert
-    from spacy.cli import init_model, profile, evaluate, validate, debug_data
-    from spacy.cli import train_cli
+    from spacy.cli import setup_cli

-    commands = {
-        "download": download,
-        "link": link,
-        "info": info,
-        "train": train_cli,
-        "pretrain": pretrain,
-        "debug-data": debug_data,
-        "evaluate": evaluate,
-        "convert": convert,
-        "package": package,
-        "init-model": init_model,
-        "profile": profile,
-        "validate": validate,
-    }
-    if len(sys.argv) == 1:
-        msg.info("Available commands", ", ".join(commands), exits=1)
-    command = sys.argv.pop(1)
-    sys.argv[0] = f"spacy {command}"
-    if command in commands:
-        plac.call(commands[command], sys.argv[1:])
-    else:
-        available = f"Available: {', '.join(commands)}"
-        msg.fail(f"Unknown command: {command}", available, exits=1)
+    setup_cli()
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,7 +1,8 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.0.dev9"
+__version__ = "3.0.0.dev12"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"
+__projects__ = "https://github.com/explosion/spacy-boilerplates"
--- a/spacy/cli/init.py
+++ b/spacy/cli/init.py
@ -1,19 +1,28 @@
 from wasabi import msg

+from ._app import app, setup_cli  # noqa: F401
+
+# These are the actual functions, NOT the wrapped CLI commands. The CLI commands
+# are registered automatically and won't have to be imported here.
 from .download import download  # noqa: F401
 from .info import info  # noqa: F401
 from .package import package  # noqa: F401
 from .profile import profile  # noqa: F401
-from .train_from_config import train_cli  # noqa: F401
+from .train import train_cli  # noqa: F401
 from .pretrain import pretrain  # noqa: F401
 from .debug_data import debug_data  # noqa: F401
 from .evaluate import evaluate  # noqa: F401
 from .convert import convert  # noqa: F401
 from .init_model import init_model  # noqa: F401
 from .validate import validate  # noqa: F401
+from .project import project_clone, project_assets, project_run  # noqa: F401
+from .project import project_run_all  # noqa: F401


+@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
 def link(*args, **kwargs):
+    """As of spaCy v3.0, model symlinks are deprecated. You can load models
+    using their full names or from a directory path."""
    msg.warn(
        "As of spaCy v3.0, model symlinks are deprecated. You can load models "
        "using their full names or from a directory path."
--- a/spacy/cli/_app.py
+++ b/spacy/cli/_app.py
@ -0,0 +1,24 @@
+import typer
+from typer.main import get_command
+
+
+COMMAND = "python -m spacy"
+NAME = "spacy"
+HELP = """spaCy Command-line Interface
+
+DOCS: https://spacy.io/api/cli
+"""
+
+
+app = typer.Typer(name=NAME, help=HELP)
+
+# Wrappers for Typer's annotations. Initially created to set defaults and to
+# keep the names short, but not needed at the moment.
+Arg = typer.Argument
+Opt = typer.Option
+
+
+def setup_cli() -> None:
+    # Ensure that the help messages always display the correct prompt
+    command = get_command(app)
+    command(prog_name=COMMAND)
--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -1,118 +1,157 @@
+from typing import Optional
+from enum import Enum
 from pathlib import Path
 from wasabi import Printer
 import srsly
 import re
+import sys

-from .converters import conllu2json, iob2json, conll_ner2json
-from .converters import ner_jsonl2json
+from ._app import app, Arg, Opt
+from ..gold import docs_to_json
+from ..tokens import DocBin
+from ..gold.converters import iob2docs, conll_ner2docs, json2docs


 # Converters are matched by file extension except for ner/iob, which are
 # matched by file extension and content. To add a converter, add a new
 # entry to this dict with the file extension mapped to the converter function
 # imported from /converters.
+
 CONVERTERS = {
-    "conllubio": conllu2json,
-    "conllu": conllu2json,
-    "conll": conllu2json,
-    "ner": conll_ner2json,
-    "iob": iob2json,
-    "jsonl": ner_jsonl2json,
+    # "conllubio": conllu2docs, TODO
+    # "conllu": conllu2docs, TODO
+    # "conll": conllu2docs, TODO
+    "ner": conll_ner2docs,
+    "iob": iob2docs,
+    "json": json2docs,
 }

-# File types
-FILE_TYPES = ("json", "jsonl", "msg")
-FILE_TYPES_STDOUT = ("json", "jsonl")
+
+# File types that can be written to stdout
+FILE_TYPES_STDOUT = ("json")


-def convert(
+class FileTypes(str, Enum):
+    json = "json"
+    spacy = "spacy"
+
+
+@app.command("convert")
+def convert_cli(
    # fmt: off
-    input_file: ("Input file", "positional", None, str),
-    output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-",
-    file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json",
-    n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1,
-    seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False,
-    model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None,
-    morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False,
-    merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False,
-    converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto",
-    ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None,
-    lang: ("Language (if tokenizer required)", "option", "l", str) = None,
+    input_path: str = Arg(..., help="Input file or directory", exists=True),
+    output_dir: Path = Arg("-", help="Output directory. '-' for stdout.", allow_dash=True, exists=True),
+    file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"),
+    n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"),
+    seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"),
+    model: Optional[str] = Opt(None, "--model", "-b", help="Model for sentence segmentation (for -s)"),
+    morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
+    merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
+    converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
+    ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
+    lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
    # fmt: on
 ):
    """
-    Convert files into JSON format for use with train command and other
+    Convert files into json or DocBin format for use with train command and other
    experiment management functions. If no output_dir is specified, the data
    is written to stdout, so you can pipe them forward to a JSON file:
    $ spacy convert some_file.conllu > some_file.json
    """
-    no_print = output_dir == "-"
-    msg = Printer(no_print=no_print)
-    input_path = Path(input_file)
-    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
-        # TODO: support msgpack via stdout in srsly?
-        msg.fail(
-            f"Can't write .{file_type} data to stdout",
-            "Please specify an output directory.",
-            exits=1,
-        )
-    if not input_path.exists():
-        msg.fail("Input file not found", input_path, exits=1)
-    if output_dir != "-" and not Path(output_dir).exists():
-        msg.fail("Output directory not found", output_dir, exits=1)
-    input_data = input_path.open("r", encoding="utf-8").read()
-    if converter == "auto":
-        converter = input_path.suffix[1:]
-    if converter == "ner" or converter == "iob":
-        converter_autodetect = autodetect_ner_format(input_data)
-        if converter_autodetect == "ner":
-            msg.info("Auto-detected token-per-line NER format")
-            converter = converter_autodetect
-        elif converter_autodetect == "iob":
-            msg.info("Auto-detected sentence-per-line NER format")
-            converter = converter_autodetect
-        else:
-            msg.warn(
-                "Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert"
-            )
-    if converter not in CONVERTERS:
-        msg.fail(f"Can't find converter for {converter}", exits=1)
-    ner_map = None
-    if ner_map_path is not None:
-        ner_map = srsly.read_json(ner_map_path)
-    # Use converter function to convert data
-    func = CONVERTERS[converter]
-    data = func(
-        input_data,
+    if isinstance(file_type, FileTypes):
+        # We get an instance of the FileTypes from the CLI so we need its string value
+        file_type = file_type.value
+    input_path = Path(input_path)
+    output_dir = "-" if output_dir == Path("-") else output_dir
+    cli_args = locals()
+    silent = output_dir == "-"
+    msg = Printer(no_print=silent)
+    verify_cli_args(msg, **cli_args)
+    converter = _get_converter(msg, converter, input_path)
+    convert(
+        input_path,
+        output_dir,
+        file_type=file_type,
        n_sents=n_sents,
        seg_sents=seg_sents,
-        append_morphology=morphology,
-        merge_subtokens=merge_subtokens,
-        lang=lang,
        model=model,
-        no_print=no_print,
+        morphology=morphology,
+        merge_subtokens=merge_subtokens,
+        converter=converter,
        ner_map=ner_map,
+        lang=lang,
+        silent=silent,
+        msg=msg,
    )
-    if output_dir != "-":
-        # Export data to a file
-        suffix = f".{file_type}"
-        output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
-        if file_type == "json":
-            srsly.write_json(output_file, data)
-        elif file_type == "jsonl":
-            srsly.write_jsonl(output_file, data)
-        elif file_type == "msg":
-            srsly.write_msgpack(output_file, data)
-        msg.good(f"Generated output file ({len(data)} documents): {output_file}")
+
+
+def convert(
+        input_path: Path,
+        output_dir: Path,
+        *,
+        file_type: str = "json",
+        n_sents: int = 1,
+        seg_sents: bool = False,
+        model: Optional[str] = None,
+        morphology: bool = False,
+        merge_subtokens: bool = False,
+        converter: str = "auto",
+        ner_map: Optional[Path] = None,
+        lang: Optional[str] = None,
+        silent: bool = True,
+        msg: Optional[Path] = None,
+) -> None:
+    if not msg:
+        msg = Printer(no_print=silent)
+    ner_map = srsly.read_json(ner_map) if ner_map is not None else None
+
+    for input_loc in walk_directory(input_path):
+        input_data = input_loc.open("r", encoding="utf-8").read()
+        # Use converter function to convert data
+        func = CONVERTERS[converter]
+        docs = func(
+            input_data,
+            n_sents=n_sents,
+            seg_sents=seg_sents,
+            append_morphology=morphology,
+            merge_subtokens=merge_subtokens,
+            lang=lang,
+            model=model,
+            no_print=silent,
+            ner_map=ner_map,
+        )
+        if output_dir == "-":
+            _print_docs_to_stdout(docs, file_type)
+        else:
+            if input_loc != input_path:
+                subpath = input_loc.relative_to(input_path)
+                output_file = Path(output_dir) / subpath.with_suffix(f".{file_type}")
+            else:
+                output_file = Path(output_dir) / input_loc.parts[-1]
+                output_file = output_file.with_suffix(f".{file_type}")
+            _write_docs_to_file(docs, output_file, file_type)
+            msg.good(f"Generated output file ({len(docs)} documents): {output_file}")
+
+
+def _print_docs_to_stdout(docs, output_type):
+    if output_type == "json":
+        srsly.write_json("-", docs_to_json(docs))
    else:
-        # Print to stdout
-        if file_type == "json":
-            srsly.write_json("-", data)
-        elif file_type == "jsonl":
-            srsly.write_jsonl("-", data)
+        sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())


-def autodetect_ner_format(input_data):
+def _write_docs_to_file(docs, output_file, output_type):
+    if not output_file.parent.exists():
+        output_file.parent.mkdir(parents=True)
+    if output_type == "json":
+        srsly.write_json(output_file, docs_to_json(docs))
+    else:
+        data = DocBin(docs=docs).to_bytes()
+        with output_file.open("wb") as file_:
+            file_.write(data)
+ 
+
+def autodetect_ner_format(input_data: str) -> str:
    # guess format from the first 20 lines
    lines = input_data.split("\n")[:20]
    format_guesses = {"ner": 0, "iob": 0}
@ -129,3 +168,86 @@ def autodetect_ner_format(input_data):
    if format_guesses["ner"] == 0 and format_guesses["iob"] > 0:
        return "iob"
    return None
+
+
+def walk_directory(path):
+    if not path.is_dir():
+        return [path]
+    paths = [path]
+    locs = []
+    seen = set()
+    for path in paths:
+        if str(path) in seen:
+            continue
+        seen.add(str(path))
+        if path.parts[-1].startswith("."):
+            continue
+        elif path.is_dir():
+            paths.extend(path.iterdir())
+        else:
+            locs.append(path)
+    return locs
+
+
+def verify_cli_args(
+    msg,
+    input_path,
+    output_dir,
+    file_type,
+    n_sents,
+    seg_sents,
+    model,
+    morphology,
+    merge_subtokens,
+    converter,
+    ner_map,
+    lang,
+):
+    input_path = Path(input_path)
+    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
+        # TODO: support msgpack via stdout in srsly?
+        msg.fail(
+            f"Can't write .{file_type} data to stdout",
+            "Please specify an output directory.",
+            exits=1,
+        )
+    if not input_path.exists():
+        msg.fail("Input file not found", input_path, exits=1)
+    if output_dir != "-" and not Path(output_dir).exists():
+        msg.fail("Output directory not found", output_dir, exits=1)
+    if input_path.is_dir():
+        input_locs = walk_directory(input_path)
+        if len(input_locs) == 0:
+            msg.fail("No input files in directory", input_path, exits=1)
+        file_types = list(set([loc.suffix[1:] for loc in input_locs]))
+        if len(file_types) >= 2:
+            file_types = ",".join(file_types)
+            msg.fail("All input files must be same type", file_types, exits=1)
+    converter = _get_converter(msg, converter, input_path)
+    if converter not in CONVERTERS:
+        msg.fail(f"Can't find converter for {converter}", exits=1)
+    return converter
+
+
+def _get_converter(msg, converter, input_path):
+    if input_path.is_dir():
+        input_path = walk_directory(input_path)[0]
+    if converter == "auto":
+        converter = input_path.suffix[1:]
+    if converter == "ner" or converter == "iob":
+        with input_path.open() as file_:
+            input_data = file_.read()
+        converter_autodetect = autodetect_ner_format(input_data)
+        if converter_autodetect == "ner":
+            msg.info("Auto-detected token-per-line NER format")
+            converter = converter_autodetect
+        elif converter_autodetect == "iob":
+            msg.info("Auto-detected sentence-per-line NER format")
+            converter = converter_autodetect
+        else:
+            msg.warn(
+                "Can't automatically detect NER format. "
+                "Conversion may not succeed. "
+                "See https://spacy.io/api/cli#convert"
+            )
+    return converter
--- a/spacy/cli/converters/init.py
+++ b/spacy/cli/converters/init.py
@ -1,4 +0,0 @@
-from .conllu2json import conllu2json  # noqa: F401
-from .iob2json import iob2json  # noqa: F401
-from .conll_ner2json import conll_ner2json  # noqa: F401
-from .jsonl2json import ner_jsonl2json  # noqa: F401
--- a/spacy/cli/converters/iob2json.py
+++ b/spacy/cli/converters/iob2json.py
@ -1,65 +0,0 @@
-from wasabi import Printer
-
-from ...gold import iob_to_biluo
-from ...util import minibatch
-from .conll_ner2json import n_sents_info
-
-
-def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
-    """
-    Convert IOB files with one sentence per line and tags separated with '|'
-    into JSON format for use with train cli. IOB and IOB2 are accepted.
-
-    Sample formats:
-
-    I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
-    I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
-    I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
-    I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
-    """
-    msg = Printer(no_print=no_print)
-    docs = read_iob(input_data.split("\n"))
-    if n_sents > 0:
-        n_sents_info(msg, n_sents)
-        docs = merge_sentences(docs, n_sents)
-    return docs
-
-
-def read_iob(raw_sents):
-    sentences = []
-    for line in raw_sents:
-        if not line.strip():
-            continue
-        tokens = [t.split("|") for t in line.split()]
-        if len(tokens[0]) == 3:
-            words, pos, iob = zip(*tokens)
-        elif len(tokens[0]) == 2:
-            words, iob = zip(*tokens)
-            pos = ["-"] * len(words)
-        else:
-            raise ValueError(
-                "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
-            )
-        biluo = iob_to_biluo(iob)
-        sentences.append(
-            [
-                {"orth": w, "tag": p, "ner": ent}
-                for (w, p, ent) in zip(words, pos, biluo)
-            ]
-        )
-    sentences = [{"tokens": sent} for sent in sentences]
-    paragraphs = [{"sentences": [sent]} for sent in sentences]
-    docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)]
-    return docs
-
-
-def merge_sentences(docs, n_sents):
-    merged = []
-    for group in minibatch(docs, size=n_sents):
-        group = list(group)
-        first = group.pop(0)
-        to_extend = first["paragraphs"][0]["sentences"]
-        for sent in group:
-            to_extend.extend(sent["paragraphs"][0]["sentences"])
-        merged.append(first)
-    return merged
--- a/spacy/cli/converters/jsonl2json.py
+++ b/spacy/cli/converters/jsonl2json.py
@ -1,50 +0,0 @@
-import srsly
-
-from ...gold import docs_to_json
-from ...util import get_lang_class, minibatch
-
-
-def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
-    if lang is None:
-        raise ValueError("No --lang specified, but tokenization required")
-    json_docs = []
-    input_examples = [srsly.json_loads(line) for line in input_data.strip().split("\n")]
-    nlp = get_lang_class(lang)()
-    sentencizer = nlp.create_pipe("sentencizer")
-    for i, batch in enumerate(minibatch(input_examples, size=n_sents)):
-        docs = []
-        for record in batch:
-            raw_text = record["text"]
-            if "entities" in record:
-                ents = record["entities"]
-            else:
-                ents = record["spans"]
-            ents = [(e["start"], e["end"], e["label"]) for e in ents]
-            doc = nlp.make_doc(raw_text)
-            sentencizer(doc)
-            spans = [doc.char_span(s, e, label=L) for s, e, L in ents]
-            doc.ents = _cleanup_spans(spans)
-            docs.append(doc)
-        json_docs.append(docs_to_json(docs, id=i))
-    return json_docs
-
-
-def _cleanup_spans(spans):
-    output = []
-    seen = set()
-    for span in spans:
-        if span is not None:
-            # Trim whitespace
-            while len(span) and span[0].is_space:
-                span = span[1:]
-            while len(span) and span[-1].is_space:
-                span = span[:-1]
-            if not len(span):
-                continue
-            for i in range(span.start, span.end):
-                if i in seen:
-                    break
-            else:
-                output.append(span)
-                seen.update(range(span.start, span.end))
-    return output
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -1,11 +1,14 @@
+from typing import Optional, List, Sequence, Dict, Any, Tuple
 from pathlib import Path
 from collections import Counter
 import sys
 import srsly
 from wasabi import Printer, MESSAGES

-from ..gold import GoldCorpus
+from ._app import app, Arg, Opt
+from ..gold import Corpus, Example
 from ..syntax import nonproj
+from ..language import Language
 from ..util import load_model, get_lang_class


@ -18,17 +21,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100
 BLANK_MODEL_THRESHOLD = 2000


-def debug_data(
+@app.command("debug-data")
+def debug_data_cli(
    # fmt: off
-    lang: ("Model language", "positional", None, str),
-    train_path: ("Location of JSON-formatted training data", "positional", None, Path),
-    dev_path: ("Location of JSON-formatted development data", "positional", None, Path),
-    tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None,
-    base_model: ("Name of model to update (optional)", "option", "b", str) = None,
-    pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner",
-    ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False,
-    verbose: ("Print additional information and explanations", "flag", "V", bool) = False,
-    no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False,
+    lang: str = Arg(..., help="Model language"),
+    train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
+    dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
+    tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map", exists=True, dir_okay=False),
+    base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Name of model to update (optional)"),
+    pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of pipeline components to train"),
+    ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
+    verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
+    no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
    # fmt: on
 ):
    """
@ -36,8 +40,36 @@ def debug_data(
    stats, and find problems like invalid entity annotations, cyclic
    dependencies, low data labels and more.
    """
-    msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
+    debug_data(
+        lang,
+        train_path,
+        dev_path,
+        tag_map_path=tag_map_path,
+        base_model=base_model,
+        pipeline=[p.strip() for p in pipeline.split(",")],
+        ignore_warnings=ignore_warnings,
+        verbose=verbose,
+        no_format=no_format,
+        silent=False,
+    )

+
+def debug_data(
+    lang: str,
+    train_path: Path,
+    dev_path: Path,
+    *,
+    tag_map_path: Optional[Path] = None,
+    base_model: Optional[str] = None,
+    pipeline: List[str] = ["tagger", "parser", "ner"],
+    ignore_warnings: bool = False,
+    verbose: bool = False,
+    no_format: bool = True,
+    silent: bool = True,
+):
+    msg = Printer(
+        no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings
+    )
    # Make sure all files and paths exists if they are needed
    if not train_path.exists():
        msg.fail("Training data not found", train_path, exits=1)
@ -49,7 +81,6 @@ def debug_data(
        tag_map = srsly.read_json(tag_map_path)

    # Initialize the model and pipeline
-    pipeline = [p.strip() for p in pipeline.split(",")]
    if base_model:
        nlp = load_model(base_model)
    else:
@ -68,12 +99,9 @@ def debug_data(
    loading_train_error_message = ""
    loading_dev_error_message = ""
    with msg.loading("Loading corpus..."):
-        corpus = GoldCorpus(train_path, dev_path)
+        corpus = Corpus(train_path, dev_path)
        try:
            train_dataset = list(corpus.train_dataset(nlp))
-            train_dataset_unpreprocessed = list(
-                corpus.train_dataset_without_preprocessing(nlp)
-            )
        except ValueError as e:
            loading_train_error_message = f"Training data cannot be loaded: {e}"
        try:
@ -89,11 +117,9 @@ def debug_data(
    msg.good("Corpus is loadable")

    # Create all gold data here to avoid iterating over the train_dataset constantly
-    gold_train_data = _compile_gold(train_dataset, pipeline, nlp)
-    gold_train_unpreprocessed_data = _compile_gold(
-        train_dataset_unpreprocessed, pipeline
-    )
-    gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp)
+    gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
+    gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
+    gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)

    train_texts = gold_train_data["texts"]
    dev_texts = gold_dev_data["texts"]
@ -446,7 +472,7 @@ def debug_data(
        sys.exit(1)


-def _load_file(file_path, msg):
+def _load_file(file_path: Path, msg: Printer) -> None:
    file_name = file_path.parts[-1]
    if file_path.suffix == ".json":
        with msg.loading(f"Loading {file_name}..."):
@ -465,7 +491,9 @@ def _load_file(file_path, msg):
    )


-def _compile_gold(examples, pipeline, nlp):
+def _compile_gold(
+    examples: Sequence[Example], pipeline: List[str], nlp: Language, make_proj: bool
+) -> Dict[str, Any]:
    data = {
        "ner": Counter(),
        "cats": Counter(),
@ -484,20 +512,20 @@ def _compile_gold(examples, pipeline, nlp):
        "n_cats_multilabel": 0,
        "texts": set(),
    }
-    for example in examples:
-        gold = example.gold
-        doc = example.doc
-        valid_words = [x for x in gold.words if x is not None]
+    for eg in examples:
+        gold = eg.reference
+        doc = eg.predicted
+        valid_words = [x for x in gold if x is not None]
        data["words"].update(valid_words)
        data["n_words"] += len(valid_words)
-        data["n_misaligned_words"] += len(gold.words) - len(valid_words)
+        data["n_misaligned_words"] += len(gold) - len(valid_words)
        data["texts"].add(doc.text)
        if len(nlp.vocab.vectors):
            for word in valid_words:
                if nlp.vocab.strings[word] not in nlp.vocab.vectors:
                    data["words_missing_vectors"].update([word])
        if "ner" in pipeline:
-            for i, label in enumerate(gold.ner):
+            for i, label in enumerate(eg.get_aligned_ner()):
                if label is None:
                    continue
                if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
@ -523,32 +551,34 @@ def _compile_gold(examples, pipeline, nlp):
            if list(gold.cats.values()).count(1.0) != 1:
                data["n_cats_multilabel"] += 1
        if "tagger" in pipeline:
-            data["tags"].update([x for x in gold.tags if x is not None])
+            tags = eg.get_aligned("TAG", as_string=True)
+            data["tags"].update([x for x in tags if x is not None])
        if "parser" in pipeline:
-            data["deps"].update([x for x in gold.labels if x is not None])
-            for i, (dep, head) in enumerate(zip(gold.labels, gold.heads)):
+            aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
+            data["deps"].update([x for x in aligned_deps if x is not None])
+            for i, (dep, head) in enumerate(zip(aligned_deps, aligned_heads)):
                if head == i:
                    data["roots"].update([dep])
                    data["n_sents"] += 1
-            if nonproj.is_nonproj_tree(gold.heads):
+            if nonproj.is_nonproj_tree(aligned_heads):
                data["n_nonproj"] += 1
-            if nonproj.contains_cycle(gold.heads):
+            if nonproj.contains_cycle(aligned_heads):
                data["n_cycles"] += 1
    return data


-def _format_labels(labels, counts=False):
+def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
    if counts:
        return ", ".join([f"'{l}' ({c})" for l, c in labels])
    return ", ".join([f"'{l}'" for l in labels])


-def _get_examples_without_label(data, label):
+def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
    count = 0
-    for ex in data:
+    for eg in data:
        labels = [
            label.split("-")[1]
-            for label in ex.gold.ner
+            for label in eg.get_aligned_ner()
            if label not in ("O", "-", None)
        ]
        if label not in labels:
@ -556,7 +586,7 @@ def _get_examples_without_label(data, label):
    return count


-def _get_labels_from_model(nlp, pipe_name):
+def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
    if pipe_name not in nlp.pipe_names:
        return set()
    pipe = nlp.get_pipe(pipe_name)
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -1,23 +1,36 @@
+from typing import Optional, Sequence, Union
 import requests
-import os
-import subprocess
 import sys
 from wasabi import msg
+import typer

+from ._app import app, Arg, Opt
 from .. import about
-from ..util import is_package, get_base_version
+from ..util import is_package, get_base_version, run_command


-def download(
-    model: ("Model to download (shortcut or name)", "positional", None, str),
-    direct: ("Force direct download of name + version", "flag", "d", bool) = False,
-    *pip_args: ("Additional arguments to be passed to `pip install` on model install"),
+@app.command(
+    "download",
+    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def download_cli(
+    # fmt: off
+    ctx: typer.Context,
+    model: str = Arg(..., help="Model to download (shortcut or name)"),
+    direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
+    # fmt: on
 ):
    """
    Download compatible model from default download path using pip. If --direct
    flag is set, the command expects the full model name with version.
-    For direct downloads, the compatibility check will be skipped.
+    For direct downloads, the compatibility check will be skipped. All
+    additional arguments provided to this command will be passed to `pip install`
+    on model installation.
    """
+    download(model, direct, *ctx.args)
+
+
+def download(model: str, direct: bool = False, *pip_args) -> None:
    if not is_package("spacy") and "--no-deps" not in pip_args:
        msg.warn(
            "Skipping model package dependencies and setting `--no-deps`. "
@ -33,22 +46,20 @@ def download(
        components = model.split("-")
        model_name = "".join(components[:-1])
        version = components[-1]
-        dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
+        download_model(dl_tpl.format(m=model_name, v=version), pip_args)
    else:
        shortcuts = get_json(about.__shortcuts__, "available shortcuts")
        model_name = shortcuts.get(model, model)
        compatibility = get_compatibility()
        version = get_version(model_name, compatibility)
-        dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
-        if dl != 0:  # if download subprocess doesn't return 0, exit
-            sys.exit(dl)
-        msg.good(
-            "Download and installation successful",
-            f"You can now load the model via spacy.load('{model_name}')",
-        )
+        download_model(dl_tpl.format(m=model_name, v=version), pip_args)
+    msg.good(
+        "Download and installation successful",
+        f"You can now load the model via spacy.load('{model_name}')",
+    )


-def get_json(url, desc):
+def get_json(url: str, desc: str) -> Union[dict, list]:
    r = requests.get(url)
    if r.status_code != 200:
        msg.fail(
@ -62,7 +73,7 @@ def get_json(url, desc):
    return r.json()


-def get_compatibility():
+def get_compatibility() -> dict:
    version = get_base_version(about.__version__)
    comp_table = get_json(about.__compatibility__, "compatibility table")
    comp = comp_table["spacy"]
@ -71,7 +82,7 @@ def get_compatibility():
    return comp[version]


-def get_version(model, comp):
+def get_version(model: str, comp: dict) -> str:
    model = get_base_version(model)
    if model not in comp:
        msg.fail(
@ -81,10 +92,12 @@ def get_version(model, comp):
    return comp[model][0]


-def download_model(filename, user_pip_args=None):
+def download_model(
+    filename: str, user_pip_args: Optional[Sequence[str]] = None
+) -> None:
    download_url = about.__download_url__ + "/" + filename
    pip_args = ["--no-cache-dir"]
    if user_pip_args:
        pip_args.extend(user_pip_args)
    cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
-    return subprocess.call(cmd, env=os.environ.copy())
+    run_command(cmd)
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -1,46 +1,75 @@
+from typing import Optional, List, Dict
 from timeit import default_timer as timer
-from wasabi import msg
+from wasabi import Printer
+from pathlib import Path
+import re
+import srsly

-from ..gold import GoldCorpus
+from ..gold import Corpus
+from ..tokens import Doc
+from ._app import app, Arg, Opt
+from ..scorer import Scorer
 from .. import util
 from .. import displacy


-def evaluate(
+@app.command("evaluate")
+def evaluate_cli(
    # fmt: off
-    model: ("Model name or path", "positional", None, str),
-    data_path: ("Location of JSON-formatted evaluation data", "positional", None, str),
-    gpu_id: ("Use GPU", "option", "g", int) = -1,
-    gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False,
-    displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None,
-    displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25,
-    return_scores: ("Return dict containing model scores", "flag", "R", bool) = False,
+    model: str = Arg(..., help="Model name or path"),
+    data_path: Path = Arg(..., help="Location of JSON-formatted evaluation data", exists=True),
+    output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
+    gpu_id: int = Opt(-1, "--gpu-id", "-g", help="Use GPU"),
+    gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
+    displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
+    displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
    # fmt: on
 ):
    """
    Evaluate a model. To render a sample of parses in a HTML file, set an
    output directory as the displacy_path argument.
    """
+    evaluate(
+        model,
+        data_path,
+        output=output,
+        gpu_id=gpu_id,
+        gold_preproc=gold_preproc,
+        displacy_path=displacy_path,
+        displacy_limit=displacy_limit,
+        silent=False,
+    )
+
+
+def evaluate(
+    model: str,
+    data_path: Path,
+    output: Optional[Path],
+    gpu_id: int = -1,
+    gold_preproc: bool = False,
+    displacy_path: Optional[Path] = None,
+    displacy_limit: int = 25,
+    silent: bool = True,
+) -> Scorer:
+    msg = Printer(no_print=silent, pretty=not silent)
    util.fix_random_seed()
    if gpu_id >= 0:
        util.use_gpu(gpu_id)
    util.set_env_log(False)
    data_path = util.ensure_path(data_path)
+    output_path = util.ensure_path(output)
    displacy_path = util.ensure_path(displacy_path)
    if not data_path.exists():
        msg.fail("Evaluation data not found", data_path, exits=1)
    if displacy_path and not displacy_path.exists():
        msg.fail("Visualization output directory not found", displacy_path, exits=1)
-    corpus = GoldCorpus(data_path, data_path)
-    if model.startswith("blank:"):
-        nlp = util.get_lang_class(model.replace("blank:", ""))()
-    else:
-        nlp = util.load_model(model)
+    corpus = Corpus(data_path, data_path)
+    nlp = util.load_model(model)
    dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc))
    begin = timer()
    scorer = nlp.evaluate(dev_dataset, verbose=False)
    end = timer()
-    nwords = sum(len(ex.doc) for ex in dev_dataset)
+    nwords = sum(len(ex.predicted) for ex in dev_dataset)
    results = {
        "Time": f"{end - begin:.2f} s",
        "Words": nwords,
@ -60,10 +89,22 @@ def evaluate(
        "Sent R": f"{scorer.sent_r:.2f}",
        "Sent F": f"{scorer.sent_f:.2f}",
    }
+    data = {re.sub(r"[\s/]", "_", k.lower()): v for k, v in results.items()}
+
    msg.table(results, title="Results")

+    if scorer.ents_per_type:
+        data["ents_per_type"] = scorer.ents_per_type
+        print_ents_per_type(msg, scorer.ents_per_type)
+    if scorer.textcats_f_per_cat:
+        data["textcats_f_per_cat"] = scorer.textcats_f_per_cat
+        print_textcats_f_per_cat(msg, scorer.textcats_f_per_cat)
+    if scorer.textcats_auc_per_cat:
+        data["textcats_auc_per_cat"] = scorer.textcats_auc_per_cat
+        print_textcats_auc_per_cat(msg, scorer.textcats_auc_per_cat)
+
    if displacy_path:
-        docs = [ex.doc for ex in dev_dataset]
+        docs = [ex.predicted for ex in dev_dataset]
        render_deps = "parser" in nlp.meta.get("pipeline", [])
        render_ents = "ner" in nlp.meta.get("pipeline", [])
        render_parses(
@ -75,11 +116,21 @@ def evaluate(
            ents=render_ents,
        )
        msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
-    if return_scores:
-        return scorer.scores
+
+    if output_path is not None:
+        srsly.write_json(output_path, data)
+        msg.good(f"Saved results to {output_path}")
+    return data


-def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
+def render_parses(
+    docs: List[Doc],
+    output_path: Path,
+    model_name: str = "",
+    limit: int = 250,
+    deps: bool = True,
+    ents: bool = True,
+):
    docs[0].user_data["title"] = model_name
    if ents:
        html = displacy.render(docs[:limit], style="ent", page=True)
@ -91,3 +142,40 @@ def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=T
        )
        with (output_path / "parses.html").open("w", encoding="utf8") as file_:
            file_.write(html)
+
+
+def print_ents_per_type(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
+    data = [
+        (k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
+        for k, v in scores.items()
+    ]
+    msg.table(
+        data,
+        header=("", "P", "R", "F"),
+        aligns=("l", "r", "r", "r"),
+        title="NER (per type)",
+    )
+
+
+def print_textcats_f_per_cat(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
+    data = [
+        (k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
+        for k, v in scores.items()
+    ]
+    msg.table(
+        data,
+        header=("", "P", "R", "F"),
+        aligns=("l", "r", "r", "r"),
+        title="Textcat F (per type)",
+    )
+
+
+def print_textcats_auc_per_cat(
+    msg: Printer, scores: Dict[str, Dict[str, float]]
+) -> None:
+    msg.table(
+        [(k, f"{v['roc_auc_score']:.2f}") for k, v in scores.items()],
+        header=("", "ROC AUC"),
+        aligns=("l", "r"),
+        title="Textcat ROC AUC (per label)",
+    )
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -1,77 +1,109 @@
+from typing import Optional, Dict, Any, Union
 import platform
 from pathlib import Path
-from wasabi import msg
+from wasabi import Printer
 import srsly

-from .validate import get_model_pkgs
+from ._app import app, Arg, Opt
 from .. import util
 from .. import about


-def info(
-    model: ("Optional model name", "positional", None, str) = None,
-    markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False,
-    silent: ("Don't print anything (just return)", "flag", "s") = False,
+@app.command("info")
+def info_cli(
+    # fmt: off
+    model: Optional[str] = Arg(None, help="Optional model name"),
+    markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
+    silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
+    # fmt: on
 ):
    """
    Print info about spaCy installation. If a model is speficied as an argument,
    print model information. Flag --markdown prints details in Markdown for easy
    copy-pasting to GitHub issues.
    """
+    info(model, markdown=markdown, silent=silent)
+
+
+def info(
+    model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
+) -> Union[str, dict]:
+    msg = Printer(no_print=silent, pretty=not silent)
    if model:
-        if util.is_package(model):
-            model_path = util.get_package_path(model)
-        else:
-            model_path = model
-        meta_path = model_path / "meta.json"
-        if not meta_path.is_file():
-            msg.fail("Can't find model meta.json", meta_path, exits=1)
-        meta = srsly.read_json(meta_path)
-        if model_path.resolve() != model_path:
-            meta["link"] = str(model_path)
-            meta["source"] = str(model_path.resolve())
-        else:
-            meta["source"] = str(model_path)
+        title = f"Info about model '{model}'"
+        data = info_model(model, silent=silent)
+    else:
+        title = "Info about spaCy"
+        data = info_spacy()
+    raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()}
+    if "Models" in data and isinstance(data["Models"], dict):
+        data["Models"] = ", ".join(f"{n} ({v})" for n, v in data["Models"].items())
+    markdown_data = get_markdown(data, title=title)
+    if markdown:
        if not silent:
-            title = f"Info about model '{model}'"
-            model_meta = {
-                k: v for k, v in meta.items() if k not in ("accuracy", "speed")
-            }
-            if markdown:
-                print_markdown(model_meta, title=title)
-            else:
-                msg.table(model_meta, title=title)
-        return meta
-    all_models, _ = get_model_pkgs()
-    data = {
+            print(markdown_data)
+        return markdown_data
+    if not silent:
+        table_data = dict(data)
+        msg.table(table_data, title=title)
+    return raw_data
+
+
+def info_spacy() -> Dict[str, any]:
+    """Generate info about the current spaCy intallation.
+
+    RETURNS (dict): The spaCy info.
+    """
+    all_models = {}
+    for pkg_name in util.get_installed_models():
+        package = pkg_name.replace("-", "_")
+        all_models[package] = util.get_package_version(pkg_name)
+    return {
        "spaCy version": about.__version__,
        "Location": str(Path(__file__).parent.parent),
        "Platform": platform.platform(),
        "Python version": platform.python_version(),
-        "Models": ", ".join(
-            f"{m['name']} ({m['version']})" for m in all_models.values()
-        ),
+        "Models": all_models,
    }
-    if not silent:
-        title = "Info about spaCy"
-        if markdown:
-            print_markdown(data, title=title)
-        else:
-            msg.table(data, title=title)
-    return data


-def print_markdown(data, title=None):
-    """Print data in GitHub-flavoured Markdown format for issues etc.
+def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
+    """Generate info about a specific model.
+
+    model (str): Model name of path.
+    silent (bool): Don't print anything, just return.
+    RETURNS (dict): The model meta.
+    """
+    msg = Printer(no_print=silent, pretty=not silent)
+    if util.is_package(model):
+        model_path = util.get_package_path(model)
+    else:
+        model_path = model
+    meta_path = model_path / "meta.json"
+    if not meta_path.is_file():
+        msg.fail("Can't find model meta.json", meta_path, exits=1)
+    meta = srsly.read_json(meta_path)
+    if model_path.resolve() != model_path:
+        meta["link"] = str(model_path)
+        meta["source"] = str(model_path.resolve())
+    else:
+        meta["source"] = str(model_path)
+    return {k: v for k, v in meta.items() if k not in ("accuracy", "speed")}
+
+
+def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
+    """Get data in GitHub-flavoured Markdown format for issues etc.

    data (dict or list of tuples): Label/value pairs.
    title (str / None): Title, will be rendered as headline 2.
+    RETURNS (str): The Markdown string.
    """
    markdown = []
    for key, value in data.items():
        if isinstance(value, str) and Path(value).exists():
            continue
        markdown.append(f"* **{key}:** {value}")
+    result = "\n{}\n".format("\n".join(markdown))
    if title:
-        print(f"\n## {title}")
-    print("\n{}\n".format("\n".join(markdown)))
+        result = f"\n## {title}\n{result}"
+    return result
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -1,3 +1,4 @@
+from typing import Optional, List, Dict, Any, Union, IO
 import math
 from tqdm import tqdm
 import numpy
@ -9,10 +10,12 @@ import gzip
 import zipfile
 import srsly
 import warnings
-from wasabi import msg
+from wasabi import Printer

+from ._app import app, Arg, Opt
 from ..vectors import Vectors
 from ..errors import Errors, Warnings
+from ..language import Language
 from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
 from ..lookups import Lookups

@ -25,20 +28,21 @@ except ImportError:
 DEFAULT_OOV_PROB = -20


-def init_model(
+@app.command("init-model")
+def init_model_cli(
    # fmt: off
-    lang: ("Model language", "positional", None, str),
-    output_dir: ("Model output directory", "positional", None, Path),
-    freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None,
-    clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None,
-    jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None,
-    vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None,
-    prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1,
-    truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0,
-    vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None,
-    model_name: ("Optional name for the model meta", "option", "mn", str) = None,
-    omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False,
-    base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None
+    lang: str = Arg(..., help="Model language"),
+    output_dir: Path = Arg(..., help="Model output directory"),
+    freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),
+    clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
+    jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
+    vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
+    prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
+    truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
+    vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
+    model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
+    omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
+    base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base model (for languages with custom tokenizers)")
    # fmt: on
 ):
    """
@ -46,6 +50,38 @@ def init_model(
    and word vectors. If vectors are provided in Word2Vec format, they can
    be either a .txt or zipped as a .zip or .tar.gz.
    """
+    init_model(
+        lang,
+        output_dir,
+        freqs_loc=freqs_loc,
+        clusters_loc=clusters_loc,
+        jsonl_loc=jsonl_loc,
+        prune_vectors=prune_vectors,
+        truncate_vectors=truncate_vectors,
+        vectors_name=vectors_name,
+        model_name=model_name,
+        omit_extra_lookups=omit_extra_lookups,
+        base_model=base_model,
+        silent=False,
+    )
+
+
+def init_model(
+    lang: str,
+    output_dir: Path,
+    freqs_loc: Optional[Path] = None,
+    clusters_loc: Optional[Path] = None,
+    jsonl_loc: Optional[Path] = None,
+    vectors_loc: Optional[Path] = None,
+    prune_vectors: int = -1,
+    truncate_vectors: int = 0,
+    vectors_name: Optional[str] = None,
+    model_name: Optional[str] = None,
+    omit_extra_lookups: bool = False,
+    base_model: Optional[str] = None,
+    silent: bool = True,
+) -> Language:
+    msg = Printer(no_print=silent, pretty=not silent)
    if jsonl_loc is not None:
        if freqs_loc is not None or clusters_loc is not None:
            settings = ["-j"]
@ -68,7 +104,7 @@ def init_model(
        freqs_loc = ensure_path(freqs_loc)
        if freqs_loc is not None and not freqs_loc.exists():
            msg.fail("Can't find words frequencies file", freqs_loc, exits=1)
-        lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
+        lex_attrs = read_attrs_from_deprecated(msg, freqs_loc, clusters_loc)

    with msg.loading("Creating model..."):
        nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
@ -83,7 +119,9 @@ def init_model(

    msg.good("Successfully created model")
    if vectors_loc is not None:
-        add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
+        add_vectors(
+            msg, nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name
+        )
    vec_added = len(nlp.vocab.vectors)
    lex_added = len(nlp.vocab)
    msg.good(
@ -95,7 +133,7 @@ def init_model(
    return nlp


-def open_file(loc):
+def open_file(loc: Union[str, Path]) -> IO:
    """Handle .gz, .tar.gz or unzipped files"""
    loc = ensure_path(loc)
    if tarfile.is_tarfile(str(loc)):
@ -111,7 +149,9 @@ def open_file(loc):
        return loc.open("r", encoding="utf8")


-def read_attrs_from_deprecated(freqs_loc, clusters_loc):
+def read_attrs_from_deprecated(
+    msg: Printer, freqs_loc: Optional[Path], clusters_loc: Optional[Path]
+) -> List[Dict[str, Any]]:
    if freqs_loc is not None:
        with msg.loading("Counting frequencies..."):
            probs, _ = read_freqs(freqs_loc)
@ -139,7 +179,12 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
    return lex_attrs


-def create_model(lang, lex_attrs, name=None, base_model=None):
+def create_model(
+    lang: str,
+    lex_attrs: List[Dict[str, Any]],
+    name: Optional[str] = None,
+    base_model: Optional[Union[str, Path]] = None,
+) -> Language:
    if base_model:
        nlp = load_model(base_model)
        # keep the tokenizer but remove any existing pipeline components due to
@ -166,7 +211,14 @@ def create_model(lang, lex_attrs, name=None, base_model=None):
    return nlp


-def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
+def add_vectors(
+    msg: Printer,
+    nlp: Language,
+    vectors_loc: Optional[Path],
+    truncate_vectors: int,
+    prune_vectors: int,
+    name: Optional[str] = None,
+) -> None:
    vectors_loc = ensure_path(vectors_loc)
    if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
        nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
@ -176,7 +228,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
    else:
        if vectors_loc:
            with msg.loading(f"Reading vectors from {vectors_loc}"):
-                vectors_data, vector_keys = read_vectors(vectors_loc)
+                vectors_data, vector_keys = read_vectors(msg, vectors_loc)
            msg.good(f"Loaded vectors from {vectors_loc}")
        else:
            vectors_data, vector_keys = (None, None)
@ -195,7 +247,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
        nlp.vocab.prune_vectors(prune_vectors)


-def read_vectors(vectors_loc, truncate_vectors=0):
+def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
    f = open_file(vectors_loc)
    shape = tuple(int(size) for size in next(f).split())
    if truncate_vectors >= 1:
@ -215,7 +267,9 @@ def read_vectors(vectors_loc, truncate_vectors=0):
    return vectors_data, vectors_keys


-def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
+def read_freqs(
+    freqs_loc: Path, max_length: int = 100, min_doc_freq: int = 5, min_freq: int = 50
+):
    counts = PreshCounter()
    total = 0
    with freqs_loc.open() as f:
@ -244,7 +298,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
    return probs, oov_prob


-def read_clusters(clusters_loc):
+def read_clusters(clusters_loc: Path) -> dict:
    clusters = {}
    if ftfy is None:
        warnings.warn(Warnings.W004)
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -1,19 +1,25 @@
+from typing import Optional, Union, Any, Dict
 import shutil
 from pathlib import Path
-from wasabi import msg, get_raw_input
+from wasabi import Printer, get_raw_input
 import srsly
+import sys

+from ._app import app, Arg, Opt
+from ..schemas import validate, ModelMetaSchema
 from .. import util
 from .. import about


-def package(
+@app.command("package")
+def package_cli(
    # fmt: off
-    input_dir: ("Directory with model data", "positional", None, str),
-    output_dir: ("Output parent directory", "positional", None, str),
-    meta_path: ("Path to meta.json", "option", "m", str) = None,
-    create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False,
-    force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False,
+    input_dir: Path = Arg(..., help="Directory with model data", exists=True, file_okay=False),
+    output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
+    meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
+    create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"),
+    version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
+    force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing model in output directory"),
    # fmt: on
 ):
    """
@ -23,6 +29,27 @@ def package(
    set and a meta.json already exists in the output directory, the existing
    values will be used as the defaults in the command-line prompt.
    """
+    package(
+        input_dir,
+        output_dir,
+        meta_path=meta_path,
+        version=version,
+        create_meta=create_meta,
+        force=force,
+        silent=False,
+    )
+
+
+def package(
+    input_dir: Path,
+    output_dir: Path,
+    meta_path: Optional[Path] = None,
+    version: Optional[str] = None,
+    create_meta: bool = False,
+    force: bool = False,
+    silent: bool = True,
+) -> None:
+    msg = Printer(no_print=silent, pretty=not silent)
    input_path = util.ensure_path(input_dir)
    output_path = util.ensure_path(output_dir)
    meta_path = util.ensure_path(meta_path)
@ -33,23 +60,23 @@ def package(
    if meta_path and not meta_path.exists():
        msg.fail("Can't find model meta.json", meta_path, exits=1)

-    meta_path = meta_path or input_path / "meta.json"
-    if meta_path.is_file():
-        meta = srsly.read_json(meta_path)
-        if not create_meta:  # only print if user doesn't want to overwrite
-            msg.good("Loaded meta.json from file", meta_path)
-        else:
-            meta = generate_meta(input_dir, meta, msg)
-    for key in ("lang", "name", "version"):
-        if key not in meta or meta[key] == "":
-            msg.fail(
-                f"No '{key}' setting found in meta.json",
-                "This setting is required to build your package.",
-                exits=1,
-            )
+    meta_path = meta_path or input_dir / "meta.json"
+    if not meta_path.exists() or not meta_path.is_file():
+        msg.fail("Can't load model meta.json", meta_path, exits=1)
+    meta = srsly.read_json(meta_path)
+    meta = get_meta(input_dir, meta)
+    if version is not None:
+        meta["version"] = version
+    if not create_meta:  # only print if user doesn't want to overwrite
+        msg.good("Loaded meta.json from file", meta_path)
+    else:
+        meta = generate_meta(meta, msg)
+    errors = validate(ModelMetaSchema, meta)
+    if errors:
+        msg.fail("Invalid model meta.json", "\n".join(errors), exits=1)
    model_name = meta["lang"] + "_" + meta["name"]
    model_name_v = model_name + "-" + meta["version"]
-    main_path = output_path / model_name_v
+    main_path = output_dir / model_name_v
    package_path = main_path / model_name

    if package_path.exists():
@ -63,32 +90,37 @@ def package(
                exits=1,
            )
    Path.mkdir(package_path, parents=True)
-    shutil.copytree(str(input_path), str(package_path / model_name_v))
+    shutil.copytree(str(input_dir), str(package_path / model_name_v))
    create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
    create_file(main_path / "setup.py", TEMPLATE_SETUP)
    create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
    create_file(package_path / "__init__.py", TEMPLATE_INIT)
    msg.good(f"Successfully created package '{model_name_v}'", main_path)
-    msg.text("To build the package, run `python setup.py sdist` in this directory.")
+    with util.working_dir(main_path):
+        util.run_command([sys.executable, "setup.py", "sdist"])
+    zip_file = main_path / "dist" / f"{model_name_v}.tar.gz"
+    msg.good(f"Successfully created zipped Python package", zip_file)


-def create_file(file_path, contents):
+def create_file(file_path: Path, contents: str) -> None:
    file_path.touch()
    file_path.open("w", encoding="utf-8").write(contents)


-def generate_meta(model_path, existing_meta, msg):
-    meta = existing_meta or {}
-    settings = [
-        ("lang", "Model language", meta.get("lang", "en")),
-        ("name", "Model name", meta.get("name", "model")),
-        ("version", "Model version", meta.get("version", "0.0.0")),
-        ("description", "Model description", meta.get("description", False)),
-        ("author", "Author", meta.get("author", False)),
-        ("email", "Author email", meta.get("email", False)),
-        ("url", "Author website", meta.get("url", False)),
-        ("license", "License", meta.get("license", "MIT")),
-    ]
+def get_meta(
+    model_path: Union[str, Path], existing_meta: Dict[str, Any]
+) -> Dict[str, Any]:
+    meta = {
+        "lang": "en",
+        "name": "model",
+        "version": "0.0.0",
+        "description": None,
+        "author": None,
+        "email": None,
+        "url": None,
+        "license": "MIT",
+    }
+    meta.update(existing_meta)
    nlp = util.load_model_from_path(Path(model_path))
    meta["spacy_version"] = util.get_model_version_range(about.__version__)
    meta["pipeline"] = nlp.pipe_names
@ -98,6 +130,23 @@ def generate_meta(model_path, existing_meta, msg):
        "keys": nlp.vocab.vectors.n_keys,
        "name": nlp.vocab.vectors.name,
    }
+    if about.__title__ != "spacy":
+        meta["parent_package"] = about.__title__
+    return meta
+
+
+def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]:
+    meta = existing_meta or {}
+    settings = [
+        ("lang", "Model language", meta.get("lang", "en")),
+        ("name", "Model name", meta.get("name", "model")),
+        ("version", "Model version", meta.get("version", "0.0.0")),
+        ("description", "Model description", meta.get("description", None)),
+        ("author", "Author", meta.get("author", None)),
+        ("email", "Author email", meta.get("email", None)),
+        ("url", "Author website", meta.get("url", None)),
+        ("license", "License", meta.get("license", "MIT")),
+    ]
    msg.divider("Generating meta.json")
    msg.text(
        "Enter the package settings for your model. The following information "
@ -106,8 +155,6 @@ def generate_meta(model_path, existing_meta, msg):
    for setting, desc, default in settings:
        response = get_raw_input(desc, default)
        meta[setting] = default if response == "" and default else response
-    if about.__title__ != "spacy":
-        meta["parent_package"] = about.__title__
    return meta


@ -158,12 +205,12 @@ def setup_package():

    setup(
        name=model_name,
-        description=meta['description'],
-        author=meta['author'],
-        author_email=meta['email'],
-        url=meta['url'],
+        description=meta.get('description'),
+        author=meta.get('author'),
+        author_email=meta.get('email'),
+        url=meta.get('url'),
        version=meta['version'],
-        license=meta['license'],
+        license=meta.get('license'),
        packages=[model_name],
        package_data={model_name: list_files(model_dir)},
        install_requires=list_requirements(meta),
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -1,14 +1,15 @@
+from typing import Optional
 import random
 import numpy
 import time
 import re
 from collections import Counter
-import plac
 from pathlib import Path
 from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
 from wasabi import msg
 import srsly

+from ._app import app, Arg, Opt
 from ..errors import Errors
 from ..ml.models.multi_task import build_masked_language_model
 from ..tokens import Doc
@ -17,25 +18,17 @@ from .. import util
 from ..gold import Example


-@plac.annotations(
+@app.command("pretrain")
+def pretrain_cli(
    # fmt: off
-    texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str),
-    vectors_model=("Name or path to spaCy model with vectors to learn from", "positional", None, str),
-    output_dir=("Directory to write models to on each epoch", "positional", None, Path),
-    config_path=("Path to config file", "positional", None, Path),
-    use_gpu=("Use GPU", "option", "g", int),
-    resume_path=("Path to pretrained weights from which to resume pretraining", "option","r", Path),
-    epoch_resume=("The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files.","option", "er", int),
+    texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
+    vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
+    output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
+    config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
+    use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
+    resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
+    epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files."),
    # fmt: on
-)
-def pretrain(
-    texts_loc,
-    vectors_model,
-    config_path,
-    output_dir,
-    use_gpu=-1,
-    resume_path=None,
-    epoch_resume=None,
 ):
    """
    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -52,6 +45,26 @@ def pretrain(
    all settings are the same between pretraining and training. Ideally,
    this is done by using the same config file for both commands.
    """
+    pretrain(
+        texts_loc,
+        vectors_model,
+        output_dir,
+        config_path,
+        use_gpu=use_gpu,
+        resume_path=resume_path,
+        epoch_resume=epoch_resume,
+    )
+
+
+def pretrain(
+    texts_loc: Path,
+    vectors_model: str,
+    output_dir: Path,
+    config_path: Path,
+    use_gpu: int = -1,
+    resume_path: Optional[Path] = None,
+    epoch_resume: Optional[int] = None,
+):
    if not config_path or not config_path.exists():
        msg.fail("Config file not found", config_path, exits=1)

@ -166,8 +179,7 @@ def pretrain(
    skip_counter = 0
    loss_func = pretrain_config["loss_func"]
    for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
-        examples = [Example(doc=text) for text in texts]
-        batches = util.minibatch_by_words(examples, size=pretrain_config["batch_size"])
+        batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
        for batch_id, batch in enumerate(batches):
            docs, count = make_docs(
                nlp,
--- a/spacy/cli/profile.py
+++ b/spacy/cli/profile.py
@ -1,3 +1,4 @@
+from typing import Optional, Sequence, Union, Iterator
 import tqdm
 from pathlib import Path
 import srsly
@ -5,17 +6,19 @@ import cProfile
 import pstats
 import sys
 import itertools
-import ml_datasets
-from wasabi import msg
+from wasabi import msg, Printer

+from ._app import app, Arg, Opt
+from ..language import Language
 from ..util import load_model


-def profile(
+@app.command("profile")
+def profile_cli(
    # fmt: off
-    model: ("Model to load", "positional", None, str),
-    inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None,
-    n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000,
+    model: str = Arg(..., help="Model to load"),
+    inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True),
+    n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"),
    # fmt: on
 ):
    """
@ -24,6 +27,18 @@ def profile(
    It can either be provided as a JSONL file, or be read from sys.sytdin.
    If no input file is specified, the IMDB dataset is loaded via Thinc.
    """
+    profile(model, inputs=inputs, n_texts=n_texts)
+
+
+def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
+    try:
+        import ml_datasets
+    except ImportError:
+        msg.fail(
+            "This command requires the ml_datasets library to be installed:"
+            "pip install ml_datasets",
+            exits=1,
+        )
    if inputs is not None:
        inputs = _read_inputs(inputs, msg)
    if inputs is None:
@ -43,12 +58,12 @@ def profile(
    s.strip_dirs().sort_stats("time").print_stats()


-def parse_texts(nlp, texts):
+def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
    for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
        pass


-def _read_inputs(loc, msg):
+def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
    if loc == "-":
        msg.info("Reading input from sys.stdin")
        file_ = sys.stdin
--- a/spacy/cli/project.py
+++ b/spacy/cli/project.py
@ -0,0 +1,704 @@
+from typing import List, Dict, Any, Optional, Sequence
+import typer
+import srsly
+from pathlib import Path
+from wasabi import msg
+import subprocess
+import os
+import re
+import shutil
+import sys
+import requests
+import tqdm
+
+from ._app import app, Arg, Opt, COMMAND, NAME
+from .. import about
+from ..schemas import ProjectConfigSchema, validate
+from ..util import ensure_path, run_command, make_tempdir, working_dir
+from ..util import get_hash, get_checksum, split_command
+
+
+CONFIG_FILE = "project.yml"
+DVC_CONFIG = "dvc.yaml"
+DVC_DIR = ".dvc"
+DIRS = [
+    "assets",
+    "metas",
+    "configs",
+    "packages",
+    "metrics",
+    "scripts",
+    "notebooks",
+    "training",
+    "corpus",
+]
+CACHES = [
+    Path.home() / ".torch",
+    Path.home() / ".caches" / "torch",
+    os.environ.get("TORCH_HOME"),
+    Path.home() / ".keras",
+]
+DVC_CONFIG_COMMENT = """# This file is auto-generated by spaCy based on your project.yml. Do not edit
+# it directly and edit the project.yml instead and re-run the project."""
+CLI_HELP = f"""Command-line interface for spaCy projects and working with project
+templates. You'd typically start by cloning a project template to a local
+directory and fetching its assets like datasets etc. See the project's
+{CONFIG_FILE} for the available commands. Under the hood, spaCy uses DVC (Data
+Version Control) to manage input and output files and to ensure steps are only
+re-run if their inputs change.
+"""
+
+project_cli = typer.Typer(help=CLI_HELP, no_args_is_help=True)
+
+
+@project_cli.callback(invoke_without_command=True)
+def callback(ctx: typer.Context):
+    """This runs before every project command and ensures DVC is installed."""
+    ensure_dvc()
+
+
+################
+# CLI COMMANDS #
+################
+
+
+@project_cli.command("clone")
+def project_clone_cli(
+    # fmt: off
+    name: str = Arg(..., help="The name of the template to fetch"),
+    dest: Path = Arg(Path.cwd(), help="Where to download and work. Defaults to current working directory.", exists=False),
+    repo: str = Opt(about.__projects__, "--repo", "-r", help="The repository to look in."),
+    git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
+    no_init: bool = Opt(False, "--no-init", "-NI", help="Don't initialize the project with DVC"),
+    # fmt: on
+):
+    """Clone a project template from a repository. Calls into "git" and will
+    only download the files from the given subdirectory. The GitHub repo
+    defaults to the official spaCy template repo, but can be customized
+    (including using a private repo). Setting the --git flag will also
+    initialize the project directory as a Git repo. If the project is intended
+    to be a Git repo, it should be initialized with Git first, before
+    initializing DVC (Data Version Control). This allows DVC to integrate with
+    Git.
+    """
+    if dest == Path.cwd():
+        dest = dest / name
+    project_clone(name, dest, repo=repo, git=git, no_init=no_init)
+
+
+@project_cli.command("init")
+def project_init_cli(
+    # fmt: off
+    path: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
+    git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
+    force: bool = Opt(False, "--force", "-F", help="Force initiziation"),
+    # fmt: on
+):
+    """Initialize a project directory with DVC and optionally Git. This should
+    typically be taken care of automatically when you run the "project clone"
+    command, but you can also run it separately. If the project is intended to
+    be a Git repo, it should be initialized with Git first, before initializing
+    DVC. This allows DVC to integrate with Git.
+    """
+    project_init(path, git=git, force=force, silent=True)
+
+
+@project_cli.command("assets")
+def project_assets_cli(
+    # fmt: off
+    project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
+    # fmt: on
+):
+    """Use DVC (Data Version Control) to fetch project assets. Assets are
+    defined in the "assets" section of the project config. If possible, DVC
+    will try to track the files so you can pull changes from upstream. It will
+    also try and store the checksum so the assets are versioned. If the file
+    can't be tracked or checked, it will be downloaded without DVC. If a checksum
+    is provided in the project config, the file is only downloaded if no local
+    file with the same checksum exists.
+    """
+    project_assets(project_dir)
+
+
+@project_cli.command(
+    "run-all",
+    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def project_run_all_cli(
+    # fmt: off
+    ctx: typer.Context,
+    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
+    show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
+    # fmt: on
+):
+    """Run all commands defined in the project. This command will use DVC and
+    the defined outputs and dependencies in the project config to determine
+    which steps need to be re-run and where to start. This means you're only
+    re-generating data if the inputs have changed.
+
+    This command calls into "dvc repro" and all additional arguments are passed
+    to the "dvc repro" command: https://dvc.org/doc/command-reference/repro
+    """
+    if show_help:
+        print_run_help(project_dir)
+    else:
+        project_run_all(project_dir, *ctx.args)
+
+
+@project_cli.command(
+    "run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def project_run_cli(
+    # fmt: off
+    ctx: typer.Context,
+    subcommand: str = Arg(None, help="Name of command defined in project config"),
+    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
+    show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
+    # fmt: on
+):
+    """Run a named script defined in the project config. If the command is
+    part of the default pipeline defined in the "run" section, DVC is used to
+    determine whether the step should re-run if its inputs have changed, or
+    whether everything is up to date. If the script is not part of the default
+    pipeline, it will be called separately without DVC.
+
+    If DVC is used, the command calls into "dvc repro" and all additional
+    arguments are passed to the "dvc repro" command:
+    https://dvc.org/doc/command-reference/repro
+    """
+    if show_help or not subcommand:
+        print_run_help(project_dir, subcommand)
+    else:
+        project_run(project_dir, subcommand, *ctx.args)
+
+
+@project_cli.command("exec", hidden=True)
+def project_exec_cli(
+    # fmt: off
+    subcommand: str = Arg(..., help="Name of command defined in project config"),
+    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
+    # fmt: on
+):
+    """Execute a command defined in the project config. This CLI command is
+    only called internally in auto-generated DVC pipelines, as a shortcut for
+    multi-step commands in the project config. You typically shouldn't have to
+    call it yourself. To run a command, call "run" or "run-all".
+    """
+    project_exec(project_dir, subcommand)
+
+
+@project_cli.command("update-dvc")
+def project_update_dvc_cli(
+    # fmt: off
+    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
+    verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
+    force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
+    # fmt: on
+):
+    """Update the auto-generated DVC config file. Uses the steps defined in the
+    "run" section of the project config. This typically happens automatically
+    when running a command, but can also be triggered manually if needed.
+    """
+    config = load_project_config(project_dir)
+    updated = update_dvc_config(project_dir, config, verbose=verbose, force=force)
+    if updated:
+        msg.good(f"Updated DVC config from {CONFIG_FILE}")
+    else:
+        msg.info(f"No changes found in {CONFIG_FILE}, no update needed")
+
+
+app.add_typer(project_cli, name="project")
+
+
+#################
+# CLI FUNCTIONS #
+#################
+
+
+def project_clone(
+    name: str,
+    dest: Path,
+    *,
+    repo: str = about.__projects__,
+    git: bool = False,
+    no_init: bool = False,
+) -> None:
+    """Clone a project template from a repository.
+
+    name (str): Name of subdirectory to clone.
+    dest (Path): Destination path of cloned project.
+    repo (str): URL of Git repo containing project templates.
+    git (bool): Initialize project as Git repo. Should be set to True if project
+        is intended as a repo, since it will allow DVC to integrate with Git.
+    no_init (bool): Don't initialize DVC and Git automatically. If True, the
+        "init" command or "git init" and "dvc init" need to be run manually.
+    """
+    dest = ensure_path(dest)
+    check_clone(name, dest, repo)
+    project_dir = dest.resolve()
+    # We're using Git and sparse checkout to only clone the files we need
+    with make_tempdir() as tmp_dir:
+        cmd = f"git clone {repo} {tmp_dir} --no-checkout --depth 1 --config core.sparseCheckout=true"
+        try:
+            run_command(cmd)
+        except SystemExit:
+            err = f"Could not clone the repo '{repo}' into the temp dir '{tmp_dir}'"
+            msg.fail(err)
+        with (tmp_dir / ".git" / "info" / "sparse-checkout").open("w") as f:
+            f.write(name)
+        run_command(["git", "-C", str(tmp_dir), "fetch"])
+        run_command(["git", "-C", str(tmp_dir), "checkout"])
+        shutil.move(str(tmp_dir / Path(name).name), str(project_dir))
+    msg.good(f"Cloned project '{name}' from {repo} into {project_dir}")
+    for sub_dir in DIRS:
+        dir_path = project_dir / sub_dir
+        if not dir_path.exists():
+            dir_path.mkdir(parents=True)
+    if not no_init:
+        project_init(project_dir, git=git, force=True, silent=True)
+    msg.good(f"Your project is now ready!", dest)
+    print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
+
+
+def project_init(
+    project_dir: Path,
+    *,
+    git: bool = False,
+    force: bool = False,
+    silent: bool = False,
+    analytics: bool = False,
+):
+    """Initialize a project as a DVC and (optionally) as a Git repo.
+
+    project_dir (Path): Path to project directory.
+    git (bool): Also call "git init" to initialize directory as a Git repo.
+    silent (bool): Don't print any output (via DVC).
+    analytics (bool): Opt-in to DVC analytics (defaults to False).
+    """
+    with working_dir(project_dir) as cwd:
+        if git:
+            run_command(["git", "init"])
+        init_cmd = ["dvc", "init"]
+        if silent:
+            init_cmd.append("--quiet")
+        if not git:
+            init_cmd.append("--no-scm")
+        if force:
+            init_cmd.append("--force")
+        run_command(init_cmd)
+        # We don't want to have analytics on by default – our users should
+        # opt-in explicitly. If they want it, they can always enable it.
+        if not analytics:
+            run_command(["dvc", "config", "core.analytics", "false"])
+        # Remove unused and confusing plot templates from .dvc directory
+        # TODO: maybe we shouldn't do this, but it's otherwise super confusing
+        # once you commit your changes via Git and it creates a bunch of files
+        # that have no purpose
+        plots_dir = cwd / DVC_DIR / "plots"
+        if plots_dir.exists():
+            shutil.rmtree(str(plots_dir))
+        config = load_project_config(cwd)
+        setup_check_dvc(cwd, config)
+
+
+def project_assets(project_dir: Path) -> None:
+    """Fetch assets for a project using DVC if possible.
+
+    project_dir (Path): Path to project directory.
+    """
+    project_path = ensure_path(project_dir)
+    config = load_project_config(project_path)
+    setup_check_dvc(project_path, config)
+    assets = config.get("assets", {})
+    if not assets:
+        msg.warn(f"No assets specified in {CONFIG_FILE}", exits=0)
+    msg.info(f"Fetching {len(assets)} asset(s)")
+    variables = config.get("variables", {})
+    fetched_assets = []
+    for asset in assets:
+        url = asset["url"].format(**variables)
+        dest = asset["dest"].format(**variables)
+        fetched_path = fetch_asset(project_path, url, dest, asset.get("checksum"))
+        if fetched_path:
+            fetched_assets.append(str(fetched_path))
+    if fetched_assets:
+        with working_dir(project_path):
+            run_command(["dvc", "add", *fetched_assets, "--external"])
+
+
+def fetch_asset(
+    project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
+) -> Optional[Path]:
+    """Fetch an asset from a given URL or path. Will try to import the file
+    using DVC's import-url if possible (fully tracked and versioned) and falls
+    back to get-url (versioned) and a non-DVC download if necessary. If a
+    checksum is provided and a local file exists, it's only re-downloaded if the
+    checksum doesn't match.
+
+    project_path (Path): Path to project directory.
+    url (str): URL or path to asset.
+    checksum (Optional[str]): Optional expected checksum of local file.
+    RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
+        the asset failed.
+    """
+    url = convert_asset_url(url)
+    dest_path = (project_path / dest).resolve()
+    if dest_path.exists() and checksum:
+        # If there's already a file, check for checksum
+        # TODO: add support for caches (dvc import-url with local path)
+        if checksum == get_checksum(dest_path):
+            msg.good(f"Skipping download with matching checksum: {dest}")
+            return dest_path
+    with working_dir(project_path):
+        try:
+            # If these fail, we don't want to output an error or info message.
+            # Try with tracking the source first, then just downloading with
+            # DVC, then a regular non-DVC download.
+            try:
+                dvc_cmd = ["dvc", "import-url", url, str(dest_path)]
+                print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
+            except subprocess.CalledProcessError:
+                dvc_cmd = ["dvc", "get-url", url, str(dest_path)]
+                print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
+        except subprocess.CalledProcessError:
+            try:
+                download_file(url, dest_path)
+            except requests.exceptions.HTTPError as e:
+                msg.fail(f"Download failed: {dest}", e)
+                return None
+    if checksum and checksum != get_checksum(dest_path):
+        msg.warn(f"Checksum doesn't match value defined in {CONFIG_FILE}: {dest}")
+    msg.good(f"Fetched asset {dest}")
+    return dest_path
+
+
+def project_run_all(project_dir: Path, *dvc_args) -> None:
+    """Run all commands defined in the project using DVC.
+
+    project_dir (Path): Path to project directory.
+    *dvc_args: Other arguments passed to "dvc repro".
+    """
+    config = load_project_config(project_dir)
+    setup_check_dvc(project_dir, config)
+    dvc_cmd = ["dvc", "repro", *dvc_args]
+    with working_dir(project_dir):
+        run_command(dvc_cmd)
+
+
+def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
+    """Simulate a CLI help prompt using the info available in the project config.
+
+    project_dir (Path): The project directory.
+    subcommand (Optional[str]): The subcommand or None. If a subcommand is
+        provided, the subcommand help is shown. Otherwise, the top-level help
+        and a list of available commands is printed.
+    """
+    config = load_project_config(project_dir)
+    setup_check_dvc(project_dir, config)
+    config_commands = config.get("commands", [])
+    commands = {cmd["name"]: cmd for cmd in config_commands}
+    if subcommand:
+        validate_subcommand(commands.keys(), subcommand)
+        print(f"Usage: {COMMAND} project run {subcommand} {project_dir}")
+        help_text = commands[subcommand].get("help")
+        if help_text:
+            msg.text(f"\n{help_text}\n")
+    else:
+        print(f"\nAvailable commands in {CONFIG_FILE}")
+        print(f"Usage: {COMMAND} project run [COMMAND] {project_dir}")
+        msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands])
+        msg.text("Run all commands defined in the 'run' block of the project config:")
+        print(f"{COMMAND} project run-all {project_dir}")
+
+
+def project_run(project_dir: Path, subcommand: str, *dvc_args) -> None:
+    """Run a named script defined in the project config. If the script is part
+    of the default pipeline (defined in the "run" section), DVC is used to
+    execute the command, so it can determine whether to rerun it. It then
+    calls into "exec" to execute it.
+
+    project_dir (Path): Path to project directory.
+    subcommand (str): Name of command to run.
+    *dvc_args: Other arguments passed to "dvc repro".
+    """
+    config = load_project_config(project_dir)
+    setup_check_dvc(project_dir, config)
+    config_commands = config.get("commands", [])
+    variables = config.get("variables", {})
+    commands = {cmd["name"]: cmd for cmd in config_commands}
+    validate_subcommand(commands.keys(), subcommand)
+    if subcommand in config.get("run", []):
+        # This is one of the pipeline commands tracked in DVC
+        dvc_cmd = ["dvc", "repro", subcommand, *dvc_args]
+        with working_dir(project_dir):
+            run_command(dvc_cmd)
+    else:
+        cmd = commands[subcommand]
+        # Deps in non-DVC commands aren't tracked, but if they're defined,
+        # make sure they exist before running the command
+        for dep in cmd.get("deps", []):
+            if not (project_dir / dep).exists():
+                err = f"Missing dependency specified by command '{subcommand}': {dep}"
+                msg.fail(err, exits=1)
+        with working_dir(project_dir):
+            run_commands(cmd["script"], variables)
+
+
+def project_exec(project_dir: Path, subcommand: str):
+    """Execute a command defined in the project config.
+
+    project_dir (Path): Path to project directory.
+    subcommand (str): Name of command to run.
+    """
+    config = load_project_config(project_dir)
+    config_commands = config.get("commands", [])
+    variables = config.get("variables", {})
+    commands = {cmd["name"]: cmd for cmd in config_commands}
+    with working_dir(project_dir):
+        run_commands(commands[subcommand]["script"], variables)
+
+
+###########
+# HELPERS #
+###########
+
+
+def load_project_config(path: Path) -> Dict[str, Any]:
+    """Load the project config file from a directory and validate it.
+
+    path (Path): The path to the project directory.
+    RETURNS (Dict[str, Any]): The loaded project config.
+    """
+    config_path = path / CONFIG_FILE
+    if not config_path.exists():
+        msg.fail("Can't find project config", config_path, exits=1)
+    invalid_err = f"Invalid project config in {CONFIG_FILE}"
+    try:
+        config = srsly.read_yaml(config_path)
+    except ValueError as e:
+        msg.fail(invalid_err, e, exits=1)
+    errors = validate(ProjectConfigSchema, config)
+    if errors:
+        msg.fail(invalid_err, "\n".join(errors), exits=1)
+    return config
+
+
+def update_dvc_config(
+    path: Path,
+    config: Dict[str, Any],
+    verbose: bool = False,
+    silent: bool = False,
+    force: bool = False,
+) -> bool:
+    """Re-run the DVC commands in dry mode and update dvc.yaml file in the
+    project directory. The file is auto-generated based on the config. The
+    first line of the auto-generated file specifies the hash of the config
+    dict, so if any of the config values change, the DVC config is regenerated.
+
+    path (Path): The path to the project directory.
+    config (Dict[str, Any]): The loaded project config.
+    verbose (bool): Whether to print additional info (via DVC).
+    silent (bool): Don't output anything (via DVC).
+    force (bool): Force update, even if hashes match.
+    RETURNS (bool): Whether the DVC config file was updated.
+    """
+    config_hash = get_hash(config)
+    path = path.resolve()
+    dvc_config_path = path / DVC_CONFIG
+    if dvc_config_path.exists():
+        # Check if the file was generated using the current config, if not, redo
+        with dvc_config_path.open("r", encoding="utf8") as f:
+            ref_hash = f.readline().strip().replace("# ", "")
+        if ref_hash == config_hash and not force:
+            return False  # Nothing has changed in project config, don't need to update
+        dvc_config_path.unlink()
+    variables = config.get("variables", {})
+    commands = []
+    # We only want to include commands that are part of the main list of "run"
+    # commands in project.yml and should be run in sequence
+    config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
+    for name in config.get("run", []):
+        validate_subcommand(config_commands.keys(), name)
+        command = config_commands[name]
+        deps = command.get("deps", [])
+        outputs = command.get("outputs", [])
+        outputs_no_cache = command.get("outputs_no_cache", [])
+        if not deps and not outputs and not outputs_no_cache:
+            continue
+        # Default to "." as the project path since dvc.yaml is auto-generated
+        # and we don't want arbitrary paths in there
+        project_cmd = ["python", "-m", NAME, "project", ".", "exec", name]
+        deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
+        outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
+        outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
+        dvc_cmd = ["dvc", "run", "-n", name, "-w", str(path), "--no-exec"]
+        if verbose:
+            dvc_cmd.append("--verbose")
+        if silent:
+            dvc_cmd.append("--quiet")
+        full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
+        commands.append(" ".join(full_cmd))
+    with working_dir(path):
+        run_commands(commands, variables, silent=True)
+    with dvc_config_path.open("r+", encoding="utf8") as f:
+        content = f.read()
+        f.seek(0, 0)
+        f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
+    return True
+
+
+def ensure_dvc() -> None:
+    """Ensure that the "dvc" command is available and show an error if not."""
+    try:
+        subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
+    except Exception:
+        msg.fail(
+            "spaCy projects require DVC (Data Version Control) and the 'dvc' command",
+            "You can install the Python package from pip (pip install dvc) or "
+            "conda (conda install -c conda-forge dvc). For more details, see the "
+            "documentation: https://dvc.org/doc/install",
+            exits=1,
+        )
+
+
+def setup_check_dvc(project_dir: Path, config: Dict[str, Any]) -> None:
+    """Check that the project is set up correctly with DVC and update its
+    config if needed. Will raise an error if the project is not an initialized
+    DVC project.
+
+    project_dir (Path): The path to the project directory.
+    config (Dict[str, Any]): The loaded project config.
+    """
+    if not project_dir.exists():
+        msg.fail(f"Can't find project directory: {project_dir}")
+    if not (project_dir / ".dvc").exists():
+        msg.fail(
+            "Project not initialized as a DVC project.",
+            f"Make sure that the project template was cloned correctly. To "
+            f"initialize the project directory manually, you can run: "
+            f"{COMMAND} project init {project_dir}",
+            exits=1,
+        )
+    with msg.loading("Updating DVC config..."):
+        updated = update_dvc_config(project_dir, config, silent=True)
+    if updated:
+        msg.good(f"Updated DVC config from changed {CONFIG_FILE}")
+
+
+def run_commands(
+    commands: List[str] = tuple(), variables: Dict[str, str] = {}, silent: bool = False
+) -> None:
+    """Run a sequence of commands in a subprocess, in order.
+
+    commands (List[str]): The string commands.
+    variables (Dict[str, str]): Dictionary of variable names, mapped to their
+        values. Will be used to substitute format string variables in the
+        commands.
+    silent (bool): Don't print the commands.
+    """
+    for command in commands:
+        # Substitute variables, e.g. "./{NAME}.json"
+        command = command.format(**variables)
+        command = split_command(command)
+        # Not sure if this is needed or a good idea. Motivation: users may often
+        # use commands in their config that reference "python" and we want to
+        # make sure that it's always executing the same Python that spaCy is
+        # executed with and the pip in the same env, not some other Python/pip.
+        # Also ensures cross-compatibility if user 1 writes "python3" (because
+        # that's how it's set up on their system), and user 2 without the
+        # shortcut tries to re-run the command.
+        if len(command) and command[0] in ("python", "python3"):
+            command[0] = sys.executable
+        elif len(command) and command[0] in ("pip", "pip3"):
+            command = [sys.executable, "-m", "pip", *command[1:]]
+        if not silent:
+            print(f"Running command: {' '.join(command)}")
+        run_command(command)
+
+
+def convert_asset_url(url: str) -> str:
+    """Check and convert the asset URL if needed.
+
+    url (str): The asset URL.
+    RETURNS (str): The converted URL.
+    """
+    # If the asset URL is a regular GitHub URL it's likely a mistake
+    if re.match("(http(s?)):\/\/github.com", url):
+        converted = url.replace("github.com", "raw.githubusercontent.com")
+        converted = re.sub(r"/(tree|blob)/", "/", converted)
+        msg.warn(
+            "Downloading from a regular GitHub URL. This will only download "
+            "the source of the page, not the actual file. Converting the URL "
+            "to a raw URL.",
+            converted,
+        )
+        return converted
+    return url
+
+
+def check_clone(name: str, dest: Path, repo: str) -> None:
+    """Check and validate that the destination path can be used to clone. Will
+    check that Git is available and that the destination path is suitable.
+
+    name (str): Name of the directory to clone from the repo.
+    dest (Path): Local destination of cloned directory.
+    repo (str): URL of the repo to clone from.
+    """
+    try:
+        subprocess.run(["git", "--version"], stdout=subprocess.DEVNULL)
+    except Exception:
+        msg.fail(
+            f"Cloning spaCy project templates requires Git and the 'git' command. ",
+            f"To clone a project without Git, copy the files from the '{name}' "
+            f"directory in the {repo} to {dest} manually and then run:",
+            f"{COMMAND} project init {dest}",
+            exits=1,
+        )
+    if not dest:
+        msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
+    if dest.exists():
+        # Directory already exists (not allowed, clone needs to create it)
+        msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
+    if not dest.parent.exists():
+        # We're not creating parents, parent dir should exist
+        msg.fail(
+            f"Can't clone project, parent directory doesn't exist: {dest.parent}",
+            exits=1,
+        )
+
+
+def validate_subcommand(commands: Sequence[str], subcommand: str) -> None:
+    """Check that a subcommand is valid and defined. Raises an error otherwise.
+
+    commands (Sequence[str]): The available commands.
+    subcommand (str): The subcommand.
+    """
+    if subcommand not in commands:
+        msg.fail(
+            f"Can't find command '{subcommand}' in {CONFIG_FILE}. "
+            f"Available commands: {', '.join(commands)}",
+            exits=1,
+        )
+
+
+def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
+    """Download a file using requests.
+
+    url (str): The URL of the file.
+    dest (Path): The destination path.
+    chunk_size (int): The size of chunks to read/write.
+    """
+    response = requests.get(url, stream=True)
+    response.raise_for_status()
+    total = int(response.headers.get("content-length", 0))
+    progress_settings = {
+        "total": total,
+        "unit": "iB",
+        "unit_scale": True,
+        "unit_divisor": chunk_size,
+        "leave": False,
+    }
+    with dest.open("wb") as f, tqdm.tqdm(**progress_settings) as bar:
+        for data in response.iter_content(chunk_size=chunk_size):
+            size = f.write(data)
+            bar.update(size)
--- a/spacy/cli/train_from_config.py
+++ b/spacy/cli/train_from_config.py
@ -2,9 +2,8 @@ from typing import Optional, Dict, List, Union, Sequence
 from timeit import default_timer as timer

 import srsly
-from pydantic import BaseModel, FilePath
-import plac
 import tqdm
+from pydantic import BaseModel, FilePath
 from pathlib import Path
 from wasabi import msg
 import thinc
@ -12,11 +11,17 @@ import thinc.schedules
 from thinc.api import Model, use_pytorch_for_gpu_memory
 import random

-from ..gold import GoldCorpus
+from ._app import app, Arg, Opt
+from ..gold import Corpus
 from ..lookups import Lookups
 from .. import util
 from ..errors import Errors
-from ..ml import models  # don't remove - required to load the built-in architectures
+
+# Don't remove - required to load the built-in architectures
+from ..ml import models  # noqa: F401
+
+# from ..schemas import ConfigSchema  # TODO: include?
+

 registry = util.registry

@ -114,41 +119,24 @@ class ConfigSchema(BaseModel):
        extra = "allow"


-@plac.annotations(
-    # fmt: off
-    train_path=("Location of JSON-formatted training data", "positional", None, Path),
-    dev_path=("Location of JSON-formatted development data", "positional", None, Path),
-    config_path=("Path to config file", "positional", None, Path),
-    output_path=("Output directory to store model in", "option", "o", Path),
-    init_tok2vec=(
-    "Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental.", "option", "t2v",
-    Path),
-    raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
-    verbose=("Display more information for debugging purposes", "flag", "VV", bool),
-    use_gpu=("Use GPU", "option", "g", int),
-    num_workers=("Parallel Workers", "option", "j", int),
-    strategy=("Distributed training strategy (requires spacy_ray)", "option", "strategy", str),
-    ray_address=(
-        "Address of the Ray cluster. Multi-node training (requires spacy_ray)",
-        "option", "address", str),
-    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
-    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
-    # fmt: on
-)
+@app.command("train")
 def train_cli(
-    train_path,
-    dev_path,
-    config_path,
-    output_path=None,
-    init_tok2vec=None,
-    raw_text=None,
-    verbose=False,
-    use_gpu=-1,
-    num_workers=1,
-    strategy="allreduce",
-    ray_address=None,
-    tag_map_path=None,
-    omit_extra_lookups=False,
+    # fmt: off
+    train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
+    dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
+    config_path: Path = Arg(..., help="Path to config file", exists=True),
+    output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
+    code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
+    init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
+    raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
+    verbose: bool = Opt(False, "--verbose", "-VV", help="Display more information for debugging purposes"),
+    use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
+    num_workers: int = Opt(None, "-j", help="Parallel Workers"),
+    strategy: str = Opt(None, "--strategy", help="Distributed training strategy (requires spacy_ray)"),
+    ray_address: str = Opt(None, "--address", help="Address of the Ray cluster. Multi-node training (requires spacy_ray)"),
+    tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map"),
+    omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
+    # fmt: on
 ):
    """
    Train or update a spaCy model. Requires data to be formatted in spaCy's
@ -156,26 +144,8 @@ def train_cli(
    command.
    """
    util.set_env_log(verbose)
+    verify_cli_args(**locals())

-    # Make sure all files and paths exists if they are needed
-    if not config_path or not config_path.exists():
-        msg.fail("Config file not found", config_path, exits=1)
-    if not train_path or not train_path.exists():
-        msg.fail("Training data not found", train_path, exits=1)
-    if not dev_path or not dev_path.exists():
-        msg.fail("Development data not found", dev_path, exits=1)
-    if output_path is not None:
-        if not output_path.exists():
-            output_path.mkdir()
-            msg.good(f"Created output directory: {output_path}")
-        elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
-            msg.warn(
-                "Output directory is not empty.",
-                "This can lead to unintended side effects when saving the model. "
-                "Please use an empty directory or a different path instead. If "
-                "the specified output path doesn't exist, the directory will be "
-                "created for you.",
-            )
    if raw_text is not None:
        raw_text = list(srsly.read_jsonl(raw_text))
    tag_map = {}
@ -184,8 +154,6 @@ def train_cli(

    weights_data = None
    if init_tok2vec is not None:
-        if not init_tok2vec.exists():
-            msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
        with init_tok2vec.open("rb") as file_:
            weights_data = file_.read()

@ -214,17 +182,17 @@ def train_cli(
        train(**train_args)

 def train(
-    config_path,
-    data_paths,
-    raw_text=None,
-    output_path=None,
-    tag_map=None,
-    weights_data=None,
-    omit_extra_lookups=False,
-    disable_tqdm=False,
-    remote_optimizer=None,
-    randomization_index=0
-):
+    config_path: Path,
+    data_paths: Dict[str, Path],
+    raw_text: Optional[Path] = None,
+    output_path: Optional[Path] = None,
+    tag_map: Optional[Path] = None,
+    weights_data: Optional[bytes] = None,
+    omit_extra_lookups: bool = False,
+    disable_tqdm: bool = False,
+    remote_optimizer: Optimizer = None,
+    randomization_index: int = 0
+) -> None:
    msg.info(f"Loading config from: {config_path}")
    # Read the config first without creating objects, to get to the original nlp_config
    config = util.load_config(config_path, create_objects=False)
@ -243,69 +211,20 @@ def train(
    if remote_optimizer:
        optimizer = remote_optimizer
    limit = training["limit"]
-    msg.info("Loading training corpus")
-    corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
-
-    # verify textcat config
+    corpus = Corpus(data_paths["train"], data_paths["dev"], limit=limit)
    if "textcat" in nlp_config["pipeline"]:
-        textcat_labels = set(nlp.get_pipe("textcat").labels)
-        textcat_multilabel = not nlp_config["pipeline"]["textcat"]["model"]["exclusive_classes"]
-
-        # check whether the setting 'exclusive_classes' corresponds to the provided training data
-        if textcat_multilabel:
-            multilabel_found = False
-            for ex in corpus.train_examples:
-                cats = ex.doc_annotation.cats
-                textcat_labels.update(cats.keys())
-                if list(cats.values()).count(1.0) != 1:
-                    multilabel_found = True
-            if not multilabel_found:
-                msg.warn(
-                    "The textcat training instances look like they have "
-                    "mutually exclusive classes. Set 'exclusive_classes' "
-                    "to 'true' in the config to train a classifier with "
-                    "mutually exclusive classes more accurately."
-                )
-        else:
-            for ex in corpus.train_examples:
-                cats = ex.doc_annotation.cats
-                textcat_labels.update(cats.keys())
-                if list(cats.values()).count(1.0) != 1:
-                    msg.fail(
-                        "Some textcat training instances do not have exactly "
-                        "one positive label. Set 'exclusive_classes' "
-                        "to 'false' in the config to train a classifier with classes "
-                        "that are not mutually exclusive."
-                    )
-        msg.info(f"Initialized textcat component for {len(textcat_labels)} unique labels")
-        nlp.get_pipe("textcat").labels = tuple(textcat_labels)
-
-        # if 'positive_label' is provided: double check whether it's in the data and the task is binary
-        if nlp_config["pipeline"]["textcat"].get("positive_label", None):
-            textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
-            pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
-            if pos_label not in textcat_labels:
-                msg.fail(
-                    f"The textcat's 'positive_label' config setting '{pos_label}' "
-                    f"does not match any label in the training data.",
-                    exits=1,
-                )
-            if len(textcat_labels) != 2:
-                msg.fail(
-                    f"A textcat 'positive_label' '{pos_label}' was "
-                    f"provided for training data that does not appear to be a "
-                    f"binary classification problem with two labels.",
-                    exits=1,
-                )
-
+        verify_textcat_config(nlp, nlp_config)
    if training.get("resume", False):
        msg.info("Resuming training")
        nlp.resume_training()
    else:
        msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
-        nlp.begin_training(
-            lambda: corpus.train_examples
-        )
+        train_examples = list(corpus.train_dataset(
+            nlp,
+            shuffle=False,
+            gold_preproc=training["gold_preproc"]
+        ))
+        nlp.begin_training(lambda: train_examples)

    # Update tag map with provided mapping
    nlp.vocab.morphology.tag_map.update(tag_map)
@ -332,11 +251,11 @@ def train(
            tok2vec = tok2vec.get(subpath)
        if not tok2vec:
            msg.fail(
-                f"Could not locate the tok2vec model at {tok2vec_path}.",
-                exits=1,
+                f"Could not locate the tok2vec model at {tok2vec_path}.", exits=1,
            )
        tok2vec.from_bytes(weights_data)

+    msg.info("Loading training corpus")
    train_batches = create_train_batches(nlp, corpus, training, randomization_index)
    evaluate = create_evaluation_callback(nlp, optimizer, corpus, training)

@ -369,18 +288,15 @@ def train(
                    update_meta(training, nlp, info)
                    nlp.to_disk(output_path / "model-best")
                progress = tqdm.tqdm(**tqdm_args)
-            # Clean up the objects to faciliate garbage collection.
-            for eg in batch:
-                eg.doc = None
-                eg.goldparse = None
-                eg.doc_annotation = None
-                eg.token_annotation = None
    except Exception as e:
-        msg.warn(
-            f"Aborting and saving the final best model. "
-            f"Encountered exception: {str(e)}",
-            exits=1,
-        )
+        if output_path is not None:
+            msg.warn(
+                f"Aborting and saving the final best model. "
+                f"Encountered exception: {str(e)}",
+                exits=1,
+            )
+        else:
+            raise e
    finally:
        if output_path is not None:
            final_model_path = output_path / "model-final"
@ -393,23 +309,22 @@ def train(


 def create_train_batches(nlp, corpus, cfg, randomization_index):
-    epochs_todo = cfg.get("max_epochs", 0)
+    max_epochs = cfg.get("max_epochs", 0)
+    train_examples = list(corpus.train_dataset(
+        nlp,
+        shuffle=True,
+        gold_preproc=cfg["gold_preproc"],
+        max_length=cfg["max_length"]
+    ))
+
+    epoch = 0
    while True:
-        train_examples = list(
-            corpus.train_dataset(
-                nlp,
-                noise_level=0.0, # I think this is deprecated?
-                orth_variant_level=cfg["orth_variant_level"],
-                gold_preproc=cfg["gold_preproc"],
-                max_length=cfg["max_length"],
-                ignore_misaligned=True,
-            )
-        )
        if len(train_examples) == 0:
            raise ValueError(Errors.E988)
        for _ in range(randomization_index):
            random.random()
        random.shuffle(train_examples)
+        epoch += 1
        batches = util.minibatch_by_words(
            train_examples,
            size=cfg["batch_size"],
@ -418,15 +333,12 @@ def create_train_batches(nlp, corpus, cfg, randomization_index):
        # make sure the minibatch_by_words result is not empty, or we'll have an infinite training loop
        try:
            first = next(batches)
-            yield first
+            yield epoch, first
        except StopIteration:
            raise ValueError(Errors.E986)
        for batch in batches:
-            yield batch
-        epochs_todo -= 1
-        # We intentionally compare exactly to 0 here, so that max_epochs < 1
-        # will not break.
-        if epochs_todo == 0:
+            yield epoch, batch
+        if max_epochs >= 1 and epoch >= max_epochs:
            break


@ -437,7 +349,8 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
                nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True
            )
        )
-        n_words = sum(len(ex.doc) for ex in dev_examples)
+
+        n_words = sum(len(ex.predicted) for ex in dev_examples)
        start_time = timer()

        if optimizer.averages:
@ -453,7 +366,11 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
        try:
            weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
        except KeyError as e:
-            raise KeyError(Errors.E983.format(dict_name='score_weights', key=str(e), keys=list(scores.keys())))
+            raise KeyError(
+                Errors.E983.format(
+                    dict="score_weights", key=str(e), keys=list(scores.keys())
+                )
+            )

        scores["speed"] = wps
        return weighted_score, scores
@ -494,7 +411,7 @@ def train_while_improving(

    Every iteration, the function yields out a tuple with:

-    * batch: A zipped sequence of Tuple[Doc, GoldParse] pairs.
+    * batch: A list of Example objects.
    * info: A dict with various information about the last update (see below).
    * is_best_checkpoint: A value in None, False, True, indicating whether this
        was the best evaluation so far. You should use this to save the model
@ -526,7 +443,7 @@ def train_while_improving(
            (nlp.make_doc(rt["text"]) for rt in raw_text), size=8
        )

-    for step, batch in enumerate(train_data):
+    for step, (epoch, batch) in enumerate(train_data):
        dropout = next(dropouts)
        with nlp.select_pipes(enable=to_enable):
            for subbatch in subdivide_batch(batch, accumulate_gradient):
@ -548,6 +465,7 @@ def train_while_improving(
            score, other_scores = (None, None)
            is_best_checkpoint = None
        info = {
+            "epoch": epoch,
            "step": step,
            "score": score,
            "other_scores": other_scores,
@ -568,7 +486,7 @@ def train_while_improving(

 def subdivide_batch(batch, accumulate_gradient):
    batch = list(batch)
-    batch.sort(key=lambda eg: len(eg.doc))
+    batch.sort(key=lambda eg: len(eg.predicted))
    sub_len = len(batch) // accumulate_gradient
    start = 0
    for i in range(accumulate_gradient):
@ -586,9 +504,9 @@ def setup_printer(training, nlp):
    score_widths = [max(len(col), 6) for col in score_cols]
    loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names]
    loss_widths = [max(len(col), 8) for col in loss_cols]
-    table_header = ["#"] + loss_cols + score_cols + ["Score"]
+    table_header = ["E", "#"] + loss_cols + score_cols + ["Score"]
    table_header = [col.upper() for col in table_header]
-    table_widths = [6] + loss_widths + score_widths + [6]
+    table_widths = [3, 6] + loss_widths + score_widths + [6]
    table_aligns = ["r" for _ in table_widths]

    msg.row(table_header, widths=table_widths)
@ -602,17 +520,25 @@ def setup_printer(training, nlp):
            ]
        except KeyError as e:
            raise KeyError(
-                Errors.E983.format(dict_name='scores (losses)', key=str(e), keys=list(info["losses"].keys())))
+                Errors.E983.format(
+                    dict="scores (losses)", key=str(e), keys=list(info["losses"].keys())
+                )
+            )

        try:
            scores = [
-                "{0:.2f}".format(float(info["other_scores"][col]))
-                for col in score_cols
+                "{0:.2f}".format(float(info["other_scores"][col])) for col in score_cols
            ]
        except KeyError as e:
-            raise KeyError(Errors.E983.format(dict_name='scores (other)', key=str(e), keys=list(info["other_scores"].keys())))
+            raise KeyError(
+                Errors.E983.format(
+                    dict="scores (other)",
+                    key=str(e),
+                    keys=list(info["other_scores"].keys()),
+                )
+            )
        data = (
-            [info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
+            [info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
        )
        msg.row(data, widths=table_widths, aligns=table_aligns)

@ -626,3 +552,67 @@ def update_meta(training, nlp, info):
        nlp.meta["performance"][metric] = info["other_scores"][metric]
    for pipe_name in nlp.pipe_names:
        nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
+
+
+def verify_cli_args(
+    train_path,
+    dev_path,
+    config_path,
+    output_path=None,
+    code_path=None,
+    init_tok2vec=None,
+    raw_text=None,
+    verbose=False,
+    use_gpu=-1,
+    tag_map_path=None,
+    omit_extra_lookups=False,
+):
+    # Make sure all files and paths exists if they are needed
+    if not config_path or not config_path.exists():
+        msg.fail("Config file not found", config_path, exits=1)
+    if not train_path or not train_path.exists():
+        msg.fail("Training data not found", train_path, exits=1)
+    if not dev_path or not dev_path.exists():
+        msg.fail("Development data not found", dev_path, exits=1)
+    if output_path is not None:
+        if not output_path.exists():
+            output_path.mkdir()
+            msg.good(f"Created output directory: {output_path}")
+        elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
+            msg.warn(
+                "Output directory is not empty.",
+                "This can lead to unintended side effects when saving the model. "
+                "Please use an empty directory or a different path instead. If "
+                "the specified output path doesn't exist, the directory will be "
+                "created for you.",
+            )
+    if code_path is not None:
+        if not code_path.exists():
+            msg.fail("Path to Python code not found", code_path, exits=1)
+        try:
+            util.import_file("python_code", code_path)
+        except Exception as e:
+            msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
+    if init_tok2vec is not None and not init_tok2vec.exists():
+        msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
+
+
+def verify_textcat_config(nlp, nlp_config):
+    # if 'positive_label' is provided: double check whether it's in the data and
+    # the task is binary
+    if nlp_config["pipeline"]["textcat"].get("positive_label", None):
+        textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
+        pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
+        if pos_label not in textcat_labels:
+            msg.fail(
+                f"The textcat's 'positive_label' config setting '{pos_label}' "
+                f"does not match any label in the training data.",
+                exits=1,
+            )
+        if len(textcat_labels) != 2:
+            msg.fail(
+                f"A textcat 'positive_label' '{pos_label}' was "
+                f"provided for training data that does not appear to be a "
+                f"binary classification problem with two labels.",
+                exits=1,
+            )
--- a/spacy/cli/validate.py
+++ b/spacy/cli/validate.py
@ -1,18 +1,25 @@
+from typing import Tuple
 from pathlib import Path
 import sys
 import requests
-from wasabi import msg
+from wasabi import msg, Printer

+from ._app import app
 from .. import about
 from ..util import get_package_version, get_installed_models, get_base_version
 from ..util import get_package_path, get_model_meta, is_compatible_version


-def validate():
+@app.command("validate")
+def validate_cli():
    """
    Validate that the currently installed version of spaCy is compatible
    with the installed models. Should be run after `pip install -U spacy`.
    """
+    validate()
+
+
+def validate() -> None:
    model_pkgs, compat = get_model_pkgs()
    spacy_version = get_base_version(about.__version__)
    current_compat = compat.get(spacy_version, {})
@ -55,7 +62,8 @@ def validate():
        sys.exit(1)


-def get_model_pkgs():
+def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]:
+    msg = Printer(no_print=silent, pretty=not silent)
    with msg.loading("Loading compatibility table..."):
        r = requests.get(about.__compatibility__)
        if r.status_code != 200:
@ -93,7 +101,7 @@ def get_model_pkgs():
    return pkgs, compat


-def reformat_version(version):
+def reformat_version(version: str) -> str:
    """Hack to reformat old versions ending on '-alpha' to match pip format."""
    if version.endswith("-alpha"):
        return version.replace("-alpha", "a0")
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -3,7 +3,7 @@ def add_codes(err_cls):

    class ErrorsWithCodes(err_cls):
        def __getattribute__(self, code):
-            msg = super().__getattribute__(code)
+            msg = super(ErrorsWithCodes, self).__getattribute__(code)
            if code.startswith("__"):  # python system attributes like __class__
                return msg
            else:
@ -111,8 +111,31 @@ class Warnings(object):
            "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
            " to check the alignment. Misaligned entities ('-') will be "
            "ignored during training.")
+    W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
+            "is incompatible with the current spaCy version ({current}). This "
+            "may lead to unexpected results or runtime errors. To resolve "
+            "this, download a newer compatible model or retrain your custom "
+            "model with the current spaCy version. For more details and "
+            "available updates, run: python -m spacy validate")
+    W032 = ("Unable to determine model compatibility for model '{model}' "
+            "({model_version}) with the current spaCy version ({current}). "
+            "This may lead to unexpected results or runtime errors. To resolve "
+            "this, download a newer compatible model or retrain your custom "
+            "model with the current spaCy version. For more details and "
+            "available updates, run: python -m spacy validate")
+    W033 = ("Training a new {model} using a model with no lexeme normalization "
+            "table. This may degrade the performance of the model to some "
+            "degree. If this is intentional or the language you're using "
+            "doesn't have a normalization table, please ignore this warning. "
+            "If this is surprising, make sure you have the spacy-lookups-data "
+            "package installed. The languages with lexeme normalization tables "
+            "are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.")

    # TODO: fix numbering after merging develop into master
+    W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
+    W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
+    W093 = ("Could not find any data to train the {name} on. Is your "
+            "input data correctly formatted ?")
    W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
            "spaCy version requirement: {version}. This can lead to compatibility "
            "problems with older versions, or as new spaCy versions are "
@ -133,7 +156,7 @@ class Warnings(object):
            "so a default configuration was used.")
    W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
            "but got '{type}' instead, so ignoring it.")
-    W100 = ("Skipping unsupported morphological feature(s): {feature}. "
+    W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
            "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
            "string \"Field1=Value1,Value2|Field2=Value3\".")

@ -161,18 +184,13 @@ class Errors(object):
            "`nlp.select_pipes()`, you should remove them explicitly with "
            "`nlp.remove_pipe()` before the pipeline is restored. Names of "
            "the new components: {names}")
-    E009 = ("The `update` method expects same number of docs and golds, but "
-            "got: {n_docs} docs, {n_golds} golds.")
    E010 = ("Word vectors set to length 0. This may be because you don't have "
            "a model installed or loaded, or because your model doesn't "
            "include word vectors. For more info, see the docs:\n"
            "https://spacy.io/usage/models")
    E011 = ("Unknown operator: '{op}'. Options: {opts}")
    E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
-    E013 = ("Error selecting action in matcher")
    E014 = ("Unknown tag ID: {tag}")
-    E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
-            "`force=True` to overwrite.")
    E016 = ("MultitaskObjective target should be function or one of: dep, "
            "tag, ent, dep_tag_offset, ent_tag.")
    E017 = ("Can only add unicode or bytes. Got type: {value_type}")
@ -180,21 +198,8 @@ class Errors(object):
            "refers to an issue with the `Vocab` or `StringStore`.")
    E019 = ("Can't create transition with unknown action ID: {action}. Action "
            "IDs are enumerated in spacy/syntax/{src}.pyx.")
-    E020 = ("Could not find a gold-standard action to supervise the "
-            "dependency parser. The tree is non-projective (i.e. it has "
-            "crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
-            "The ArcEager transition system only supports projective trees. "
-            "To learn non-projective representations, transform the data "
-            "before training and after parsing. Either pass "
-            "`make_projective=True` to the GoldParse class, or use "
-            "spacy.syntax.nonproj.preprocess_training_data.")
-    E021 = ("Could not find a gold-standard action to supervise the "
-            "dependency parser. The GoldParse was projective. The transition "
-            "system has {n_actions} actions. State at failure: {state}")
    E022 = ("Could not find a transition with the name '{name}' in the NER "
            "model.")
-    E023 = ("Error cleaning up beam: The same state occurred twice at "
-            "memory address {addr} and position {i}.")
    E024 = ("Could not find an optimal move to supervise the parser. Usually, "
            "this means that the model can't be updated in a way that's valid "
            "and satisfies the correct annotations specified in the GoldParse. "
@ -238,7 +243,6 @@ class Errors(object):
            "offset {start}.")
    E037 = ("Error calculating span: Can't find a token ending at character "
            "offset {end}.")
-    E038 = ("Error finding sentence for span. Infinite loop detected.")
    E039 = ("Array bounds exceeded while searching for root word. This likely "
            "means the parse tree is in an invalid state. Please report this "
            "issue here: http://github.com/explosion/spaCy/issues")
@ -269,8 +273,6 @@ class Errors(object):
    E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
    E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
            "({rows}, {cols}).")
-    E061 = ("Bad file name: {filename}. Example of a valid file name: "
-            "'vectors.128.f.bin'")
    E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
            "and 63 are occupied. You can replace one by specifying the "
            "`flag_id` explicitly, e.g. "
@ -284,39 +286,17 @@ class Errors(object):
            "Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
    E065 = ("Only one of the vector table's width and shape can be specified. "
            "Got width {width} and shape {shape}.")
-    E066 = ("Error creating model helper for extracting columns. Can only "
-            "extract columns by positive integer. Got: {value}.")
    E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
            "an entity) without a preceding 'B' (beginning of an entity). "
            "Tag sequence:\n{tags}")
    E068 = ("Invalid BILUO tag: '{tag}'.")
-    E069 = ("Invalid gold-standard parse tree. Found cycle between word "
-            "IDs: {cycle} (tokens: {cycle_tokens}) in the document starting "
-            "with tokens: {doc_tokens}.")
-    E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
-            "does not align with number of annotations ({n_annots}).")
    E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
            "match the one in the vocab ({vocab_orth}).")
-    E072 = ("Error serializing lexeme: expected data length {length}, "
-            "got {bad_length}.")
    E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
            "are of length {length}. You can use `vocab.reset_vectors` to "
            "clear the existing vectors and resize the table.")
    E074 = ("Error interpreting compiled match pattern: patterns are expected "
            "to end with the attribute {attr}. Got: {bad_attr}.")
-    E075 = ("Error accepting match: length ({length}) > maximum length "
-            "({max_len}).")
-    E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
-            "has {words} words.")
-    E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
-            "equal number of GoldParse objects ({n_golds}) in batch.")
-    E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
-            "not equal number of words in GoldParse ({words_gold}).")
-    E079 = ("Error computing states in beam: number of predicted beams "
-            "({pbeams}) does not equal number of gold beams ({gbeams}).")
-    E080 = ("Duplicate state found in beam: {key}.")
-    E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
-            "does not equal number of losses ({losses}).")
    E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
            "projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
            "match.")
@ -324,8 +304,6 @@ class Errors(object):
            "`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
    E084 = ("Error assigning label ID {label} to span: not in StringStore.")
    E085 = ("Can't create lexeme for string '{string}'.")
-    E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
-            "not match hash {hash_id} in StringStore.")
    E087 = ("Unknown displaCy style: {style}.")
    E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
            "v2.x parser and NER models require roughly 1GB of temporary "
@ -367,7 +345,6 @@ class Errors(object):
    E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
            "token can only be part of one entity, so make sure the entities "
            "you're setting don't overlap.")
-    E104 = ("Can't find JSON schema for '{name}'.")
    E105 = ("The Doc.print_tree() method is now deprecated. Please use "
            "Doc.to_json() instead or write your own function.")
    E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
@ -390,8 +367,6 @@ class Errors(object):
            "practically no advantage over pickling the parent Doc directly. "
            "So instead of pickling the span, pickle the Doc it belongs to or "
            "use Span.as_doc to convert the span to a standalone Doc object.")
-    E113 = ("The newly split token can only have one root (head = 0).")
-    E114 = ("The newly split token needs to have a root (head = 0).")
    E115 = ("All subtokens must have associated heads.")
    E116 = ("Cannot currently add labels to pretrained text classifier. Add "
            "labels before training begins. This functionality was available "
@ -414,12 +389,9 @@ class Errors(object):
            "equal to span length ({span_len}).")
    E122 = ("Cannot find token to be split. Did it get merged?")
    E123 = ("Cannot find head of token to be split. Did it get merged?")
-    E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
    E125 = ("Unexpected value: {value}")
    E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
            "This is likely a bug in spaCy, so feel free to open an issue.")
-    E127 = ("Cannot create phrase pattern representation for length 0. This "
-            "is likely a bug in spaCy.")
    E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
            "arguments to exclude fields from being serialized or deserialized "
            "is now deprecated. Please use the `exclude` argument instead. "
@ -461,8 +433,6 @@ class Errors(object):
            "provided {found}.")
    E143 = ("Labels for component '{name}' not initialized. Did you forget to "
            "call add_label()?")
-    E144 = ("Could not find parameter `{param}` when building the entity "
-            "linker model.")
    E145 = ("Error reading `{param}` from input file.")
    E146 = ("Could not access `{path}`.")
    E147 = ("Unexpected error in the {method} functionality of the "
@ -474,8 +444,6 @@ class Errors(object):
            "the component matches the model being loaded.")
    E150 = ("The language of the `nlp` object and the `vocab` should be the "
            "same, but found '{nlp}' and '{vocab}' respectively.")
-    E151 = ("Trying to call nlp.update without required annotation types. "
-            "Expected top-level keys: {exp}. Got: {unexp}.")
    E152 = ("The attribute {attr} is not supported for token patterns. "
            "Please use the option validate=True with Matcher, PhraseMatcher, "
            "or EntityRuler for more details.")
@ -512,11 +480,6 @@ class Errors(object):
            "that case.")
    E166 = ("Can only merge DocBins with the same pre-defined attributes.\n"
            "Current DocBin: {current}\nOther DocBin: {other}")
-    E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can "
-            "happen if the tagger was trained with a different set of "
-            "morphological features. If you're using a pretrained model, make "
-            "sure that your models are up to date:\npython -m spacy validate")
-    E168 = ("Unknown field: {field}")
    E169 = ("Can't find module: {module}")
    E170 = ("Cannot apply transition {name}: invalid for the current state.")
    E171 = ("Matcher.add received invalid on_match callback argument: expected "
@ -527,8 +490,6 @@ class Errors(object):
    E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
            "Lookups containing the lemmatization tables. See the docs for "
            "details: https://spacy.io/api/lemmatizer#init")
-    E174 = ("Architecture '{name}' not found in registry. Available "
-            "names: {names}")
    E175 = ("Can't remove rule for unknown match pattern ID: {key}")
    E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
    E177 = ("Ill-formed IOB input detected: {tag}")
@ -556,9 +517,6 @@ class Errors(object):
            "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
    E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
    E187 = ("Only unicode strings are supported as labels.")
-    E188 = ("Could not match the gold entity links to entities in the doc - "
-            "make sure the gold EL data refers to valid results of the "
-            "named entity recognizer in the `nlp` pipeline.")
    E189 = ("Each argument to `get_doc` should be of equal length.")
    E190 = ("Token head out of range in `Doc.from_array()` for token index "
            "'{index}' with value '{value}' (equivalent to relative head "
@ -578,12 +536,32 @@ class Errors(object):
    E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
    E198 = ("Unable to return {n} most similar vectors for the current vectors "
            "table, which contains {n_rows} vectors.")
+    E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")

    # TODO: fix numbering after merging develop into master
-    E983 = ("Invalid key for '{dict_name}': {key}. Available keys: "
+    E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
+    E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
+            "array and {doc_length} for the Doc itself.")
+    E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
+    E973 = ("Unexpected type for NER data")
+    E974 = ("Unknown {obj} attribute: {key}")
+    E975 = ("The method Example.from_dict expects a Doc as first argument, "
+            "but got {type}")
+    E976 = ("The method Example.from_dict expects a dict as second argument, "
+            "but received None.")
+    E977 = ("Can not compare a MorphAnalysis with a string object. "
+            "This is likely a bug in spaCy, so feel free to open an issue.")
+    E978 = ("The {method} method of component {name} takes a list of Example objects, "
+            "but found {types} instead.")
+    E979 = ("Cannot convert {type} to an Example object.")
+    E980 = ("Each link annotation should refer to a dictionary with at most one "
+            "identifier mapping to 1.0, and all others to 0.0.")
+    E981 = ("The offsets of the annotations for 'links' need to refer exactly "
+            "to the offsets of the 'entities' annotations.")
+    E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
+            "into {values}, but found {value}.")
+    E983 = ("Invalid key for '{dict}': {key}. Available keys: "
            "{keys}")
-    E984 = ("Could not parse the {input} - double check the data is written "
-            "in the correct format as expected by spaCy.")
    E985 = ("The pipeline component '{component}' is already available in the base "
            "model. The settings in the component block in the config file are "
            "being ignored. If you want to replace this component instead, set "
@ -615,22 +593,13 @@ class Errors(object):
    E997 = ("Tokenizer special cases are not allowed to modify the text. "
            "This would map '{chunk}' to '{orth}' given token attributes "
            "'{token_attrs}'.")
-    E998 = ("To create GoldParse objects from Example objects without a "
-            "Doc, get_gold_parses() should be called with a Vocab object.")
-    E999 = ("Encountered an unexpected format for the dictionary holding "
-            "gold annotations: {gold_dict}")
-
+ 

@add_codes
 class TempErrors(object):
    T003 = ("Resizing pretrained Tagger models is not currently supported.")
-    T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
    T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
            "issue tracker: http://github.com/explosion/spaCy/issues")
-    T008 = ("Bad configuration of Tagger. This is probably a bug within "
-            "spaCy. We changed the name of an internal attribute for loading "
-            "pretrained vectors, and the class has been passed the old name "
-            "(pretrained_dims) but not the new name (pretrained_vectors).")


 # fmt: on
--- a/spacy/gold.pxd
+++ b/spacy/gold.pxd
@ -1,68 +0,0 @@
-from cymem.cymem cimport Pool
-
-from .typedefs cimport attr_t
-from .syntax.transition_system cimport Transition
-
-from .tokens import Doc
-
-
-cdef struct GoldParseC:
-    int* tags
-    int* heads
-    int* has_dep
-    int* sent_start
-    attr_t* labels
-    int** brackets
-    Transition* ner
-
-
-cdef class GoldParse:
-    cdef Pool mem
-
-    cdef GoldParseC c
-    cdef readonly TokenAnnotation orig
-
-    cdef int length
-    cdef public int loss
-    cdef public list words
-    cdef public list tags
-    cdef public list pos
-    cdef public list morphs
-    cdef public list lemmas
-    cdef public list sent_starts
-    cdef public list heads
-    cdef public list labels
-    cdef public dict orths
-    cdef public list ner
-    cdef public dict brackets
-    cdef public dict cats
-    cdef public dict links
-
-    cdef readonly list cand_to_gold
-    cdef readonly list gold_to_cand
-
-
-cdef class TokenAnnotation:
-    cdef public list ids
-    cdef public list words
-    cdef public list tags
-    cdef public list pos
-    cdef public list morphs
-    cdef public list lemmas
-    cdef public list heads
-    cdef public list deps
-    cdef public list entities
-    cdef public list sent_starts
-    cdef public dict brackets_by_start
-
-
-cdef class DocAnnotation:
-    cdef public object cats
-    cdef public object links
-
-
-cdef class Example:
-    cdef public object doc
-    cdef public TokenAnnotation token_annotation
-    cdef public DocAnnotation doc_annotation
-    cdef public object goldparse
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
--- a/spacy/gold/init.pxd
+++ b/spacy/gold/init.pxd
--- a/spacy/gold/init.py
+++ b/spacy/gold/init.py
@ -0,0 +1,11 @@
+from .corpus import Corpus
+from .example import Example
+from .align import align
+
+from .iob_utils import iob_to_biluo, biluo_to_iob
+from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
+from .iob_utils import spans_from_biluo_tags
+from .iob_utils import tags_to_entities
+
+from .gold_io import docs_to_json
+from .gold_io import read_json_file
--- a/spacy/gold/align.pxd
+++ b/spacy/gold/align.pxd
@ -0,0 +1,8 @@
+cdef class Alignment:
+    cdef public object cost
+    cdef public object i2j
+    cdef public object j2i
+    cdef public object i2j_multi
+    cdef public object j2i_multi
+    cdef public object cand_to_gold
+    cdef public object gold_to_cand
--- a/spacy/gold/align.pyx
+++ b/spacy/gold/align.pyx
@ -0,0 +1,101 @@
+import numpy
+from ..errors import Errors, AlignmentError
+
+
+cdef class Alignment:
+    def __init__(self, spacy_words, gold_words):
+        # Do many-to-one alignment for misaligned tokens.
+        # If we over-segment, we'll have one gold word that covers a sequence
+        # of predicted words
+        # If we under-segment, we'll have one predicted word that covers a
+        # sequence of gold words.
+        # If we "mis-segment", we'll have a sequence of predicted words covering
+        # a sequence of gold words. That's many-to-many -- we don't do that
+        # except for NER spans where the start and end can be aligned.
+        cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
+        self.cost = cost
+        self.i2j = i2j
+        self.j2i = j2i
+        self.i2j_multi = i2j_multi
+        self.j2i_multi = j2i_multi
+        self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
+        self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
+
+
+def align(tokens_a, tokens_b):
+    """Calculate alignment tables between two tokenizations.
+
+    tokens_a (List[str]): The candidate tokenization.
+    tokens_b (List[str]): The reference tokenization.
+    RETURNS: (tuple): A 5-tuple consisting of the following information:
+      * cost (int): The number of misaligned tokens.
+      * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
+        For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
+        to `tokens_b[6]`. If there's no one-to-one alignment for a token,
+        it has the value -1.
+      * b2a (List[int]): The same as `a2b`, but mapping the other direction.
+      * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
+        to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
+        the same token of `tokens_b`.
+      * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
+            direction.
+    """
+    tokens_a = _normalize_for_alignment(tokens_a)
+    tokens_b = _normalize_for_alignment(tokens_b)
+    cost = 0
+    a2b = numpy.empty(len(tokens_a), dtype="i")
+    b2a = numpy.empty(len(tokens_b), dtype="i")
+    a2b.fill(-1)
+    b2a.fill(-1)
+    a2b_multi = {}
+    b2a_multi = {}
+    i = 0
+    j = 0
+    offset_a = 0
+    offset_b = 0
+    while i < len(tokens_a) and j < len(tokens_b):
+        a = tokens_a[i][offset_a:]
+        b = tokens_b[j][offset_b:]
+        if a == b:
+            if offset_a == offset_b == 0:
+                a2b[i] = j
+                b2a[j] = i
+            elif offset_a == 0:
+                cost += 2
+                a2b_multi[i] = j
+            elif offset_b == 0:
+                cost += 2
+                b2a_multi[j] = i
+            offset_a = offset_b = 0
+            i += 1
+            j += 1
+        elif a == "":
+            assert offset_a == 0
+            cost += 1
+            i += 1
+        elif b == "":
+            assert offset_b == 0
+            cost += 1
+            j += 1
+        elif b.startswith(a):
+            cost += 1
+            if offset_a == 0:
+                a2b_multi[i] = j
+            i += 1
+            offset_a = 0
+            offset_b += len(a)
+        elif a.startswith(b):
+            cost += 1
+            if offset_b == 0:
+                b2a_multi[j] = i
+            j += 1
+            offset_b = 0
+            offset_a += len(b)
+        else:
+            assert "".join(tokens_a) != "".join(tokens_b)
+            raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
+    return cost, a2b, b2a, a2b_multi, b2a_multi
+
+
+def _normalize_for_alignment(tokens):
+    return [w.replace(" ", "").lower() for w in tokens]
--- a/spacy/gold/augment.py
+++ b/spacy/gold/augment.py
@ -0,0 +1,111 @@
+import random
+import itertools
+
+
+def make_orth_variants_example(nlp, example, orth_variant_level=0.0):  # TODO: naming
+    raw_text = example.text
+    orig_dict = example.to_dict()
+    variant_text, variant_token_annot = make_orth_variants(
+        nlp, raw_text, orig_dict["token_annotation"], orth_variant_level
+    )
+    doc = nlp.make_doc(variant_text)
+    orig_dict["token_annotation"] = variant_token_annot
+    return example.from_dict(doc, orig_dict)
+
+
+def make_orth_variants(nlp, raw_text, orig_token_dict, orth_variant_level=0.0):
+    if random.random() >= orth_variant_level:
+        return raw_text, orig_token_dict
+    if not orig_token_dict:
+        return raw_text, orig_token_dict
+    raw = raw_text
+    token_dict = orig_token_dict
+    lower = False
+    if random.random() >= 0.5:
+        lower = True
+        if raw is not None:
+            raw = raw.lower()
+    ndsv = nlp.Defaults.single_orth_variants
+    ndpv = nlp.Defaults.paired_orth_variants
+    words = token_dict.get("words", [])
+    tags = token_dict.get("tags", [])
+    # keep unmodified if words or tags are not defined
+    if words and tags:
+        if lower:
+            words = [w.lower() for w in words]
+        # single variants
+        punct_choices = [random.choice(x["variants"]) for x in ndsv]
+        for word_idx in range(len(words)):
+            for punct_idx in range(len(ndsv)):
+                if (
+                    tags[word_idx] in ndsv[punct_idx]["tags"]
+                    and words[word_idx] in ndsv[punct_idx]["variants"]
+                ):
+                    words[word_idx] = punct_choices[punct_idx]
+        # paired variants
+        punct_choices = [random.choice(x["variants"]) for x in ndpv]
+        for word_idx in range(len(words)):
+            for punct_idx in range(len(ndpv)):
+                if tags[word_idx] in ndpv[punct_idx]["tags"] and words[
+                    word_idx
+                ] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]):
+                    # backup option: random left vs. right from pair
+                    pair_idx = random.choice([0, 1])
+                    # best option: rely on paired POS tags like `` / ''
+                    if len(ndpv[punct_idx]["tags"]) == 2:
+                        pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx])
+                    # next best option: rely on position in variants
+                    # (may not be unambiguous, so order of variants matters)
+                    else:
+                        for pair in ndpv[punct_idx]["variants"]:
+                            if words[word_idx] in pair:
+                                pair_idx = pair.index(words[word_idx])
+                    words[word_idx] = punct_choices[punct_idx][pair_idx]
+        token_dict["words"] = words
+        token_dict["tags"] = tags
+    # modify raw
+    if raw is not None:
+        variants = []
+        for single_variants in ndsv:
+            variants.extend(single_variants["variants"])
+        for paired_variants in ndpv:
+            variants.extend(
+                list(itertools.chain.from_iterable(paired_variants["variants"]))
+            )
+        # store variants in reverse length order to be able to prioritize
+        # longer matches (e.g., "---" before "--")
+        variants = sorted(variants, key=lambda x: len(x))
+        variants.reverse()
+        variant_raw = ""
+        raw_idx = 0
+        # add initial whitespace
+        while raw_idx < len(raw) and raw[raw_idx].isspace():
+            variant_raw += raw[raw_idx]
+            raw_idx += 1
+        for word in words:
+            match_found = False
+            # skip whitespace words
+            if word.isspace():
+                match_found = True
+            # add identical word
+            elif word not in variants and raw[raw_idx:].startswith(word):
+                variant_raw += word
+                raw_idx += len(word)
+                match_found = True
+            # add variant word
+            else:
+                for variant in variants:
+                    if not match_found and raw[raw_idx:].startswith(variant):
+                        raw_idx += len(variant)
+                        variant_raw += word
+                        match_found = True
+            # something went wrong, abort
+            # (add a warning message?)
+            if not match_found:
+                return raw_text, orig_token_dict
+            # add following whitespace
+            while raw_idx < len(raw) and raw[raw_idx].isspace():
+                variant_raw += raw[raw_idx]
+                raw_idx += 1
+        raw = variant_raw
+    return raw, token_dict
--- a/spacy/gold/converters/init.py
+++ b/spacy/gold/converters/init.py
@ -0,0 +1,6 @@
+from .iob2docs import iob2docs  # noqa: F401
+from .conll_ner2docs import conll_ner2docs  # noqa: F401
+from .json2docs import json2docs
+
+# TODO: Update this one
+# from .conllu2docs import conllu2docs  # noqa: F401
--- a/spacy/gold/converters/conll_ner2docs.py
+++ b/spacy/gold/converters/conll_ner2docs.py
@ -1,17 +1,18 @@
 from wasabi import Printer

+from .. import tags_to_entities
 from ...gold import iob_to_biluo
 from ...lang.xx import MultiLanguage
-from ...tokens.doc import Doc
+from ...tokens import Doc, Span
 from ...util import load_model


-def conll_ner2json(
+def conll_ner2docs(
    input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
 ):
    """
    Convert files in the CoNLL-2003 NER format and similar
-    whitespace-separated columns into JSON format for use with train cli.
+    whitespace-separated columns into Doc objects.

    The first column is the tokens, the final column is the IOB tags. If an
    additional second column is present, the second column is the tags.
@ -81,17 +82,25 @@ def conll_ner2json(
            "No document delimiters found. Use `-n` to automatically group "
            "sentences into documents."
        )
+
+    if model:
+        nlp = load_model(model)
+    else:
+        nlp = MultiLanguage()
    output_docs = []
-    for doc in input_data.strip().split(doc_delimiter):
-        doc = doc.strip()
-        if not doc:
+    for conll_doc in input_data.strip().split(doc_delimiter):
+        conll_doc = conll_doc.strip()
+        if not conll_doc:
            continue
-        output_doc = []
-        for sent in doc.split("\n\n"):
-            sent = sent.strip()
-            if not sent:
+        words = []
+        sent_starts = []
+        pos_tags = []
+        biluo_tags = []
+        for conll_sent in conll_doc.split("\n\n"):
+            conll_sent = conll_sent.strip()
+            if not conll_sent:
                continue
-            lines = [line.strip() for line in sent.split("\n") if line.strip()]
+            lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
            cols = list(zip(*[line.split() for line in lines]))
            if len(cols) < 2:
                raise ValueError(
@ -99,25 +108,19 @@ def conll_ner2json(
                    "Try checking whitespace and delimiters. See "
                    "https://spacy.io/api/cli#convert"
                )
-            words = cols[0]
-            iob_ents = cols[-1]
-            if len(cols) > 2:
-                tags = cols[1]
-            else:
-                tags = ["-"] * len(words)
-            biluo_ents = iob_to_biluo(iob_ents)
-            output_doc.append(
-                {
-                    "tokens": [
-                        {"orth": w, "tag": tag, "ner": ent}
-                        for (w, tag, ent) in zip(words, tags, biluo_ents)
-                    ]
-                }
-            )
-        output_docs.append(
-            {"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]}
-        )
-        output_doc = []
+            length = len(cols[0])
+            words.extend(cols[0])
+            sent_starts.extend([True] + [False] * (length - 1))
+            biluo_tags.extend(iob_to_biluo(cols[-1]))
+            pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
+
+        doc = Doc(nlp.vocab, words=words)
+        for i, token in enumerate(doc):
+            token.tag_ = pos_tags[i]
+            token.is_sent_start = sent_starts[i]
+        entities = tags_to_entities(biluo_tags)
+        doc.ents = [Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
+        output_docs.append(doc)
    return output_docs


--- a/spacy/gold/converters/conllu2json.py
+++ b/spacy/gold/converters/conllu2json.py
@ -1,10 +1,10 @@
 import re

+from .conll_ner2docs import n_sents_info
 from ...gold import Example
-from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets
+from ...gold import iob_to_biluo, spans_from_biluo_tags
 from ...language import Language
 from ...tokens import Doc, Token
-from .conll_ner2json import n_sents_info
 from wasabi import Printer


@ -12,7 +12,6 @@ def conllu2json(
    input_data,
    n_sents=10,
    append_morphology=False,
-    lang=None,
    ner_map=None,
    merge_subtokens=False,
    no_print=False,
@ -44,10 +43,7 @@ def conllu2json(
        raw += example.text
        sentences.append(
            generate_sentence(
-                example.token_annotation,
-                has_ner_tags,
-                MISC_NER_PATTERN,
-                ner_map=ner_map,
+                example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
            )
        )
        # Real-sized documents could be extracted using the comments on the
@ -145,21 +141,22 @@ def get_entities(lines, tag_pattern, ner_map=None):
    return iob_to_biluo(iob)


-def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None):
+def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
    sentence = {}
    tokens = []
-    for i, id_ in enumerate(token_annotation.ids):
+    token_annotation = example_dict["token_annotation"]
+    for i, id_ in enumerate(token_annotation["ids"]):
        token = {}
        token["id"] = id_
-        token["orth"] = token_annotation.get_word(i)
-        token["tag"] = token_annotation.get_tag(i)
-        token["pos"] = token_annotation.get_pos(i)
-        token["lemma"] = token_annotation.get_lemma(i)
-        token["morph"] = token_annotation.get_morph(i)
-        token["head"] = token_annotation.get_head(i) - id_
-        token["dep"] = token_annotation.get_dep(i)
+        token["orth"] = token_annotation["words"][i]
+        token["tag"] = token_annotation["tags"][i]
+        token["pos"] = token_annotation["pos"][i]
+        token["lemma"] = token_annotation["lemmas"][i]
+        token["morph"] = token_annotation["morphs"][i]
+        token["head"] = token_annotation["heads"][i] - i
+        token["dep"] = token_annotation["deps"][i]
        if has_ner_tags:
-            token["ner"] = token_annotation.get_entity(i)
+            token["ner"] = example_dict["doc_annotation"]["entities"][i]
        tokens.append(token)
    sentence["tokens"] = tokens
    return sentence
@ -267,40 +264,25 @@ def example_from_conllu_sentence(
        doc = merge_conllu_subtokens(lines, doc)

    # create Example from custom Doc annotation
-    ids, words, tags, heads, deps = [], [], [], [], []
-    pos, lemmas, morphs, spaces = [], [], [], []
+    words, spaces, tags, morphs, lemmas = [], [], [], [], []
    for i, t in enumerate(doc):
-        ids.append(i)
        words.append(t._.merged_orth)
+        lemmas.append(t._.merged_lemma)
+        spaces.append(t._.merged_spaceafter)
+        morphs.append(t._.merged_morph)
        if append_morphology and t._.merged_morph:
            tags.append(t.tag_ + "__" + t._.merged_morph)
        else:
            tags.append(t.tag_)
-        pos.append(t.pos_)
-        morphs.append(t._.merged_morph)
-        lemmas.append(t._.merged_lemma)
-        heads.append(t.head.i)
-        deps.append(t.dep_)
-        spaces.append(t._.merged_spaceafter)
-    ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
-    ents = biluo_tags_from_offsets(doc, ent_offsets)
-    raw = ""
-    for word, space in zip(words, spaces):
-        raw += word
-        if space:
-            raw += " "
-    example = Example(doc=raw)
-    example.set_token_annotation(
-        ids=ids,
-        words=words,
-        tags=tags,
-        pos=pos,
-        morphs=morphs,
-        lemmas=lemmas,
-        heads=heads,
-        deps=deps,
-        entities=ents,
-    )
+
+    doc_x = Doc(vocab, words=words, spaces=spaces)
+    ref_dict = Example(doc_x, reference=doc).to_dict()
+    ref_dict["words"] = words
+    ref_dict["lemmas"] = lemmas
+    ref_dict["spaces"] = spaces
+    ref_dict["tags"] = tags
+    ref_dict["morphs"] = morphs
+    example = Example.from_dict(doc_x, ref_dict)
    return example


--- a/spacy/gold/converters/iob2docs.py
+++ b/spacy/gold/converters/iob2docs.py
@ -0,0 +1,64 @@
+from wasabi import Printer
+
+from .conll_ner2docs import n_sents_info
+from ...gold import iob_to_biluo, tags_to_entities
+from ...tokens import Doc, Span
+from ...util import minibatch
+
+
+def iob2docs(input_data, vocab, n_sents=10, no_print=False, *args, **kwargs):
+    """
+    Convert IOB files with one sentence per line and tags separated with '|'
+    into Doc objects so they can be saved. IOB and IOB2 are accepted.
+
+    Sample formats:
+
+    I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
+    I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
+    I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
+    I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
+    """
+    msg = Printer(no_print=no_print)
+    if n_sents > 0:
+        n_sents_info(msg, n_sents)
+    docs = read_iob(input_data.split("\n"), vocab, n_sents)
+    return docs
+
+
+def read_iob(raw_sents, vocab, n_sents):
+    docs = []
+    for group in minibatch(raw_sents, size=n_sents):
+        tokens = []
+        words = []
+        tags = []
+        iob = []
+        sent_starts = []
+        for line in group:
+            if not line.strip():
+                continue
+            sent_tokens = [t.split("|") for t in line.split()]
+            if len(sent_tokens[0]) == 3:
+                sent_words, sent_tags, sent_iob = zip(*sent_tokens)
+            elif len(sent_tokens[0]) == 2:
+                sent_words, sent_iob = zip(*sent_tokens)
+                sent_tags = ["-"] * len(sent_words)
+            else:
+                raise ValueError(
+                    "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
+                )
+            words.extend(sent_words)
+            tags.extend(sent_tags)
+            iob.extend(sent_iob)
+            tokens.extend(sent_tokens)
+            sent_starts.append(True)
+            sent_starts.extend([False for _ in sent_words[1:]])
+        doc = Doc(vocab, words=words)
+        for i, tag in enumerate(tags):
+            doc[i].tag_ = tag
+        for i, sent_start in enumerate(sent_starts):
+            doc[i].is_sent_start = sent_start
+        biluo = iob_to_biluo(iob)
+        entities = tags_to_entities(biluo)
+        doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
+        docs.append(doc)
+    return docs
--- a/spacy/gold/converters/json2docs.py
+++ b/spacy/gold/converters/json2docs.py
@ -0,0 +1,24 @@
+import srsly
+from ..gold_io import json_iterate, json_to_annotations
+from ..example import annotations2doc
+from ..example import _fix_legacy_dict_data, _parse_example_dict_data
+from ...util import load_model
+from ...lang.xx import MultiLanguage
+
+
+def json2docs(input_data, model=None, **kwargs):
+    nlp = load_model(model) if model is not None else MultiLanguage()
+    if not isinstance(input_data, bytes):
+        if not isinstance(input_data, str):
+            input_data = srsly.json_dumps(input_data)
+        input_data = input_data.encode("utf8")
+    docs = []
+    for json_doc in json_iterate(input_data):
+        for json_para in json_to_annotations(json_doc):
+            example_dict = _fix_legacy_dict_data(json_para)
+            tok_dict, doc_dict = _parse_example_dict_data(example_dict)
+            if json_para.get("raw"):
+                assert tok_dict.get("SPACY")
+            doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
+            docs.append(doc)
+    return docs
--- a/spacy/gold/corpus.py
+++ b/spacy/gold/corpus.py
@ -0,0 +1,122 @@
+import random
+from .. import util
+from .example import Example
+from ..tokens import DocBin, Doc
+
+
+class Corpus:
+    """An annotated corpus, reading train and dev datasets from
+    the DocBin (.spacy) format.
+
+    DOCS: https://spacy.io/api/goldcorpus
+    """
+
+    def __init__(self, train_loc, dev_loc, limit=0):
+        """Create a Corpus.
+
+        train (str / Path): File or directory of training data.
+        dev (str / Path): File or directory of development data.
+        limit (int): Max. number of examples returned
+        RETURNS (Corpus): The newly created object.
+        """
+        self.train_loc = train_loc
+        self.dev_loc = dev_loc
+        self.limit = limit
+
+    @staticmethod
+    def walk_corpus(path):
+        path = util.ensure_path(path)
+        if not path.is_dir():
+            return [path]
+        paths = [path]
+        locs = []
+        seen = set()
+        for path in paths:
+            if str(path) in seen:
+                continue
+            seen.add(str(path))
+            if path.parts[-1].startswith("."):
+                continue
+            elif path.is_dir():
+                paths.extend(path.iterdir())
+            elif path.parts[-1].endswith(".spacy"):
+                locs.append(path)
+        return locs
+
+    def make_examples(self, nlp, reference_docs, max_length=0):
+        for reference in reference_docs:
+            if len(reference) >= max_length >= 1:
+                if reference.is_sentenced:
+                    for ref_sent in reference.sents:
+                        yield Example(
+                            nlp.make_doc(ref_sent.text),
+                            ref_sent.as_doc()
+                        )
+            else:
+                yield Example(
+                    nlp.make_doc(reference.text),
+                    reference
+                )
+    
+    def make_examples_gold_preproc(self, nlp, reference_docs):
+        for reference in reference_docs:
+            if reference.is_sentenced:
+                ref_sents = [sent.as_doc() for sent in reference.sents]
+            else:
+                ref_sents = [reference]
+            for ref_sent in ref_sents:
+                yield Example(
+                    Doc(
+                        nlp.vocab, 
+                        words=[w.text for w in ref_sent],
+                        spaces=[bool(w.whitespace_) for w in ref_sent]
+                    ),
+                    ref_sent
+                )
+
+    def read_docbin(self, vocab, locs):
+        """ Yield training examples as example dicts """
+        i = 0
+        for loc in locs:
+            loc = util.ensure_path(loc)
+            if loc.parts[-1].endswith(".spacy"):
+                with loc.open("rb") as file_:
+                    doc_bin = DocBin().from_bytes(file_.read())
+                docs = doc_bin.get_docs(vocab)
+                for doc in docs:
+                    if len(doc):
+                        yield doc
+                        i += 1
+                        if self.limit >= 1 and i >= self.limit:
+                            break
+
+    def count_train(self, nlp):
+        """Returns count of words in train examples"""
+        n = 0
+        i = 0
+        for example in self.train_dataset(nlp):
+            n += len(example.predicted)
+            if self.limit >= 0 and i >= self.limit:
+                break
+            i += 1
+        return n
+
+    def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
+            max_length=0, **kwargs):
+        ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
+        if gold_preproc:
+            examples = self.make_examples_gold_preproc(nlp, ref_docs)
+        else:
+            examples = self.make_examples(nlp, ref_docs, max_length)
+        if shuffle:
+            examples = list(examples)
+            random.shuffle(examples)
+        yield from examples
+
+    def dev_dataset(self, nlp, *, gold_preproc=False, **kwargs):
+        ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.dev_loc))
+        if gold_preproc:
+            examples = self.make_examples_gold_preproc(nlp, ref_docs)
+        else:
+            examples = self.make_examples(nlp, ref_docs, max_length=0)
+        yield from examples
--- a/spacy/gold/example.pxd
+++ b/spacy/gold/example.pxd
@ -0,0 +1,8 @@
+from ..tokens.doc cimport Doc
+from .align cimport Alignment
+
+
+cdef class Example:
+    cdef readonly Doc x
+    cdef readonly Doc y
+    cdef readonly Alignment _alignment
--- a/spacy/gold/example.pyx
+++ b/spacy/gold/example.pyx
@ -0,0 +1,432 @@
+import warnings
+
+import numpy
+
+from ..tokens.doc cimport Doc
+from ..tokens.span cimport Span
+from ..tokens.span import Span
+from ..attrs import IDS
+from .align cimport Alignment
+from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
+from .iob_utils import spans_from_biluo_tags
+from .align import Alignment
+from ..errors import Errors, Warnings
+from ..syntax import nonproj
+
+
+cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
+    """ Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
+    attrs, array = _annot2array(vocab, tok_annot, doc_annot)
+    output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
+    if "entities" in doc_annot:
+       _add_entities_to_doc(output, doc_annot["entities"])
+    if array.size:
+        output = output.from_array(attrs, array)
+    # links are currently added with ENT_KB_ID on the token level
+    output.cats.update(doc_annot.get("cats", {}))
+    return output
+
+
+cdef class Example:
+    def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
+        """ Doc can either be text, or an actual Doc """
+        if predicted is None:
+            raise TypeError(Errors.E972.format(arg="predicted"))
+        if reference is None:
+            raise TypeError(Errors.E972.format(arg="reference"))
+        self.x = predicted
+        self.y = reference
+        self._alignment = alignment
+
+    property predicted:
+        def __get__(self):
+            return self.x
+
+        def __set__(self, doc):
+            self.x = doc
+
+    property reference:
+        def __get__(self):
+            return self.y
+
+        def __set__(self, doc):
+            self.y = doc
+
+    def copy(self):
+        return Example(
+            self.x.copy(),
+            self.y.copy()
+        )
+
+    @classmethod
+    def from_dict(cls, Doc predicted, dict example_dict):
+        if example_dict is None:
+            raise ValueError(Errors.E976)
+        if not isinstance(predicted, Doc):
+            raise TypeError(Errors.E975.format(type=type(predicted)))
+        example_dict = _fix_legacy_dict_data(example_dict)
+        tok_dict, doc_dict = _parse_example_dict_data(example_dict)
+        if "ORTH" not in tok_dict:
+            tok_dict["ORTH"] = [tok.text for tok in predicted]
+            tok_dict["SPACY"] = [tok.whitespace_ for tok in predicted]
+        if not _has_field(tok_dict, "SPACY"):
+            spaces = _guess_spaces(predicted.text, tok_dict["ORTH"])
+        return Example(
+            predicted,
+            annotations2doc(predicted.vocab, tok_dict, doc_dict)
+        )
+
+    @property
+    def alignment(self):
+        if self._alignment is None:
+            spacy_words = [token.orth_ for token in self.predicted]
+            gold_words = [token.orth_ for token in self.reference]
+            if gold_words == []:
+                gold_words = spacy_words
+            self._alignment = Alignment(spacy_words, gold_words)
+        return self._alignment
+
+    def get_aligned(self, field, as_string=False):
+        """Return an aligned array for a token attribute."""
+        i2j_multi = self.alignment.i2j_multi
+        cand_to_gold = self.alignment.cand_to_gold
+
+        vocab = self.reference.vocab
+        gold_values = self.reference.to_array([field])
+        output = [None] * len(self.predicted)
+        for i, gold_i in enumerate(cand_to_gold):
+            if self.predicted[i].text.isspace():
+                output[i] = None
+            if gold_i is None:
+                if i in i2j_multi:
+                    output[i] = gold_values[i2j_multi[i]]
+                else:
+                    output[i] = None
+            else:
+                output[i] = gold_values[gold_i]
+        if as_string and field not in ["ENT_IOB", "SENT_START"]:
+            output = [vocab.strings[o] if o is not None else o for o in output]
+        return output
+
+    def get_aligned_parse(self, projectivize=True):
+        cand_to_gold = self.alignment.cand_to_gold
+        gold_to_cand = self.alignment.gold_to_cand
+        aligned_heads = [None] * self.x.length
+        aligned_deps = [None] * self.x.length
+        heads = [token.head.i for token in self.y]
+        deps = [token.dep_ for token in self.y]
+        if projectivize:
+            heads, deps = nonproj.projectivize(heads, deps)
+        for cand_i in range(self.x.length):
+            gold_i = cand_to_gold[cand_i]
+            if gold_i is not None: # Alignment found
+                gold_head = gold_to_cand[heads[gold_i]]
+                if gold_head is not None:
+                    aligned_heads[cand_i] = gold_head
+                    aligned_deps[cand_i] = deps[gold_i]
+        return aligned_heads, aligned_deps
+
+    def get_aligned_ner(self):
+        if not self.y.is_nered:
+            return [None] * len(self.x)  # should this be 'missing' instead of 'None' ?
+        x_text = self.x.text
+        # Get a list of entities, and make spans for non-entity tokens.
+        # We then work through the spans in order, trying to find them in
+        # the text and using that to get the offset. Any token that doesn't
+        # get a tag set this way is tagged None.
+        # This could maybe be improved? It at least feels easy to reason about.
+        y_spans = list(self.y.ents)
+        y_spans.sort()
+        x_text_offset = 0
+        x_spans = []
+        for y_span in y_spans:
+            if x_text.count(y_span.text) >= 1:
+                start_char = x_text.index(y_span.text) + x_text_offset
+                end_char = start_char + len(y_span.text)
+                x_span = self.x.char_span(start_char, end_char, label=y_span.label)
+                if x_span is not None:
+                    x_spans.append(x_span)
+                    x_text = self.x.text[end_char:]
+                    x_text_offset = end_char
+        x_tags = biluo_tags_from_offsets(
+            self.x,
+            [(e.start_char, e.end_char, e.label_) for e in x_spans],
+            missing=None
+        )
+        gold_to_cand = self.alignment.gold_to_cand
+        for token in self.y:
+            if token.ent_iob_ == "O":
+                cand_i = gold_to_cand[token.i]
+                if cand_i is not None and x_tags[cand_i] is None:
+                    x_tags[cand_i] = "O"
+        i2j_multi = self.alignment.i2j_multi
+        for i, tag in enumerate(x_tags):
+            if tag is None and i in i2j_multi:
+                gold_i = i2j_multi[i]
+                if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
+                    x_tags[i] = "O"
+        return x_tags
+
+    def to_dict(self):
+        return {
+            "doc_annotation": {
+                "cats": dict(self.reference.cats),
+                "entities": biluo_tags_from_doc(self.reference),
+                "links": self._links_to_dict()
+            },
+            "token_annotation": {
+                "ids": [t.i+1 for t in self.reference],
+                "words": [t.text for t in self.reference],
+                "tags": [t.tag_ for t in self.reference],
+                "lemmas": [t.lemma_ for t in self.reference],
+                "pos": [t.pos_ for t in self.reference],
+                "morphs": [t.morph_ for t in self.reference],
+                "heads": [t.head.i for t in self.reference],
+                "deps": [t.dep_ for t in self.reference],
+                "sent_starts": [int(bool(t.is_sent_start)) for t in self.reference]
+            }
+        }
+
+    def _links_to_dict(self):
+        links = {}
+        for ent in self.reference.ents:
+            if ent.kb_id_:
+                links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
+        return links
+
+
+    def split_sents(self):
+        """ Split the token annotations into multiple Examples based on
+        sent_starts and return a list of the new Examples"""
+        if not self.reference.is_sentenced:
+            return [self]
+
+        sent_starts = self.get_aligned("SENT_START")
+        sent_starts.append(1)   # appending virtual start of a next sentence to facilitate search
+
+        output = []
+        pred_start = 0
+        for sent in self.reference.sents:
+            new_ref = sent.as_doc()
+            pred_end = sent_starts.index(1, pred_start+1)  # find where the next sentence starts
+            new_pred = self.predicted[pred_start : pred_end].as_doc()
+            output.append(Example(new_pred, new_ref))
+            pred_start = pred_end
+
+        return output
+
+    property text:
+        def __get__(self):
+            return self.x.text
+
+    def __str__(self):
+        return str(self.to_dict())
+
+    def __repr__(self):
+        return str(self.to_dict())
+
+
+def _annot2array(vocab, tok_annot, doc_annot):
+    attrs = []
+    values = []
+
+    for key, value in doc_annot.items():
+        if value:
+            if key == "entities":
+                pass
+            elif key == "links":
+                entities = doc_annot.get("entities", {})
+                if not entities:
+                    raise ValueError(Errors.E981)
+                ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
+                tok_annot["ENT_KB_ID"] = ent_kb_ids
+            elif key == "cats":
+                pass
+            else:
+                raise ValueError(Errors.E974.format(obj="doc", key=key))
+
+    for key, value in tok_annot.items():
+        if key not in IDS:
+            raise ValueError(Errors.E974.format(obj="token", key=key))
+        elif key in ["ORTH", "SPACY"]:
+            pass
+        elif key == "HEAD":
+            attrs.append(key)
+            values.append([h-i for i, h in enumerate(value)])
+        elif key == "SENT_START":
+            attrs.append(key)
+            values.append(value)
+        elif key == "MORPH":
+            attrs.append(key)
+            values.append([vocab.morphology.add(v) for v in value])
+        else:
+            attrs.append(key)
+            values.append([vocab.strings.add(v) for v in value])
+
+    array = numpy.asarray(values, dtype="uint64")
+    return attrs, array.T
+
+
+def _add_entities_to_doc(doc, ner_data):
+    if ner_data is None:
+        return
+    elif ner_data == []:
+        doc.ents = []
+    elif isinstance(ner_data[0], tuple):
+        return _add_entities_to_doc(
+            doc,
+            biluo_tags_from_offsets(doc, ner_data)
+        )
+    elif isinstance(ner_data[0], str) or ner_data[0] is None:
+        return _add_entities_to_doc(
+            doc,
+            spans_from_biluo_tags(doc, ner_data)
+        )
+    elif isinstance(ner_data[0], Span):
+        # Ugh, this is super messy. Really hard to set O entities
+        doc.ents = ner_data
+        doc.ents = [span for span in ner_data if span.label_]
+    else:
+        raise ValueError(Errors.E973)
+
+
+def _parse_example_dict_data(example_dict):
+    return (
+        example_dict["token_annotation"],
+        example_dict["doc_annotation"]
+    )
+
+
+def _fix_legacy_dict_data(example_dict):
+    token_dict = example_dict.get("token_annotation", {})
+    doc_dict = example_dict.get("doc_annotation", {})
+    for key, value in example_dict.items():
+        if value:
+            if key in ("token_annotation", "doc_annotation"):
+                pass
+            elif key == "ids":
+                pass
+            elif key in ("cats", "links"):
+                doc_dict[key] = value
+            elif key in ("ner", "entities"):
+                doc_dict["entities"] = value
+            else:
+                token_dict[key] = value
+    # Remap keys
+    remapping = {
+        "words": "ORTH",
+        "tags": "TAG",
+        "pos": "POS",
+        "lemmas": "LEMMA",
+        "deps": "DEP",
+        "heads": "HEAD",
+        "sent_starts": "SENT_START",
+        "morphs": "MORPH",
+        "spaces": "SPACY",
+    }
+    old_token_dict = token_dict
+    token_dict = {}
+    for key, value in old_token_dict.items():
+        if key in ("text", "ids", "brackets"):
+            pass
+        elif key in remapping:
+            token_dict[remapping[key]] = value
+        else:
+            raise KeyError(Errors.E983.format(key=key, dict="token_annotation", keys=remapping.keys()))
+    text = example_dict.get("text", example_dict.get("raw"))
+    if _has_field(token_dict, "ORTH") and not _has_field(token_dict, "SPACY"):
+        token_dict["SPACY"] = _guess_spaces(text, token_dict["ORTH"])
+    if "HEAD" in token_dict and "SENT_START" in token_dict:
+        # If heads are set, we don't also redundantly specify SENT_START.
+        token_dict.pop("SENT_START")
+        warnings.warn(Warnings.W092)
+    return {
+        "token_annotation": token_dict,
+        "doc_annotation": doc_dict
+    }
+
+def _has_field(annot, field):
+    if field not in annot:
+        return False
+    elif annot[field] is None:
+        return False
+    elif len(annot[field]) == 0:
+        return False
+    elif all([value is None for value in annot[field]]):
+        return False
+    else:
+        return True
+
+
+def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
+    if isinstance(biluo_or_offsets[0], (list, tuple)):
+        # Convert to biluo if necessary
+        # This is annoying but to convert the offsets we need a Doc
+        # that has the target tokenization.
+        reference = Doc(vocab, words=words, spaces=spaces)
+        biluo = biluo_tags_from_offsets(reference, biluo_or_offsets)
+    else:
+        biluo = biluo_or_offsets
+    ent_iobs = []
+    ent_types = []
+    for iob_tag in biluo_to_iob(biluo):
+        if iob_tag in (None, "-"):
+            ent_iobs.append("")
+            ent_types.append("")
+        else:
+            ent_iobs.append(iob_tag.split("-")[0])
+            if iob_tag.startswith("I") or iob_tag.startswith("B"):
+                ent_types.append(iob_tag.split("-", 1)[1])
+            else:
+                ent_types.append("")
+    return ent_iobs, ent_types
+
+def _parse_links(vocab, words, links, entities):
+    reference = Doc(vocab, words=words)
+    starts = {token.idx: token.i for token in reference}
+    ends = {token.idx + len(token): token.i for token in reference}
+    ent_kb_ids = ["" for _ in reference]
+    entity_map = [(ent[0], ent[1]) for ent in entities]
+
+    # links annotations need to refer 1-1 to entity annotations - throw error otherwise
+    for index, annot_dict in links.items():
+        start_char, end_char = index
+        if (start_char, end_char) not in entity_map:
+            raise ValueError(Errors.E981)
+
+    for index, annot_dict in links.items():
+        true_kb_ids = []
+        for key, value in annot_dict.items():
+            if value == 1.0:
+                true_kb_ids.append(key)
+        if len(true_kb_ids) > 1:
+            raise ValueError(Errors.E980)
+
+        if len(true_kb_ids) == 1:
+            start_char, end_char = index
+            start_token = starts.get(start_char)
+            end_token = ends.get(end_char)
+            for i in range(start_token, end_token+1):
+                ent_kb_ids[i] = true_kb_ids[0]
+
+    return ent_kb_ids
+
+
+def _guess_spaces(text, words):
+    if text is None:
+        return [True] * len(words)
+    spaces = []
+    text_pos = 0
+    # align words with text
+    for word in words:
+        try:
+            word_start = text[text_pos:].index(word)
+        except ValueError:
+            spaces.append(True)
+            continue
+        text_pos += word_start + len(word)
+        if text_pos < len(text) and text[text_pos] == " ":
+            spaces.append(True)
+        else:
+            spaces.append(False)
+    return spaces
--- a/spacy/gold/gold_io.pyx
+++ b/spacy/gold/gold_io.pyx
@ -0,0 +1,199 @@
+import warnings
+import srsly
+from .. import util
+from ..errors import Warnings
+from ..tokens import Doc
+from .iob_utils import biluo_tags_from_offsets, tags_to_entities
+import json
+
+
+def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
+    """Convert a list of Doc objects into the JSON-serializable format used by
+    the spacy train command.
+
+    docs (iterable / Doc): The Doc object(s) to convert.
+    doc_id (int): Id for the JSON.
+    RETURNS (dict): The data in spaCy's JSON format
+        - each input doc will be treated as a paragraph in the output doc
+    """
+    if isinstance(docs, Doc):
+        docs = [docs]
+    json_doc = {"id": doc_id, "paragraphs": []}
+    for i, doc in enumerate(docs):
+        json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
+        for cat, val in doc.cats.items():
+            json_cat = {"label": cat, "value": val}
+            json_para["cats"].append(json_cat)
+        for ent in doc.ents:
+            ent_tuple = (ent.start_char, ent.end_char, ent.label_)
+            json_para["entities"].append(ent_tuple)
+            if ent.kb_id_:
+                link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
+                json_para["links"].append(link_dict)
+        ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
+        biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
+        for j, sent in enumerate(doc.sents):
+            json_sent = {"tokens": [], "brackets": []}
+            for token in sent:
+                json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_}
+                if doc.is_tagged:
+                    json_token["tag"] = token.tag_
+                    json_token["pos"] = token.pos_
+                    json_token["morph"] = token.morph_
+                    json_token["lemma"] = token.lemma_
+                if doc.is_parsed:
+                    json_token["head"] = token.head.i-token.i
+                    json_token["dep"] = token.dep_
+                json_sent["tokens"].append(json_token)
+            json_para["sentences"].append(json_sent)
+        json_doc["paragraphs"].append(json_para)
+    return json_doc
+
+
+def read_json_file(loc, docs_filter=None, limit=None):
+    """Read Example dictionaries from a json file or directory."""
+    loc = util.ensure_path(loc)
+    if loc.is_dir():
+        for filename in loc.iterdir():
+            yield from read_json_file(loc / filename, limit=limit)
+    else:
+        with loc.open("rb") as file_:
+            utf8_str = file_.read()
+        for json_doc in json_iterate(utf8_str):
+            if docs_filter is not None and not docs_filter(json_doc):
+                continue
+            for json_paragraph in json_to_annotations(json_doc):
+                yield json_paragraph
+
+
+def json_to_annotations(doc):
+    """Convert an item in the JSON-formatted training data to the format
+    used by Example.
+
+    doc (dict): One entry in the training data.
+    YIELDS (tuple): The reformatted data - one training example per paragraph
+    """
+    for paragraph in doc["paragraphs"]:
+        example = {"text": paragraph.get("raw", None)}
+        words = []
+        spaces = []
+        ids = []
+        tags = []
+        ner_tags = []
+        pos = []
+        morphs = []
+        lemmas = []
+        heads = []
+        labels = []
+        sent_starts = []
+        brackets = []
+        for sent in paragraph["sentences"]:
+            sent_start_i = len(words)
+            for i, token in enumerate(sent["tokens"]):
+                words.append(token["orth"])
+                spaces.append(token.get("space", None))
+                ids.append(token.get('id', sent_start_i + i))
+                tags.append(token.get("tag", None))
+                pos.append(token.get("pos", None))
+                morphs.append(token.get("morph", None))
+                lemmas.append(token.get("lemma", None))
+                if "head" in token:
+                    heads.append(token["head"] + sent_start_i + i)
+                else:
+                    heads.append(None)
+                if "dep" in token:
+                    labels.append(token["dep"])
+                    # Ensure ROOT label is case-insensitive
+                    if labels[-1].lower() == "root":
+                        labels[-1] = "ROOT"
+                else:
+                    labels.append(None)
+                ner_tags.append(token.get("ner", None))
+                if i == 0:
+                    sent_starts.append(1)
+                else:
+                    sent_starts.append(0)
+            if "brackets" in sent:
+                brackets.extend((b["first"] + sent_start_i,
+                                 b["last"] + sent_start_i, b["label"])
+                                 for b in sent["brackets"])
+
+        example["token_annotation"] = dict(
+            ids=ids,
+            words=words,
+            spaces=spaces,
+            sent_starts=sent_starts,
+            brackets=brackets
+        )
+        # avoid including dummy values that looks like gold info was present
+        if any(tags):
+            example["token_annotation"]["tags"] = tags
+        if any(pos):
+            example["token_annotation"]["pos"] = pos
+        if any(morphs):
+            example["token_annotation"]["morphs"] = morphs
+        if any(lemmas):
+            example["token_annotation"]["lemmas"] = lemmas
+        if any(head is not None for head in heads):
+            example["token_annotation"]["heads"] = heads
+        if any(labels):
+            example["token_annotation"]["deps"] = labels
+
+        cats = {}
+        for cat in paragraph.get("cats", {}):
+            cats[cat["label"]] = cat["value"]
+        example["doc_annotation"] = dict(
+            cats=cats,
+            entities=ner_tags,
+            links=paragraph.get("links", [])
+        )
+        yield example
+
+def json_iterate(bytes utf8_str):
+    # We should've made these files jsonl...But since we didn't, parse out
+    # the docs one-by-one to reduce memory usage.
+    # It's okay to read in the whole file -- just don't parse it into JSON.
+    cdef long file_length = len(utf8_str)
+    if file_length > 2 ** 30:
+        warnings.warn(Warnings.W027.format(size=file_length))
+
+    raw = <char*>utf8_str
+    cdef int square_depth = 0
+    cdef int curly_depth = 0
+    cdef int inside_string = 0
+    cdef int escape = 0
+    cdef long start = -1
+    cdef char c
+    cdef char quote = ord('"')
+    cdef char backslash = ord("\\")
+    cdef char open_square = ord("[")
+    cdef char close_square = ord("]")
+    cdef char open_curly = ord("{")
+    cdef char close_curly = ord("}")
+    for i in range(file_length):
+        c = raw[i]
+        if escape:
+            escape = False
+            continue
+        if c == backslash:
+            escape = True
+            continue
+        if c == quote:
+            inside_string = not inside_string
+            continue
+        if inside_string:
+            continue
+        if c == open_square:
+            square_depth += 1
+        elif c == close_square:
+            square_depth -= 1
+        elif c == open_curly:
+            if square_depth == 1 and curly_depth == 0:
+                start = i
+            curly_depth += 1
+        elif c == close_curly:
+            curly_depth -= 1
+            if square_depth == 1 and curly_depth == 0:
+                substr = utf8_str[start : i + 1].decode("utf8")
+                yield srsly.json_loads(substr)
+                start = -1
--- a/spacy/gold/iob_utils.py
+++ b/spacy/gold/iob_utils.py
@ -0,0 +1,209 @@
+import warnings
+from ..errors import Errors, Warnings
+from ..tokens import Span
+
+
+def iob_to_biluo(tags):
+    out = []
+    tags = list(tags)
+    while tags:
+        out.extend(_consume_os(tags))
+        out.extend(_consume_ent(tags))
+    return out
+
+
+def biluo_to_iob(tags):
+    out = []
+    for tag in tags:
+        if tag is None:
+            out.append(tag)
+        else:
+            tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1)
+            out.append(tag)
+    return out
+
+
+def _consume_os(tags):
+    while tags and tags[0] == "O":
+        yield tags.pop(0)
+
+
+def _consume_ent(tags):
+    if not tags:
+        return []
+    tag = tags.pop(0)
+    target_in = "I" + tag[1:]
+    target_last = "L" + tag[1:]
+    length = 1
+    while tags and tags[0] in {target_in, target_last}:
+        length += 1
+        tags.pop(0)
+    label = tag[2:]
+    if length == 1:
+        if len(label) == 0:
+            raise ValueError(Errors.E177.format(tag=tag))
+        return ["U-" + label]
+    else:
+        start = "B-" + label
+        end = "L-" + label
+        middle = [f"I-{label}" for _ in range(1, length - 1)]
+        return [start] + middle + [end]
+
+
+def biluo_tags_from_doc(doc, missing="O"):
+    return biluo_tags_from_offsets(
+        doc,
+        [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents],
+        missing=missing,
+    )
+
+
+def biluo_tags_from_offsets(doc, entities, missing="O"):
+    """Encode labelled spans into per-token tags, using the
+    Begin/In/Last/Unit/Out scheme (BILUO).
+
+    doc (Doc): The document that the entity offsets refer to. The output tags
+        will refer to the token boundaries within the document.
+    entities (iterable): A sequence of `(start, end, label)` triples. `start`
+        and `end` should be character-offset integers denoting the slice into
+        the original string.
+    RETURNS (list): A list of unicode strings, describing the tags. Each tag
+        string will be of the form either "", "O" or "{action}-{label}", where
+        action is one of "B", "I", "L", "U". The string "-" is used where the
+        entity offsets don't align with the tokenization in the `Doc` object.
+        The training algorithm will view these as missing values. "O" denotes a
+        non-entity token. "B" denotes the beginning of a multi-token entity,
+        "I" the inside of an entity of three or more tokens, and "L" the end
+        of an entity of two or more tokens. "U" denotes a single-token entity.
+
+    EXAMPLE:
+        >>> text = 'I like London.'
+        >>> entities = [(len('I like '), len('I like London'), 'LOC')]
+        >>> doc = nlp.tokenizer(text)
+        >>> tags = biluo_tags_from_offsets(doc, entities)
+        >>> assert tags == ["O", "O", 'U-LOC', "O"]
+    """
+    # Ensure no overlapping entity labels exist
+    tokens_in_ents = {}
+
+    starts = {token.idx: token.i for token in doc}
+    ends = {token.idx + len(token): token.i for token in doc}
+    biluo = ["-" for _ in doc]
+    # Handle entity cases
+    for start_char, end_char, label in entities:
+        if not label:
+            for s in starts:   # account for many-to-one
+                if s >= start_char and s < end_char:
+                    biluo[starts[s]] = "O"
+        else:
+            for token_index in range(start_char, end_char):
+                if token_index in tokens_in_ents.keys():
+                    raise ValueError(
+                        Errors.E103.format(
+                            span1=(
+                                tokens_in_ents[token_index][0],
+                                tokens_in_ents[token_index][1],
+                                tokens_in_ents[token_index][2],
+                            ),
+                            span2=(start_char, end_char, label),
+                        )
+                    )
+                tokens_in_ents[token_index] = (start_char, end_char, label)
+
+            start_token = starts.get(start_char)
+            end_token = ends.get(end_char)
+            # Only interested if the tokenization is correct
+            if start_token is not None and end_token is not None:
+                if start_token == end_token:
+                    biluo[start_token] = f"U-{label}"
+                else:
+                    biluo[start_token] = f"B-{label}"
+                    for i in range(start_token + 1, end_token):
+                        biluo[i] = f"I-{label}"
+                    biluo[end_token] = f"L-{label}"
+    # Now distinguish the O cases from ones where we miss the tokenization
+    entity_chars = set()
+    for start_char, end_char, label in entities:
+        for i in range(start_char, end_char):
+            entity_chars.add(i)
+    for token in doc:
+        for i in range(token.idx, token.idx + len(token)):
+            if i in entity_chars:
+                break
+        else:
+            biluo[token.i] = missing
+    if "-" in biluo and missing != "-":
+        ent_str = str(entities)
+        warnings.warn(
+            Warnings.W030.format(
+                text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
+                entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
+            )
+        )
+    return biluo
+
+
+def spans_from_biluo_tags(doc, tags):
+    """Encode per-token tags following the BILUO scheme into Span object, e.g.
+    to overwrite the doc.ents.
+
+    doc (Doc): The document that the BILUO tags refer to.
+    entities (iterable): A sequence of BILUO tags with each tag describing one
+        token. Each tags string will be of the form of either "", "O" or
+        "{action}-{label}", where action is one of "B", "I", "L", "U".
+    RETURNS (list): A sequence of Span objects.
+    """
+    token_offsets = tags_to_entities(tags)
+    spans = []
+    for label, start_idx, end_idx in token_offsets:
+        span = Span(doc, start_idx, end_idx + 1, label=label)
+        spans.append(span)
+    return spans
+
+
+def offsets_from_biluo_tags(doc, tags):
+    """Encode per-token tags following the BILUO scheme into entity offsets.
+
+    doc (Doc): The document that the BILUO tags refer to.
+    entities (iterable): A sequence of BILUO tags with each tag describing one
+        token. Each tags string will be of the form of either "", "O" or
+        "{action}-{label}", where action is one of "B", "I", "L", "U".
+    RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
+        `end` will be character-offset integers denoting the slice into the
+        original string.
+    """
+    spans = spans_from_biluo_tags(doc, tags)
+    return [(span.start_char, span.end_char, span.label_) for span in spans]
+
+
+def tags_to_entities(tags):
+    """ Note that the end index returned by this function is inclusive.
+    To use it for Span creation, increment the end by 1."""
+    entities = []
+    start = None
+    for i, tag in enumerate(tags):
+        if tag is None:
+            continue
+        if tag.startswith("O"):
+            # TODO: We shouldn't be getting these malformed inputs. Fix this.
+            if start is not None:
+                start = None
+            else:
+                entities.append(("", i, i))
+            continue
+        elif tag == "-":
+            continue
+        elif tag.startswith("I"):
+            if start is None:
+                raise ValueError(Errors.E067.format(tags=tags[: i + 1]))
+            continue
+        if tag.startswith("U"):
+            entities.append((tag[2:], i, i))
+        elif tag.startswith("B"):
+            start = i
+        elif tag.startswith("L"):
+            entities.append((tag[2:], start, i))
+            start = None
+        else:
+            raise ValueError(Errors.E068.format(tag=tag))
+    return entities
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -446,6 +446,8 @@ cdef class Writer:
            assert not path.isdir(loc), f"{loc} is directory"
        if isinstance(loc, Path):
            loc = bytes(loc)
+        if path.exists(loc):
+            assert not path.isdir(loc), "%s is directory." % loc
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
@ -487,10 +489,10 @@ cdef class Writer:

 cdef class Reader:
    def __init__(self, object loc):
-        assert path.exists(loc)
-        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
+        assert path.exists(loc)
+        assert not path.isdir(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
--- a/spacy/lang/el/syntax_iterators.py
+++ b/spacy/lang/el/syntax_iterators.py
@ -20,29 +20,25 @@ def noun_chunks(doclike):
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
            flag = False
            if word.pos == NOUN:
                #  check for patterns such as γραμμή παραγωγής
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
-                        seen.update(
-                            j for j in range(word.left_edge.i, potential_nmod.i + 1)
-                        )
+                        prev_end = potential_nmod.i
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        flag = True
                        break
            if flag is False:
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -51,9 +47,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/en/syntax_iterators.py
+++ b/spacy/lang/en/syntax_iterators.py
@ -25,17 +25,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.i + 1))
+            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -136,7 +136,19 @@ for pron in ["he", "she", "it"]:

 # W-words, relative pronouns, prepositions etc.

-for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
+for word in [
+    "who",
+    "what",
+    "when",
+    "where",
+    "why",
+    "how",
+    "there",
+    "that",
+    "this",
+    "these",
+    "those",
+]:
    for orth in [word, word.title()]:
        _exc[orth + "'s"] = [
            {ORTH: orth, LEMMA: word, NORM: word},
@ -396,6 +408,8 @@ _other_exc = {
        {ORTH: "Let", LEMMA: "let", NORM: "let"},
        {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"},
    ],
+    "c'mon": [{ORTH: "c'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
+    "C'mon": [{ORTH: "C'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
 }

 _exc.update(_other_exc)
--- a/spacy/lang/es/examples.py
+++ b/spacy/lang/es/examples.py
@ -14,5 +14,9 @@ sentences = [
    "El gato come pescado.",
    "Veo al hombre con el telescopio.",
    "La araña come moscas.",
-    "El pingüino incuba en su nido.",
+    "El pingüino incuba en su nido sobre el hielo.",
+    "¿Dónde estais?",
+    "¿Quién es el presidente Francés?",
+    "¿Dónde está encuentra la capital de Argentina?",
+    "¿Cuándo nació José de San Martín?",
 ]
--- a/spacy/lang/es/punctuation.py
+++ b/spacy/lang/es/punctuation.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
 from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
--- a/spacy/lang/es/tokenizer_exceptions.py
+++ b/spacy/lang/es/tokenizer_exceptions.py
@ -7,8 +7,12 @@ _exc = {


 for exc_data in [
+    {ORTH: "n°", LEMMA: "número"},
+    {ORTH: "°C", LEMMA: "grados Celcius"},
    {ORTH: "aprox.", LEMMA: "aproximadamente"},
    {ORTH: "dna.", LEMMA: "docena"},
+    {ORTH: "dpto.", LEMMA: "departamento"},
+    {ORTH: "ej.", LEMMA: "ejemplo"},
    {ORTH: "esq.", LEMMA: "esquina"},
    {ORTH: "pág.", LEMMA: "página"},
    {ORTH: "p.ej.", LEMMA: "por ejemplo"},
@ -16,6 +20,7 @@ for exc_data in [
    {ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"},
    {ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
    {ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
+    {ORTH: "vol.", NORM: "volúmen"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]

@ -35,10 +40,14 @@ for h in range(1, 12 + 1):
 for orth in [
    "a.C.",
    "a.J.C.",
+    "d.C.",
+    "d.J.C.",
    "apdo.",
    "Av.",
    "Avda.",
    "Cía.",
+    "Dr.",
+    "Dra.",
    "EE.UU.",
    "etc.",
    "fig.",
@ -54,9 +63,9 @@ for orth in [
    "Prof.",
    "Profa.",
    "q.e.p.d.",
-    "S.A.",
+    "Q.E.P.D." "S.A.",
    "S.L.",
-    "s.s.s.",
+    "S.R.L." "s.s.s.",
    "Sr.",
    "Sra.",
    "Srta.",
--- a/spacy/lang/fa/syntax_iterators.py
+++ b/spacy/lang/fa/syntax_iterators.py
@ -25,17 +25,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.i + 1))
+            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/fr/_tokenizer_exceptions_list.py
+++ b/spacy/lang/fr/_tokenizer_exceptions_list.py
@ -531,7 +531,6 @@ FR_BASE_EXCEPTIONS = [
    "Beaumont-Hamel",
    "Beaumont-Louestault",
    "Beaumont-Monteux",
-    "Beaumont-Pied-de-Buf",
    "Beaumont-Pied-de-Bœuf",
    "Beaumont-Sardolles",
    "Beaumont-Village",
@ -948,7 +947,7 @@ FR_BASE_EXCEPTIONS = [
    "Buxières-sous-les-Côtes",
    "Buzy-Darmont",
    "Byhleguhre-Byhlen",
-    "Burs-en-Othe",
+    "Bœurs-en-Othe",
    "Bâle-Campagne",
    "Bâle-Ville",
    "Béard-Géovreissiat",
@ -1586,11 +1585,11 @@ FR_BASE_EXCEPTIONS = [
    "Cruci-Falgardiens",
    "Cruquius-Oost",
    "Cruviers-Lascours",
-    "Crèvecur-en-Auge",
-    "Crèvecur-en-Brie",
-    "Crèvecur-le-Grand",
-    "Crèvecur-le-Petit",
-    "Crèvecur-sur-l'Escaut",
+    "Crèvecœur-en-Auge",
+    "Crèvecœur-en-Brie",
+    "Crèvecœur-le-Grand",
+    "Crèvecœur-le-Petit",
+    "Crèvecœur-sur-l'Escaut",
    "Crécy-Couvé",
    "Créon-d'Armagnac",
    "Cubjac-Auvézère-Val-d'Ans",
@ -1616,7 +1615,7 @@ FR_BASE_EXCEPTIONS = [
    "Cuxac-Cabardès",
    "Cuxac-d'Aude",
    "Cuyk-Sainte-Agathe",
-    "Cuvres-et-Valsery",
+    "Cœuvres-et-Valsery",
    "Céaux-d'Allègre",
    "Céleste-Empire",
    "Cénac-et-Saint-Julien",
@ -1679,7 +1678,7 @@ FR_BASE_EXCEPTIONS = [
    "Devrai-Gondragnières",
    "Dhuys et Morin-en-Brie",
    "Diane-Capelle",
-    "Dieffenbach-lès-Wrth",
+    "Dieffenbach-lès-Wœrth",
    "Diekhusen-Fahrstedt",
    "Diennes-Aubigny",
    "Diensdorf-Radlow",
@ -1752,7 +1751,7 @@ FR_BASE_EXCEPTIONS = [
    "Durdat-Larequille",
    "Durfort-Lacapelette",
    "Durfort-et-Saint-Martin-de-Sossenac",
-    "Duil-sur-le-Mignon",
+    "Dœuil-sur-le-Mignon",
    "Dão-Lafões",
    "Débats-Rivière-d'Orpra",
    "Décines-Charpieu",
@ -2687,8 +2686,8 @@ FR_BASE_EXCEPTIONS = [
    "Kuhlen-Wendorf",
    "KwaZulu-Natal",
    "Kyzyl-Arvat",
-    "Kur-la-Grande",
-    "Kur-la-Petite",
+    "Kœur-la-Grande",
+    "Kœur-la-Petite",
    "Kölln-Reisiek",
    "Königsbach-Stein",
    "Königshain-Wiederau",
@ -4024,7 +4023,7 @@ FR_BASE_EXCEPTIONS = [
    "Marcilly-d'Azergues",
    "Marcillé-Raoul",
    "Marcillé-Robert",
-    "Marcq-en-Barul",
+    "Marcq-en-Barœul",
    "Marcy-l'Etoile",
    "Marcy-l'Étoile",
    "Mareil-Marly",
@ -4258,7 +4257,7 @@ FR_BASE_EXCEPTIONS = [
    "Monlezun-d'Armagnac",
    "Monléon-Magnoac",
    "Monnetier-Mornex",
-    "Mons-en-Barul",
+    "Mons-en-Barœul",
    "Monsempron-Libos",
    "Monsteroux-Milieu",
    "Montacher-Villegardin",
@ -4348,7 +4347,7 @@ FR_BASE_EXCEPTIONS = [
    "Mornay-Berry",
    "Mortain-Bocage",
    "Morteaux-Couliboeuf",
-    "Morteaux-Coulibuf",
+    "Morteaux-Coulibœuf",
    "Morteaux-Coulibœuf",
    "Mortes-Frontières",
    "Mory-Montcrux",
@ -4391,7 +4390,7 @@ FR_BASE_EXCEPTIONS = [
    "Muncq-Nieurlet",
    "Murtin-Bogny",
    "Murtin-et-le-Châtelet",
-    "Murs-Verdey",
+    "Mœurs-Verdey",
    "Ménestérol-Montignac",
    "Ménil'muche",
    "Ménil-Annelles",
@ -4612,7 +4611,7 @@ FR_BASE_EXCEPTIONS = [
    "Neuves-Maisons",
    "Neuvic-Entier",
    "Neuvicq-Montguyon",
-    "Neuville-lès-Luilly",
+    "Neuville-lès-Lœuilly",
    "Neuvy-Bouin",
    "Neuvy-Deux-Clochers",
    "Neuvy-Grandchamp",
@ -4773,8 +4772,8 @@ FR_BASE_EXCEPTIONS = [
    "Nuncq-Hautecôte",
    "Nurieux-Volognat",
    "Nuthe-Urstromtal",
-    "Nux-les-Mines",
-    "Nux-lès-Auxi",
+    "Nœux-les-Mines",
+    "Nœux-lès-Auxi",
    "Nâves-Parmelan",
    "Nézignan-l'Evêque",
    "Nézignan-l'Évêque",
@ -5343,7 +5342,7 @@ FR_BASE_EXCEPTIONS = [
    "Quincy-Voisins",
    "Quincy-sous-le-Mont",
    "Quint-Fonsegrives",
-    "Quux-Haut-Maînil",
+    "Quœux-Haut-Maînil",
    "Quœux-Haut-Maînil",
    "Qwa-Qwa",
    "R.-V.",
@ -5631,12 +5630,12 @@ FR_BASE_EXCEPTIONS = [
    "Saint Aulaye-Puymangou",
    "Saint Geniez d'Olt et d'Aubrac",
    "Saint Martin de l'If",
-    "Saint-Denux",
-    "Saint-Jean-de-Buf",
-    "Saint-Martin-le-Nud",
-    "Saint-Michel-Tubuf",
+    "Saint-Denœux",
+    "Saint-Jean-de-Bœuf",
+    "Saint-Martin-le-Nœud",
+    "Saint-Michel-Tubœuf",
    "Saint-Paul - Flaugnac",
-    "Saint-Pierre-de-Buf",
+    "Saint-Pierre-de-Bœuf",
    "Saint-Thegonnec Loc-Eguiner",
    "Sainte-Alvère-Saint-Laurent Les Bâtons",
    "Salignac-Eyvignes",
@ -6208,7 +6207,7 @@ FR_BASE_EXCEPTIONS = [
    "Tite-Live",
    "Titisee-Neustadt",
    "Tobel-Tägerschen",
-    "Togny-aux-Bufs",
+    "Togny-aux-Bœufs",
    "Tongre-Notre-Dame",
    "Tonnay-Boutonne",
    "Tonnay-Charente",
@ -6336,7 +6335,7 @@ FR_BASE_EXCEPTIONS = [
    "Vals-près-le-Puy",
    "Valverde-Enrique",
    "Valzin-en-Petite-Montagne",
-    "Vanduvre-lès-Nancy",
+    "Vandœuvre-lès-Nancy",
    "Varces-Allières-et-Risset",
    "Varenne-l'Arconce",
    "Varenne-sur-le-Doubs",
@ -6457,9 +6456,9 @@ FR_BASE_EXCEPTIONS = [
    "Villenave-d'Ornon",
    "Villequier-Aumont",
    "Villerouge-Termenès",
-    "Villers-aux-Nuds",
+    "Villers-aux-Nœuds",
    "Villez-sur-le-Neubourg",
-    "Villiers-en-Désuvre",
+    "Villiers-en-Désœuvre",
    "Villieu-Loyes-Mollon",
    "Villingen-Schwenningen",
    "Villié-Morgon",
@ -6467,7 +6466,7 @@ FR_BASE_EXCEPTIONS = [
    "Vilosnes-Haraumont",
    "Vilters-Wangs",
    "Vincent-Froideville",
-    "Vincy-Manuvre",
+    "Vincy-Manœuvre",
    "Vincy-Manœuvre",
    "Vincy-Reuil-et-Magny",
    "Vindrac-Alayrac",
@ -6511,8 +6510,8 @@ FR_BASE_EXCEPTIONS = [
    "Vrigne-Meusiens",
    "Vrijhoeve-Capelle",
    "Vuisternens-devant-Romont",
-    "Vlfling-lès-Bouzonville",
-    "Vuil-et-Giget",
+    "Vœlfling-lès-Bouzonville",
+    "Vœuil-et-Giget",
    "Vélez-Blanco",
    "Vélez-Málaga",
    "Vélez-Rubio",
@ -6615,7 +6614,7 @@ FR_BASE_EXCEPTIONS = [
    "Wust-Fischbeck",
    "Wutha-Farnroda",
    "Wy-dit-Joli-Village",
-    "Wlfling-lès-Sarreguemines",
+    "Wœlfling-lès-Sarreguemines",
    "Wünnewil-Flamatt",
    "X-SAMPA",
    "X-arbre",
--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@ -24,17 +24,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -1,7 +1,6 @@
 import re

 from .punctuation import ELISION, HYPHENS
-from ..tokenizer_exceptions import URL_PATTERN
 from ..char_classes import ALPHA_LOWER, ALPHA
 from ...symbols import ORTH, LEMMA

@ -452,9 +451,6 @@ _regular_exp += [
    for hc in _hyphen_combination
 ]

-# URLs
-_regular_exp.append(URL_PATTERN)
-

 TOKENIZER_EXCEPTIONS = _exc
 TOKEN_MATCH = re.compile(
--- a/spacy/lang/gu/init.py
+++ b/spacy/lang/gu/init.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from .stop_words import STOP_WORDS

 from ...language import Language
--- a/spacy/lang/gu/examples.py
+++ b/spacy/lang/gu/examples.py
@ -1,7 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
 """
 Example sentences to test spaCy and its language models.

--- a/spacy/lang/gu/stop_words.py
+++ b/spacy/lang/gu/stop_words.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 STOP_WORDS = set(
    """
 એમ
--- a/spacy/lang/hu/punctuation.py
+++ b/spacy/lang/hu/punctuation.py
@ -7,7 +7,6 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "")

 _currency = r"\$¢£€¥฿"
 _quotes = CONCAT_QUOTES.replace("'", "")
-_units = UNITS.replace("%", "")

 _prefixes = (
    LIST_PUNCT
@ -18,7 +17,8 @@ _prefixes = (
 )

 _suffixes = (
-    LIST_PUNCT
+    [r"\+"]
+    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + [_concat_icons]
@ -26,7 +26,7 @@ _suffixes = (
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:[{c}])".format(c=_currency),
-        r"(?<=[0-9])(?:{u})".format(u=_units),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
        r"(?<=[{al}{e}{q}(?:{c})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency
        ),
--- a/spacy/lang/hu/tokenizer_exceptions.py
+++ b/spacy/lang/hu/tokenizer_exceptions.py
@ -1,7 +1,6 @@
 import re

 from ..punctuation import ALPHA_LOWER, CURRENCY
-from ..tokenizer_exceptions import URL_PATTERN
 from ...symbols import ORTH


@ -646,4 +645,4 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(


 TOKENIZER_EXCEPTIONS = _exc
-TOKEN_MATCH = re.compile(r"^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match
+TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match
--- a/spacy/lang/hy/init.py
+++ b/spacy/lang/hy/init.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.hy.examples import sentences
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -1,12 +1,9 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from ...attrs import LIKE_NUM


 _num_words = [
-    "զրօ",
-    "մէկ",
+    "զրո",
+    "մեկ",
    "երկու",
    "երեք",
    "չորս",
@ -28,10 +25,10 @@ _num_words = [
    "քսան" "երեսուն",
    "քառասուն",
    "հիսուն",
-    "վաթցսուն",
+    "վաթսուն",
    "յոթանասուն",
    "ութսուն",
-    "ինիսուն",
+    "իննսուն",
    "հարյուր",
    "հազար",
    "միլիոն",
--- a/spacy/lang/hy/stop_words.py
+++ b/spacy/lang/hy/stop_words.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 STOP_WORDS = set(
    """
 նա
--- a/spacy/lang/hy/tag_map.py
+++ b/spacy/lang/hy/tag_map.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
 from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ

--- a/spacy/lang/id/syntax_iterators.py
+++ b/spacy/lang/id/syntax_iterators.py
@ -24,17 +24,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -1,111 +1,266 @@
-import re
-from collections import namedtuple
+import srsly
+from collections import namedtuple, OrderedDict

 from .stop_words import STOP_WORDS
+from .syntax_iterators import SYNTAX_ITERATORS
 from .tag_map import TAG_MAP
+from .tag_orth_map import TAG_ORTH_MAP
+from .tag_bigram_map import TAG_BIGRAM_MAP
 from ...attrs import LANG
-from ...language import Language
-from ...tokens import Doc
 from ...compat import copy_reg
+from ...errors import Errors
+from ...language import Language
+from ...symbols import POS
+from ...tokens import Doc
 from ...util import DummyTokenizer
+from ... import util
+
+
+# Hold the attributes we need with convenient names
+DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])

 # Handling for multiple spaces in a row is somewhat awkward, this simplifies
 # the flow by creating a dummy with the same interface.
-DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"])
-DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"])
-DummySpace = DummyNode(" ", " ", DummyNodeFeatures(" "))
+DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
+DummySpace = DummyNode(" ", " ", " ")


-def try_fugashi_import():
-    """Fugashi is required for Japanese support, so check for it.
-    It it's not available blow up and explain how to fix it."""
+def try_sudachi_import(split_mode="A"):
+    """SudachiPy is required for Japanese support, so check for it.
+    It it's not available blow up and explain how to fix it.
+    split_mode should be one of these values: "A", "B", "C", None->"A"."""
    try:
-        import fugashi
+        from sudachipy import dictionary, tokenizer

-        return fugashi
+        split_mode = {
+            None: tokenizer.Tokenizer.SplitMode.A,
+            "A": tokenizer.Tokenizer.SplitMode.A,
+            "B": tokenizer.Tokenizer.SplitMode.B,
+            "C": tokenizer.Tokenizer.SplitMode.C,
+        }[split_mode]
+        tok = dictionary.Dictionary().create(mode=split_mode)
+        return tok
    except ImportError:
        raise ImportError(
-            "Japanese support requires Fugashi: " "https://github.com/polm/fugashi"
+            "Japanese support requires SudachiPy and SudachiDict-core "
+            "(https://github.com/WorksApplications/SudachiPy). "
+            "Install with `pip install sudachipy sudachidict_core` or "
+            "install spaCy with `pip install spacy[ja]`."
        )


-def resolve_pos(token):
+def resolve_pos(orth, pos, next_pos):
    """If necessary, add a field to the POS tag for UD mapping.
    Under Universal Dependencies, sometimes the same Unidic POS tag can
    be mapped differently depending on the literal token or its context
-    in the sentence. This function adds information to the POS tag to
-    resolve ambiguous mappings.
+    in the sentence. This function returns resolved POSs for both token
+    and next_token by tuple.
    """

-    # this is only used for consecutive ascii spaces
-    if token.surface == " ":
-        return "空白"
+    # Some tokens have their UD tag decided based on the POS of the following
+    # token.

-    # TODO: This is a first take. The rules here are crude approximations.
-    # For many of these, full dependencies are needed to properly resolve
-    # PoS mappings.
-    if token.pos == "連体詞,*,*,*":
-        if re.match(r"[こそあど此其彼]の", token.surface):
-            return token.pos + ",DET"
-        if re.match(r"[こそあど此其彼]", token.surface):
-            return token.pos + ",PRON"
-        return token.pos + ",ADJ"
-    return token.pos
+    # orth based rules
+    if pos[0] in TAG_ORTH_MAP:
+        orth_map = TAG_ORTH_MAP[pos[0]]
+        if orth in orth_map:
+            return orth_map[orth], None
+
+    # tag bi-gram mapping
+    if next_pos:
+        tag_bigram = pos[0], next_pos[0]
+        if tag_bigram in TAG_BIGRAM_MAP:
+            bipos = TAG_BIGRAM_MAP[tag_bigram]
+            if bipos[0] is None:
+                return TAG_MAP[pos[0]][POS], bipos[1]
+            else:
+                return bipos
+
+    return TAG_MAP[pos[0]][POS], None


-def get_words_and_spaces(tokenizer, text):
-    """Get the individual tokens that make up the sentence and handle white space.
+# Use a mapping of paired punctuation to avoid splitting quoted sentences.
+pairpunct = {"「": "」", "『": "』", "【": "】"}

-    Japanese doesn't usually use white space, and MeCab's handling of it for
-    multiple spaces in a row is somewhat awkward.
+
+def separate_sentences(doc):
+    """Given a doc, mark tokens that start sentences based on Unidic tags.
    """

-    tokens = tokenizer.parseToNodeList(text)
+    stack = []  # save paired punctuation

+    for i, token in enumerate(doc[:-2]):
+        # Set all tokens after the first to false by default. This is necessary
+        # for the doc code to be aware we've done sentencization, see
+        # `is_sentenced`.
+        token.sent_start = i == 0
+        if token.tag_:
+            if token.tag_ == "補助記号-括弧開":
+                ts = str(token)
+                if ts in pairpunct:
+                    stack.append(pairpunct[ts])
+                elif stack and ts == stack[-1]:
+                    stack.pop()
+
+            if token.tag_ == "補助記号-句点":
+                next_token = doc[i + 1]
+                if next_token.tag_ != token.tag_ and not stack:
+                    next_token.sent_start = True
+
+
+def get_dtokens(tokenizer, text):
+    tokens = tokenizer.tokenize(text)
    words = []
-    spaces = []
-    for token in tokens:
-        # If there's more than one space, spaces after the first become tokens
-        for ii in range(len(token.white_space) - 1):
-            words.append(DummySpace)
-            spaces.append(False)
+    for ti, token in enumerate(tokens):
+        tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
+        inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
+        dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
+        if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
+            # don't add multiple space tokens in a row
+            continue
+        words.append(dtoken)

-        words.append(token)
-        spaces.append(bool(token.white_space))
-    return words, spaces
+    # remove empty tokens. These can be produced with characters like … that
+    # Sudachi normalizes internally.
+    words = [ww for ww in words if len(ww.surface) > 0]
+    return words
+
+
+def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
+    words = [x.surface for x in dtokens]
+    if "".join("".join(words).split()) != "".join(text.split()):
+        raise ValueError(Errors.E194.format(text=text, words=words))
+    text_words = []
+    text_lemmas = []
+    text_tags = []
+    text_spaces = []
+    text_pos = 0
+    # handle empty and whitespace-only texts
+    if len(words) == 0:
+        return text_words, text_lemmas, text_tags, text_spaces
+    elif len([word for word in words if not word.isspace()]) == 0:
+        assert text.isspace()
+        text_words = [text]
+        text_lemmas = [text]
+        text_tags = [gap_tag]
+        text_spaces = [False]
+        return text_words, text_lemmas, text_tags, text_spaces
+    # normalize words to remove all whitespace tokens
+    norm_words, norm_dtokens = zip(
+        *[
+            (word, dtokens)
+            for word, dtokens in zip(words, dtokens)
+            if not word.isspace()
+        ]
+    )
+    # align words with text
+    for word, dtoken in zip(norm_words, norm_dtokens):
+        try:
+            word_start = text[text_pos:].index(word)
+        except ValueError:
+            raise ValueError(Errors.E194.format(text=text, words=words))
+        if word_start > 0:
+            w = text[text_pos : text_pos + word_start]
+            text_words.append(w)
+            text_lemmas.append(w)
+            text_tags.append(gap_tag)
+            text_spaces.append(False)
+            text_pos += word_start
+        text_words.append(word)
+        text_lemmas.append(dtoken.lemma)
+        text_tags.append(dtoken.pos)
+        text_spaces.append(False)
+        text_pos += len(word)
+        if text_pos < len(text) and text[text_pos] == " ":
+            text_spaces[-1] = True
+            text_pos += 1
+    if text_pos < len(text):
+        w = text[text_pos:]
+        text_words.append(w)
+        text_lemmas.append(w)
+        text_tags.append(gap_tag)
+        text_spaces.append(False)
+    return text_words, text_lemmas, text_tags, text_spaces


 class JapaneseTokenizer(DummyTokenizer):
-    def __init__(self, cls, nlp=None):
+    def __init__(self, cls, nlp=None, config={}):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.tokenizer = try_fugashi_import().Tagger()
-        self.tokenizer.parseToNodeList("")  # see #2901
+        self.split_mode = config.get("split_mode", None)
+        self.tokenizer = try_sudachi_import(self.split_mode)

    def __call__(self, text):
-        dtokens, spaces = get_words_and_spaces(self.tokenizer, text)
-        words = [x.surface for x in dtokens]
+        dtokens = get_dtokens(self.tokenizer, text)
+
+        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        unidic_tags = []
-        for token, dtoken in zip(doc, dtokens):
-            unidic_tags.append(dtoken.pos)
-            token.tag_ = resolve_pos(dtoken)
+        next_pos = None
+        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
+            token.tag_ = unidic_tag[0]
+            if next_pos:
+                token.pos = next_pos
+                next_pos = None
+            else:
+                token.pos, next_pos = resolve_pos(
+                    token.orth_,
+                    unidic_tag,
+                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
+                )

            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = dtoken.feature.lemma or dtoken.surface
+            token.lemma_ = lemma
        doc.user_data["unidic_tags"] = unidic_tags
+
        return doc

+    def _get_config(self):
+        config = OrderedDict((("split_mode", self.split_mode),))
+        return config
+
+    def _set_config(self, config={}):
+        self.split_mode = config.get("split_mode", None)
+
+    def to_bytes(self, **kwargs):
+        serializers = OrderedDict(
+            (("cfg", lambda: srsly.json_dumps(self._get_config())),)
+        )
+        return util.to_bytes(serializers, [])
+
+    def from_bytes(self, data, **kwargs):
+        deserializers = OrderedDict(
+            (("cfg", lambda b: self._set_config(srsly.json_loads(b))),)
+        )
+        util.from_bytes(data, deserializers, [])
+        self.tokenizer = try_sudachi_import(self.split_mode)
+        return self
+
+    def to_disk(self, path, **kwargs):
+        path = util.ensure_path(path)
+        serializers = OrderedDict(
+            (("cfg", lambda p: srsly.write_json(p, self._get_config())),)
+        )
+        return util.to_disk(path, serializers, [])
+
+    def from_disk(self, path, **kwargs):
+        path = util.ensure_path(path)
+        serializers = OrderedDict(
+            (("cfg", lambda p: self._set_config(srsly.read_json(p))),)
+        )
+        util.from_disk(path, serializers, [])
+        self.tokenizer = try_sudachi_import(self.split_mode)
+

 class JapaneseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda _text: "ja"
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
+    syntax_iterators = SYNTAX_ITERATORS
    writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}

    @classmethod
-    def create_tokenizer(cls, nlp=None):
-        return JapaneseTokenizer(cls, nlp)
+    def create_tokenizer(cls, nlp=None, config={}):
+        return JapaneseTokenizer(cls, nlp, config)


 class Japanese(Language):
--- a/spacy/lang/ja/bunsetu.py
+++ b/spacy/lang/ja/bunsetu.py
@ -0,0 +1,176 @@
+POS_PHRASE_MAP = {
+    "NOUN": "NP",
+    "NUM": "NP",
+    "PRON": "NP",
+    "PROPN": "NP",
+    "VERB": "VP",
+    "ADJ": "ADJP",
+    "ADV": "ADVP",
+    "CCONJ": "CCONJP",
+}
+
+
+# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
+def yield_bunsetu(doc, debug=False):
+    bunsetu = []
+    bunsetu_may_end = False
+    phrase_type = None
+    phrase = None
+    prev = None
+    prev_tag = None
+    prev_dep = None
+    prev_head = None
+    for t in doc:
+        pos = t.pos_
+        pos_type = POS_PHRASE_MAP.get(pos, None)
+        tag = t.tag_
+        dep = t.dep_
+        head = t.head.i
+        if debug:
+            print(
+                t.i,
+                t.orth_,
+                pos,
+                pos_type,
+                dep,
+                head,
+                bunsetu_may_end,
+                phrase_type,
+                phrase,
+                bunsetu,
+            )
+
+        # DET is always an individual bunsetu
+        if pos == "DET":
+            if bunsetu:
+                yield bunsetu, phrase_type, phrase
+            yield [t], None, None
+            bunsetu = []
+            bunsetu_may_end = False
+            phrase_type = None
+            phrase = None
+
+        # PRON or Open PUNCT always splits bunsetu
+        elif tag == "補助記号-括弧開":
+            if bunsetu:
+                yield bunsetu, phrase_type, phrase
+            bunsetu = [t]
+            bunsetu_may_end = True
+            phrase_type = None
+            phrase = None
+
+        # bunsetu head not appeared
+        elif phrase_type is None:
+            if bunsetu and prev_tag == "補助記号-読点":
+                yield bunsetu, phrase_type, phrase
+                bunsetu = []
+                bunsetu_may_end = False
+                phrase_type = None
+                phrase = None
+            bunsetu.append(t)
+            if pos_type:  # begin phrase
+                phrase = [t]
+                phrase_type = pos_type
+                if pos_type in {"ADVP", "CCONJP"}:
+                    bunsetu_may_end = True
+
+        # entering new bunsetu
+        elif pos_type and (
+            pos_type != phrase_type
+            or bunsetu_may_end  # different phrase type arises  # same phrase type but bunsetu already ended
+        ):
+            # exceptional case: NOUN to VERB
+            if (
+                phrase_type == "NP"
+                and pos_type == "VP"
+                and prev_dep == "compound"
+                and prev_head == t.i
+            ):
+                bunsetu.append(t)
+                phrase_type = "VP"
+                phrase.append(t)
+            # exceptional case: VERB to NOUN
+            elif (
+                phrase_type == "VP"
+                and pos_type == "NP"
+                and (
+                    prev_dep == "compound"
+                    and prev_head == t.i
+                    or dep == "compound"
+                    and prev == head
+                    or prev_dep == "nmod"
+                    and prev_head == t.i
+                )
+            ):
+                bunsetu.append(t)
+                phrase_type = "NP"
+                phrase.append(t)
+            else:
+                yield bunsetu, phrase_type, phrase
+                bunsetu = [t]
+                bunsetu_may_end = False
+                phrase_type = pos_type
+                phrase = [t]
+
+        # NOUN bunsetu
+        elif phrase_type == "NP":
+            bunsetu.append(t)
+            if not bunsetu_may_end and (
+                (
+                    (pos_type == "NP" or pos == "SYM")
+                    and (prev_head == t.i or prev_head == head)
+                    and prev_dep in {"compound", "nummod"}
+                )
+                or (
+                    pos == "PART"
+                    and (prev == head or prev_head == head)
+                    and dep == "mark"
+                )
+            ):
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # VERB bunsetu
+        elif phrase_type == "VP":
+            bunsetu.append(t)
+            if (
+                not bunsetu_may_end
+                and pos == "VERB"
+                and prev_head == t.i
+                and prev_dep == "compound"
+            ):
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # ADJ bunsetu
+        elif phrase_type == "ADJP" and tag != "連体詞":
+            bunsetu.append(t)
+            if not bunsetu_may_end and (
+                (
+                    pos == "NOUN"
+                    and (prev_head == t.i or prev_head == head)
+                    and prev_dep in {"amod", "compound"}
+                )
+                or (
+                    pos == "PART"
+                    and (prev == head or prev_head == head)
+                    and dep == "mark"
+                )
+            ):
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # other bunsetu
+        else:
+            bunsetu.append(t)
+
+        prev = t.i
+        prev_tag = t.tag_
+        prev_dep = t.dep_
+        prev_head = head
+
+    if bunsetu:
+        yield bunsetu, phrase_type, phrase
--- a/spacy/lang/ja/syntax_iterators.py
+++ b/spacy/lang/ja/syntax_iterators.py
@ -0,0 +1,54 @@
+from ...symbols import NOUN, PROPN, PRON, VERB
+
+# XXX this can probably be pruned a bit
+labels = [
+    "nsubj",
+    "nmod",
+    "dobj",
+    "nsubjpass",
+    "pcomp",
+    "pobj",
+    "obj",
+    "obl",
+    "dative",
+    "appos",
+    "attr",
+    "ROOT",
+]
+
+
+def noun_chunks(obj):
+    """
+    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
+    """
+
+    doc = obj.doc  # Ensure works on both Doc and Span.
+    np_deps = [doc.vocab.strings.add(label) for label in labels]
+    doc.vocab.strings.add("conj")
+    np_label = doc.vocab.strings.add("NP")
+    seen = set()
+    for i, word in enumerate(obj):
+        if word.pos not in (NOUN, PROPN, PRON):
+            continue
+        # Prevent nested chunks from being produced
+        if word.i in seen:
+            continue
+        if word.dep in np_deps:
+            unseen = [w.i for w in word.subtree if w.i not in seen]
+            if not unseen:
+                continue
+
+            # this takes care of particles etc.
+            seen.update(j.i for j in word.subtree)
+            # This avoids duplicating embedded clauses
+            seen.update(range(word.i + 1))
+
+            # if the head of this is a verb, mark that and rights seen
+            # Don't do the subtree as that can hide other phrases
+            if word.head.pos == VERB:
+                seen.add(word.head.i)
+                seen.update(w.i for w in word.head.rights)
+            yield unseen[0], word.i + 1, np_label
+
+
+SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
--- a/spacy/lang/ja/tag_bigram_map.py
+++ b/spacy/lang/ja/tag_bigram_map.py
@ -0,0 +1,28 @@
+from ...symbols import ADJ, AUX, NOUN, PART, VERB
+
+# mapping from tag bi-gram to pos of previous token
+TAG_BIGRAM_MAP = {
+    # This covers only small part of AUX.
+    ("形容詞-非自立可能", "助詞-終助詞"): (AUX, None),
+    ("名詞-普通名詞-形状詞可能", "助動詞"): (ADJ, None),
+    # ("副詞", "名詞-普通名詞-形状詞可能"): (None, ADJ),
+    # This covers acl, advcl, obl and root, but has side effect for compound.
+    ("名詞-普通名詞-サ変可能", "動詞-非自立可能"): (VERB, AUX),
+    # This covers almost all of the deps
+    ("名詞-普通名詞-サ変形状詞可能", "動詞-非自立可能"): (VERB, AUX),
+    ("名詞-普通名詞-副詞可能", "動詞-非自立可能"): (None, VERB),
+    ("副詞", "動詞-非自立可能"): (None, VERB),
+    ("形容詞-一般", "動詞-非自立可能"): (None, VERB),
+    ("形容詞-非自立可能", "動詞-非自立可能"): (None, VERB),
+    ("接頭辞", "動詞-非自立可能"): (None, VERB),
+    ("助詞-係助詞", "動詞-非自立可能"): (None, VERB),
+    ("助詞-副助詞", "動詞-非自立可能"): (None, VERB),
+    ("助詞-格助詞", "動詞-非自立可能"): (None, VERB),
+    ("補助記号-読点", "動詞-非自立可能"): (None, VERB),
+    ("形容詞-一般", "接尾辞-名詞的-一般"): (None, PART),
+    ("助詞-格助詞", "形状詞-助動詞語幹"): (None, NOUN),
+    ("連体詞", "形状詞-助動詞語幹"): (None, NOUN),
+    ("動詞-一般", "助詞-副助詞"): (None, PART),
+    ("動詞-非自立可能", "助詞-副助詞"): (None, PART),
+    ("助動詞", "助詞-副助詞"): (None, PART),
+}
--- a/spacy/lang/ja/tag_map.py
+++ b/spacy/lang/ja/tag_map.py
@ -1,79 +1,68 @@
-from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN
-from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE
+from ...symbols import POS, PUNCT, INTJ, ADJ, AUX, ADP, PART, SCONJ, NOUN
+from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE, CCONJ


 TAG_MAP = {
    # Explanation of Unidic tags:
    # https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
-    # Universal Dependencies Mapping:
+    # Universal Dependencies Mapping: (Some of the entries in this mapping are updated to v2.6 in the list below)
    # http://universaldependencies.org/ja/overview/morphology.html
    # http://universaldependencies.org/ja/pos/all.html
-    "記号,一般,*,*": {
-        POS: PUNCT
-    },  # this includes characters used to represent sounds like ドレミ
-    "記号,文字,*,*": {
-        POS: PUNCT
-    },  # this is for Greek and Latin characters used as sumbols, as in math
-    "感動詞,フィラー,*,*": {POS: INTJ},
-    "感動詞,一般,*,*": {POS: INTJ},
-    # this is specifically for unicode full-width space
-    "空白,*,*,*": {POS: X},
-    # This is used when sequential half-width spaces are present
+    "記号-一般": {POS: NOUN},  # this includes characters used to represent sounds like ドレミ
+    "記号-文字": {
+        POS: NOUN
+    },  # this is for Greek and Latin characters having some meanings, or used as symbols, as in math
+    "感動詞-フィラー": {POS: INTJ},
+    "感動詞-一般": {POS: INTJ},
    "空白": {POS: SPACE},
-    "形状詞,一般,*,*": {POS: ADJ},
-    "形状詞,タリ,*,*": {POS: ADJ},
-    "形状詞,助動詞語幹,*,*": {POS: ADJ},
-    "形容詞,一般,*,*": {POS: ADJ},
-    "形容詞,非自立可能,*,*": {POS: AUX},  # XXX ADJ if alone, AUX otherwise
-    "助詞,格助詞,*,*": {POS: ADP},
-    "助詞,係助詞,*,*": {POS: ADP},
-    "助詞,終助詞,*,*": {POS: PART},
-    "助詞,準体助詞,*,*": {POS: SCONJ},  # の as in 走るのが速い
-    "助詞,接続助詞,*,*": {POS: SCONJ},  # verb ending て
-    "助詞,副助詞,*,*": {POS: PART},  # ばかり, つつ after a verb
-    "助動詞,*,*,*": {POS: AUX},
-    "接続詞,*,*,*": {POS: SCONJ},  # XXX: might need refinement
-    "接頭辞,*,*,*": {POS: NOUN},
-    "接尾辞,形状詞的,*,*": {POS: ADJ},  # がち, チック
-    "接尾辞,形容詞的,*,*": {POS: ADJ},  # -らしい
-    "接尾辞,動詞的,*,*": {POS: NOUN},  # -じみ
-    "接尾辞,名詞的,サ変可能,*": {POS: NOUN},  # XXX see 名詞,普通名詞,サ変可能,*
-    "接尾辞,名詞的,一般,*": {POS: NOUN},
-    "接尾辞,名詞的,助数詞,*": {POS: NOUN},
-    "接尾辞,名詞的,副詞可能,*": {POS: NOUN},  # -後, -過ぎ
-    "代名詞,*,*,*": {POS: PRON},
-    "動詞,一般,*,*": {POS: VERB},
-    "動詞,非自立可能,*,*": {POS: VERB},  # XXX VERB if alone, AUX otherwise
-    "動詞,非自立可能,*,*,AUX": {POS: AUX},
-    "動詞,非自立可能,*,*,VERB": {POS: VERB},
-    "副詞,*,*,*": {POS: ADV},
-    "補助記号,ＡＡ,一般,*": {POS: SYM},  # text art
-    "補助記号,ＡＡ,顔文字,*": {POS: SYM},  # kaomoji
-    "補助記号,一般,*,*": {POS: SYM},
-    "補助記号,括弧開,*,*": {POS: PUNCT},  # open bracket
-    "補助記号,括弧閉,*,*": {POS: PUNCT},  # close bracket
-    "補助記号,句点,*,*": {POS: PUNCT},  # period or other EOS marker
-    "補助記号,読点,*,*": {POS: PUNCT},  # comma
-    "名詞,固有名詞,一般,*": {POS: PROPN},  # general proper noun
-    "名詞,固有名詞,人名,一般": {POS: PROPN},  # person's name
-    "名詞,固有名詞,人名,姓": {POS: PROPN},  # surname
-    "名詞,固有名詞,人名,名": {POS: PROPN},  # first name
-    "名詞,固有名詞,地名,一般": {POS: PROPN},  # place name
-    "名詞,固有名詞,地名,国": {POS: PROPN},  # country name
-    "名詞,助動詞語幹,*,*": {POS: AUX},
-    "名詞,数詞,*,*": {POS: NUM},  # includes Chinese numerals
-    "名詞,普通名詞,サ変可能,*": {POS: NOUN},  # XXX: sometimes VERB in UDv2; suru-verb noun
-    "名詞,普通名詞,サ変可能,*,NOUN": {POS: NOUN},
-    "名詞,普通名詞,サ変可能,*,VERB": {POS: VERB},
-    "名詞,普通名詞,サ変形状詞可能,*": {POS: NOUN},  # ex: 下手
-    "名詞,普通名詞,一般,*": {POS: NOUN},
-    "名詞,普通名詞,形状詞可能,*": {POS: NOUN},  # XXX: sometimes ADJ in UDv2
-    "名詞,普通名詞,形状詞可能,*,NOUN": {POS: NOUN},
-    "名詞,普通名詞,形状詞可能,*,ADJ": {POS: ADJ},
-    "名詞,普通名詞,助数詞可能,*": {POS: NOUN},  # counter / unit
-    "名詞,普通名詞,副詞可能,*": {POS: NOUN},
-    "連体詞,*,*,*": {POS: ADJ},  # XXX this has exceptions based on literal token
-    "連体詞,*,*,*,ADJ": {POS: ADJ},
-    "連体詞,*,*,*,PRON": {POS: PRON},
-    "連体詞,*,*,*,DET": {POS: DET},
+    "形状詞-一般": {POS: ADJ},
+    "形状詞-タリ": {POS: ADJ},
+    "形状詞-助動詞語幹": {POS: AUX},
+    "形容詞-一般": {POS: ADJ},
+    "形容詞-非自立可能": {POS: ADJ},  # XXX ADJ if alone, AUX otherwise
+    "助詞-格助詞": {POS: ADP},
+    "助詞-係助詞": {POS: ADP},
+    "助詞-終助詞": {POS: PART},
+    "助詞-準体助詞": {POS: SCONJ},  # の as in 走るのが速い
+    "助詞-接続助詞": {POS: SCONJ},  # verb ending て0
+    "助詞-副助詞": {POS: ADP},  # ばかり, つつ after a verb
+    "助動詞": {POS: AUX},
+    "接続詞": {POS: CCONJ},  # XXX: might need refinement
+    "接頭辞": {POS: NOUN},
+    "接尾辞-形状詞的": {POS: PART},  # がち, チック
+    "接尾辞-形容詞的": {POS: AUX},  # -らしい
+    "接尾辞-動詞的": {POS: PART},  # -じみ
+    "接尾辞-名詞的-サ変可能": {POS: NOUN},  # XXX see 名詞,普通名詞,サ変可能,*
+    "接尾辞-名詞的-一般": {POS: NOUN},
+    "接尾辞-名詞的-助数詞": {POS: NOUN},
+    "接尾辞-名詞的-副詞可能": {POS: NOUN},  # -後, -過ぎ
+    "代名詞": {POS: PRON},
+    "動詞-一般": {POS: VERB},
+    "動詞-非自立可能": {POS: AUX},  # XXX VERB if alone, AUX otherwise
+    "副詞": {POS: ADV},
+    "補助記号-ＡＡ-一般": {POS: SYM},  # text art
+    "補助記号-ＡＡ-顔文字": {POS: PUNCT},  # kaomoji
+    "補助記号-一般": {POS: SYM},
+    "補助記号-括弧開": {POS: PUNCT},  # open bracket
+    "補助記号-括弧閉": {POS: PUNCT},  # close bracket
+    "補助記号-句点": {POS: PUNCT},  # period or other EOS marker
+    "補助記号-読点": {POS: PUNCT},  # comma
+    "名詞-固有名詞-一般": {POS: PROPN},  # general proper noun
+    "名詞-固有名詞-人名-一般": {POS: PROPN},  # person's name
+    "名詞-固有名詞-人名-姓": {POS: PROPN},  # surname
+    "名詞-固有名詞-人名-名": {POS: PROPN},  # first name
+    "名詞-固有名詞-地名-一般": {POS: PROPN},  # place name
+    "名詞-固有名詞-地名-国": {POS: PROPN},  # country name
+    "名詞-助動詞語幹": {POS: AUX},
+    "名詞-数詞": {POS: NUM},  # includes Chinese numerals
+    "名詞-普通名詞-サ変可能": {POS: NOUN},  # XXX: sometimes VERB in UDv2; suru-verb noun
+    "名詞-普通名詞-サ変形状詞可能": {POS: NOUN},
+    "名詞-普通名詞-一般": {POS: NOUN},
+    "名詞-普通名詞-形状詞可能": {POS: NOUN},  # XXX: sometimes ADJ in UDv2
+    "名詞-普通名詞-助数詞可能": {POS: NOUN},  # counter / unit
+    "名詞-普通名詞-副詞可能": {POS: NOUN},
+    "連体詞": {POS: DET},  # XXX this has exceptions based on literal token
+    # GSD tags. These aren't in Unidic, but we need them for the GSD data.
+    "外国語": {POS: PROPN},  # Foreign words
+    "絵文字・記号等": {POS: SYM},  # emoji / kaomoji ^^;
 }
--- a/spacy/lang/ja/tag_orth_map.py
+++ b/spacy/lang/ja/tag_orth_map.py
@ -0,0 +1,22 @@
+from ...symbols import DET, PART, PRON, SPACE, X
+
+# mapping from tag bi-gram to pos of previous token
+TAG_ORTH_MAP = {
+    "空白": {" ": SPACE, "　": X},
+    "助詞-副助詞": {"たり": PART},
+    "連体詞": {
+        "あの": DET,
+        "かの": DET,
+        "この": DET,
+        "その": DET,
+        "どの": DET,
+        "彼の": DET,
+        "此の": DET,
+        "其の": DET,
+        "ある": PRON,
+        "こんな": PRON,
+        "そんな": PRON,
+        "どんな": PRON,
+        "あらゆる": PRON,
+    },
+}
--- a/spacy/lang/kn/examples.py
+++ b/spacy/lang/kn/examples.py
@ -1,7 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
 """
 Example sentences to test spaCy and its language models.

--- a/spacy/lang/ml/init.py
+++ b/spacy/lang/ml/init.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from .stop_words import STOP_WORDS

 from ...language import Language
--- a/spacy/lang/ml/examples.py
+++ b/spacy/lang/ml/examples.py
@ -1,7 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
 """
 Example sentences to test spaCy and its language models.

--- a/Show More
+++ b/Show More