Merge remote-tracking branch 'origin/develop' into rliaw-develop

2025-11-23 03:06:12 +03:00 · 2020-06-30 13:50:03 -07:00 · 2020-06-30 13:50:03 -07:00 · 610dfd85c2
commit 610dfd85c2
parent 9b3ba4071a b032943c34
235 changed files with 8908 additions and 5314 deletions
--- a/.github/contributors/Arvindcheenu.md
+++ b/.github/contributors/Arvindcheenu.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Arvind Srinivasan    |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-06-13           |
 | GitHub username                | arvindcheenu         |
 | Website (optional)             |                      |
--- a/.github/contributors/JannisTriesToCode.md
+++ b/.github/contributors/JannisTriesToCode.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                         |
 |------------------------------- | ----------------------------- |
 | Name                           | Jannis Rauschke               |
 | Company name (if applicable)   |                               |
 | Title or role (if applicable)  |                               |
 | Date                           | 22.05.2020                    |
 | GitHub username                | JannisTriesToCode             |
 | Website (optional)             | https://twitter.com/JRauschke |
--- a/.github/contributors/MartinoMensio.md
+++ b/.github/contributors/MartinoMensio.md
@ -99,8 +99,8 @@ mark both statements:
 | Field                          | Entry                              |
 |------------------------------- | --------------------               |
 | Name                           | Martino Mensio                     |
-| Company name (if applicable)   | Polytechnic University of Turin    |
+| Company name (if applicable)   | The Open University                |
-| Title or role (if applicable)  | Student                            |
+| Title or role (if applicable)  | PhD Student                        |
 | Date                           | 17 November 2017                   |
 | GitHub username                | MartinoMensio                      |
 | Website (optional)             | https://martinomensio.github.io/   |
--- a/.github/contributors/R1j1t.md
+++ b/.github/contributors/R1j1t.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Rajat                     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |  24 May 2020                    |
 | GitHub username                |  R1j1t                    |
 | Website (optional)             |                      |
--- a/.github/contributors/hiroshi-matsuda-rit.md
+++ b/.github/contributors/hiroshi-matsuda-rit.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Hiroshi Matsuda      |
 | Company name (if applicable)   | Megagon Labs, Tokyo  |
 | Title or role (if applicable)  | Research Scientist   |
 | Date                           | June 6, 2020         |
 | GitHub username                | hiroshi-matsuda-rit  |
 | Website (optional)             |                      |
--- a/.github/contributors/jonesmartins.md
+++ b/.github/contributors/jonesmartins.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jones Martins        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-06-10           |
 | GitHub username                | jonesmartins         |
 | Website (optional)             |                      |
--- a/.github/contributors/leomrocha.md
+++ b/.github/contributors/leomrocha.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Leonardo M. Rocha    |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |  Eng.                |
 | Date                           |  31/05/2020          |
 | GitHub username                |  leomrocha           |
 | Website (optional)             |                      |
--- a/.github/contributors/lfiedler.md
+++ b/.github/contributors/lfiedler.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Leander Fiedler      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 06 April 2020        |
 | GitHub username                | lfiedler             |
 | Website (optional)             |                      |
--- a/.github/contributors/mahnerak.md
+++ b/.github/contributors/mahnerak.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Karen Hambardzumyan  |
 | Company name (if applicable)   | YerevaNN             |
 | Title or role (if applicable)  | Researcher           |
 | Date                           | 2020-06-19           |
 | GitHub username                | mahnerak             |
 | Website (optional)             | https://mahnerak.com/|
--- a/.github/contributors/myavrum.md
+++ b/.github/contributors/myavrum.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Marat M. Yavrumyan   |
 | Company name (if applicable)   | YSU, UD_Armenian Project |
 | Title or role (if applicable)  | Dr., Principal Investigator |
 | Date                           | 2020-06-19           |
 | GitHub username                | myavrum              |
 | Website (optional)             | http://armtreebank.yerevann.com/ |
--- a/.github/contributors/theudas.md
+++ b/.github/contributors/theudas.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Philipp Sodmann          |
 | Company name (if applicable)   | Empolis                  |
 | Title or role (if applicable)  |                          |
 | Date                           | 2017-05-06               |
 | GitHub username                | theudas                  |
 | Website (optional)             |                          |
--- a/.github/workflows/issue-manager.yml
+++ b/.github/workflows/issue-manager.yml
@ -0,0 +1,29 @@
 name: Issue Manager
 on:
  schedule:
    - cron: "0 0 * * *"
  issue_comment:
    types:
      - created
      - edited
  issues:
    types:
      - labeled
 jobs:
  issue-manager:
    runs-on: ubuntu-latest
    steps:
      - uses: tiangolo/issue-manager@0.2.1
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          config: >
            {
              "resolved": {
                "delay": "P7D",
                "message": "This issue has been automatically closed because it was answered and there was no follow-up discussion.",
                "remove_label_on_comment": true,
                "remove_label_on_close": true
              }
            }
--- a/5
+++ b/5
@ -5,8 +5,9 @@ VENV := ./env$(PYVER)
 version := $(shell "bin/get-version.sh")
 dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy_lookups_data
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
 	chmod a+rx $@
 	cp $@ dist/spacy.pex
 dist/pytest.pex : wheelhouse/pytest-*.whl
 	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
@ -14,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
 wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
 	$(VENV)/bin/pip wheel . -w ./wheelhouse
-	$(VENV)/bin/pip wheel spacy_lookups_data -w ./wheelhouse
+	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
 	touch $@
 wheelhouse/pytest-%.whl : $(VENV)/bin/pex
--- a/README.md
+++ b/README.md
@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
 Cython. It's built on the very latest research, and was designed from day one to
 be used in real products. spaCy comes with
 [pretrained statistical models](https://spacy.io/models) and word vectors, and
-currently supports tokenization for **50+ languages**. It features
+currently supports tokenization for **60+ languages**. It features
 state-of-the-art speed, convolutional **neural network models** for tagging,
 parsing and **named entity recognition** and easy **deep learning** integration.
 It's commercial open-source software, released under the MIT license.
-💫 **Version 2.2 out now!**
+💫 **Version 2.3 out now!**
 [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
 [![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
@ -31,7 +31,7 @@ It's commercial open-source software, released under the MIT license.
 | --------------- | -------------------------------------------------------------- |
 | [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
 | [Usage Guides]  | How to use spaCy and its features.                             |
-| [New in v2.2]   | New features, backwards incompatibilities and migration guide. |
+| [New in v2.3]   | New features, backwards incompatibilities and migration guide. |
 | [API Reference] | The detailed reference for spaCy's API.                        |
 | [Models]        | Download statistical language models for spaCy.                |
 | [Universe]      | Libraries, extensions, demos, books and courses.               |
@ -39,7 +39,7 @@ It's commercial open-source software, released under the MIT license.
 | [Contribute]    | How to contribute to the spaCy project and code base.          |
 [spacy 101]: https://spacy.io/usage/spacy-101
-[new in v2.2]: https://spacy.io/usage/v2-2
+[new in v2.3]: https://spacy.io/usage/v2-3
 [usage guides]: https://spacy.io/usage/
 [api reference]: https://spacy.io/api/
 [models]: https://spacy.io/models
@ -119,12 +119,13 @@ of `v2.0.13`).
 pip install spacy
 ```
-To install additional data tables for lemmatization in **spaCy v2.2+** you can
+To install additional data tables for lemmatization and normalization in
-run `pip install spacy[lookups]` or install
+**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 separately. The lookups package is needed to create blank models with
-lemmatization data, and to lemmatize in languages that don't yet come with
+lemmatization data for v2.2+ plus normalization data for v2.3+, and to
-pretrained models and aren't powered by third-party libraries.
+lemmatize in languages that don't yet come with pretrained models and aren't
 powered by third-party libraries.
 When using pip it is generally recommended to install packages in a virtual
 environment to avoid modifying system state:
--- a/bin/ud/ud_train.py
+++ b/bin/ud/ud_train.py
@ -14,7 +14,7 @@ import spacy
 import spacy.util
 from bin.ud import conll17_ud_eval
 from spacy.tokens import Token, Doc
-from spacy.gold import GoldParse, Example
+from spacy.gold import Example
 from spacy.util import compounding, minibatch, minibatch_by_words
 from spacy.syntax.nonproj import projectivize
 from spacy.matcher import Matcher
@ -78,22 +78,21 @@ def read_data(
                head = int(head) - 1 if head != "0" else id_
                sent["words"].append(word)
                sent["tags"].append(tag)
-                sent["morphology"].append(_parse_morph_string(morph))
+                sent["morphs"].append(_compile_morph_string(morph, pos))
                sent["morphology"][-1].add("POS_%s" % pos)
                sent["heads"].append(head)
                sent["deps"].append("ROOT" if dep == "root" else dep)
                sent["spaces"].append(space_after == "_")
-            sent["entities"] = ["-"] * len(sent["words"])
+            sent["entities"] = ["-"] * len(sent["words"])    # TODO: doc-level format
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
-                golds.append(GoldParse(docs[-1], **sent))
+                golds.append(sent)
-                assert golds[-1].morphology is not None
+                assert golds[-1]["morphs"] is not None
            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
                doc, gold = _make_gold(nlp, None, sent_annots)
-                assert gold.morphology is not None
+                assert gold["morphs"] is not None
                sent_annots = []
                docs.append(doc)
                golds.append(gold)
@ -109,17 +108,10 @@ def read_data(
    return golds_to_gold_data(docs, golds)
-def _parse_morph_string(morph_string):
+def _compile_morph_string(morph_string, pos):
    if morph_string == '_':
-        return set()
+        return f"POS={pos}"
-    output = []
+    return morph_string + f"|POS={pos}"
    replacements = {'1': 'one', '2': 'two', '3': 'three'}
    for feature in morph_string.split('|'):
        key, value = feature.split('=')
        value = replacements.get(value, value)
        value = value.split(',')[0]
        output.append('%s_%s' % (key, value.lower()))
    return set(output)
 def read_conllu(file_):
@ -151,28 +143,27 @@ def read_conllu(file_):
 def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
    # Flatten the conll annotations, and adjust the head indices
-    flat = defaultdict(list)
+    gold = defaultdict(list)
    sent_starts = []
    for sent in sent_annots:
-        flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
+        gold["heads"].extend(len(gold["words"])+head for head in sent["heads"])
-        for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
+        for field in ["words", "tags", "deps", "morphs", "entities", "spaces"]:
-            flat[field].extend(sent[field])
+            gold[field].extend(sent[field])
        sent_starts.append(True)
        sent_starts.extend([False] * (len(sent["words"]) - 1))
    # Construct text if necessary
-    assert len(flat["words"]) == len(flat["spaces"])
+    assert len(gold["words"]) == len(gold["spaces"])
    if text is None:
        text = "".join(
-            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
+            word + " " * space for word, space in zip(gold["words"], gold["spaces"])
        )
    doc = nlp.make_doc(text)
-    flat.pop("spaces")
+    gold.pop("spaces")
-    gold = GoldParse(doc, **flat)
+    gold["sent_starts"] = sent_starts
-    gold.sent_starts = sent_starts
+    for i in range(len(gold["heads"])):
    for i in range(len(gold.heads)):
        if random.random() < drop_deps:
-            gold.heads[i] = None
+            gold["heads"][i] = None
-            gold.labels[i] = None
+            gold["labels"][i] = None
    return doc, gold
@ -183,15 +174,10 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
 def golds_to_gold_data(docs, golds):
-    """Get out the training data format used by begin_training, given the
+    """Get out the training data format used by begin_training"""
    GoldParse objects."""
    data = []
    for doc, gold in zip(docs, golds):
-        example = Example(doc=doc)
+        example = Example.from_dict(doc, dict(gold))
        example.add_doc_annotation(cats=gold.cats)
        token_annotation_dict = gold.orig.to_dict()
        example.add_token_annotation(**token_annotation_dict)
        example.goldparse = gold
        data.append(example)
    return data
@ -359,9 +345,8 @@ def initialize_pipeline(nlp, examples, config, device):
        nlp.parser.add_multitask_objective("tag")
    if config.multitask_sent:
        nlp.parser.add_multitask_objective("sent_start")
-    for ex in examples:
+    for eg in examples:
-        gold = ex.gold
+        for tag in eg.get_aligned("TAG", as_string=True):
        for tag in gold.tags:
            if tag is not None:
                nlp.tagger.add_label(tag)
    if torch is not None and device != -1:
@ -495,10 +480,6 @@ def main(
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)
    Token.set_extension("get_conllu_lines", method=get_token_conllu)
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)
    spacy.util.fix_random_seed()
    lang.zh.Chinese.Defaults.use_jieba = False
    lang.ja.Japanese.Defaults.use_janome = False
@ -541,10 +522,10 @@ def main(
        else:
            batches = minibatch(examples, size=batch_sizes)
        losses = {}
-        n_train_words = sum(len(ex.doc) for ex in examples)
+        n_train_words = sum(len(eg.predicted) for eg in examples)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
-                pbar.update(sum(len(ex.doc) for ex in batch))
+                pbar.update(sum(len(ex.predicted) for ex in batch))
                nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
                nlp.update(
                    batch,
--- a/examples/experiments/onto-joint/defaults.cfg
+++ b/examples/experiments/onto-joint/defaults.cfg
@ -5,17 +5,16 @@
 # data is passed in sentence-by-sentence via some prior preprocessing.
 gold_preproc = false
 # Limitations on training document length or number of examples.
-max_length = 0
+max_length = 5000
 limit = 0
 # Data augmentation
 orth_variant_level = 0.0
 noise_level = 0.0
 dropout = 0.1
 # Controls early-stopping. 0 or -1 mean unlimited.
 patience = 1600
 max_epochs = 0
 max_steps = 20000
-eval_frequency = 400
+eval_frequency = 200
 # Other settings
 seed = 0
 accumulate_gradient = 1
@ -41,15 +40,15 @@ beta2 = 0.999
 L2_is_weight_decay = true
 L2 = 0.01
 grad_clip = 1.0
-use_averages = true
+use_averages = false
 eps = 1e-8
-learn_rate = 0.001
+#learn_rate = 0.001
-#[optimizer.learn_rate]
+[optimizer.learn_rate]
-#@schedules = "warmup_linear.v1"
+@schedules = "warmup_linear.v1"
-#warmup_steps = 250
+warmup_steps = 250
-#total_steps = 20000
+total_steps = 20000
-#initial_rate = 0.001
+initial_rate = 0.001
 [nlp]
 lang = "en"
@ -58,15 +57,11 @@ vectors = null
 [nlp.pipeline.tok2vec]
 factory = "tok2vec"
 [nlp.pipeline.senter]
 factory = "senter"
 [nlp.pipeline.ner]
 factory = "ner"
 learn_tokens = false
 min_action_freq = 1
 beam_width = 1
 beam_update_prob = 1.0
 [nlp.pipeline.tagger]
 factory = "tagger"
@ -74,16 +69,7 @@ factory = "tagger"
 [nlp.pipeline.parser]
 factory = "parser"
 learn_tokens = false
-min_action_freq = 1
+min_action_freq = 30
 beam_width = 1
 beam_update_prob = 1.0
 [nlp.pipeline.senter.model]
@architectures = "spacy.Tagger.v1"
 [nlp.pipeline.senter.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
 width = ${nlp.pipeline.tok2vec.model:width}
 [nlp.pipeline.tagger.model]
@architectures = "spacy.Tagger.v1"
@ -96,8 +82,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1"
 nr_feature_tokens = 8
 hidden_width = 128
-maxout_pieces = 3
+maxout_pieces = 2
-use_upper = false
+use_upper = true
 [nlp.pipeline.parser.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
@ -107,8 +93,8 @@ width = ${nlp.pipeline.tok2vec.model:width}
@architectures = "spacy.TransitionBasedParser.v1"
 nr_feature_tokens = 3
 hidden_width = 128
-maxout_pieces = 3
+maxout_pieces = 2
-use_upper = false
+use_upper = true
 [nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
@ -117,10 +103,10 @@ width = ${nlp.pipeline.tok2vec.model:width}
 [nlp.pipeline.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
 pretrained_vectors = ${nlp:vectors}
-width = 256
+width = 128
-depth = 6
+depth = 4
 window_size = 1
-embed_size = 10000
+embed_size = 7000
 maxout_pieces = 3
 subword_features = true
-dropout = null
+dropout = ${training:dropout}
--- a/examples/experiments/onto-joint/pretrain.cfg
+++ b/examples/experiments/onto-joint/pretrain.cfg
@ -9,7 +9,6 @@ max_length = 0
 limit = 0
 # Data augmentation
 orth_variant_level = 0.0
 noise_level = 0.0
 dropout = 0.1
 # Controls early-stopping. 0 or -1 mean unlimited.
 patience = 1600
--- a/examples/experiments/onto-ner.cfg
+++ b/examples/experiments/onto-ner.cfg
@ -0,0 +1,80 @@
 # Training hyper-parameters and additional features.
 [training]
 # Whether to train on sequences with 'gold standard' sentence boundaries
 # and tokens. If you set this to true, take care to ensure your run-time
 # data is passed in sentence-by-sentence via some prior preprocessing.
 gold_preproc = false
 # Limitations on training document length or number of examples.
 max_length = 5000
 limit = 0
 # Data augmentation
 orth_variant_level = 0.0
 dropout = 0.2
 # Controls early-stopping. 0 or -1 mean unlimited.
 patience = 1600
 max_epochs = 0
 max_steps = 20000
 eval_frequency = 500
 # Other settings
 seed = 0
 accumulate_gradient = 1
 use_pytorch_for_gpu_memory = false
 # Control how scores are printed and checkpoints are evaluated.
 scores = ["speed", "ents_p", "ents_r", "ents_f"]
 score_weights = {"ents_f": 1.0}
 # These settings are invalid for the transformer models.
 init_tok2vec = null
 discard_oversize = false
 omit_extra_lookups = false
 [training.batch_size]
@schedules = "compounding.v1"
 start = 100
 stop = 1000
 compound = 1.001
 [training.optimizer]
@optimizers = "Adam.v1"
 beta1 = 0.9
 beta2 = 0.999
 L2_is_weight_decay = false
 L2 = 1e-6
 grad_clip = 1.0
 use_averages = true
 eps = 1e-8
 learn_rate = 0.001
 #[optimizer.learn_rate]
 #@schedules = "warmup_linear.v1"
 #warmup_steps = 250
 #total_steps = 20000
 #initial_rate = 0.001
 [nlp]
 lang = "en"
 vectors = null
 [nlp.pipeline.ner]
 factory = "ner"
 learn_tokens = false
 min_action_freq = 1
 beam_width = 1
 beam_update_prob = 1.0
 [nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
 nr_feature_tokens = 3
 hidden_width = 64
 maxout_pieces = 2
 use_upper = true
 [nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
 pretrained_vectors = ${nlp:vectors}
 width = 96
 depth = 4
 window_size = 1
 embed_size = 2000
 maxout_pieces = 3
 subword_features = true
 dropout = ${training:dropout}
--- a/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
+++ b/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
@ -6,7 +6,6 @@ init_tok2vec = null
 vectors = null
 max_epochs = 100
 orth_variant_level = 0.0
 noise_level = 0.0
 gold_preproc = true
 max_length = 0
 use_gpu = 0
--- a/examples/experiments/ptb-joint-pos-dep/defaults.cfg
+++ b/examples/experiments/ptb-joint-pos-dep/defaults.cfg
@ -6,7 +6,6 @@ init_tok2vec = null
 vectors = null
 max_epochs = 100
 orth_variant_level = 0.0
 noise_level = 0.0
 gold_preproc = true
 max_length = 0
 use_gpu = -1
--- a/examples/training/conllu.py
+++ b/examples/training/conllu.py
@ -12,7 +12,7 @@ import tqdm
 import spacy
 import spacy.util
 from spacy.tokens import Token, Doc
-from spacy.gold import GoldParse, Example
+from spacy.gold import Example
 from spacy.syntax.nonproj import projectivize
 from collections import defaultdict
 from spacy.matcher import Matcher
@ -33,31 +33,6 @@ random.seed(0)
 numpy.random.seed(0)
 def minibatch_by_words(examples, size=5000):
    random.shuffle(examples)
    if isinstance(size, int):
        size_ = itertools.repeat(size)
    else:
        size_ = size
    examples = iter(examples)
    while True:
        batch_size = next(size_)
        batch = []
        while batch_size >= 0:
            try:
                example = next(examples)
            except StopIteration:
                if batch:
                    yield batch
                return
            batch_size -= len(example.doc)
            batch.append(example)
        if batch:
            yield batch
        else:
            break
 ################
 # Data reading #
 ################
@ -110,7 +85,7 @@ def read_data(
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
-                golds.append(GoldParse(docs[-1], **sent))
+                golds.append(sent)
            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
@ -159,20 +134,19 @@ def read_conllu(file_):
 def _make_gold(nlp, text, sent_annots):
    # Flatten the conll annotations, and adjust the head indices
-    flat = defaultdict(list)
+    gold = defaultdict(list)
    for sent in sent_annots:
-        flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
+        gold["heads"].extend(len(gold["words"]) + head for head in sent["heads"])
        for field in ["words", "tags", "deps", "entities", "spaces"]:
-            flat[field].extend(sent[field])
+            gold[field].extend(sent[field])
    # Construct text if necessary
-    assert len(flat["words"]) == len(flat["spaces"])
+    assert len(gold["words"]) == len(gold["spaces"])
    if text is None:
        text = "".join(
-            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
+            word + " " * space for word, space in zip(gold["words"], gold["spaces"])
        )
    doc = nlp.make_doc(text)
-    flat.pop("spaces")
+    gold.pop("spaces")
    gold = GoldParse(doc, **flat)
    return doc, gold
@ -182,15 +156,10 @@ def _make_gold(nlp, text, sent_annots):
 def golds_to_gold_data(docs, golds):
-    """Get out the training data format used by begin_training, given the
+    """Get out the training data format used by begin_training."""
    GoldParse objects."""
    data = []
    for doc, gold in zip(docs, golds):
-        example = Example(doc=doc)
+        example = Example.from_dict(doc, gold)
        example.add_doc_annotation(cats=gold.cats)
        token_annotation_dict = gold.orig.to_dict()
        example.add_token_annotation(**token_annotation_dict)
        example.goldparse = gold
        data.append(example)
    return data
@ -313,15 +282,15 @@ def initialize_pipeline(nlp, examples, config):
        nlp.parser.add_multitask_objective("sent_start")
    nlp.parser.moves.add_action(2, "subtok")
    nlp.add_pipe(nlp.create_pipe("tagger"))
-    for ex in examples:
+    for eg in examples:
-        for tag in ex.gold.tags:
+        for tag in eg.get_aligned("TAG", as_string=True):
            if tag is not None:
                nlp.tagger.add_label(tag)
    # Replace labels that didn't make the frequency cutoff
    actions = set(nlp.parser.labels)
    label_set = set([act.split("-")[1] for act in actions if "-" in act])
-    for ex in examples:
+    for eg in examples:
-        gold = ex.gold
+        gold = eg.gold
        for i, label in enumerate(gold.labels):
            if label is not None and label not in label_set:
                gold.labels[i] = label.split("||")[0]
@ -415,13 +384,12 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
    optimizer = initialize_pipeline(nlp, examples, config)
    for i in range(config.nr_epoch):
-        docs = [nlp.make_doc(example.doc.text) for example in examples]
+        batches = spacy.minibatch_by_words(examples, size=config.batch_size)
        batches = minibatch_by_words(examples, size=config.batch_size)
        losses = {}
-        n_train_words = sum(len(doc) for doc in docs)
+        n_train_words = sum(len(eg.reference.doc) for eg in examples)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
-                pbar.update(sum(len(ex.doc) for ex in batch))
+                pbar.update(sum(len(eg.reference.doc) for eg in batch))
                nlp.update(
                    examples=batch, sgd=optimizer, drop=config.dropout, losses=losses,
                )
--- a/examples/training/create_kb.py
+++ b/examples/training/create_kb.py
@ -30,7 +30,7 @@ ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
    model=("Model name, should have pretrained word embeddings", "positional", None, str),
    output_dir=("Optional output directory", "option", "o", Path),
 )
-def main(model=None, output_dir=None):
+def main(model, output_dir=None):
    """Load the model and create the KB with pre-defined entity encodings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
    The updated vocab will also be written to a directory in the output_dir."""
--- a/examples/training/ner_multitask_objective.py
+++ b/examples/training/ner_multitask_objective.py
@ -24,8 +24,10 @@ import random
 import plac
 import spacy
 import os.path
 from spacy.gold.example import Example
 from spacy.tokens import Doc
-from spacy.gold import read_json_file, GoldParse
+from spacy.gold import read_json_file
 random.seed(0)
@ -59,17 +61,15 @@ def main(n_iter=10):
    print(nlp.pipeline)
    print("Create data", len(TRAIN_DATA))
-    optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA)
+    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
-        for example in TRAIN_DATA:
+        for example_dict in TRAIN_DATA:
-            for token_annotation in example.token_annotations:
+            doc = Doc(nlp.vocab, words=example_dict["words"])
-                doc = Doc(nlp.vocab, words=token_annotation.words)
+            example = Example.from_dict(doc, example_dict)
                gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation)
            nlp.update(
-                    examples=[(doc, gold)],  # 1 example
+                examples=[example],  # 1 example
                drop=0.2,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights
                losses=losses,
@ -77,9 +77,9 @@ def main(n_iter=10):
        print(losses.get("nn_labeller", 0.0), losses["ner"])
    # test the trained model
-    for example in TRAIN_DATA:
+    for example_dict in TRAIN_DATA:
-        if example.text is not None:
+        if "text" in example_dict:
-            doc = nlp(example.text)
+            doc = nlp(example_dict["text"])
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -4,9 +4,10 @@ import random
 import warnings
 import srsly
 import spacy
-from spacy.gold import GoldParse
+from spacy.gold import Example
 from spacy.util import minibatch, compounding
 # TODO: further fix & test this script for v.3 ? (read_gold_data is never called)
 LABEL = "ANIMAL"
 TRAIN_DATA = [
@ -36,15 +37,13 @@ def read_raw_data(nlp, jsonl_loc):
 def read_gold_data(nlp, gold_loc):
-    docs = []
+    examples = []
    golds = []
    for json_obj in srsly.read_jsonl(gold_loc):
        doc = nlp.make_doc(json_obj["text"])
        ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
-        gold = GoldParse(doc, entities=ents)
+        example = Example.from_dict(doc, {"entities": ents})
-        docs.append(doc)
+        examples.append(example)
-        golds.append(gold)
+    return examples
    return list(zip(docs, golds))
 def main(model_name, unlabelled_loc):
--- a/examples/training/train_intent_parser.py
+++ b/examples/training/train_intent_parser.py
@ -2,7 +2,7 @@
 # coding: utf-8
 """Using the parser to recognise your own semantics
-spaCy's parser component can be used to trained to predict any type of tree
+spaCy's parser component can be trained to predict any type of tree
 structure over your input text. You can also predict trees over whole documents
 or chat logs, with connections between the sentence-roots used to annotate
 discourse structure. In this example, we'll build a message parser for a common
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -56,7 +56,7 @@ def main(model=None, output_dir=None, n_iter=100):
            print("Add label", ent[2])
            ner.add_label(ent[2])
-    with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
+    with nlp.select_pipes(enable="simple_ner") and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module="spacy")
--- a/examples/training/train_textcat.py
+++ b/examples/training/train_textcat.py
@ -19,7 +19,7 @@ from ml_datasets import loaders
 import spacy
 from spacy import util
 from spacy.util import minibatch, compounding
-from spacy.gold import Example, GoldParse
+from spacy.gold import Example
@plac.annotations(
@ -62,11 +62,10 @@ def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=Non
    train_examples = []
    for text, cats in zip(train_texts, train_cats):
        doc = nlp.make_doc(text)
-        gold = GoldParse(doc, cats=cats)
+        example = Example.from_dict(doc, {"cats": cats})
        for cat in cats:
            textcat.add_label(cat)
-        ex = Example.from_gold(gold, doc=doc)
+        train_examples.append(example)
        train_examples.append(ex)
    with nlp.select_pipes(enable="textcat"):  # only train textcat
        optimizer = nlp.begin_training()
--- a/pyproject.toml
+++ b/pyproject.toml
@ -6,7 +6,7 @@ requires = [
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc==8.0.0a9",
+    "thinc==8.0.0a11",
    "blis>=0.4.0,<0.5.0"
 ]
 build-backend = "setuptools.build_meta"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,17 +1,17 @@
 # Our libraries
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc==8.0.0a9
+thinc==8.0.0a11
 blis>=0.4.0,<0.5.0
 ml_datasets>=0.1.1
 murmurhash>=0.28.0,<1.1.0
-wasabi>=0.4.0,<1.1.0
+wasabi>=0.7.0,<1.1.0
-srsly>=2.0.0,<3.0.0
+srsly>=2.1.0,<3.0.0
 catalogue>=0.0.7,<1.1.0
 typer>=0.3.0,<1.0.0
 # Third party dependencies
 numpy>=1.15.0
 requests>=2.13.0,<3.0.0
 plac>=0.9.6,<1.2.0
 tqdm>=4.38.0,<5.0.0
 pydantic>=1.3.0,<2.0.0
 # Official Python utilities
--- a/setup.cfg
+++ b/setup.cfg
@ -36,22 +36,21 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc==8.0.0a9
+    thinc==8.0.0a11
 install_requires =
    # Our libraries
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc==8.0.0a9
+    thinc==8.0.0a11
    blis>=0.4.0,<0.5.0
-    wasabi>=0.4.0,<1.1.0
+    wasabi>=0.7.0,<1.1.0
-    srsly>=2.0.0,<3.0.0
+    srsly>=2.1.0,<3.0.0
    catalogue>=0.0.7,<1.1.0
-    ml_datasets>=0.1.1
+    typer>=0.3.0,<1.0.0
    # Third-party dependencies
    tqdm>=4.38.0,<5.0.0
    numpy>=1.15.0
    plac>=0.9.6,<1.2.0
    requests>=2.13.0,<3.0.0
    pydantic>=1.3.0,<2.0.0
    # Official Python utilities
@ -61,7 +60,7 @@ install_requires =
 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.3.1,<0.4.0
+    spacy_lookups_data>=0.3.2,<0.4.0
 cuda =
    cupy>=5.0.0b4,<9.0.0
 cuda80 =
@ -80,7 +79,8 @@ cuda102 =
    cupy-cuda102>=5.0.0b4,<9.0.0
 # Language tokenizers with external dependencies
 ja =
-    fugashi>=0.1.3
+    sudachipy>=0.4.5
    sudachidict_core>=20200330
 ko =
    natto-py==0.9.0
 th =
--- a/setup.py
+++ b/setup.py
@ -23,6 +23,8 @@ Options.docstrings = True
 PACKAGES = find_packages()
 MOD_NAMES = [
    "spacy.gold.align",
    "spacy.gold.example",
    "spacy.parts_of_speech",
    "spacy.strings",
    "spacy.lexeme",
@ -37,11 +39,10 @@ MOD_NAMES = [
    "spacy.tokenizer",
    "spacy.syntax.nn_parser",
    "spacy.syntax._parser_model",
    "spacy.syntax._beam_utils",
    "spacy.syntax.nonproj",
    "spacy.syntax.transition_system",
    "spacy.syntax.arc_eager",
-    "spacy.gold",
+    "spacy.gold.gold_io",
    "spacy.tokens.doc",
    "spacy.tokens.span",
    "spacy.tokens.token",
@ -120,7 +121,7 @@ class build_ext_subclass(build_ext, build_ext_options):
 def clean(path):
    for path in path.glob("**/*"):
-        if path.is_file() and path.suffix in (".so", ".cpp"):
+        if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
            print(f"Deleting {path.name}")
            path.unlink()
--- a/spacy/init.py
+++ b/spacy/init.py
@ -8,7 +8,7 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
 from thinc.api import prefer_gpu, require_gpu
 from . import pipeline
-from .cli.info import info as cli_info
+from .cli.info import info
 from .glossary import explain
 from .about import __version__
 from .errors import Errors, Warnings
@ -34,7 +34,3 @@ def load(name, **overrides):
 def blank(name, **kwargs):
    LangClass = util.get_lang_class(name)
    return LangClass(**kwargs)
 def info(model=None, markdown=False, silent=False):
    return cli_info(model, markdown, silent)
--- a/spacy/main.py
+++ b/spacy/main.py
@ -1,31 +1,4 @@
 if __name__ == "__main__":
-    import plac
+    from spacy.cli import setup_cli
    import sys
    from wasabi import msg
    from spacy.cli import download, link, info, package, pretrain, convert
    from spacy.cli import init_model, profile, evaluate, validate, debug_data
    from spacy.cli import train_cli
-    commands = {
+    setup_cli()
        "download": download,
        "link": link,
        "info": info,
        "train": train_cli,
        "pretrain": pretrain,
        "debug-data": debug_data,
        "evaluate": evaluate,
        "convert": convert,
        "package": package,
        "init-model": init_model,
        "profile": profile,
        "validate": validate,
    }
    if len(sys.argv) == 1:
        msg.info("Available commands", ", ".join(commands), exits=1)
    command = sys.argv.pop(1)
    sys.argv[0] = f"spacy {command}"
    if command in commands:
        plac.call(commands[command], sys.argv[1:])
    else:
        available = f"Available: {', '.join(commands)}"
        msg.fail(f"Unknown command: {command}", available, exits=1)
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,7 +1,8 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.0.0.dev9"
+__version__ = "3.0.0.dev12"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"
 __projects__ = "https://github.com/explosion/spacy-boilerplates"
--- a/spacy/cli/init.py
+++ b/spacy/cli/init.py
@ -1,19 +1,28 @@
 from wasabi import msg
 from ._app import app, setup_cli  # noqa: F401
 # These are the actual functions, NOT the wrapped CLI commands. The CLI commands
 # are registered automatically and won't have to be imported here.
 from .download import download  # noqa: F401
 from .info import info  # noqa: F401
 from .package import package  # noqa: F401
 from .profile import profile  # noqa: F401
-from .train_from_config import train_cli  # noqa: F401
+from .train import train_cli  # noqa: F401
 from .pretrain import pretrain  # noqa: F401
 from .debug_data import debug_data  # noqa: F401
 from .evaluate import evaluate  # noqa: F401
 from .convert import convert  # noqa: F401
 from .init_model import init_model  # noqa: F401
 from .validate import validate  # noqa: F401
 from .project import project_clone, project_assets, project_run  # noqa: F401
 from .project import project_run_all  # noqa: F401
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
 def link(*args, **kwargs):
    """As of spaCy v3.0, model symlinks are deprecated. You can load models
    using their full names or from a directory path."""
    msg.warn(
        "As of spaCy v3.0, model symlinks are deprecated. You can load models "
        "using their full names or from a directory path."
--- a/spacy/cli/_app.py
+++ b/spacy/cli/_app.py
@ -0,0 +1,24 @@
 import typer
 from typer.main import get_command
 COMMAND = "python -m spacy"
 NAME = "spacy"
 HELP = """spaCy Command-line Interface
 DOCS: https://spacy.io/api/cli
 """
 app = typer.Typer(name=NAME, help=HELP)
 # Wrappers for Typer's annotations. Initially created to set defaults and to
 # keep the names short, but not needed at the moment.
 Arg = typer.Argument
 Opt = typer.Option
 def setup_cli() -> None:
    # Ensure that the help messages always display the correct prompt
    command = get_command(app)
    command(prog_name=COMMAND)
--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -1,88 +1,115 @@
 from typing import Optional
 from enum import Enum
 from pathlib import Path
 from wasabi import Printer
 import srsly
 import re
 import sys
-from .converters import conllu2json, iob2json, conll_ner2json
+from ._app import app, Arg, Opt
-from .converters import ner_jsonl2json
+from ..gold import docs_to_json
 from ..tokens import DocBin
 from ..gold.converters import iob2docs, conll_ner2docs, json2docs
 # Converters are matched by file extension except for ner/iob, which are
 # matched by file extension and content. To add a converter, add a new
 # entry to this dict with the file extension mapped to the converter function
 # imported from /converters.
 CONVERTERS = {
-    "conllubio": conllu2json,
+    # "conllubio": conllu2docs, TODO
-    "conllu": conllu2json,
+    # "conllu": conllu2docs, TODO
-    "conll": conllu2json,
+    # "conll": conllu2docs, TODO
-    "ner": conll_ner2json,
+    "ner": conll_ner2docs,
-    "iob": iob2json,
+    "iob": iob2docs,
-    "jsonl": ner_jsonl2json,
+    "json": json2docs,
 }
-# File types
+
-FILE_TYPES = ("json", "jsonl", "msg")
+# File types that can be written to stdout
-FILE_TYPES_STDOUT = ("json", "jsonl")
+FILE_TYPES_STDOUT = ("json")
-def convert(
+class FileTypes(str, Enum):
    json = "json"
    spacy = "spacy"
@app.command("convert")
 def convert_cli(
    # fmt: off
-    input_file: ("Input file", "positional", None, str),
+    input_path: str = Arg(..., help="Input file or directory", exists=True),
-    output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-",
+    output_dir: Path = Arg("-", help="Output directory. '-' for stdout.", allow_dash=True, exists=True),
-    file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json",
+    file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"),
-    n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1,
+    n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"),
-    seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False,
+    seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"),
-    model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None,
+    model: Optional[str] = Opt(None, "--model", "-b", help="Model for sentence segmentation (for -s)"),
-    morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False,
+    morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
-    merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False,
+    merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
-    converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto",
+    converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
-    ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None,
+    ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
-    lang: ("Language (if tokenizer required)", "option", "l", str) = None,
+    lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
    # fmt: on
 ):
    """
-    Convert files into JSON format for use with train command and other
+    Convert files into json or DocBin format for use with train command and other
    experiment management functions. If no output_dir is specified, the data
    is written to stdout, so you can pipe them forward to a JSON file:
    $ spacy convert some_file.conllu > some_file.json
    """
-    no_print = output_dir == "-"
+    if isinstance(file_type, FileTypes):
-    msg = Printer(no_print=no_print)
+        # We get an instance of the FileTypes from the CLI so we need its string value
-    input_path = Path(input_file)
+        file_type = file_type.value
-    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
+    input_path = Path(input_path)
-        # TODO: support msgpack via stdout in srsly?
+    output_dir = "-" if output_dir == Path("-") else output_dir
-        msg.fail(
+    cli_args = locals()
-            f"Can't write .{file_type} data to stdout",
+    silent = output_dir == "-"
-            "Please specify an output directory.",
+    msg = Printer(no_print=silent)
-            exits=1,
+    verify_cli_args(msg, **cli_args)
    converter = _get_converter(msg, converter, input_path)
    convert(
        input_path,
        output_dir,
        file_type=file_type,
        n_sents=n_sents,
        seg_sents=seg_sents,
        model=model,
        morphology=morphology,
        merge_subtokens=merge_subtokens,
        converter=converter,
        ner_map=ner_map,
        lang=lang,
        silent=silent,
        msg=msg,
    )
-    if not input_path.exists():
+
-        msg.fail("Input file not found", input_path, exits=1)
+
-    if output_dir != "-" and not Path(output_dir).exists():
+def convert(
-        msg.fail("Output directory not found", output_dir, exits=1)
+        input_path: Path,
-    input_data = input_path.open("r", encoding="utf-8").read()
+        output_dir: Path,
-    if converter == "auto":
+        *,
-        converter = input_path.suffix[1:]
+        file_type: str = "json",
-    if converter == "ner" or converter == "iob":
+        n_sents: int = 1,
-        converter_autodetect = autodetect_ner_format(input_data)
+        seg_sents: bool = False,
-        if converter_autodetect == "ner":
+        model: Optional[str] = None,
-            msg.info("Auto-detected token-per-line NER format")
+        morphology: bool = False,
-            converter = converter_autodetect
+        merge_subtokens: bool = False,
-        elif converter_autodetect == "iob":
+        converter: str = "auto",
-            msg.info("Auto-detected sentence-per-line NER format")
+        ner_map: Optional[Path] = None,
-            converter = converter_autodetect
+        lang: Optional[str] = None,
-        else:
+        silent: bool = True,
-            msg.warn(
+        msg: Optional[Path] = None,
-                "Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert"
+) -> None:
-            )
+    if not msg:
-    if converter not in CONVERTERS:
+        msg = Printer(no_print=silent)
-        msg.fail(f"Can't find converter for {converter}", exits=1)
+    ner_map = srsly.read_json(ner_map) if ner_map is not None else None
-    ner_map = None
+
-    if ner_map_path is not None:
+    for input_loc in walk_directory(input_path):
-        ner_map = srsly.read_json(ner_map_path)
+        input_data = input_loc.open("r", encoding="utf-8").read()
        # Use converter function to convert data
        func = CONVERTERS[converter]
-    data = func(
+        docs = func(
            input_data,
            n_sents=n_sents,
            seg_sents=seg_sents,
@ -90,29 +117,41 @@ def convert(
            merge_subtokens=merge_subtokens,
            lang=lang,
            model=model,
-        no_print=no_print,
+            no_print=silent,
            ner_map=ner_map,
        )
-    if output_dir != "-":
+        if output_dir == "-":
-        # Export data to a file
+            _print_docs_to_stdout(docs, file_type)
        suffix = f".{file_type}"
        output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
        if file_type == "json":
            srsly.write_json(output_file, data)
        elif file_type == "jsonl":
            srsly.write_jsonl(output_file, data)
        elif file_type == "msg":
            srsly.write_msgpack(output_file, data)
        msg.good(f"Generated output file ({len(data)} documents): {output_file}")
        else:
-        # Print to stdout
+            if input_loc != input_path:
-        if file_type == "json":
+                subpath = input_loc.relative_to(input_path)
-            srsly.write_json("-", data)
+                output_file = Path(output_dir) / subpath.with_suffix(f".{file_type}")
-        elif file_type == "jsonl":
+            else:
-            srsly.write_jsonl("-", data)
+                output_file = Path(output_dir) / input_loc.parts[-1]
                output_file = output_file.with_suffix(f".{file_type}")
            _write_docs_to_file(docs, output_file, file_type)
            msg.good(f"Generated output file ({len(docs)} documents): {output_file}")
-def autodetect_ner_format(input_data):
+def _print_docs_to_stdout(docs, output_type):
    if output_type == "json":
        srsly.write_json("-", docs_to_json(docs))
    else:
        sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())
 def _write_docs_to_file(docs, output_file, output_type):
    if not output_file.parent.exists():
        output_file.parent.mkdir(parents=True)
    if output_type == "json":
        srsly.write_json(output_file, docs_to_json(docs))
    else:
        data = DocBin(docs=docs).to_bytes()
        with output_file.open("wb") as file_:
            file_.write(data)
 def autodetect_ner_format(input_data: str) -> str:
    # guess format from the first 20 lines
    lines = input_data.split("\n")[:20]
    format_guesses = {"ner": 0, "iob": 0}
@ -129,3 +168,86 @@ def autodetect_ner_format(input_data):
    if format_guesses["ner"] == 0 and format_guesses["iob"] > 0:
        return "iob"
    return None
 def walk_directory(path):
    if not path.is_dir():
        return [path]
    paths = [path]
    locs = []
    seen = set()
    for path in paths:
        if str(path) in seen:
            continue
        seen.add(str(path))
        if path.parts[-1].startswith("."):
            continue
        elif path.is_dir():
            paths.extend(path.iterdir())
        else:
            locs.append(path)
    return locs
 def verify_cli_args(
    msg,
    input_path,
    output_dir,
    file_type,
    n_sents,
    seg_sents,
    model,
    morphology,
    merge_subtokens,
    converter,
    ner_map,
    lang,
 ):
    input_path = Path(input_path)
    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
        # TODO: support msgpack via stdout in srsly?
        msg.fail(
            f"Can't write .{file_type} data to stdout",
            "Please specify an output directory.",
            exits=1,
        )
    if not input_path.exists():
        msg.fail("Input file not found", input_path, exits=1)
    if output_dir != "-" and not Path(output_dir).exists():
        msg.fail("Output directory not found", output_dir, exits=1)
    if input_path.is_dir():
        input_locs = walk_directory(input_path)
        if len(input_locs) == 0:
            msg.fail("No input files in directory", input_path, exits=1)
        file_types = list(set([loc.suffix[1:] for loc in input_locs]))
        if len(file_types) >= 2:
            file_types = ",".join(file_types)
            msg.fail("All input files must be same type", file_types, exits=1)
    converter = _get_converter(msg, converter, input_path)
    if converter not in CONVERTERS:
        msg.fail(f"Can't find converter for {converter}", exits=1)
    return converter
 def _get_converter(msg, converter, input_path):
    if input_path.is_dir():
        input_path = walk_directory(input_path)[0]
    if converter == "auto":
        converter = input_path.suffix[1:]
    if converter == "ner" or converter == "iob":
        with input_path.open() as file_:
            input_data = file_.read()
        converter_autodetect = autodetect_ner_format(input_data)
        if converter_autodetect == "ner":
            msg.info("Auto-detected token-per-line NER format")
            converter = converter_autodetect
        elif converter_autodetect == "iob":
            msg.info("Auto-detected sentence-per-line NER format")
            converter = converter_autodetect
        else:
            msg.warn(
                "Can't automatically detect NER format. "
                "Conversion may not succeed. "
                "See https://spacy.io/api/cli#convert"
            )
    return converter
--- a/spacy/cli/converters/init.py
+++ b/spacy/cli/converters/init.py
@ -1,4 +0,0 @@
 from .conllu2json import conllu2json  # noqa: F401
 from .iob2json import iob2json  # noqa: F401
 from .conll_ner2json import conll_ner2json  # noqa: F401
 from .jsonl2json import ner_jsonl2json  # noqa: F401
--- a/spacy/cli/converters/iob2json.py
+++ b/spacy/cli/converters/iob2json.py
@ -1,65 +0,0 @@
 from wasabi import Printer
 from ...gold import iob_to_biluo
 from ...util import minibatch
 from .conll_ner2json import n_sents_info
 def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs):
    """
    Convert IOB files with one sentence per line and tags separated with '|'
    into JSON format for use with train cli. IOB and IOB2 are accepted.
    Sample formats:
    I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    """
    msg = Printer(no_print=no_print)
    docs = read_iob(input_data.split("\n"))
    if n_sents > 0:
        n_sents_info(msg, n_sents)
        docs = merge_sentences(docs, n_sents)
    return docs
 def read_iob(raw_sents):
    sentences = []
    for line in raw_sents:
        if not line.strip():
            continue
        tokens = [t.split("|") for t in line.split()]
        if len(tokens[0]) == 3:
            words, pos, iob = zip(*tokens)
        elif len(tokens[0]) == 2:
            words, iob = zip(*tokens)
            pos = ["-"] * len(words)
        else:
            raise ValueError(
                "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
            )
        biluo = iob_to_biluo(iob)
        sentences.append(
            [
                {"orth": w, "tag": p, "ner": ent}
                for (w, p, ent) in zip(words, pos, biluo)
            ]
        )
    sentences = [{"tokens": sent} for sent in sentences]
    paragraphs = [{"sentences": [sent]} for sent in sentences]
    docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)]
    return docs
 def merge_sentences(docs, n_sents):
    merged = []
    for group in minibatch(docs, size=n_sents):
        group = list(group)
        first = group.pop(0)
        to_extend = first["paragraphs"][0]["sentences"]
        for sent in group:
            to_extend.extend(sent["paragraphs"][0]["sentences"])
        merged.append(first)
    return merged
--- a/spacy/cli/converters/jsonl2json.py
+++ b/spacy/cli/converters/jsonl2json.py
@ -1,50 +0,0 @@
 import srsly
 from ...gold import docs_to_json
 from ...util import get_lang_class, minibatch
 def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_):
    if lang is None:
        raise ValueError("No --lang specified, but tokenization required")
    json_docs = []
    input_examples = [srsly.json_loads(line) for line in input_data.strip().split("\n")]
    nlp = get_lang_class(lang)()
    sentencizer = nlp.create_pipe("sentencizer")
    for i, batch in enumerate(minibatch(input_examples, size=n_sents)):
        docs = []
        for record in batch:
            raw_text = record["text"]
            if "entities" in record:
                ents = record["entities"]
            else:
                ents = record["spans"]
            ents = [(e["start"], e["end"], e["label"]) for e in ents]
            doc = nlp.make_doc(raw_text)
            sentencizer(doc)
            spans = [doc.char_span(s, e, label=L) for s, e, L in ents]
            doc.ents = _cleanup_spans(spans)
            docs.append(doc)
        json_docs.append(docs_to_json(docs, id=i))
    return json_docs
 def _cleanup_spans(spans):
    output = []
    seen = set()
    for span in spans:
        if span is not None:
            # Trim whitespace
            while len(span) and span[0].is_space:
                span = span[1:]
            while len(span) and span[-1].is_space:
                span = span[:-1]
            if not len(span):
                continue
            for i in range(span.start, span.end):
                if i in seen:
                    break
            else:
                output.append(span)
                seen.update(range(span.start, span.end))
    return output
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -1,11 +1,14 @@
 from typing import Optional, List, Sequence, Dict, Any, Tuple
 from pathlib import Path
 from collections import Counter
 import sys
 import srsly
 from wasabi import Printer, MESSAGES
-from ..gold import GoldCorpus
+from ._app import app, Arg, Opt
 from ..gold import Corpus, Example
 from ..syntax import nonproj
 from ..language import Language
 from ..util import load_model, get_lang_class
@ -18,17 +21,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100
 BLANK_MODEL_THRESHOLD = 2000
-def debug_data(
+@app.command("debug-data")
 def debug_data_cli(
    # fmt: off
-    lang: ("Model language", "positional", None, str),
+    lang: str = Arg(..., help="Model language"),
-    train_path: ("Location of JSON-formatted training data", "positional", None, Path),
+    train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
-    dev_path: ("Location of JSON-formatted development data", "positional", None, Path),
+    dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
-    tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None,
+    tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map", exists=True, dir_okay=False),
-    base_model: ("Name of model to update (optional)", "option", "b", str) = None,
+    base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Name of model to update (optional)"),
-    pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner",
+    pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of pipeline components to train"),
-    ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False,
+    ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
-    verbose: ("Print additional information and explanations", "flag", "V", bool) = False,
+    verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
-    no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False,
+    no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
    # fmt: on
 ):
    """
@ -36,8 +40,36 @@ def debug_data(
    stats, and find problems like invalid entity annotations, cyclic
    dependencies, low data labels and more.
    """
-    msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings)
+    debug_data(
        lang,
        train_path,
        dev_path,
        tag_map_path=tag_map_path,
        base_model=base_model,
        pipeline=[p.strip() for p in pipeline.split(",")],
        ignore_warnings=ignore_warnings,
        verbose=verbose,
        no_format=no_format,
        silent=False,
    )
 def debug_data(
    lang: str,
    train_path: Path,
    dev_path: Path,
    *,
    tag_map_path: Optional[Path] = None,
    base_model: Optional[str] = None,
    pipeline: List[str] = ["tagger", "parser", "ner"],
    ignore_warnings: bool = False,
    verbose: bool = False,
    no_format: bool = True,
    silent: bool = True,
 ):
    msg = Printer(
        no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings
    )
    # Make sure all files and paths exists if they are needed
    if not train_path.exists():
        msg.fail("Training data not found", train_path, exits=1)
@ -49,7 +81,6 @@ def debug_data(
        tag_map = srsly.read_json(tag_map_path)
    # Initialize the model and pipeline
    pipeline = [p.strip() for p in pipeline.split(",")]
    if base_model:
        nlp = load_model(base_model)
    else:
@ -68,12 +99,9 @@ def debug_data(
    loading_train_error_message = ""
    loading_dev_error_message = ""
    with msg.loading("Loading corpus..."):
-        corpus = GoldCorpus(train_path, dev_path)
+        corpus = Corpus(train_path, dev_path)
        try:
            train_dataset = list(corpus.train_dataset(nlp))
            train_dataset_unpreprocessed = list(
                corpus.train_dataset_without_preprocessing(nlp)
            )
        except ValueError as e:
            loading_train_error_message = f"Training data cannot be loaded: {e}"
        try:
@ -89,11 +117,9 @@ def debug_data(
    msg.good("Corpus is loadable")
    # Create all gold data here to avoid iterating over the train_dataset constantly
-    gold_train_data = _compile_gold(train_dataset, pipeline, nlp)
+    gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
-    gold_train_unpreprocessed_data = _compile_gold(
+    gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
-        train_dataset_unpreprocessed, pipeline
+    gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
    )
    gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp)
    train_texts = gold_train_data["texts"]
    dev_texts = gold_dev_data["texts"]
@ -446,7 +472,7 @@ def debug_data(
        sys.exit(1)
-def _load_file(file_path, msg):
+def _load_file(file_path: Path, msg: Printer) -> None:
    file_name = file_path.parts[-1]
    if file_path.suffix == ".json":
        with msg.loading(f"Loading {file_name}..."):
@ -465,7 +491,9 @@ def _load_file(file_path, msg):
    )
-def _compile_gold(examples, pipeline, nlp):
+def _compile_gold(
    examples: Sequence[Example], pipeline: List[str], nlp: Language, make_proj: bool
 ) -> Dict[str, Any]:
    data = {
        "ner": Counter(),
        "cats": Counter(),
@ -484,20 +512,20 @@ def _compile_gold(examples, pipeline, nlp):
        "n_cats_multilabel": 0,
        "texts": set(),
    }
-    for example in examples:
+    for eg in examples:
-        gold = example.gold
+        gold = eg.reference
-        doc = example.doc
+        doc = eg.predicted
-        valid_words = [x for x in gold.words if x is not None]
+        valid_words = [x for x in gold if x is not None]
        data["words"].update(valid_words)
        data["n_words"] += len(valid_words)
-        data["n_misaligned_words"] += len(gold.words) - len(valid_words)
+        data["n_misaligned_words"] += len(gold) - len(valid_words)
        data["texts"].add(doc.text)
        if len(nlp.vocab.vectors):
            for word in valid_words:
                if nlp.vocab.strings[word] not in nlp.vocab.vectors:
                    data["words_missing_vectors"].update([word])
        if "ner" in pipeline:
-            for i, label in enumerate(gold.ner):
+            for i, label in enumerate(eg.get_aligned_ner()):
                if label is None:
                    continue
                if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
@ -523,32 +551,34 @@ def _compile_gold(examples, pipeline, nlp):
            if list(gold.cats.values()).count(1.0) != 1:
                data["n_cats_multilabel"] += 1
        if "tagger" in pipeline:
-            data["tags"].update([x for x in gold.tags if x is not None])
+            tags = eg.get_aligned("TAG", as_string=True)
            data["tags"].update([x for x in tags if x is not None])
        if "parser" in pipeline:
-            data["deps"].update([x for x in gold.labels if x is not None])
+            aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
-            for i, (dep, head) in enumerate(zip(gold.labels, gold.heads)):
+            data["deps"].update([x for x in aligned_deps if x is not None])
            for i, (dep, head) in enumerate(zip(aligned_deps, aligned_heads)):
                if head == i:
                    data["roots"].update([dep])
                    data["n_sents"] += 1
-            if nonproj.is_nonproj_tree(gold.heads):
+            if nonproj.is_nonproj_tree(aligned_heads):
                data["n_nonproj"] += 1
-            if nonproj.contains_cycle(gold.heads):
+            if nonproj.contains_cycle(aligned_heads):
                data["n_cycles"] += 1
    return data
-def _format_labels(labels, counts=False):
+def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
    if counts:
        return ", ".join([f"'{l}' ({c})" for l, c in labels])
    return ", ".join([f"'{l}'" for l in labels])
-def _get_examples_without_label(data, label):
+def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
    count = 0
-    for ex in data:
+    for eg in data:
        labels = [
            label.split("-")[1]
-            for label in ex.gold.ner
+            for label in eg.get_aligned_ner()
            if label not in ("O", "-", None)
        ]
        if label not in labels:
@ -556,7 +586,7 @@ def _get_examples_without_label(data, label):
    return count
-def _get_labels_from_model(nlp, pipe_name):
+def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
    if pipe_name not in nlp.pipe_names:
        return set()
    pipe = nlp.get_pipe(pipe_name)
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -1,23 +1,36 @@
 from typing import Optional, Sequence, Union
 import requests
 import os
 import subprocess
 import sys
 from wasabi import msg
 import typer
 from ._app import app, Arg, Opt
 from .. import about
-from ..util import is_package, get_base_version
+from ..util import is_package, get_base_version, run_command
-def download(
+@app.command(
-    model: ("Model to download (shortcut or name)", "positional", None, str),
+    "download",
-    direct: ("Force direct download of name + version", "flag", "d", bool) = False,
+    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
-    *pip_args: ("Additional arguments to be passed to `pip install` on model install"),
+)
 def download_cli(
    # fmt: off
    ctx: typer.Context,
    model: str = Arg(..., help="Model to download (shortcut or name)"),
    direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
    # fmt: on
 ):
    """
    Download compatible model from default download path using pip. If --direct
    flag is set, the command expects the full model name with version.
-    For direct downloads, the compatibility check will be skipped.
+    For direct downloads, the compatibility check will be skipped. All
    additional arguments provided to this command will be passed to `pip install`
    on model installation.
    """
    download(model, direct, *ctx.args)
 def download(model: str, direct: bool = False, *pip_args) -> None:
    if not is_package("spacy") and "--no-deps" not in pip_args:
        msg.warn(
            "Skipping model package dependencies and setting `--no-deps`. "
@ -33,22 +46,20 @@ def download(
        components = model.split("-")
        model_name = "".join(components[:-1])
        version = components[-1]
-        dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
+        download_model(dl_tpl.format(m=model_name, v=version), pip_args)
    else:
        shortcuts = get_json(about.__shortcuts__, "available shortcuts")
        model_name = shortcuts.get(model, model)
        compatibility = get_compatibility()
        version = get_version(model_name, compatibility)
-        dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
+        download_model(dl_tpl.format(m=model_name, v=version), pip_args)
        if dl != 0:  # if download subprocess doesn't return 0, exit
            sys.exit(dl)
    msg.good(
        "Download and installation successful",
        f"You can now load the model via spacy.load('{model_name}')",
    )
-def get_json(url, desc):
+def get_json(url: str, desc: str) -> Union[dict, list]:
    r = requests.get(url)
    if r.status_code != 200:
        msg.fail(
@ -62,7 +73,7 @@ def get_json(url, desc):
    return r.json()
-def get_compatibility():
+def get_compatibility() -> dict:
    version = get_base_version(about.__version__)
    comp_table = get_json(about.__compatibility__, "compatibility table")
    comp = comp_table["spacy"]
@ -71,7 +82,7 @@ def get_compatibility():
    return comp[version]
-def get_version(model, comp):
+def get_version(model: str, comp: dict) -> str:
    model = get_base_version(model)
    if model not in comp:
        msg.fail(
@ -81,10 +92,12 @@ def get_version(model, comp):
    return comp[model][0]
-def download_model(filename, user_pip_args=None):
+def download_model(
    filename: str, user_pip_args: Optional[Sequence[str]] = None
 ) -> None:
    download_url = about.__download_url__ + "/" + filename
    pip_args = ["--no-cache-dir"]
    if user_pip_args:
        pip_args.extend(user_pip_args)
    cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
-    return subprocess.call(cmd, env=os.environ.copy())
+    run_command(cmd)
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -1,46 +1,75 @@
 from typing import Optional, List, Dict
 from timeit import default_timer as timer
-from wasabi import msg
+from wasabi import Printer
 from pathlib import Path
 import re
 import srsly
-from ..gold import GoldCorpus
+from ..gold import Corpus
 from ..tokens import Doc
 from ._app import app, Arg, Opt
 from ..scorer import Scorer
 from .. import util
 from .. import displacy
-def evaluate(
+@app.command("evaluate")
 def evaluate_cli(
    # fmt: off
-    model: ("Model name or path", "positional", None, str),
+    model: str = Arg(..., help="Model name or path"),
-    data_path: ("Location of JSON-formatted evaluation data", "positional", None, str),
+    data_path: Path = Arg(..., help="Location of JSON-formatted evaluation data", exists=True),
-    gpu_id: ("Use GPU", "option", "g", int) = -1,
+    output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
-    gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False,
+    gpu_id: int = Opt(-1, "--gpu-id", "-g", help="Use GPU"),
-    displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None,
+    gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
-    displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25,
+    displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
-    return_scores: ("Return dict containing model scores", "flag", "R", bool) = False,
+    displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
    # fmt: on
 ):
    """
    Evaluate a model. To render a sample of parses in a HTML file, set an
    output directory as the displacy_path argument.
    """
    evaluate(
        model,
        data_path,
        output=output,
        gpu_id=gpu_id,
        gold_preproc=gold_preproc,
        displacy_path=displacy_path,
        displacy_limit=displacy_limit,
        silent=False,
    )
 def evaluate(
    model: str,
    data_path: Path,
    output: Optional[Path],
    gpu_id: int = -1,
    gold_preproc: bool = False,
    displacy_path: Optional[Path] = None,
    displacy_limit: int = 25,
    silent: bool = True,
 ) -> Scorer:
    msg = Printer(no_print=silent, pretty=not silent)
    util.fix_random_seed()
    if gpu_id >= 0:
        util.use_gpu(gpu_id)
    util.set_env_log(False)
    data_path = util.ensure_path(data_path)
    output_path = util.ensure_path(output)
    displacy_path = util.ensure_path(displacy_path)
    if not data_path.exists():
        msg.fail("Evaluation data not found", data_path, exits=1)
    if displacy_path and not displacy_path.exists():
        msg.fail("Visualization output directory not found", displacy_path, exits=1)
-    corpus = GoldCorpus(data_path, data_path)
+    corpus = Corpus(data_path, data_path)
    if model.startswith("blank:"):
        nlp = util.get_lang_class(model.replace("blank:", ""))()
    else:
    nlp = util.load_model(model)
    dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc))
    begin = timer()
    scorer = nlp.evaluate(dev_dataset, verbose=False)
    end = timer()
-    nwords = sum(len(ex.doc) for ex in dev_dataset)
+    nwords = sum(len(ex.predicted) for ex in dev_dataset)
    results = {
        "Time": f"{end - begin:.2f} s",
        "Words": nwords,
@ -60,10 +89,22 @@ def evaluate(
        "Sent R": f"{scorer.sent_r:.2f}",
        "Sent F": f"{scorer.sent_f:.2f}",
    }
    data = {re.sub(r"[\s/]", "_", k.lower()): v for k, v in results.items()}
    msg.table(results, title="Results")
    if scorer.ents_per_type:
        data["ents_per_type"] = scorer.ents_per_type
        print_ents_per_type(msg, scorer.ents_per_type)
    if scorer.textcats_f_per_cat:
        data["textcats_f_per_cat"] = scorer.textcats_f_per_cat
        print_textcats_f_per_cat(msg, scorer.textcats_f_per_cat)
    if scorer.textcats_auc_per_cat:
        data["textcats_auc_per_cat"] = scorer.textcats_auc_per_cat
        print_textcats_auc_per_cat(msg, scorer.textcats_auc_per_cat)
    if displacy_path:
-        docs = [ex.doc for ex in dev_dataset]
+        docs = [ex.predicted for ex in dev_dataset]
        render_deps = "parser" in nlp.meta.get("pipeline", [])
        render_ents = "ner" in nlp.meta.get("pipeline", [])
        render_parses(
@ -75,11 +116,21 @@ def evaluate(
            ents=render_ents,
        )
        msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
-    if return_scores:
+
-        return scorer.scores
+    if output_path is not None:
        srsly.write_json(output_path, data)
        msg.good(f"Saved results to {output_path}")
    return data
-def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
+def render_parses(
    docs: List[Doc],
    output_path: Path,
    model_name: str = "",
    limit: int = 250,
    deps: bool = True,
    ents: bool = True,
 ):
    docs[0].user_data["title"] = model_name
    if ents:
        html = displacy.render(docs[:limit], style="ent", page=True)
@ -91,3 +142,40 @@ def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=T
        )
        with (output_path / "parses.html").open("w", encoding="utf8") as file_:
            file_.write(html)
 def print_ents_per_type(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
    data = [
        (k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
        for k, v in scores.items()
    ]
    msg.table(
        data,
        header=("", "P", "R", "F"),
        aligns=("l", "r", "r", "r"),
        title="NER (per type)",
    )
 def print_textcats_f_per_cat(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None:
    data = [
        (k, f"{v['p']:.2f}", f"{v['r']:.2f}", f"{v['f']:.2f}")
        for k, v in scores.items()
    ]
    msg.table(
        data,
        header=("", "P", "R", "F"),
        aligns=("l", "r", "r", "r"),
        title="Textcat F (per type)",
    )
 def print_textcats_auc_per_cat(
    msg: Printer, scores: Dict[str, Dict[str, float]]
 ) -> None:
    msg.table(
        [(k, f"{v['roc_auc_score']:.2f}") for k, v in scores.items()],
        header=("", "ROC AUC"),
        aligns=("l", "r"),
        title="Textcat ROC AUC (per label)",
    )
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -1,24 +1,80 @@
 from typing import Optional, Dict, Any, Union
 import platform
 from pathlib import Path
-from wasabi import msg
+from wasabi import Printer
 import srsly
-from .validate import get_model_pkgs
+from ._app import app, Arg, Opt
 from .. import util
 from .. import about
-def info(
+@app.command("info")
-    model: ("Optional model name", "positional", None, str) = None,
+def info_cli(
-    markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False,
+    # fmt: off
-    silent: ("Don't print anything (just return)", "flag", "s") = False,
+    model: Optional[str] = Arg(None, help="Optional model name"),
    markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
    silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
    # fmt: on
 ):
    """
    Print info about spaCy installation. If a model is speficied as an argument,
    print model information. Flag --markdown prints details in Markdown for easy
    copy-pasting to GitHub issues.
    """
    info(model, markdown=markdown, silent=silent)
 def info(
    model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
 ) -> Union[str, dict]:
    msg = Printer(no_print=silent, pretty=not silent)
    if model:
        title = f"Info about model '{model}'"
        data = info_model(model, silent=silent)
    else:
        title = "Info about spaCy"
        data = info_spacy()
    raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()}
    if "Models" in data and isinstance(data["Models"], dict):
        data["Models"] = ", ".join(f"{n} ({v})" for n, v in data["Models"].items())
    markdown_data = get_markdown(data, title=title)
    if markdown:
        if not silent:
            print(markdown_data)
        return markdown_data
    if not silent:
        table_data = dict(data)
        msg.table(table_data, title=title)
    return raw_data
 def info_spacy() -> Dict[str, any]:
    """Generate info about the current spaCy intallation.
    RETURNS (dict): The spaCy info.
    """
    all_models = {}
    for pkg_name in util.get_installed_models():
        package = pkg_name.replace("-", "_")
        all_models[package] = util.get_package_version(pkg_name)
    return {
        "spaCy version": about.__version__,
        "Location": str(Path(__file__).parent.parent),
        "Platform": platform.platform(),
        "Python version": platform.python_version(),
        "Models": all_models,
    }
 def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
    """Generate info about a specific model.
    model (str): Model name of path.
    silent (bool): Don't print anything, just return.
    RETURNS (dict): The model meta.
    """
    msg = Printer(no_print=silent, pretty=not silent)
    if util.is_package(model):
        model_path = util.get_package_path(model)
    else:
@ -32,46 +88,22 @@ def info(
        meta["source"] = str(model_path.resolve())
    else:
        meta["source"] = str(model_path)
-        if not silent:
+    return {k: v for k, v in meta.items() if k not in ("accuracy", "speed")}
            title = f"Info about model '{model}'"
            model_meta = {
                k: v for k, v in meta.items() if k not in ("accuracy", "speed")
            }
            if markdown:
                print_markdown(model_meta, title=title)
            else:
                msg.table(model_meta, title=title)
        return meta
    all_models, _ = get_model_pkgs()
    data = {
        "spaCy version": about.__version__,
        "Location": str(Path(__file__).parent.parent),
        "Platform": platform.platform(),
        "Python version": platform.python_version(),
        "Models": ", ".join(
            f"{m['name']} ({m['version']})" for m in all_models.values()
        ),
    }
    if not silent:
        title = "Info about spaCy"
        if markdown:
            print_markdown(data, title=title)
        else:
            msg.table(data, title=title)
    return data
-def print_markdown(data, title=None):
+def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
-    """Print data in GitHub-flavoured Markdown format for issues etc.
+    """Get data in GitHub-flavoured Markdown format for issues etc.
    data (dict or list of tuples): Label/value pairs.
    title (str / None): Title, will be rendered as headline 2.
    RETURNS (str): The Markdown string.
    """
    markdown = []
    for key, value in data.items():
        if isinstance(value, str) and Path(value).exists():
            continue
        markdown.append(f"* **{key}:** {value}")
    result = "\n{}\n".format("\n".join(markdown))
    if title:
-        print(f"\n## {title}")
+        result = f"\n## {title}\n{result}"
-    print("\n{}\n".format("\n".join(markdown)))
+    return result
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -1,3 +1,4 @@
 from typing import Optional, List, Dict, Any, Union, IO
 import math
 from tqdm import tqdm
 import numpy
@ -9,10 +10,12 @@ import gzip
 import zipfile
 import srsly
 import warnings
-from wasabi import msg
+from wasabi import Printer
 from ._app import app, Arg, Opt
 from ..vectors import Vectors
 from ..errors import Errors, Warnings
 from ..language import Language
 from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
 from ..lookups import Lookups
@ -25,20 +28,21 @@ except ImportError:
 DEFAULT_OOV_PROB = -20
-def init_model(
+@app.command("init-model")
 def init_model_cli(
    # fmt: off
-    lang: ("Model language", "positional", None, str),
+    lang: str = Arg(..., help="Model language"),
-    output_dir: ("Model output directory", "positional", None, Path),
+    output_dir: Path = Arg(..., help="Model output directory"),
-    freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None,
+    freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),
-    clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None,
+    clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
-    jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None,
+    jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
-    vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None,
+    vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
-    prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1,
+    prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
-    truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0,
+    truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
-    vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None,
+    vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
-    model_name: ("Optional name for the model meta", "option", "mn", str) = None,
+    model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
-    omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False,
+    omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
-    base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None
+    base_model: Optional[str] = Opt(None, "--base-model", "-b", help="Base model (for languages with custom tokenizers)")
    # fmt: on
 ):
    """
@ -46,6 +50,38 @@ def init_model(
    and word vectors. If vectors are provided in Word2Vec format, they can
    be either a .txt or zipped as a .zip or .tar.gz.
    """
    init_model(
        lang,
        output_dir,
        freqs_loc=freqs_loc,
        clusters_loc=clusters_loc,
        jsonl_loc=jsonl_loc,
        prune_vectors=prune_vectors,
        truncate_vectors=truncate_vectors,
        vectors_name=vectors_name,
        model_name=model_name,
        omit_extra_lookups=omit_extra_lookups,
        base_model=base_model,
        silent=False,
    )
 def init_model(
    lang: str,
    output_dir: Path,
    freqs_loc: Optional[Path] = None,
    clusters_loc: Optional[Path] = None,
    jsonl_loc: Optional[Path] = None,
    vectors_loc: Optional[Path] = None,
    prune_vectors: int = -1,
    truncate_vectors: int = 0,
    vectors_name: Optional[str] = None,
    model_name: Optional[str] = None,
    omit_extra_lookups: bool = False,
    base_model: Optional[str] = None,
    silent: bool = True,
 ) -> Language:
    msg = Printer(no_print=silent, pretty=not silent)
    if jsonl_loc is not None:
        if freqs_loc is not None or clusters_loc is not None:
            settings = ["-j"]
@ -68,7 +104,7 @@ def init_model(
        freqs_loc = ensure_path(freqs_loc)
        if freqs_loc is not None and not freqs_loc.exists():
            msg.fail("Can't find words frequencies file", freqs_loc, exits=1)
-        lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
+        lex_attrs = read_attrs_from_deprecated(msg, freqs_loc, clusters_loc)
    with msg.loading("Creating model..."):
        nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
@ -83,7 +119,9 @@ def init_model(
    msg.good("Successfully created model")
    if vectors_loc is not None:
-        add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
+        add_vectors(
            msg, nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name
        )
    vec_added = len(nlp.vocab.vectors)
    lex_added = len(nlp.vocab)
    msg.good(
@ -95,7 +133,7 @@ def init_model(
    return nlp
-def open_file(loc):
+def open_file(loc: Union[str, Path]) -> IO:
    """Handle .gz, .tar.gz or unzipped files"""
    loc = ensure_path(loc)
    if tarfile.is_tarfile(str(loc)):
@ -111,7 +149,9 @@ def open_file(loc):
        return loc.open("r", encoding="utf8")
-def read_attrs_from_deprecated(freqs_loc, clusters_loc):
+def read_attrs_from_deprecated(
    msg: Printer, freqs_loc: Optional[Path], clusters_loc: Optional[Path]
 ) -> List[Dict[str, Any]]:
    if freqs_loc is not None:
        with msg.loading("Counting frequencies..."):
            probs, _ = read_freqs(freqs_loc)
@ -139,7 +179,12 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
    return lex_attrs
-def create_model(lang, lex_attrs, name=None, base_model=None):
+def create_model(
    lang: str,
    lex_attrs: List[Dict[str, Any]],
    name: Optional[str] = None,
    base_model: Optional[Union[str, Path]] = None,
 ) -> Language:
    if base_model:
        nlp = load_model(base_model)
        # keep the tokenizer but remove any existing pipeline components due to
@ -166,7 +211,14 @@ def create_model(lang, lex_attrs, name=None, base_model=None):
    return nlp
-def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
+def add_vectors(
    msg: Printer,
    nlp: Language,
    vectors_loc: Optional[Path],
    truncate_vectors: int,
    prune_vectors: int,
    name: Optional[str] = None,
 ) -> None:
    vectors_loc = ensure_path(vectors_loc)
    if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
        nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
@ -176,7 +228,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
    else:
        if vectors_loc:
            with msg.loading(f"Reading vectors from {vectors_loc}"):
-                vectors_data, vector_keys = read_vectors(vectors_loc)
+                vectors_data, vector_keys = read_vectors(msg, vectors_loc)
            msg.good(f"Loaded vectors from {vectors_loc}")
        else:
            vectors_data, vector_keys = (None, None)
@ -195,7 +247,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
        nlp.vocab.prune_vectors(prune_vectors)
-def read_vectors(vectors_loc, truncate_vectors=0):
+def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
    f = open_file(vectors_loc)
    shape = tuple(int(size) for size in next(f).split())
    if truncate_vectors >= 1:
@ -215,7 +267,9 @@ def read_vectors(vectors_loc, truncate_vectors=0):
    return vectors_data, vectors_keys
-def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
+def read_freqs(
    freqs_loc: Path, max_length: int = 100, min_doc_freq: int = 5, min_freq: int = 50
 ):
    counts = PreshCounter()
    total = 0
    with freqs_loc.open() as f:
@ -244,7 +298,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
    return probs, oov_prob
-def read_clusters(clusters_loc):
+def read_clusters(clusters_loc: Path) -> dict:
    clusters = {}
    if ftfy is None:
        warnings.warn(Warnings.W004)
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -1,19 +1,25 @@
 from typing import Optional, Union, Any, Dict
 import shutil
 from pathlib import Path
-from wasabi import msg, get_raw_input
+from wasabi import Printer, get_raw_input
 import srsly
 import sys
 from ._app import app, Arg, Opt
 from ..schemas import validate, ModelMetaSchema
 from .. import util
 from .. import about
-def package(
+@app.command("package")
 def package_cli(
    # fmt: off
-    input_dir: ("Directory with model data", "positional", None, str),
+    input_dir: Path = Arg(..., help="Directory with model data", exists=True, file_okay=False),
-    output_dir: ("Output parent directory", "positional", None, str),
+    output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
-    meta_path: ("Path to meta.json", "option", "m", str) = None,
+    meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
-    create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False,
+    create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"),
-    force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False,
+    version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
    force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing model in output directory"),
    # fmt: on
 ):
    """
@ -23,6 +29,27 @@ def package(
    set and a meta.json already exists in the output directory, the existing
    values will be used as the defaults in the command-line prompt.
    """
    package(
        input_dir,
        output_dir,
        meta_path=meta_path,
        version=version,
        create_meta=create_meta,
        force=force,
        silent=False,
    )
 def package(
    input_dir: Path,
    output_dir: Path,
    meta_path: Optional[Path] = None,
    version: Optional[str] = None,
    create_meta: bool = False,
    force: bool = False,
    silent: bool = True,
 ) -> None:
    msg = Printer(no_print=silent, pretty=not silent)
    input_path = util.ensure_path(input_dir)
    output_path = util.ensure_path(output_dir)
    meta_path = util.ensure_path(meta_path)
@ -33,23 +60,23 @@ def package(
    if meta_path and not meta_path.exists():
        msg.fail("Can't find model meta.json", meta_path, exits=1)
-    meta_path = meta_path or input_path / "meta.json"
+    meta_path = meta_path or input_dir / "meta.json"
-    if meta_path.is_file():
+    if not meta_path.exists() or not meta_path.is_file():
        msg.fail("Can't load model meta.json", meta_path, exits=1)
    meta = srsly.read_json(meta_path)
    meta = get_meta(input_dir, meta)
    if version is not None:
        meta["version"] = version
    if not create_meta:  # only print if user doesn't want to overwrite
        msg.good("Loaded meta.json from file", meta_path)
    else:
-            meta = generate_meta(input_dir, meta, msg)
+        meta = generate_meta(meta, msg)
-    for key in ("lang", "name", "version"):
+    errors = validate(ModelMetaSchema, meta)
-        if key not in meta or meta[key] == "":
+    if errors:
-            msg.fail(
+        msg.fail("Invalid model meta.json", "\n".join(errors), exits=1)
                f"No '{key}' setting found in meta.json",
                "This setting is required to build your package.",
                exits=1,
            )
    model_name = meta["lang"] + "_" + meta["name"]
    model_name_v = model_name + "-" + meta["version"]
-    main_path = output_path / model_name_v
+    main_path = output_dir / model_name_v
    package_path = main_path / model_name
    if package_path.exists():
@ -63,32 +90,37 @@ def package(
                exits=1,
            )
    Path.mkdir(package_path, parents=True)
-    shutil.copytree(str(input_path), str(package_path / model_name_v))
+    shutil.copytree(str(input_dir), str(package_path / model_name_v))
    create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
    create_file(main_path / "setup.py", TEMPLATE_SETUP)
    create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
    create_file(package_path / "__init__.py", TEMPLATE_INIT)
    msg.good(f"Successfully created package '{model_name_v}'", main_path)
-    msg.text("To build the package, run `python setup.py sdist` in this directory.")
+    with util.working_dir(main_path):
        util.run_command([sys.executable, "setup.py", "sdist"])
    zip_file = main_path / "dist" / f"{model_name_v}.tar.gz"
    msg.good(f"Successfully created zipped Python package", zip_file)
-def create_file(file_path, contents):
+def create_file(file_path: Path, contents: str) -> None:
    file_path.touch()
    file_path.open("w", encoding="utf-8").write(contents)
-def generate_meta(model_path, existing_meta, msg):
+def get_meta(
-    meta = existing_meta or {}
+    model_path: Union[str, Path], existing_meta: Dict[str, Any]
-    settings = [
+) -> Dict[str, Any]:
-        ("lang", "Model language", meta.get("lang", "en")),
+    meta = {
-        ("name", "Model name", meta.get("name", "model")),
+        "lang": "en",
-        ("version", "Model version", meta.get("version", "0.0.0")),
+        "name": "model",
-        ("description", "Model description", meta.get("description", False)),
+        "version": "0.0.0",
-        ("author", "Author", meta.get("author", False)),
+        "description": None,
-        ("email", "Author email", meta.get("email", False)),
+        "author": None,
-        ("url", "Author website", meta.get("url", False)),
+        "email": None,
-        ("license", "License", meta.get("license", "MIT")),
+        "url": None,
-    ]
+        "license": "MIT",
    }
    meta.update(existing_meta)
    nlp = util.load_model_from_path(Path(model_path))
    meta["spacy_version"] = util.get_model_version_range(about.__version__)
    meta["pipeline"] = nlp.pipe_names
@ -98,6 +130,23 @@ def generate_meta(model_path, existing_meta, msg):
        "keys": nlp.vocab.vectors.n_keys,
        "name": nlp.vocab.vectors.name,
    }
    if about.__title__ != "spacy":
        meta["parent_package"] = about.__title__
    return meta
 def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]:
    meta = existing_meta or {}
    settings = [
        ("lang", "Model language", meta.get("lang", "en")),
        ("name", "Model name", meta.get("name", "model")),
        ("version", "Model version", meta.get("version", "0.0.0")),
        ("description", "Model description", meta.get("description", None)),
        ("author", "Author", meta.get("author", None)),
        ("email", "Author email", meta.get("email", None)),
        ("url", "Author website", meta.get("url", None)),
        ("license", "License", meta.get("license", "MIT")),
    ]
    msg.divider("Generating meta.json")
    msg.text(
        "Enter the package settings for your model. The following information "
@ -106,8 +155,6 @@ def generate_meta(model_path, existing_meta, msg):
    for setting, desc, default in settings:
        response = get_raw_input(desc, default)
        meta[setting] = default if response == "" and default else response
    if about.__title__ != "spacy":
        meta["parent_package"] = about.__title__
    return meta
@ -158,12 +205,12 @@ def setup_package():
    setup(
        name=model_name,
-        description=meta['description'],
+        description=meta.get('description'),
-        author=meta['author'],
+        author=meta.get('author'),
-        author_email=meta['email'],
+        author_email=meta.get('email'),
-        url=meta['url'],
+        url=meta.get('url'),
        version=meta['version'],
-        license=meta['license'],
+        license=meta.get('license'),
        packages=[model_name],
        package_data={model_name: list_files(model_dir)},
        install_requires=list_requirements(meta),
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -1,14 +1,15 @@
 from typing import Optional
 import random
 import numpy
 import time
 import re
 from collections import Counter
 import plac
 from pathlib import Path
 from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
 from wasabi import msg
 import srsly
 from ._app import app, Arg, Opt
 from ..errors import Errors
 from ..ml.models.multi_task import build_masked_language_model
 from ..tokens import Doc
@ -17,25 +18,17 @@ from .. import util
 from ..gold import Example
-@plac.annotations(
+@app.command("pretrain")
 def pretrain_cli(
    # fmt: off
-    texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str),
+    texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
-    vectors_model=("Name or path to spaCy model with vectors to learn from", "positional", None, str),
+    vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
-    output_dir=("Directory to write models to on each epoch", "positional", None, Path),
+    output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
-    config_path=("Path to config file", "positional", None, Path),
+    config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
-    use_gpu=("Use GPU", "option", "g", int),
+    use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
-    resume_path=("Path to pretrained weights from which to resume pretraining", "option","r", Path),
+    resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
-    epoch_resume=("The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files.","option", "er", int),
+    epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using '--resume_path'. Prevents unintended overwriting of existing weight files."),
    # fmt: on
 )
 def pretrain(
    texts_loc,
    vectors_model,
    config_path,
    output_dir,
    use_gpu=-1,
    resume_path=None,
    epoch_resume=None,
 ):
    """
    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -52,6 +45,26 @@ def pretrain(
    all settings are the same between pretraining and training. Ideally,
    this is done by using the same config file for both commands.
    """
    pretrain(
        texts_loc,
        vectors_model,
        output_dir,
        config_path,
        use_gpu=use_gpu,
        resume_path=resume_path,
        epoch_resume=epoch_resume,
    )
 def pretrain(
    texts_loc: Path,
    vectors_model: str,
    output_dir: Path,
    config_path: Path,
    use_gpu: int = -1,
    resume_path: Optional[Path] = None,
    epoch_resume: Optional[int] = None,
 ):
    if not config_path or not config_path.exists():
        msg.fail("Config file not found", config_path, exits=1)
@ -166,8 +179,7 @@ def pretrain(
    skip_counter = 0
    loss_func = pretrain_config["loss_func"]
    for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
-        examples = [Example(doc=text) for text in texts]
+        batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
        batches = util.minibatch_by_words(examples, size=pretrain_config["batch_size"])
        for batch_id, batch in enumerate(batches):
            docs, count = make_docs(
                nlp,
--- a/spacy/cli/profile.py
+++ b/spacy/cli/profile.py
@ -1,3 +1,4 @@
 from typing import Optional, Sequence, Union, Iterator
 import tqdm
 from pathlib import Path
 import srsly
@ -5,17 +6,19 @@ import cProfile
 import pstats
 import sys
 import itertools
-import ml_datasets
+from wasabi import msg, Printer
 from wasabi import msg
 from ._app import app, Arg, Opt
 from ..language import Language
 from ..util import load_model
-def profile(
+@app.command("profile")
 def profile_cli(
    # fmt: off
-    model: ("Model to load", "positional", None, str),
+    model: str = Arg(..., help="Model to load"),
-    inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None,
+    inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True),
-    n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000,
+    n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"),
    # fmt: on
 ):
    """
@ -24,6 +27,18 @@ def profile(
    It can either be provided as a JSONL file, or be read from sys.sytdin.
    If no input file is specified, the IMDB dataset is loaded via Thinc.
    """
    profile(model, inputs=inputs, n_texts=n_texts)
 def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
    try:
        import ml_datasets
    except ImportError:
        msg.fail(
            "This command requires the ml_datasets library to be installed:"
            "pip install ml_datasets",
            exits=1,
        )
    if inputs is not None:
        inputs = _read_inputs(inputs, msg)
    if inputs is None:
@ -43,12 +58,12 @@ def profile(
    s.strip_dirs().sort_stats("time").print_stats()
-def parse_texts(nlp, texts):
+def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
    for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
        pass
-def _read_inputs(loc, msg):
+def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
    if loc == "-":
        msg.info("Reading input from sys.stdin")
        file_ = sys.stdin
--- a/spacy/cli/project.py
+++ b/spacy/cli/project.py
@ -0,0 +1,704 @@
 from typing import List, Dict, Any, Optional, Sequence
 import typer
 import srsly
 from pathlib import Path
 from wasabi import msg
 import subprocess
 import os
 import re
 import shutil
 import sys
 import requests
 import tqdm
 from ._app import app, Arg, Opt, COMMAND, NAME
 from .. import about
 from ..schemas import ProjectConfigSchema, validate
 from ..util import ensure_path, run_command, make_tempdir, working_dir
 from ..util import get_hash, get_checksum, split_command
 CONFIG_FILE = "project.yml"
 DVC_CONFIG = "dvc.yaml"
 DVC_DIR = ".dvc"
 DIRS = [
    "assets",
    "metas",
    "configs",
    "packages",
    "metrics",
    "scripts",
    "notebooks",
    "training",
    "corpus",
 ]
 CACHES = [
    Path.home() / ".torch",
    Path.home() / ".caches" / "torch",
    os.environ.get("TORCH_HOME"),
    Path.home() / ".keras",
 ]
 DVC_CONFIG_COMMENT = """# This file is auto-generated by spaCy based on your project.yml. Do not edit
 # it directly and edit the project.yml instead and re-run the project."""
 CLI_HELP = f"""Command-line interface for spaCy projects and working with project
 templates. You'd typically start by cloning a project template to a local
 directory and fetching its assets like datasets etc. See the project's
 {CONFIG_FILE} for the available commands. Under the hood, spaCy uses DVC (Data
 Version Control) to manage input and output files and to ensure steps are only
 re-run if their inputs change.
 """
 project_cli = typer.Typer(help=CLI_HELP, no_args_is_help=True)
@project_cli.callback(invoke_without_command=True)
 def callback(ctx: typer.Context):
    """This runs before every project command and ensures DVC is installed."""
    ensure_dvc()
 ################
 # CLI COMMANDS #
 ################
@project_cli.command("clone")
 def project_clone_cli(
    # fmt: off
    name: str = Arg(..., help="The name of the template to fetch"),
    dest: Path = Arg(Path.cwd(), help="Where to download and work. Defaults to current working directory.", exists=False),
    repo: str = Opt(about.__projects__, "--repo", "-r", help="The repository to look in."),
    git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
    no_init: bool = Opt(False, "--no-init", "-NI", help="Don't initialize the project with DVC"),
    # fmt: on
 ):
    """Clone a project template from a repository. Calls into "git" and will
    only download the files from the given subdirectory. The GitHub repo
    defaults to the official spaCy template repo, but can be customized
    (including using a private repo). Setting the --git flag will also
    initialize the project directory as a Git repo. If the project is intended
    to be a Git repo, it should be initialized with Git first, before
    initializing DVC (Data Version Control). This allows DVC to integrate with
    Git.
    """
    if dest == Path.cwd():
        dest = dest / name
    project_clone(name, dest, repo=repo, git=git, no_init=no_init)
@project_cli.command("init")
 def project_init_cli(
    # fmt: off
    path: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
    git: bool = Opt(False, "--git", "-G", help="Initialize project as a Git repo"),
    force: bool = Opt(False, "--force", "-F", help="Force initiziation"),
    # fmt: on
 ):
    """Initialize a project directory with DVC and optionally Git. This should
    typically be taken care of automatically when you run the "project clone"
    command, but you can also run it separately. If the project is intended to
    be a Git repo, it should be initialized with Git first, before initializing
    DVC. This allows DVC to integrate with Git.
    """
    project_init(path, git=git, force=force, silent=True)
@project_cli.command("assets")
 def project_assets_cli(
    # fmt: off
    project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
    # fmt: on
 ):
    """Use DVC (Data Version Control) to fetch project assets. Assets are
    defined in the "assets" section of the project config. If possible, DVC
    will try to track the files so you can pull changes from upstream. It will
    also try and store the checksum so the assets are versioned. If the file
    can't be tracked or checked, it will be downloaded without DVC. If a checksum
    is provided in the project config, the file is only downloaded if no local
    file with the same checksum exists.
    """
    project_assets(project_dir)
@project_cli.command(
    "run-all",
    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
 )
 def project_run_all_cli(
    # fmt: off
    ctx: typer.Context,
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
    # fmt: on
 ):
    """Run all commands defined in the project. This command will use DVC and
    the defined outputs and dependencies in the project config to determine
    which steps need to be re-run and where to start. This means you're only
    re-generating data if the inputs have changed.
    This command calls into "dvc repro" and all additional arguments are passed
    to the "dvc repro" command: https://dvc.org/doc/command-reference/repro
    """
    if show_help:
        print_run_help(project_dir)
    else:
        project_run_all(project_dir, *ctx.args)
@project_cli.command(
    "run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
 )
 def project_run_cli(
    # fmt: off
    ctx: typer.Context,
    subcommand: str = Arg(None, help="Name of command defined in project config"),
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
    # fmt: on
 ):
    """Run a named script defined in the project config. If the command is
    part of the default pipeline defined in the "run" section, DVC is used to
    determine whether the step should re-run if its inputs have changed, or
    whether everything is up to date. If the script is not part of the default
    pipeline, it will be called separately without DVC.
    If DVC is used, the command calls into "dvc repro" and all additional
    arguments are passed to the "dvc repro" command:
    https://dvc.org/doc/command-reference/repro
    """
    if show_help or not subcommand:
        print_run_help(project_dir, subcommand)
    else:
        project_run(project_dir, subcommand, *ctx.args)
@project_cli.command("exec", hidden=True)
 def project_exec_cli(
    # fmt: off
    subcommand: str = Arg(..., help="Name of command defined in project config"),
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    # fmt: on
 ):
    """Execute a command defined in the project config. This CLI command is
    only called internally in auto-generated DVC pipelines, as a shortcut for
    multi-step commands in the project config. You typically shouldn't have to
    call it yourself. To run a command, call "run" or "run-all".
    """
    project_exec(project_dir, subcommand)
@project_cli.command("update-dvc")
 def project_update_dvc_cli(
    # fmt: off
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
    verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
    force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
    # fmt: on
 ):
    """Update the auto-generated DVC config file. Uses the steps defined in the
    "run" section of the project config. This typically happens automatically
    when running a command, but can also be triggered manually if needed.
    """
    config = load_project_config(project_dir)
    updated = update_dvc_config(project_dir, config, verbose=verbose, force=force)
    if updated:
        msg.good(f"Updated DVC config from {CONFIG_FILE}")
    else:
        msg.info(f"No changes found in {CONFIG_FILE}, no update needed")
 app.add_typer(project_cli, name="project")
 #################
 # CLI FUNCTIONS #
 #################
 def project_clone(
    name: str,
    dest: Path,
    *,
    repo: str = about.__projects__,
    git: bool = False,
    no_init: bool = False,
 ) -> None:
    """Clone a project template from a repository.
    name (str): Name of subdirectory to clone.
    dest (Path): Destination path of cloned project.
    repo (str): URL of Git repo containing project templates.
    git (bool): Initialize project as Git repo. Should be set to True if project
        is intended as a repo, since it will allow DVC to integrate with Git.
    no_init (bool): Don't initialize DVC and Git automatically. If True, the
        "init" command or "git init" and "dvc init" need to be run manually.
    """
    dest = ensure_path(dest)
    check_clone(name, dest, repo)
    project_dir = dest.resolve()
    # We're using Git and sparse checkout to only clone the files we need
    with make_tempdir() as tmp_dir:
        cmd = f"git clone {repo} {tmp_dir} --no-checkout --depth 1 --config core.sparseCheckout=true"
        try:
            run_command(cmd)
        except SystemExit:
            err = f"Could not clone the repo '{repo}' into the temp dir '{tmp_dir}'"
            msg.fail(err)
        with (tmp_dir / ".git" / "info" / "sparse-checkout").open("w") as f:
            f.write(name)
        run_command(["git", "-C", str(tmp_dir), "fetch"])
        run_command(["git", "-C", str(tmp_dir), "checkout"])
        shutil.move(str(tmp_dir / Path(name).name), str(project_dir))
    msg.good(f"Cloned project '{name}' from {repo} into {project_dir}")
    for sub_dir in DIRS:
        dir_path = project_dir / sub_dir
        if not dir_path.exists():
            dir_path.mkdir(parents=True)
    if not no_init:
        project_init(project_dir, git=git, force=True, silent=True)
    msg.good(f"Your project is now ready!", dest)
    print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
 def project_init(
    project_dir: Path,
    *,
    git: bool = False,
    force: bool = False,
    silent: bool = False,
    analytics: bool = False,
 ):
    """Initialize a project as a DVC and (optionally) as a Git repo.
    project_dir (Path): Path to project directory.
    git (bool): Also call "git init" to initialize directory as a Git repo.
    silent (bool): Don't print any output (via DVC).
    analytics (bool): Opt-in to DVC analytics (defaults to False).
    """
    with working_dir(project_dir) as cwd:
        if git:
            run_command(["git", "init"])
        init_cmd = ["dvc", "init"]
        if silent:
            init_cmd.append("--quiet")
        if not git:
            init_cmd.append("--no-scm")
        if force:
            init_cmd.append("--force")
        run_command(init_cmd)
        # We don't want to have analytics on by default – our users should
        # opt-in explicitly. If they want it, they can always enable it.
        if not analytics:
            run_command(["dvc", "config", "core.analytics", "false"])
        # Remove unused and confusing plot templates from .dvc directory
        # TODO: maybe we shouldn't do this, but it's otherwise super confusing
        # once you commit your changes via Git and it creates a bunch of files
        # that have no purpose
        plots_dir = cwd / DVC_DIR / "plots"
        if plots_dir.exists():
            shutil.rmtree(str(plots_dir))
        config = load_project_config(cwd)
        setup_check_dvc(cwd, config)
 def project_assets(project_dir: Path) -> None:
    """Fetch assets for a project using DVC if possible.
    project_dir (Path): Path to project directory.
    """
    project_path = ensure_path(project_dir)
    config = load_project_config(project_path)
    setup_check_dvc(project_path, config)
    assets = config.get("assets", {})
    if not assets:
        msg.warn(f"No assets specified in {CONFIG_FILE}", exits=0)
    msg.info(f"Fetching {len(assets)} asset(s)")
    variables = config.get("variables", {})
    fetched_assets = []
    for asset in assets:
        url = asset["url"].format(**variables)
        dest = asset["dest"].format(**variables)
        fetched_path = fetch_asset(project_path, url, dest, asset.get("checksum"))
        if fetched_path:
            fetched_assets.append(str(fetched_path))
    if fetched_assets:
        with working_dir(project_path):
            run_command(["dvc", "add", *fetched_assets, "--external"])
 def fetch_asset(
    project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
 ) -> Optional[Path]:
    """Fetch an asset from a given URL or path. Will try to import the file
    using DVC's import-url if possible (fully tracked and versioned) and falls
    back to get-url (versioned) and a non-DVC download if necessary. If a
    checksum is provided and a local file exists, it's only re-downloaded if the
    checksum doesn't match.
    project_path (Path): Path to project directory.
    url (str): URL or path to asset.
    checksum (Optional[str]): Optional expected checksum of local file.
    RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
        the asset failed.
    """
    url = convert_asset_url(url)
    dest_path = (project_path / dest).resolve()
    if dest_path.exists() and checksum:
        # If there's already a file, check for checksum
        # TODO: add support for caches (dvc import-url with local path)
        if checksum == get_checksum(dest_path):
            msg.good(f"Skipping download with matching checksum: {dest}")
            return dest_path
    with working_dir(project_path):
        try:
            # If these fail, we don't want to output an error or info message.
            # Try with tracking the source first, then just downloading with
            # DVC, then a regular non-DVC download.
            try:
                dvc_cmd = ["dvc", "import-url", url, str(dest_path)]
                print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
            except subprocess.CalledProcessError:
                dvc_cmd = ["dvc", "get-url", url, str(dest_path)]
                print(subprocess.check_output(dvc_cmd, stderr=subprocess.DEVNULL))
        except subprocess.CalledProcessError:
            try:
                download_file(url, dest_path)
            except requests.exceptions.HTTPError as e:
                msg.fail(f"Download failed: {dest}", e)
                return None
    if checksum and checksum != get_checksum(dest_path):
        msg.warn(f"Checksum doesn't match value defined in {CONFIG_FILE}: {dest}")
    msg.good(f"Fetched asset {dest}")
    return dest_path
 def project_run_all(project_dir: Path, *dvc_args) -> None:
    """Run all commands defined in the project using DVC.
    project_dir (Path): Path to project directory.
    *dvc_args: Other arguments passed to "dvc repro".
    """
    config = load_project_config(project_dir)
    setup_check_dvc(project_dir, config)
    dvc_cmd = ["dvc", "repro", *dvc_args]
    with working_dir(project_dir):
        run_command(dvc_cmd)
 def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
    """Simulate a CLI help prompt using the info available in the project config.
    project_dir (Path): The project directory.
    subcommand (Optional[str]): The subcommand or None. If a subcommand is
        provided, the subcommand help is shown. Otherwise, the top-level help
        and a list of available commands is printed.
    """
    config = load_project_config(project_dir)
    setup_check_dvc(project_dir, config)
    config_commands = config.get("commands", [])
    commands = {cmd["name"]: cmd for cmd in config_commands}
    if subcommand:
        validate_subcommand(commands.keys(), subcommand)
        print(f"Usage: {COMMAND} project run {subcommand} {project_dir}")
        help_text = commands[subcommand].get("help")
        if help_text:
            msg.text(f"\n{help_text}\n")
    else:
        print(f"\nAvailable commands in {CONFIG_FILE}")
        print(f"Usage: {COMMAND} project run [COMMAND] {project_dir}")
        msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands])
        msg.text("Run all commands defined in the 'run' block of the project config:")
        print(f"{COMMAND} project run-all {project_dir}")
 def project_run(project_dir: Path, subcommand: str, *dvc_args) -> None:
    """Run a named script defined in the project config. If the script is part
    of the default pipeline (defined in the "run" section), DVC is used to
    execute the command, so it can determine whether to rerun it. It then
    calls into "exec" to execute it.
    project_dir (Path): Path to project directory.
    subcommand (str): Name of command to run.
    *dvc_args: Other arguments passed to "dvc repro".
    """
    config = load_project_config(project_dir)
    setup_check_dvc(project_dir, config)
    config_commands = config.get("commands", [])
    variables = config.get("variables", {})
    commands = {cmd["name"]: cmd for cmd in config_commands}
    validate_subcommand(commands.keys(), subcommand)
    if subcommand in config.get("run", []):
        # This is one of the pipeline commands tracked in DVC
        dvc_cmd = ["dvc", "repro", subcommand, *dvc_args]
        with working_dir(project_dir):
            run_command(dvc_cmd)
    else:
        cmd = commands[subcommand]
        # Deps in non-DVC commands aren't tracked, but if they're defined,
        # make sure they exist before running the command
        for dep in cmd.get("deps", []):
            if not (project_dir / dep).exists():
                err = f"Missing dependency specified by command '{subcommand}': {dep}"
                msg.fail(err, exits=1)
        with working_dir(project_dir):
            run_commands(cmd["script"], variables)
 def project_exec(project_dir: Path, subcommand: str):
    """Execute a command defined in the project config.
    project_dir (Path): Path to project directory.
    subcommand (str): Name of command to run.
    """
    config = load_project_config(project_dir)
    config_commands = config.get("commands", [])
    variables = config.get("variables", {})
    commands = {cmd["name"]: cmd for cmd in config_commands}
    with working_dir(project_dir):
        run_commands(commands[subcommand]["script"], variables)
 ###########
 # HELPERS #
 ###########
 def load_project_config(path: Path) -> Dict[str, Any]:
    """Load the project config file from a directory and validate it.
    path (Path): The path to the project directory.
    RETURNS (Dict[str, Any]): The loaded project config.
    """
    config_path = path / CONFIG_FILE
    if not config_path.exists():
        msg.fail("Can't find project config", config_path, exits=1)
    invalid_err = f"Invalid project config in {CONFIG_FILE}"
    try:
        config = srsly.read_yaml(config_path)
    except ValueError as e:
        msg.fail(invalid_err, e, exits=1)
    errors = validate(ProjectConfigSchema, config)
    if errors:
        msg.fail(invalid_err, "\n".join(errors), exits=1)
    return config
 def update_dvc_config(
    path: Path,
    config: Dict[str, Any],
    verbose: bool = False,
    silent: bool = False,
    force: bool = False,
 ) -> bool:
    """Re-run the DVC commands in dry mode and update dvc.yaml file in the
    project directory. The file is auto-generated based on the config. The
    first line of the auto-generated file specifies the hash of the config
    dict, so if any of the config values change, the DVC config is regenerated.
    path (Path): The path to the project directory.
    config (Dict[str, Any]): The loaded project config.
    verbose (bool): Whether to print additional info (via DVC).
    silent (bool): Don't output anything (via DVC).
    force (bool): Force update, even if hashes match.
    RETURNS (bool): Whether the DVC config file was updated.
    """
    config_hash = get_hash(config)
    path = path.resolve()
    dvc_config_path = path / DVC_CONFIG
    if dvc_config_path.exists():
        # Check if the file was generated using the current config, if not, redo
        with dvc_config_path.open("r", encoding="utf8") as f:
            ref_hash = f.readline().strip().replace("# ", "")
        if ref_hash == config_hash and not force:
            return False  # Nothing has changed in project config, don't need to update
        dvc_config_path.unlink()
    variables = config.get("variables", {})
    commands = []
    # We only want to include commands that are part of the main list of "run"
    # commands in project.yml and should be run in sequence
    config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
    for name in config.get("run", []):
        validate_subcommand(config_commands.keys(), name)
        command = config_commands[name]
        deps = command.get("deps", [])
        outputs = command.get("outputs", [])
        outputs_no_cache = command.get("outputs_no_cache", [])
        if not deps and not outputs and not outputs_no_cache:
            continue
        # Default to "." as the project path since dvc.yaml is auto-generated
        # and we don't want arbitrary paths in there
        project_cmd = ["python", "-m", NAME, "project", ".", "exec", name]
        deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
        outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
        outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
        dvc_cmd = ["dvc", "run", "-n", name, "-w", str(path), "--no-exec"]
        if verbose:
            dvc_cmd.append("--verbose")
        if silent:
            dvc_cmd.append("--quiet")
        full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
        commands.append(" ".join(full_cmd))
    with working_dir(path):
        run_commands(commands, variables, silent=True)
    with dvc_config_path.open("r+", encoding="utf8") as f:
        content = f.read()
        f.seek(0, 0)
        f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
    return True
 def ensure_dvc() -> None:
    """Ensure that the "dvc" command is available and show an error if not."""
    try:
        subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
    except Exception:
        msg.fail(
            "spaCy projects require DVC (Data Version Control) and the 'dvc' command",
            "You can install the Python package from pip (pip install dvc) or "
            "conda (conda install -c conda-forge dvc). For more details, see the "
            "documentation: https://dvc.org/doc/install",
            exits=1,
        )
 def setup_check_dvc(project_dir: Path, config: Dict[str, Any]) -> None:
    """Check that the project is set up correctly with DVC and update its
    config if needed. Will raise an error if the project is not an initialized
    DVC project.
    project_dir (Path): The path to the project directory.
    config (Dict[str, Any]): The loaded project config.
    """
    if not project_dir.exists():
        msg.fail(f"Can't find project directory: {project_dir}")
    if not (project_dir / ".dvc").exists():
        msg.fail(
            "Project not initialized as a DVC project.",
            f"Make sure that the project template was cloned correctly. To "
            f"initialize the project directory manually, you can run: "
            f"{COMMAND} project init {project_dir}",
            exits=1,
        )
    with msg.loading("Updating DVC config..."):
        updated = update_dvc_config(project_dir, config, silent=True)
    if updated:
        msg.good(f"Updated DVC config from changed {CONFIG_FILE}")
 def run_commands(
    commands: List[str] = tuple(), variables: Dict[str, str] = {}, silent: bool = False
 ) -> None:
    """Run a sequence of commands in a subprocess, in order.
    commands (List[str]): The string commands.
    variables (Dict[str, str]): Dictionary of variable names, mapped to their
        values. Will be used to substitute format string variables in the
        commands.
    silent (bool): Don't print the commands.
    """
    for command in commands:
        # Substitute variables, e.g. "./{NAME}.json"
        command = command.format(**variables)
        command = split_command(command)
        # Not sure if this is needed or a good idea. Motivation: users may often
        # use commands in their config that reference "python" and we want to
        # make sure that it's always executing the same Python that spaCy is
        # executed with and the pip in the same env, not some other Python/pip.
        # Also ensures cross-compatibility if user 1 writes "python3" (because
        # that's how it's set up on their system), and user 2 without the
        # shortcut tries to re-run the command.
        if len(command) and command[0] in ("python", "python3"):
            command[0] = sys.executable
        elif len(command) and command[0] in ("pip", "pip3"):
            command = [sys.executable, "-m", "pip", *command[1:]]
        if not silent:
            print(f"Running command: {' '.join(command)}")
        run_command(command)
 def convert_asset_url(url: str) -> str:
    """Check and convert the asset URL if needed.
    url (str): The asset URL.
    RETURNS (str): The converted URL.
    """
    # If the asset URL is a regular GitHub URL it's likely a mistake
    if re.match("(http(s?)):\/\/github.com", url):
        converted = url.replace("github.com", "raw.githubusercontent.com")
        converted = re.sub(r"/(tree|blob)/", "/", converted)
        msg.warn(
            "Downloading from a regular GitHub URL. This will only download "
            "the source of the page, not the actual file. Converting the URL "
            "to a raw URL.",
            converted,
        )
        return converted
    return url
 def check_clone(name: str, dest: Path, repo: str) -> None:
    """Check and validate that the destination path can be used to clone. Will
    check that Git is available and that the destination path is suitable.
    name (str): Name of the directory to clone from the repo.
    dest (Path): Local destination of cloned directory.
    repo (str): URL of the repo to clone from.
    """
    try:
        subprocess.run(["git", "--version"], stdout=subprocess.DEVNULL)
    except Exception:
        msg.fail(
            f"Cloning spaCy project templates requires Git and the 'git' command. ",
            f"To clone a project without Git, copy the files from the '{name}' "
            f"directory in the {repo} to {dest} manually and then run:",
            f"{COMMAND} project init {dest}",
            exits=1,
        )
    if not dest:
        msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
    if dest.exists():
        # Directory already exists (not allowed, clone needs to create it)
        msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
    if not dest.parent.exists():
        # We're not creating parents, parent dir should exist
        msg.fail(
            f"Can't clone project, parent directory doesn't exist: {dest.parent}",
            exits=1,
        )
 def validate_subcommand(commands: Sequence[str], subcommand: str) -> None:
    """Check that a subcommand is valid and defined. Raises an error otherwise.
    commands (Sequence[str]): The available commands.
    subcommand (str): The subcommand.
    """
    if subcommand not in commands:
        msg.fail(
            f"Can't find command '{subcommand}' in {CONFIG_FILE}. "
            f"Available commands: {', '.join(commands)}",
            exits=1,
        )
 def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
    """Download a file using requests.
    url (str): The URL of the file.
    dest (Path): The destination path.
    chunk_size (int): The size of chunks to read/write.
    """
    response = requests.get(url, stream=True)
    response.raise_for_status()
    total = int(response.headers.get("content-length", 0))
    progress_settings = {
        "total": total,
        "unit": "iB",
        "unit_scale": True,
        "unit_divisor": chunk_size,
        "leave": False,
    }
    with dest.open("wb") as f, tqdm.tqdm(**progress_settings) as bar:
        for data in response.iter_content(chunk_size=chunk_size):
            size = f.write(data)
            bar.update(size)
--- a/spacy/cli/train_from_config.py
+++ b/spacy/cli/train_from_config.py
@ -2,9 +2,8 @@ from typing import Optional, Dict, List, Union, Sequence
 from timeit import default_timer as timer
 import srsly
 from pydantic import BaseModel, FilePath
 import plac
 import tqdm
 from pydantic import BaseModel, FilePath
 from pathlib import Path
 from wasabi import msg
 import thinc
@ -12,11 +11,17 @@ import thinc.schedules
 from thinc.api import Model, use_pytorch_for_gpu_memory
 import random
-from ..gold import GoldCorpus
+from ._app import app, Arg, Opt
 from ..gold import Corpus
 from ..lookups import Lookups
 from .. import util
 from ..errors import Errors
-from ..ml import models  # don't remove - required to load the built-in architectures
+
 # Don't remove - required to load the built-in architectures
 from ..ml import models  # noqa: F401
 # from ..schemas import ConfigSchema  # TODO: include?
 registry = util.registry
@ -114,41 +119,24 @@ class ConfigSchema(BaseModel):
        extra = "allow"
-@plac.annotations(
+@app.command("train")
    # fmt: off
    train_path=("Location of JSON-formatted training data", "positional", None, Path),
    dev_path=("Location of JSON-formatted development data", "positional", None, Path),
    config_path=("Path to config file", "positional", None, Path),
    output_path=("Output directory to store model in", "option", "o", Path),
    init_tok2vec=(
    "Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental.", "option", "t2v",
    Path),
    raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
    verbose=("Display more information for debugging purposes", "flag", "VV", bool),
    use_gpu=("Use GPU", "option", "g", int),
    num_workers=("Parallel Workers", "option", "j", int),
    strategy=("Distributed training strategy (requires spacy_ray)", "option", "strategy", str),
    ray_address=(
        "Address of the Ray cluster. Multi-node training (requires spacy_ray)",
        "option", "address", str),
    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
    # fmt: on
 )
 def train_cli(
-    train_path,
+    # fmt: off
-    dev_path,
+    train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
-    config_path,
+    dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
-    output_path=None,
+    config_path: Path = Arg(..., help="Path to config file", exists=True),
-    init_tok2vec=None,
+    output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
-    raw_text=None,
+    code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
-    verbose=False,
+    init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
-    use_gpu=-1,
+    raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
-    num_workers=1,
+    verbose: bool = Opt(False, "--verbose", "-VV", help="Display more information for debugging purposes"),
-    strategy="allreduce",
+    use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
-    ray_address=None,
+    num_workers: int = Opt(None, "-j", help="Parallel Workers"),
-    tag_map_path=None,
+    strategy: str = Opt(None, "--strategy", help="Distributed training strategy (requires spacy_ray)"),
-    omit_extra_lookups=False,
+    ray_address: str = Opt(None, "--address", help="Address of the Ray cluster. Multi-node training (requires spacy_ray)"),
    tag_map_path: Optional[Path] = Opt(None, "--tag-map-path", "-tm", help="Location of JSON-formatted tag map"),
    omit_extra_lookups: bool = Opt(False, "--omit-extra-lookups", "-OEL", help="Don't include extra lookups in model"),
    # fmt: on
 ):
    """
    Train or update a spaCy model. Requires data to be formatted in spaCy's
@ -156,26 +144,8 @@ def train_cli(
    command.
    """
    util.set_env_log(verbose)
    verify_cli_args(**locals())
    # Make sure all files and paths exists if they are needed
    if not config_path or not config_path.exists():
        msg.fail("Config file not found", config_path, exits=1)
    if not train_path or not train_path.exists():
        msg.fail("Training data not found", train_path, exits=1)
    if not dev_path or not dev_path.exists():
        msg.fail("Development data not found", dev_path, exits=1)
    if output_path is not None:
        if not output_path.exists():
            output_path.mkdir()
            msg.good(f"Created output directory: {output_path}")
        elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
            msg.warn(
                "Output directory is not empty.",
                "This can lead to unintended side effects when saving the model. "
                "Please use an empty directory or a different path instead. If "
                "the specified output path doesn't exist, the directory will be "
                "created for you.",
            )
    if raw_text is not None:
        raw_text = list(srsly.read_jsonl(raw_text))
    tag_map = {}
@ -184,8 +154,6 @@ def train_cli(
    weights_data = None
    if init_tok2vec is not None:
        if not init_tok2vec.exists():
            msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
        with init_tok2vec.open("rb") as file_:
            weights_data = file_.read()
@ -214,17 +182,17 @@ def train_cli(
        train(**train_args)
 def train(
-    config_path,
+    config_path: Path,
-    data_paths,
+    data_paths: Dict[str, Path],
-    raw_text=None,
+    raw_text: Optional[Path] = None,
-    output_path=None,
+    output_path: Optional[Path] = None,
-    tag_map=None,
+    tag_map: Optional[Path] = None,
-    weights_data=None,
+    weights_data: Optional[bytes] = None,
-    omit_extra_lookups=False,
+    omit_extra_lookups: bool = False,
-    disable_tqdm=False,
+    disable_tqdm: bool = False,
-    remote_optimizer=None,
+    remote_optimizer: Optimizer = None,
-    randomization_index=0
+    randomization_index: int = 0
-):
+) -> None:
    msg.info(f"Loading config from: {config_path}")
    # Read the config first without creating objects, to get to the original nlp_config
    config = util.load_config(config_path, create_objects=False)
@ -243,69 +211,20 @@ def train(
    if remote_optimizer:
        optimizer = remote_optimizer
    limit = training["limit"]
-    msg.info("Loading training corpus")
+    corpus = Corpus(data_paths["train"], data_paths["dev"], limit=limit)
    corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
    # verify textcat config
    if "textcat" in nlp_config["pipeline"]:
-        textcat_labels = set(nlp.get_pipe("textcat").labels)
+        verify_textcat_config(nlp, nlp_config)
        textcat_multilabel = not nlp_config["pipeline"]["textcat"]["model"]["exclusive_classes"]
        # check whether the setting 'exclusive_classes' corresponds to the provided training data
        if textcat_multilabel:
            multilabel_found = False
            for ex in corpus.train_examples:
                cats = ex.doc_annotation.cats
                textcat_labels.update(cats.keys())
                if list(cats.values()).count(1.0) != 1:
                    multilabel_found = True
            if not multilabel_found:
                msg.warn(
                    "The textcat training instances look like they have "
                    "mutually exclusive classes. Set 'exclusive_classes' "
                    "to 'true' in the config to train a classifier with "
                    "mutually exclusive classes more accurately."
                )
        else:
            for ex in corpus.train_examples:
                cats = ex.doc_annotation.cats
                textcat_labels.update(cats.keys())
                if list(cats.values()).count(1.0) != 1:
                    msg.fail(
                        "Some textcat training instances do not have exactly "
                        "one positive label. Set 'exclusive_classes' "
                        "to 'false' in the config to train a classifier with classes "
                        "that are not mutually exclusive."
                    )
        msg.info(f"Initialized textcat component for {len(textcat_labels)} unique labels")
        nlp.get_pipe("textcat").labels = tuple(textcat_labels)
        # if 'positive_label' is provided: double check whether it's in the data and the task is binary
        if nlp_config["pipeline"]["textcat"].get("positive_label", None):
            textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
            pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
            if pos_label not in textcat_labels:
                msg.fail(
                    f"The textcat's 'positive_label' config setting '{pos_label}' "
                    f"does not match any label in the training data.",
                    exits=1,
                )
            if len(textcat_labels) != 2:
                msg.fail(
                    f"A textcat 'positive_label' '{pos_label}' was "
                    f"provided for training data that does not appear to be a "
                    f"binary classification problem with two labels.",
                    exits=1,
                )
    if training.get("resume", False):
        msg.info("Resuming training")
        nlp.resume_training()
    else:
        msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
-        nlp.begin_training(
+        train_examples = list(corpus.train_dataset(
-            lambda: corpus.train_examples
+            nlp,
-        )
+            shuffle=False,
            gold_preproc=training["gold_preproc"]
        ))
        nlp.begin_training(lambda: train_examples)
    # Update tag map with provided mapping
    nlp.vocab.morphology.tag_map.update(tag_map)
@ -332,11 +251,11 @@ def train(
            tok2vec = tok2vec.get(subpath)
        if not tok2vec:
            msg.fail(
-                f"Could not locate the tok2vec model at {tok2vec_path}.",
+                f"Could not locate the tok2vec model at {tok2vec_path}.", exits=1,
                exits=1,
            )
        tok2vec.from_bytes(weights_data)
    msg.info("Loading training corpus")
    train_batches = create_train_batches(nlp, corpus, training, randomization_index)
    evaluate = create_evaluation_callback(nlp, optimizer, corpus, training)
@ -369,18 +288,15 @@ def train(
                    update_meta(training, nlp, info)
                    nlp.to_disk(output_path / "model-best")
                progress = tqdm.tqdm(**tqdm_args)
            # Clean up the objects to faciliate garbage collection.
            for eg in batch:
                eg.doc = None
                eg.goldparse = None
                eg.doc_annotation = None
                eg.token_annotation = None
    except Exception as e:
        if output_path is not None:
            msg.warn(
                f"Aborting and saving the final best model. "
                f"Encountered exception: {str(e)}",
                exits=1,
            )
        else:
            raise e
    finally:
        if output_path is not None:
            final_model_path = output_path / "model-final"
@ -393,23 +309,22 @@ def train(
 def create_train_batches(nlp, corpus, cfg, randomization_index):
-    epochs_todo = cfg.get("max_epochs", 0)
+    max_epochs = cfg.get("max_epochs", 0)
-    while True:
+    train_examples = list(corpus.train_dataset(
        train_examples = list(
            corpus.train_dataset(
        nlp,
-                noise_level=0.0, # I think this is deprecated?
+        shuffle=True,
                orth_variant_level=cfg["orth_variant_level"],
        gold_preproc=cfg["gold_preproc"],
-                max_length=cfg["max_length"],
+        max_length=cfg["max_length"]
-                ignore_misaligned=True,
+    ))
-            )
+
-        )
+    epoch = 0
    while True:
        if len(train_examples) == 0:
            raise ValueError(Errors.E988)
        for _ in range(randomization_index):
            random.random()
        random.shuffle(train_examples)
        epoch += 1
        batches = util.minibatch_by_words(
            train_examples,
            size=cfg["batch_size"],
@ -418,15 +333,12 @@ def create_train_batches(nlp, corpus, cfg, randomization_index):
        # make sure the minibatch_by_words result is not empty, or we'll have an infinite training loop
        try:
            first = next(batches)
-            yield first
+            yield epoch, first
        except StopIteration:
            raise ValueError(Errors.E986)
        for batch in batches:
-            yield batch
+            yield epoch, batch
-        epochs_todo -= 1
+        if max_epochs >= 1 and epoch >= max_epochs:
        # We intentionally compare exactly to 0 here, so that max_epochs < 1
        # will not break.
        if epochs_todo == 0:
            break
@ -437,7 +349,8 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
                nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True
            )
        )
-        n_words = sum(len(ex.doc) for ex in dev_examples)
+
        n_words = sum(len(ex.predicted) for ex in dev_examples)
        start_time = timer()
        if optimizer.averages:
@ -453,7 +366,11 @@ def create_evaluation_callback(nlp, optimizer, corpus, cfg):
        try:
            weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
        except KeyError as e:
-            raise KeyError(Errors.E983.format(dict_name='score_weights', key=str(e), keys=list(scores.keys())))
+            raise KeyError(
                Errors.E983.format(
                    dict="score_weights", key=str(e), keys=list(scores.keys())
                )
            )
        scores["speed"] = wps
        return weighted_score, scores
@ -494,7 +411,7 @@ def train_while_improving(
    Every iteration, the function yields out a tuple with:
-    * batch: A zipped sequence of Tuple[Doc, GoldParse] pairs.
+    * batch: A list of Example objects.
    * info: A dict with various information about the last update (see below).
    * is_best_checkpoint: A value in None, False, True, indicating whether this
        was the best evaluation so far. You should use this to save the model
@ -526,7 +443,7 @@ def train_while_improving(
            (nlp.make_doc(rt["text"]) for rt in raw_text), size=8
        )
-    for step, batch in enumerate(train_data):
+    for step, (epoch, batch) in enumerate(train_data):
        dropout = next(dropouts)
        with nlp.select_pipes(enable=to_enable):
            for subbatch in subdivide_batch(batch, accumulate_gradient):
@ -548,6 +465,7 @@ def train_while_improving(
            score, other_scores = (None, None)
            is_best_checkpoint = None
        info = {
            "epoch": epoch,
            "step": step,
            "score": score,
            "other_scores": other_scores,
@ -568,7 +486,7 @@ def train_while_improving(
 def subdivide_batch(batch, accumulate_gradient):
    batch = list(batch)
-    batch.sort(key=lambda eg: len(eg.doc))
+    batch.sort(key=lambda eg: len(eg.predicted))
    sub_len = len(batch) // accumulate_gradient
    start = 0
    for i in range(accumulate_gradient):
@ -586,9 +504,9 @@ def setup_printer(training, nlp):
    score_widths = [max(len(col), 6) for col in score_cols]
    loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names]
    loss_widths = [max(len(col), 8) for col in loss_cols]
-    table_header = ["#"] + loss_cols + score_cols + ["Score"]
+    table_header = ["E", "#"] + loss_cols + score_cols + ["Score"]
    table_header = [col.upper() for col in table_header]
-    table_widths = [6] + loss_widths + score_widths + [6]
+    table_widths = [3, 6] + loss_widths + score_widths + [6]
    table_aligns = ["r" for _ in table_widths]
    msg.row(table_header, widths=table_widths)
@ -602,17 +520,25 @@ def setup_printer(training, nlp):
            ]
        except KeyError as e:
            raise KeyError(
-                Errors.E983.format(dict_name='scores (losses)', key=str(e), keys=list(info["losses"].keys())))
+                Errors.E983.format(
                    dict="scores (losses)", key=str(e), keys=list(info["losses"].keys())
                )
            )
        try:
            scores = [
-                "{0:.2f}".format(float(info["other_scores"][col]))
+                "{0:.2f}".format(float(info["other_scores"][col])) for col in score_cols
                for col in score_cols
            ]
        except KeyError as e:
-            raise KeyError(Errors.E983.format(dict_name='scores (other)', key=str(e), keys=list(info["other_scores"].keys())))
+            raise KeyError(
                Errors.E983.format(
                    dict="scores (other)",
                    key=str(e),
                    keys=list(info["other_scores"].keys()),
                )
            )
        data = (
-            [info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
+            [info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
        )
        msg.row(data, widths=table_widths, aligns=table_aligns)
@ -626,3 +552,67 @@ def update_meta(training, nlp, info):
        nlp.meta["performance"][metric] = info["other_scores"][metric]
    for pipe_name in nlp.pipe_names:
        nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
 def verify_cli_args(
    train_path,
    dev_path,
    config_path,
    output_path=None,
    code_path=None,
    init_tok2vec=None,
    raw_text=None,
    verbose=False,
    use_gpu=-1,
    tag_map_path=None,
    omit_extra_lookups=False,
 ):
    # Make sure all files and paths exists if they are needed
    if not config_path or not config_path.exists():
        msg.fail("Config file not found", config_path, exits=1)
    if not train_path or not train_path.exists():
        msg.fail("Training data not found", train_path, exits=1)
    if not dev_path or not dev_path.exists():
        msg.fail("Development data not found", dev_path, exits=1)
    if output_path is not None:
        if not output_path.exists():
            output_path.mkdir()
            msg.good(f"Created output directory: {output_path}")
        elif output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]:
            msg.warn(
                "Output directory is not empty.",
                "This can lead to unintended side effects when saving the model. "
                "Please use an empty directory or a different path instead. If "
                "the specified output path doesn't exist, the directory will be "
                "created for you.",
            )
    if code_path is not None:
        if not code_path.exists():
            msg.fail("Path to Python code not found", code_path, exits=1)
        try:
            util.import_file("python_code", code_path)
        except Exception as e:
            msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
    if init_tok2vec is not None and not init_tok2vec.exists():
        msg.fail("Can't find pretrained tok2vec", init_tok2vec, exits=1)
 def verify_textcat_config(nlp, nlp_config):
    # if 'positive_label' is provided: double check whether it's in the data and
    # the task is binary
    if nlp_config["pipeline"]["textcat"].get("positive_label", None):
        textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
        pos_label = nlp_config["pipeline"]["textcat"]["positive_label"]
        if pos_label not in textcat_labels:
            msg.fail(
                f"The textcat's 'positive_label' config setting '{pos_label}' "
                f"does not match any label in the training data.",
                exits=1,
            )
        if len(textcat_labels) != 2:
            msg.fail(
                f"A textcat 'positive_label' '{pos_label}' was "
                f"provided for training data that does not appear to be a "
                f"binary classification problem with two labels.",
                exits=1,
            )
--- a/spacy/cli/validate.py
+++ b/spacy/cli/validate.py
@ -1,18 +1,25 @@
 from typing import Tuple
 from pathlib import Path
 import sys
 import requests
-from wasabi import msg
+from wasabi import msg, Printer
 from ._app import app
 from .. import about
 from ..util import get_package_version, get_installed_models, get_base_version
 from ..util import get_package_path, get_model_meta, is_compatible_version
-def validate():
+@app.command("validate")
 def validate_cli():
    """
    Validate that the currently installed version of spaCy is compatible
    with the installed models. Should be run after `pip install -U spacy`.
    """
    validate()
 def validate() -> None:
    model_pkgs, compat = get_model_pkgs()
    spacy_version = get_base_version(about.__version__)
    current_compat = compat.get(spacy_version, {})
@ -55,7 +62,8 @@ def validate():
        sys.exit(1)
-def get_model_pkgs():
+def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]:
    msg = Printer(no_print=silent, pretty=not silent)
    with msg.loading("Loading compatibility table..."):
        r = requests.get(about.__compatibility__)
        if r.status_code != 200:
@ -93,7 +101,7 @@ def get_model_pkgs():
    return pkgs, compat
-def reformat_version(version):
+def reformat_version(version: str) -> str:
    """Hack to reformat old versions ending on '-alpha' to match pip format."""
    if version.endswith("-alpha"):
        return version.replace("-alpha", "a0")
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -3,7 +3,7 @@ def add_codes(err_cls):
    class ErrorsWithCodes(err_cls):
        def __getattribute__(self, code):
-            msg = super().__getattribute__(code)
+            msg = super(ErrorsWithCodes, self).__getattribute__(code)
            if code.startswith("__"):  # python system attributes like __class__
                return msg
            else:
@ -111,8 +111,31 @@ class Warnings(object):
            "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
            " to check the alignment. Misaligned entities ('-') will be "
            "ignored during training.")
    W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
            "is incompatible with the current spaCy version ({current}). This "
            "may lead to unexpected results or runtime errors. To resolve "
            "this, download a newer compatible model or retrain your custom "
            "model with the current spaCy version. For more details and "
            "available updates, run: python -m spacy validate")
    W032 = ("Unable to determine model compatibility for model '{model}' "
            "({model_version}) with the current spaCy version ({current}). "
            "This may lead to unexpected results or runtime errors. To resolve "
            "this, download a newer compatible model or retrain your custom "
            "model with the current spaCy version. For more details and "
            "available updates, run: python -m spacy validate")
    W033 = ("Training a new {model} using a model with no lexeme normalization "
            "table. This may degrade the performance of the model to some "
            "degree. If this is intentional or the language you're using "
            "doesn't have a normalization table, please ignore this warning. "
            "If this is surprising, make sure you have the spacy-lookups-data "
            "package installed. The languages with lexeme normalization tables "
            "are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.")
    # TODO: fix numbering after merging develop into master
    W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
    W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
    W093 = ("Could not find any data to train the {name} on. Is your "
            "input data correctly formatted ?")
    W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
            "spaCy version requirement: {version}. This can lead to compatibility "
            "problems with older versions, or as new spaCy versions are "
@ -133,7 +156,7 @@ class Warnings(object):
            "so a default configuration was used.")
    W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
            "but got '{type}' instead, so ignoring it.")
-    W100 = ("Skipping unsupported morphological feature(s): {feature}. "
+    W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
            "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
            "string \"Field1=Value1,Value2|Field2=Value3\".")
@ -161,18 +184,13 @@ class Errors(object):
            "`nlp.select_pipes()`, you should remove them explicitly with "
            "`nlp.remove_pipe()` before the pipeline is restored. Names of "
            "the new components: {names}")
    E009 = ("The `update` method expects same number of docs and golds, but "
            "got: {n_docs} docs, {n_golds} golds.")
    E010 = ("Word vectors set to length 0. This may be because you don't have "
            "a model installed or loaded, or because your model doesn't "
            "include word vectors. For more info, see the docs:\n"
            "https://spacy.io/usage/models")
    E011 = ("Unknown operator: '{op}'. Options: {opts}")
    E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
    E013 = ("Error selecting action in matcher")
    E014 = ("Unknown tag ID: {tag}")
    E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
            "`force=True` to overwrite.")
    E016 = ("MultitaskObjective target should be function or one of: dep, "
            "tag, ent, dep_tag_offset, ent_tag.")
    E017 = ("Can only add unicode or bytes. Got type: {value_type}")
@ -180,21 +198,8 @@ class Errors(object):
            "refers to an issue with the `Vocab` or `StringStore`.")
    E019 = ("Can't create transition with unknown action ID: {action}. Action "
            "IDs are enumerated in spacy/syntax/{src}.pyx.")
    E020 = ("Could not find a gold-standard action to supervise the "
            "dependency parser. The tree is non-projective (i.e. it has "
            "crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
            "The ArcEager transition system only supports projective trees. "
            "To learn non-projective representations, transform the data "
            "before training and after parsing. Either pass "
            "`make_projective=True` to the GoldParse class, or use "
            "spacy.syntax.nonproj.preprocess_training_data.")
    E021 = ("Could not find a gold-standard action to supervise the "
            "dependency parser. The GoldParse was projective. The transition "
            "system has {n_actions} actions. State at failure: {state}")
    E022 = ("Could not find a transition with the name '{name}' in the NER "
            "model.")
    E023 = ("Error cleaning up beam: The same state occurred twice at "
            "memory address {addr} and position {i}.")
    E024 = ("Could not find an optimal move to supervise the parser. Usually, "
            "this means that the model can't be updated in a way that's valid "
            "and satisfies the correct annotations specified in the GoldParse. "
@ -238,7 +243,6 @@ class Errors(object):
            "offset {start}.")
    E037 = ("Error calculating span: Can't find a token ending at character "
            "offset {end}.")
    E038 = ("Error finding sentence for span. Infinite loop detected.")
    E039 = ("Array bounds exceeded while searching for root word. This likely "
            "means the parse tree is in an invalid state. Please report this "
            "issue here: http://github.com/explosion/spaCy/issues")
@ -269,8 +273,6 @@ class Errors(object):
    E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
    E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
            "({rows}, {cols}).")
    E061 = ("Bad file name: {filename}. Example of a valid file name: "
            "'vectors.128.f.bin'")
    E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
            "and 63 are occupied. You can replace one by specifying the "
            "`flag_id` explicitly, e.g. "
@ -284,39 +286,17 @@ class Errors(object):
            "Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
    E065 = ("Only one of the vector table's width and shape can be specified. "
            "Got width {width} and shape {shape}.")
    E066 = ("Error creating model helper for extracting columns. Can only "
            "extract columns by positive integer. Got: {value}.")
    E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
            "an entity) without a preceding 'B' (beginning of an entity). "
            "Tag sequence:\n{tags}")
    E068 = ("Invalid BILUO tag: '{tag}'.")
    E069 = ("Invalid gold-standard parse tree. Found cycle between word "
            "IDs: {cycle} (tokens: {cycle_tokens}) in the document starting "
            "with tokens: {doc_tokens}.")
    E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
            "does not align with number of annotations ({n_annots}).")
    E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
            "match the one in the vocab ({vocab_orth}).")
    E072 = ("Error serializing lexeme: expected data length {length}, "
            "got {bad_length}.")
    E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
            "are of length {length}. You can use `vocab.reset_vectors` to "
            "clear the existing vectors and resize the table.")
    E074 = ("Error interpreting compiled match pattern: patterns are expected "
            "to end with the attribute {attr}. Got: {bad_attr}.")
    E075 = ("Error accepting match: length ({length}) > maximum length "
            "({max_len}).")
    E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
            "has {words} words.")
    E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
            "equal number of GoldParse objects ({n_golds}) in batch.")
    E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
            "not equal number of words in GoldParse ({words_gold}).")
    E079 = ("Error computing states in beam: number of predicted beams "
            "({pbeams}) does not equal number of gold beams ({gbeams}).")
    E080 = ("Duplicate state found in beam: {key}.")
    E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
            "does not equal number of losses ({losses}).")
    E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
            "projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
            "match.")
@ -324,8 +304,6 @@ class Errors(object):
            "`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
    E084 = ("Error assigning label ID {label} to span: not in StringStore.")
    E085 = ("Can't create lexeme for string '{string}'.")
    E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
            "not match hash {hash_id} in StringStore.")
    E087 = ("Unknown displaCy style: {style}.")
    E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
            "v2.x parser and NER models require roughly 1GB of temporary "
@ -367,7 +345,6 @@ class Errors(object):
    E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
            "token can only be part of one entity, so make sure the entities "
            "you're setting don't overlap.")
    E104 = ("Can't find JSON schema for '{name}'.")
    E105 = ("The Doc.print_tree() method is now deprecated. Please use "
            "Doc.to_json() instead or write your own function.")
    E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
@ -390,8 +367,6 @@ class Errors(object):
            "practically no advantage over pickling the parent Doc directly. "
            "So instead of pickling the span, pickle the Doc it belongs to or "
            "use Span.as_doc to convert the span to a standalone Doc object.")
    E113 = ("The newly split token can only have one root (head = 0).")
    E114 = ("The newly split token needs to have a root (head = 0).")
    E115 = ("All subtokens must have associated heads.")
    E116 = ("Cannot currently add labels to pretrained text classifier. Add "
            "labels before training begins. This functionality was available "
@ -414,12 +389,9 @@ class Errors(object):
            "equal to span length ({span_len}).")
    E122 = ("Cannot find token to be split. Did it get merged?")
    E123 = ("Cannot find head of token to be split. Did it get merged?")
    E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
    E125 = ("Unexpected value: {value}")
    E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
            "This is likely a bug in spaCy, so feel free to open an issue.")
    E127 = ("Cannot create phrase pattern representation for length 0. This "
            "is likely a bug in spaCy.")
    E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
            "arguments to exclude fields from being serialized or deserialized "
            "is now deprecated. Please use the `exclude` argument instead. "
@ -461,8 +433,6 @@ class Errors(object):
            "provided {found}.")
    E143 = ("Labels for component '{name}' not initialized. Did you forget to "
            "call add_label()?")
    E144 = ("Could not find parameter `{param}` when building the entity "
            "linker model.")
    E145 = ("Error reading `{param}` from input file.")
    E146 = ("Could not access `{path}`.")
    E147 = ("Unexpected error in the {method} functionality of the "
@ -474,8 +444,6 @@ class Errors(object):
            "the component matches the model being loaded.")
    E150 = ("The language of the `nlp` object and the `vocab` should be the "
            "same, but found '{nlp}' and '{vocab}' respectively.")
    E151 = ("Trying to call nlp.update without required annotation types. "
            "Expected top-level keys: {exp}. Got: {unexp}.")
    E152 = ("The attribute {attr} is not supported for token patterns. "
            "Please use the option validate=True with Matcher, PhraseMatcher, "
            "or EntityRuler for more details.")
@ -512,11 +480,6 @@ class Errors(object):
            "that case.")
    E166 = ("Can only merge DocBins with the same pre-defined attributes.\n"
            "Current DocBin: {current}\nOther DocBin: {other}")
    E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can "
            "happen if the tagger was trained with a different set of "
            "morphological features. If you're using a pretrained model, make "
            "sure that your models are up to date:\npython -m spacy validate")
    E168 = ("Unknown field: {field}")
    E169 = ("Can't find module: {module}")
    E170 = ("Cannot apply transition {name}: invalid for the current state.")
    E171 = ("Matcher.add received invalid on_match callback argument: expected "
@ -527,8 +490,6 @@ class Errors(object):
    E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
            "Lookups containing the lemmatization tables. See the docs for "
            "details: https://spacy.io/api/lemmatizer#init")
    E174 = ("Architecture '{name}' not found in registry. Available "
            "names: {names}")
    E175 = ("Can't remove rule for unknown match pattern ID: {key}")
    E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
    E177 = ("Ill-formed IOB input detected: {tag}")
@ -556,9 +517,6 @@ class Errors(object):
            "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
    E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
    E187 = ("Only unicode strings are supported as labels.")
    E188 = ("Could not match the gold entity links to entities in the doc - "
            "make sure the gold EL data refers to valid results of the "
            "named entity recognizer in the `nlp` pipeline.")
    E189 = ("Each argument to `get_doc` should be of equal length.")
    E190 = ("Token head out of range in `Doc.from_array()` for token index "
            "'{index}' with value '{value}' (equivalent to relative head "
@ -578,12 +536,32 @@ class Errors(object):
    E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
    E198 = ("Unable to return {n} most similar vectors for the current vectors "
            "table, which contains {n_rows} vectors.")
    E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
    # TODO: fix numbering after merging develop into master
-    E983 = ("Invalid key for '{dict_name}': {key}. Available keys: "
+    E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
    E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
            "array and {doc_length} for the Doc itself.")
    E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
    E973 = ("Unexpected type for NER data")
    E974 = ("Unknown {obj} attribute: {key}")
    E975 = ("The method Example.from_dict expects a Doc as first argument, "
            "but got {type}")
    E976 = ("The method Example.from_dict expects a dict as second argument, "
            "but received None.")
    E977 = ("Can not compare a MorphAnalysis with a string object. "
            "This is likely a bug in spaCy, so feel free to open an issue.")
    E978 = ("The {method} method of component {name} takes a list of Example objects, "
            "but found {types} instead.")
    E979 = ("Cannot convert {type} to an Example object.")
    E980 = ("Each link annotation should refer to a dictionary with at most one "
            "identifier mapping to 1.0, and all others to 0.0.")
    E981 = ("The offsets of the annotations for 'links' need to refer exactly "
            "to the offsets of the 'entities' annotations.")
    E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
            "into {values}, but found {value}.")
    E983 = ("Invalid key for '{dict}': {key}. Available keys: "
            "{keys}")
    E984 = ("Could not parse the {input} - double check the data is written "
            "in the correct format as expected by spaCy.")
    E985 = ("The pipeline component '{component}' is already available in the base "
            "model. The settings in the component block in the config file are "
            "being ignored. If you want to replace this component instead, set "
@ -615,22 +593,13 @@ class Errors(object):
    E997 = ("Tokenizer special cases are not allowed to modify the text. "
            "This would map '{chunk}' to '{orth}' given token attributes "
            "'{token_attrs}'.")
    E998 = ("To create GoldParse objects from Example objects without a "
            "Doc, get_gold_parses() should be called with a Vocab object.")
    E999 = ("Encountered an unexpected format for the dictionary holding "
            "gold annotations: {gold_dict}")
@add_codes
 class TempErrors(object):
    T003 = ("Resizing pretrained Tagger models is not currently supported.")
    T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
    T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
            "issue tracker: http://github.com/explosion/spaCy/issues")
    T008 = ("Bad configuration of Tagger. This is probably a bug within "
            "spaCy. We changed the name of an internal attribute for loading "
            "pretrained vectors, and the class has been passed the old name "
            "(pretrained_dims) but not the new name (pretrained_vectors).")
 # fmt: on
--- a/spacy/gold.pxd
+++ b/spacy/gold.pxd
@ -1,68 +0,0 @@
 from cymem.cymem cimport Pool
 from .typedefs cimport attr_t
 from .syntax.transition_system cimport Transition
 from .tokens import Doc
 cdef struct GoldParseC:
    int* tags
    int* heads
    int* has_dep
    int* sent_start
    attr_t* labels
    int** brackets
    Transition* ner
 cdef class GoldParse:
    cdef Pool mem
    cdef GoldParseC c
    cdef readonly TokenAnnotation orig
    cdef int length
    cdef public int loss
    cdef public list words
    cdef public list tags
    cdef public list pos
    cdef public list morphs
    cdef public list lemmas
    cdef public list sent_starts
    cdef public list heads
    cdef public list labels
    cdef public dict orths
    cdef public list ner
    cdef public dict brackets
    cdef public dict cats
    cdef public dict links
    cdef readonly list cand_to_gold
    cdef readonly list gold_to_cand
 cdef class TokenAnnotation:
    cdef public list ids
    cdef public list words
    cdef public list tags
    cdef public list pos
    cdef public list morphs
    cdef public list lemmas
    cdef public list heads
    cdef public list deps
    cdef public list entities
    cdef public list sent_starts
    cdef public dict brackets_by_start
 cdef class DocAnnotation:
    cdef public object cats
    cdef public object links
 cdef class Example:
    cdef public object doc
    cdef public TokenAnnotation token_annotation
    cdef public DocAnnotation doc_annotation
    cdef public object goldparse
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
--- a/spacy/gold/init.pxd
+++ b/spacy/gold/init.pxd
--- a/spacy/gold/init.py
+++ b/spacy/gold/init.py
@ -0,0 +1,11 @@
 from .corpus import Corpus
 from .example import Example
 from .align import align
 from .iob_utils import iob_to_biluo, biluo_to_iob
 from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
 from .iob_utils import spans_from_biluo_tags
 from .iob_utils import tags_to_entities
 from .gold_io import docs_to_json
 from .gold_io import read_json_file
--- a/spacy/gold/align.pxd
+++ b/spacy/gold/align.pxd
@ -0,0 +1,8 @@
 cdef class Alignment:
    cdef public object cost
    cdef public object i2j
    cdef public object j2i
    cdef public object i2j_multi
    cdef public object j2i_multi
    cdef public object cand_to_gold
    cdef public object gold_to_cand
--- a/spacy/gold/align.pyx
+++ b/spacy/gold/align.pyx
@ -0,0 +1,101 @@
 import numpy
 from ..errors import Errors, AlignmentError
 cdef class Alignment:
    def __init__(self, spacy_words, gold_words):
        # Do many-to-one alignment for misaligned tokens.
        # If we over-segment, we'll have one gold word that covers a sequence
        # of predicted words
        # If we under-segment, we'll have one predicted word that covers a
        # sequence of gold words.
        # If we "mis-segment", we'll have a sequence of predicted words covering
        # a sequence of gold words. That's many-to-many -- we don't do that
        # except for NER spans where the start and end can be aligned.
        cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
        self.cost = cost
        self.i2j = i2j
        self.j2i = j2i
        self.i2j_multi = i2j_multi
        self.j2i_multi = j2i_multi
        self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
        self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
 def align(tokens_a, tokens_b):
    """Calculate alignment tables between two tokenizations.
    tokens_a (List[str]): The candidate tokenization.
    tokens_b (List[str]): The reference tokenization.
    RETURNS: (tuple): A 5-tuple consisting of the following information:
      * cost (int): The number of misaligned tokens.
      * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
        For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
        to `tokens_b[6]`. If there's no one-to-one alignment for a token,
        it has the value -1.
      * b2a (List[int]): The same as `a2b`, but mapping the other direction.
      * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
        to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
        the same token of `tokens_b`.
      * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
            direction.
    """
    tokens_a = _normalize_for_alignment(tokens_a)
    tokens_b = _normalize_for_alignment(tokens_b)
    cost = 0
    a2b = numpy.empty(len(tokens_a), dtype="i")
    b2a = numpy.empty(len(tokens_b), dtype="i")
    a2b.fill(-1)
    b2a.fill(-1)
    a2b_multi = {}
    b2a_multi = {}
    i = 0
    j = 0
    offset_a = 0
    offset_b = 0
    while i < len(tokens_a) and j < len(tokens_b):
        a = tokens_a[i][offset_a:]
        b = tokens_b[j][offset_b:]
        if a == b:
            if offset_a == offset_b == 0:
                a2b[i] = j
                b2a[j] = i
            elif offset_a == 0:
                cost += 2
                a2b_multi[i] = j
            elif offset_b == 0:
                cost += 2
                b2a_multi[j] = i
            offset_a = offset_b = 0
            i += 1
            j += 1
        elif a == "":
            assert offset_a == 0
            cost += 1
            i += 1
        elif b == "":
            assert offset_b == 0
            cost += 1
            j += 1
        elif b.startswith(a):
            cost += 1
            if offset_a == 0:
                a2b_multi[i] = j
            i += 1
            offset_a = 0
            offset_b += len(a)
        elif a.startswith(b):
            cost += 1
            if offset_b == 0:
                b2a_multi[j] = i
            j += 1
            offset_b = 0
            offset_a += len(b)
        else:
            assert "".join(tokens_a) != "".join(tokens_b)
            raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
    return cost, a2b, b2a, a2b_multi, b2a_multi
 def _normalize_for_alignment(tokens):
    return [w.replace(" ", "").lower() for w in tokens]
--- a/spacy/gold/augment.py
+++ b/spacy/gold/augment.py
@ -0,0 +1,111 @@
 import random
 import itertools
 def make_orth_variants_example(nlp, example, orth_variant_level=0.0):  # TODO: naming
    raw_text = example.text
    orig_dict = example.to_dict()
    variant_text, variant_token_annot = make_orth_variants(
        nlp, raw_text, orig_dict["token_annotation"], orth_variant_level
    )
    doc = nlp.make_doc(variant_text)
    orig_dict["token_annotation"] = variant_token_annot
    return example.from_dict(doc, orig_dict)
 def make_orth_variants(nlp, raw_text, orig_token_dict, orth_variant_level=0.0):
    if random.random() >= orth_variant_level:
        return raw_text, orig_token_dict
    if not orig_token_dict:
        return raw_text, orig_token_dict
    raw = raw_text
    token_dict = orig_token_dict
    lower = False
    if random.random() >= 0.5:
        lower = True
        if raw is not None:
            raw = raw.lower()
    ndsv = nlp.Defaults.single_orth_variants
    ndpv = nlp.Defaults.paired_orth_variants
    words = token_dict.get("words", [])
    tags = token_dict.get("tags", [])
    # keep unmodified if words or tags are not defined
    if words and tags:
        if lower:
            words = [w.lower() for w in words]
        # single variants
        punct_choices = [random.choice(x["variants"]) for x in ndsv]
        for word_idx in range(len(words)):
            for punct_idx in range(len(ndsv)):
                if (
                    tags[word_idx] in ndsv[punct_idx]["tags"]
                    and words[word_idx] in ndsv[punct_idx]["variants"]
                ):
                    words[word_idx] = punct_choices[punct_idx]
        # paired variants
        punct_choices = [random.choice(x["variants"]) for x in ndpv]
        for word_idx in range(len(words)):
            for punct_idx in range(len(ndpv)):
                if tags[word_idx] in ndpv[punct_idx]["tags"] and words[
                    word_idx
                ] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]):
                    # backup option: random left vs. right from pair
                    pair_idx = random.choice([0, 1])
                    # best option: rely on paired POS tags like `` / ''
                    if len(ndpv[punct_idx]["tags"]) == 2:
                        pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx])
                    # next best option: rely on position in variants
                    # (may not be unambiguous, so order of variants matters)
                    else:
                        for pair in ndpv[punct_idx]["variants"]:
                            if words[word_idx] in pair:
                                pair_idx = pair.index(words[word_idx])
                    words[word_idx] = punct_choices[punct_idx][pair_idx]
        token_dict["words"] = words
        token_dict["tags"] = tags
    # modify raw
    if raw is not None:
        variants = []
        for single_variants in ndsv:
            variants.extend(single_variants["variants"])
        for paired_variants in ndpv:
            variants.extend(
                list(itertools.chain.from_iterable(paired_variants["variants"]))
            )
        # store variants in reverse length order to be able to prioritize
        # longer matches (e.g., "---" before "--")
        variants = sorted(variants, key=lambda x: len(x))
        variants.reverse()
        variant_raw = ""
        raw_idx = 0
        # add initial whitespace
        while raw_idx < len(raw) and raw[raw_idx].isspace():
            variant_raw += raw[raw_idx]
            raw_idx += 1
        for word in words:
            match_found = False
            # skip whitespace words
            if word.isspace():
                match_found = True
            # add identical word
            elif word not in variants and raw[raw_idx:].startswith(word):
                variant_raw += word
                raw_idx += len(word)
                match_found = True
            # add variant word
            else:
                for variant in variants:
                    if not match_found and raw[raw_idx:].startswith(variant):
                        raw_idx += len(variant)
                        variant_raw += word
                        match_found = True
            # something went wrong, abort
            # (add a warning message?)
            if not match_found:
                return raw_text, orig_token_dict
            # add following whitespace
            while raw_idx < len(raw) and raw[raw_idx].isspace():
                variant_raw += raw[raw_idx]
                raw_idx += 1
        raw = variant_raw
    return raw, token_dict
--- a/spacy/gold/converters/init.py
+++ b/spacy/gold/converters/init.py
@ -0,0 +1,6 @@
 from .iob2docs import iob2docs  # noqa: F401
 from .conll_ner2docs import conll_ner2docs  # noqa: F401
 from .json2docs import json2docs
 # TODO: Update this one
 # from .conllu2docs import conllu2docs  # noqa: F401
--- a/spacy/gold/converters/conll_ner2docs.py
+++ b/spacy/gold/converters/conll_ner2docs.py
@ -1,17 +1,18 @@
 from wasabi import Printer
 from .. import tags_to_entities
 from ...gold import iob_to_biluo
 from ...lang.xx import MultiLanguage
-from ...tokens.doc import Doc
+from ...tokens import Doc, Span
 from ...util import load_model
-def conll_ner2json(
+def conll_ner2docs(
    input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
 ):
    """
    Convert files in the CoNLL-2003 NER format and similar
-    whitespace-separated columns into JSON format for use with train cli.
+    whitespace-separated columns into Doc objects.
    The first column is the tokens, the final column is the IOB tags. If an
    additional second column is present, the second column is the tags.
@ -81,17 +82,25 @@ def conll_ner2json(
            "No document delimiters found. Use `-n` to automatically group "
            "sentences into documents."
        )
    if model:
        nlp = load_model(model)
    else:
        nlp = MultiLanguage()
    output_docs = []
-    for doc in input_data.strip().split(doc_delimiter):
+    for conll_doc in input_data.strip().split(doc_delimiter):
-        doc = doc.strip()
+        conll_doc = conll_doc.strip()
-        if not doc:
+        if not conll_doc:
            continue
-        output_doc = []
+        words = []
-        for sent in doc.split("\n\n"):
+        sent_starts = []
-            sent = sent.strip()
+        pos_tags = []
-            if not sent:
+        biluo_tags = []
        for conll_sent in conll_doc.split("\n\n"):
            conll_sent = conll_sent.strip()
            if not conll_sent:
                continue
-            lines = [line.strip() for line in sent.split("\n") if line.strip()]
+            lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
            cols = list(zip(*[line.split() for line in lines]))
            if len(cols) < 2:
                raise ValueError(
@ -99,25 +108,19 @@ def conll_ner2json(
                    "Try checking whitespace and delimiters. See "
                    "https://spacy.io/api/cli#convert"
                )
-            words = cols[0]
+            length = len(cols[0])
-            iob_ents = cols[-1]
+            words.extend(cols[0])
-            if len(cols) > 2:
+            sent_starts.extend([True] + [False] * (length - 1))
-                tags = cols[1]
+            biluo_tags.extend(iob_to_biluo(cols[-1]))
-            else:
+            pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
-                tags = ["-"] * len(words)
+
-            biluo_ents = iob_to_biluo(iob_ents)
+        doc = Doc(nlp.vocab, words=words)
-            output_doc.append(
+        for i, token in enumerate(doc):
-                {
+            token.tag_ = pos_tags[i]
-                    "tokens": [
+            token.is_sent_start = sent_starts[i]
-                        {"orth": w, "tag": tag, "ner": ent}
+        entities = tags_to_entities(biluo_tags)
-                        for (w, tag, ent) in zip(words, tags, biluo_ents)
+        doc.ents = [Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
-                    ]
+        output_docs.append(doc)
                }
            )
        output_docs.append(
            {"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]}
        )
        output_doc = []
    return output_docs
--- a/spacy/gold/converters/conllu2json.py
+++ b/spacy/gold/converters/conllu2json.py
@ -1,10 +1,10 @@
 import re
 from .conll_ner2docs import n_sents_info
 from ...gold import Example
-from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets
+from ...gold import iob_to_biluo, spans_from_biluo_tags
 from ...language import Language
 from ...tokens import Doc, Token
 from .conll_ner2json import n_sents_info
 from wasabi import Printer
@ -12,7 +12,6 @@ def conllu2json(
    input_data,
    n_sents=10,
    append_morphology=False,
    lang=None,
    ner_map=None,
    merge_subtokens=False,
    no_print=False,
@ -44,10 +43,7 @@ def conllu2json(
        raw += example.text
        sentences.append(
            generate_sentence(
-                example.token_annotation,
+                example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
                has_ner_tags,
                MISC_NER_PATTERN,
                ner_map=ner_map,
            )
        )
        # Real-sized documents could be extracted using the comments on the
@ -145,21 +141,22 @@ def get_entities(lines, tag_pattern, ner_map=None):
    return iob_to_biluo(iob)
-def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None):
+def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
    sentence = {}
    tokens = []
-    for i, id_ in enumerate(token_annotation.ids):
+    token_annotation = example_dict["token_annotation"]
    for i, id_ in enumerate(token_annotation["ids"]):
        token = {}
        token["id"] = id_
-        token["orth"] = token_annotation.get_word(i)
+        token["orth"] = token_annotation["words"][i]
-        token["tag"] = token_annotation.get_tag(i)
+        token["tag"] = token_annotation["tags"][i]
-        token["pos"] = token_annotation.get_pos(i)
+        token["pos"] = token_annotation["pos"][i]
-        token["lemma"] = token_annotation.get_lemma(i)
+        token["lemma"] = token_annotation["lemmas"][i]
-        token["morph"] = token_annotation.get_morph(i)
+        token["morph"] = token_annotation["morphs"][i]
-        token["head"] = token_annotation.get_head(i) - id_
+        token["head"] = token_annotation["heads"][i] - i
-        token["dep"] = token_annotation.get_dep(i)
+        token["dep"] = token_annotation["deps"][i]
        if has_ner_tags:
-            token["ner"] = token_annotation.get_entity(i)
+            token["ner"] = example_dict["doc_annotation"]["entities"][i]
        tokens.append(token)
    sentence["tokens"] = tokens
    return sentence
@ -267,40 +264,25 @@ def example_from_conllu_sentence(
        doc = merge_conllu_subtokens(lines, doc)
    # create Example from custom Doc annotation
-    ids, words, tags, heads, deps = [], [], [], [], []
+    words, spaces, tags, morphs, lemmas = [], [], [], [], []
    pos, lemmas, morphs, spaces = [], [], [], []
    for i, t in enumerate(doc):
        ids.append(i)
        words.append(t._.merged_orth)
        lemmas.append(t._.merged_lemma)
        spaces.append(t._.merged_spaceafter)
        morphs.append(t._.merged_morph)
        if append_morphology and t._.merged_morph:
            tags.append(t.tag_ + "__" + t._.merged_morph)
        else:
            tags.append(t.tag_)
-        pos.append(t.pos_)
+
-        morphs.append(t._.merged_morph)
+    doc_x = Doc(vocab, words=words, spaces=spaces)
-        lemmas.append(t._.merged_lemma)
+    ref_dict = Example(doc_x, reference=doc).to_dict()
-        heads.append(t.head.i)
+    ref_dict["words"] = words
-        deps.append(t.dep_)
+    ref_dict["lemmas"] = lemmas
-        spaces.append(t._.merged_spaceafter)
+    ref_dict["spaces"] = spaces
-    ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
+    ref_dict["tags"] = tags
-    ents = biluo_tags_from_offsets(doc, ent_offsets)
+    ref_dict["morphs"] = morphs
-    raw = ""
+    example = Example.from_dict(doc_x, ref_dict)
    for word, space in zip(words, spaces):
        raw += word
        if space:
            raw += " "
    example = Example(doc=raw)
    example.set_token_annotation(
        ids=ids,
        words=words,
        tags=tags,
        pos=pos,
        morphs=morphs,
        lemmas=lemmas,
        heads=heads,
        deps=deps,
        entities=ents,
    )
    return example
--- a/spacy/gold/converters/iob2docs.py
+++ b/spacy/gold/converters/iob2docs.py
@ -0,0 +1,64 @@
 from wasabi import Printer
 from .conll_ner2docs import n_sents_info
 from ...gold import iob_to_biluo, tags_to_entities
 from ...tokens import Doc, Span
 from ...util import minibatch
 def iob2docs(input_data, vocab, n_sents=10, no_print=False, *args, **kwargs):
    """
    Convert IOB files with one sentence per line and tags separated with '|'
    into Doc objects so they can be saved. IOB and IOB2 are accepted.
    Sample formats:
    I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    """
    msg = Printer(no_print=no_print)
    if n_sents > 0:
        n_sents_info(msg, n_sents)
    docs = read_iob(input_data.split("\n"), vocab, n_sents)
    return docs
 def read_iob(raw_sents, vocab, n_sents):
    docs = []
    for group in minibatch(raw_sents, size=n_sents):
        tokens = []
        words = []
        tags = []
        iob = []
        sent_starts = []
        for line in group:
            if not line.strip():
                continue
            sent_tokens = [t.split("|") for t in line.split()]
            if len(sent_tokens[0]) == 3:
                sent_words, sent_tags, sent_iob = zip(*sent_tokens)
            elif len(sent_tokens[0]) == 2:
                sent_words, sent_iob = zip(*sent_tokens)
                sent_tags = ["-"] * len(sent_words)
            else:
                raise ValueError(
                    "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
                )
            words.extend(sent_words)
            tags.extend(sent_tags)
            iob.extend(sent_iob)
            tokens.extend(sent_tokens)
            sent_starts.append(True)
            sent_starts.extend([False for _ in sent_words[1:]])
        doc = Doc(vocab, words=words)
        for i, tag in enumerate(tags):
            doc[i].tag_ = tag
        for i, sent_start in enumerate(sent_starts):
            doc[i].is_sent_start = sent_start
        biluo = iob_to_biluo(iob)
        entities = tags_to_entities(biluo)
        doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
        docs.append(doc)
    return docs
--- a/spacy/gold/converters/json2docs.py
+++ b/spacy/gold/converters/json2docs.py
@ -0,0 +1,24 @@
 import srsly
 from ..gold_io import json_iterate, json_to_annotations
 from ..example import annotations2doc
 from ..example import _fix_legacy_dict_data, _parse_example_dict_data
 from ...util import load_model
 from ...lang.xx import MultiLanguage
 def json2docs(input_data, model=None, **kwargs):
    nlp = load_model(model) if model is not None else MultiLanguage()
    if not isinstance(input_data, bytes):
        if not isinstance(input_data, str):
            input_data = srsly.json_dumps(input_data)
        input_data = input_data.encode("utf8")
    docs = []
    for json_doc in json_iterate(input_data):
        for json_para in json_to_annotations(json_doc):
            example_dict = _fix_legacy_dict_data(json_para)
            tok_dict, doc_dict = _parse_example_dict_data(example_dict)
            if json_para.get("raw"):
                assert tok_dict.get("SPACY")
            doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
            docs.append(doc)
    return docs
--- a/spacy/gold/corpus.py
+++ b/spacy/gold/corpus.py
@ -0,0 +1,122 @@
 import random
 from .. import util
 from .example import Example
 from ..tokens import DocBin, Doc
 class Corpus:
    """An annotated corpus, reading train and dev datasets from
    the DocBin (.spacy) format.
    DOCS: https://spacy.io/api/goldcorpus
    """
    def __init__(self, train_loc, dev_loc, limit=0):
        """Create a Corpus.
        train (str / Path): File or directory of training data.
        dev (str / Path): File or directory of development data.
        limit (int): Max. number of examples returned
        RETURNS (Corpus): The newly created object.
        """
        self.train_loc = train_loc
        self.dev_loc = dev_loc
        self.limit = limit
    @staticmethod
    def walk_corpus(path):
        path = util.ensure_path(path)
        if not path.is_dir():
            return [path]
        paths = [path]
        locs = []
        seen = set()
        for path in paths:
            if str(path) in seen:
                continue
            seen.add(str(path))
            if path.parts[-1].startswith("."):
                continue
            elif path.is_dir():
                paths.extend(path.iterdir())
            elif path.parts[-1].endswith(".spacy"):
                locs.append(path)
        return locs
    def make_examples(self, nlp, reference_docs, max_length=0):
        for reference in reference_docs:
            if len(reference) >= max_length >= 1:
                if reference.is_sentenced:
                    for ref_sent in reference.sents:
                        yield Example(
                            nlp.make_doc(ref_sent.text),
                            ref_sent.as_doc()
                        )
            else:
                yield Example(
                    nlp.make_doc(reference.text),
                    reference
                )
    def make_examples_gold_preproc(self, nlp, reference_docs):
        for reference in reference_docs:
            if reference.is_sentenced:
                ref_sents = [sent.as_doc() for sent in reference.sents]
            else:
                ref_sents = [reference]
            for ref_sent in ref_sents:
                yield Example(
                    Doc(
                        nlp.vocab, 
                        words=[w.text for w in ref_sent],
                        spaces=[bool(w.whitespace_) for w in ref_sent]
                    ),
                    ref_sent
                )
    def read_docbin(self, vocab, locs):
        """ Yield training examples as example dicts """
        i = 0
        for loc in locs:
            loc = util.ensure_path(loc)
            if loc.parts[-1].endswith(".spacy"):
                with loc.open("rb") as file_:
                    doc_bin = DocBin().from_bytes(file_.read())
                docs = doc_bin.get_docs(vocab)
                for doc in docs:
                    if len(doc):
                        yield doc
                        i += 1
                        if self.limit >= 1 and i >= self.limit:
                            break
    def count_train(self, nlp):
        """Returns count of words in train examples"""
        n = 0
        i = 0
        for example in self.train_dataset(nlp):
            n += len(example.predicted)
            if self.limit >= 0 and i >= self.limit:
                break
            i += 1
        return n
    def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
            max_length=0, **kwargs):
        ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
        if gold_preproc:
            examples = self.make_examples_gold_preproc(nlp, ref_docs)
        else:
            examples = self.make_examples(nlp, ref_docs, max_length)
        if shuffle:
            examples = list(examples)
            random.shuffle(examples)
        yield from examples
    def dev_dataset(self, nlp, *, gold_preproc=False, **kwargs):
        ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.dev_loc))
        if gold_preproc:
            examples = self.make_examples_gold_preproc(nlp, ref_docs)
        else:
            examples = self.make_examples(nlp, ref_docs, max_length=0)
        yield from examples
--- a/spacy/gold/example.pxd
+++ b/spacy/gold/example.pxd
@ -0,0 +1,8 @@
 from ..tokens.doc cimport Doc
 from .align cimport Alignment
 cdef class Example:
    cdef readonly Doc x
    cdef readonly Doc y
    cdef readonly Alignment _alignment
--- a/spacy/gold/example.pyx
+++ b/spacy/gold/example.pyx
@ -0,0 +1,432 @@
 import warnings
 import numpy
 from ..tokens.doc cimport Doc
 from ..tokens.span cimport Span
 from ..tokens.span import Span
 from ..attrs import IDS
 from .align cimport Alignment
 from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
 from .iob_utils import spans_from_biluo_tags
 from .align import Alignment
 from ..errors import Errors, Warnings
 from ..syntax import nonproj
 cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
    """ Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
    attrs, array = _annot2array(vocab, tok_annot, doc_annot)
    output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
    if "entities" in doc_annot:
       _add_entities_to_doc(output, doc_annot["entities"])
    if array.size:
        output = output.from_array(attrs, array)
    # links are currently added with ENT_KB_ID on the token level
    output.cats.update(doc_annot.get("cats", {}))
    return output
 cdef class Example:
    def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
        """ Doc can either be text, or an actual Doc """
        if predicted is None:
            raise TypeError(Errors.E972.format(arg="predicted"))
        if reference is None:
            raise TypeError(Errors.E972.format(arg="reference"))
        self.x = predicted
        self.y = reference
        self._alignment = alignment
    property predicted:
        def __get__(self):
            return self.x
        def __set__(self, doc):
            self.x = doc
    property reference:
        def __get__(self):
            return self.y
        def __set__(self, doc):
            self.y = doc
    def copy(self):
        return Example(
            self.x.copy(),
            self.y.copy()
        )
    @classmethod
    def from_dict(cls, Doc predicted, dict example_dict):
        if example_dict is None:
            raise ValueError(Errors.E976)
        if not isinstance(predicted, Doc):
            raise TypeError(Errors.E975.format(type=type(predicted)))
        example_dict = _fix_legacy_dict_data(example_dict)
        tok_dict, doc_dict = _parse_example_dict_data(example_dict)
        if "ORTH" not in tok_dict:
            tok_dict["ORTH"] = [tok.text for tok in predicted]
            tok_dict["SPACY"] = [tok.whitespace_ for tok in predicted]
        if not _has_field(tok_dict, "SPACY"):
            spaces = _guess_spaces(predicted.text, tok_dict["ORTH"])
        return Example(
            predicted,
            annotations2doc(predicted.vocab, tok_dict, doc_dict)
        )
    @property
    def alignment(self):
        if self._alignment is None:
            spacy_words = [token.orth_ for token in self.predicted]
            gold_words = [token.orth_ for token in self.reference]
            if gold_words == []:
                gold_words = spacy_words
            self._alignment = Alignment(spacy_words, gold_words)
        return self._alignment
    def get_aligned(self, field, as_string=False):
        """Return an aligned array for a token attribute."""
        i2j_multi = self.alignment.i2j_multi
        cand_to_gold = self.alignment.cand_to_gold
        vocab = self.reference.vocab
        gold_values = self.reference.to_array([field])
        output = [None] * len(self.predicted)
        for i, gold_i in enumerate(cand_to_gold):
            if self.predicted[i].text.isspace():
                output[i] = None
            if gold_i is None:
                if i in i2j_multi:
                    output[i] = gold_values[i2j_multi[i]]
                else:
                    output[i] = None
            else:
                output[i] = gold_values[gold_i]
        if as_string and field not in ["ENT_IOB", "SENT_START"]:
            output = [vocab.strings[o] if o is not None else o for o in output]
        return output
    def get_aligned_parse(self, projectivize=True):
        cand_to_gold = self.alignment.cand_to_gold
        gold_to_cand = self.alignment.gold_to_cand
        aligned_heads = [None] * self.x.length
        aligned_deps = [None] * self.x.length
        heads = [token.head.i for token in self.y]
        deps = [token.dep_ for token in self.y]
        if projectivize:
            heads, deps = nonproj.projectivize(heads, deps)
        for cand_i in range(self.x.length):
            gold_i = cand_to_gold[cand_i]
            if gold_i is not None: # Alignment found
                gold_head = gold_to_cand[heads[gold_i]]
                if gold_head is not None:
                    aligned_heads[cand_i] = gold_head
                    aligned_deps[cand_i] = deps[gold_i]
        return aligned_heads, aligned_deps
    def get_aligned_ner(self):
        if not self.y.is_nered:
            return [None] * len(self.x)  # should this be 'missing' instead of 'None' ?
        x_text = self.x.text
        # Get a list of entities, and make spans for non-entity tokens.
        # We then work through the spans in order, trying to find them in
        # the text and using that to get the offset. Any token that doesn't
        # get a tag set this way is tagged None.
        # This could maybe be improved? It at least feels easy to reason about.
        y_spans = list(self.y.ents)
        y_spans.sort()
        x_text_offset = 0
        x_spans = []
        for y_span in y_spans:
            if x_text.count(y_span.text) >= 1:
                start_char = x_text.index(y_span.text) + x_text_offset
                end_char = start_char + len(y_span.text)
                x_span = self.x.char_span(start_char, end_char, label=y_span.label)
                if x_span is not None:
                    x_spans.append(x_span)
                    x_text = self.x.text[end_char:]
                    x_text_offset = end_char
        x_tags = biluo_tags_from_offsets(
            self.x,
            [(e.start_char, e.end_char, e.label_) for e in x_spans],
            missing=None
        )
        gold_to_cand = self.alignment.gold_to_cand
        for token in self.y:
            if token.ent_iob_ == "O":
                cand_i = gold_to_cand[token.i]
                if cand_i is not None and x_tags[cand_i] is None:
                    x_tags[cand_i] = "O"
        i2j_multi = self.alignment.i2j_multi
        for i, tag in enumerate(x_tags):
            if tag is None and i in i2j_multi:
                gold_i = i2j_multi[i]
                if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
                    x_tags[i] = "O"
        return x_tags
    def to_dict(self):
        return {
            "doc_annotation": {
                "cats": dict(self.reference.cats),
                "entities": biluo_tags_from_doc(self.reference),
                "links": self._links_to_dict()
            },
            "token_annotation": {
                "ids": [t.i+1 for t in self.reference],
                "words": [t.text for t in self.reference],
                "tags": [t.tag_ for t in self.reference],
                "lemmas": [t.lemma_ for t in self.reference],
                "pos": [t.pos_ for t in self.reference],
                "morphs": [t.morph_ for t in self.reference],
                "heads": [t.head.i for t in self.reference],
                "deps": [t.dep_ for t in self.reference],
                "sent_starts": [int(bool(t.is_sent_start)) for t in self.reference]
            }
        }
    def _links_to_dict(self):
        links = {}
        for ent in self.reference.ents:
            if ent.kb_id_:
                links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
        return links
    def split_sents(self):
        """ Split the token annotations into multiple Examples based on
        sent_starts and return a list of the new Examples"""
        if not self.reference.is_sentenced:
            return [self]
        sent_starts = self.get_aligned("SENT_START")
        sent_starts.append(1)   # appending virtual start of a next sentence to facilitate search
        output = []
        pred_start = 0
        for sent in self.reference.sents:
            new_ref = sent.as_doc()
            pred_end = sent_starts.index(1, pred_start+1)  # find where the next sentence starts
            new_pred = self.predicted[pred_start : pred_end].as_doc()
            output.append(Example(new_pred, new_ref))
            pred_start = pred_end
        return output
    property text:
        def __get__(self):
            return self.x.text
    def __str__(self):
        return str(self.to_dict())
    def __repr__(self):
        return str(self.to_dict())
 def _annot2array(vocab, tok_annot, doc_annot):
    attrs = []
    values = []
    for key, value in doc_annot.items():
        if value:
            if key == "entities":
                pass
            elif key == "links":
                entities = doc_annot.get("entities", {})
                if not entities:
                    raise ValueError(Errors.E981)
                ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
                tok_annot["ENT_KB_ID"] = ent_kb_ids
            elif key == "cats":
                pass
            else:
                raise ValueError(Errors.E974.format(obj="doc", key=key))
    for key, value in tok_annot.items():
        if key not in IDS:
            raise ValueError(Errors.E974.format(obj="token", key=key))
        elif key in ["ORTH", "SPACY"]:
            pass
        elif key == "HEAD":
            attrs.append(key)
            values.append([h-i for i, h in enumerate(value)])
        elif key == "SENT_START":
            attrs.append(key)
            values.append(value)
        elif key == "MORPH":
            attrs.append(key)
            values.append([vocab.morphology.add(v) for v in value])
        else:
            attrs.append(key)
            values.append([vocab.strings.add(v) for v in value])
    array = numpy.asarray(values, dtype="uint64")
    return attrs, array.T
 def _add_entities_to_doc(doc, ner_data):
    if ner_data is None:
        return
    elif ner_data == []:
        doc.ents = []
    elif isinstance(ner_data[0], tuple):
        return _add_entities_to_doc(
            doc,
            biluo_tags_from_offsets(doc, ner_data)
        )
    elif isinstance(ner_data[0], str) or ner_data[0] is None:
        return _add_entities_to_doc(
            doc,
            spans_from_biluo_tags(doc, ner_data)
        )
    elif isinstance(ner_data[0], Span):
        # Ugh, this is super messy. Really hard to set O entities
        doc.ents = ner_data
        doc.ents = [span for span in ner_data if span.label_]
    else:
        raise ValueError(Errors.E973)
 def _parse_example_dict_data(example_dict):
    return (
        example_dict["token_annotation"],
        example_dict["doc_annotation"]
    )
 def _fix_legacy_dict_data(example_dict):
    token_dict = example_dict.get("token_annotation", {})
    doc_dict = example_dict.get("doc_annotation", {})
    for key, value in example_dict.items():
        if value:
            if key in ("token_annotation", "doc_annotation"):
                pass
            elif key == "ids":
                pass
            elif key in ("cats", "links"):
                doc_dict[key] = value
            elif key in ("ner", "entities"):
                doc_dict["entities"] = value
            else:
                token_dict[key] = value
    # Remap keys
    remapping = {
        "words": "ORTH",
        "tags": "TAG",
        "pos": "POS",
        "lemmas": "LEMMA",
        "deps": "DEP",
        "heads": "HEAD",
        "sent_starts": "SENT_START",
        "morphs": "MORPH",
        "spaces": "SPACY",
    }
    old_token_dict = token_dict
    token_dict = {}
    for key, value in old_token_dict.items():
        if key in ("text", "ids", "brackets"):
            pass
        elif key in remapping:
            token_dict[remapping[key]] = value
        else:
            raise KeyError(Errors.E983.format(key=key, dict="token_annotation", keys=remapping.keys()))
    text = example_dict.get("text", example_dict.get("raw"))
    if _has_field(token_dict, "ORTH") and not _has_field(token_dict, "SPACY"):
        token_dict["SPACY"] = _guess_spaces(text, token_dict["ORTH"])
    if "HEAD" in token_dict and "SENT_START" in token_dict:
        # If heads are set, we don't also redundantly specify SENT_START.
        token_dict.pop("SENT_START")
        warnings.warn(Warnings.W092)
    return {
        "token_annotation": token_dict,
        "doc_annotation": doc_dict
    }
 def _has_field(annot, field):
    if field not in annot:
        return False
    elif annot[field] is None:
        return False
    elif len(annot[field]) == 0:
        return False
    elif all([value is None for value in annot[field]]):
        return False
    else:
        return True
 def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
    if isinstance(biluo_or_offsets[0], (list, tuple)):
        # Convert to biluo if necessary
        # This is annoying but to convert the offsets we need a Doc
        # that has the target tokenization.
        reference = Doc(vocab, words=words, spaces=spaces)
        biluo = biluo_tags_from_offsets(reference, biluo_or_offsets)
    else:
        biluo = biluo_or_offsets
    ent_iobs = []
    ent_types = []
    for iob_tag in biluo_to_iob(biluo):
        if iob_tag in (None, "-"):
            ent_iobs.append("")
            ent_types.append("")
        else:
            ent_iobs.append(iob_tag.split("-")[0])
            if iob_tag.startswith("I") or iob_tag.startswith("B"):
                ent_types.append(iob_tag.split("-", 1)[1])
            else:
                ent_types.append("")
    return ent_iobs, ent_types
 def _parse_links(vocab, words, links, entities):
    reference = Doc(vocab, words=words)
    starts = {token.idx: token.i for token in reference}
    ends = {token.idx + len(token): token.i for token in reference}
    ent_kb_ids = ["" for _ in reference]
    entity_map = [(ent[0], ent[1]) for ent in entities]
    # links annotations need to refer 1-1 to entity annotations - throw error otherwise
    for index, annot_dict in links.items():
        start_char, end_char = index
        if (start_char, end_char) not in entity_map:
            raise ValueError(Errors.E981)
    for index, annot_dict in links.items():
        true_kb_ids = []
        for key, value in annot_dict.items():
            if value == 1.0:
                true_kb_ids.append(key)
        if len(true_kb_ids) > 1:
            raise ValueError(Errors.E980)
        if len(true_kb_ids) == 1:
            start_char, end_char = index
            start_token = starts.get(start_char)
            end_token = ends.get(end_char)
            for i in range(start_token, end_token+1):
                ent_kb_ids[i] = true_kb_ids[0]
    return ent_kb_ids
 def _guess_spaces(text, words):
    if text is None:
        return [True] * len(words)
    spaces = []
    text_pos = 0
    # align words with text
    for word in words:
        try:
            word_start = text[text_pos:].index(word)
        except ValueError:
            spaces.append(True)
            continue
        text_pos += word_start + len(word)
        if text_pos < len(text) and text[text_pos] == " ":
            spaces.append(True)
        else:
            spaces.append(False)
    return spaces
--- a/spacy/gold/gold_io.pyx
+++ b/spacy/gold/gold_io.pyx
@ -0,0 +1,199 @@
 import warnings
 import srsly
 from .. import util
 from ..errors import Warnings
 from ..tokens import Doc
 from .iob_utils import biluo_tags_from_offsets, tags_to_entities
 import json
 def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
    """Convert a list of Doc objects into the JSON-serializable format used by
    the spacy train command.
    docs (iterable / Doc): The Doc object(s) to convert.
    doc_id (int): Id for the JSON.
    RETURNS (dict): The data in spaCy's JSON format
        - each input doc will be treated as a paragraph in the output doc
    """
    if isinstance(docs, Doc):
        docs = [docs]
    json_doc = {"id": doc_id, "paragraphs": []}
    for i, doc in enumerate(docs):
        json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
        for cat, val in doc.cats.items():
            json_cat = {"label": cat, "value": val}
            json_para["cats"].append(json_cat)
        for ent in doc.ents:
            ent_tuple = (ent.start_char, ent.end_char, ent.label_)
            json_para["entities"].append(ent_tuple)
            if ent.kb_id_:
                link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
                json_para["links"].append(link_dict)
        ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
        biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
        for j, sent in enumerate(doc.sents):
            json_sent = {"tokens": [], "brackets": []}
            for token in sent:
                json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_}
                if doc.is_tagged:
                    json_token["tag"] = token.tag_
                    json_token["pos"] = token.pos_
                    json_token["morph"] = token.morph_
                    json_token["lemma"] = token.lemma_
                if doc.is_parsed:
                    json_token["head"] = token.head.i-token.i
                    json_token["dep"] = token.dep_
                json_sent["tokens"].append(json_token)
            json_para["sentences"].append(json_sent)
        json_doc["paragraphs"].append(json_para)
    return json_doc
 def read_json_file(loc, docs_filter=None, limit=None):
    """Read Example dictionaries from a json file or directory."""
    loc = util.ensure_path(loc)
    if loc.is_dir():
        for filename in loc.iterdir():
            yield from read_json_file(loc / filename, limit=limit)
    else:
        with loc.open("rb") as file_:
            utf8_str = file_.read()
        for json_doc in json_iterate(utf8_str):
            if docs_filter is not None and not docs_filter(json_doc):
                continue
            for json_paragraph in json_to_annotations(json_doc):
                yield json_paragraph
 def json_to_annotations(doc):
    """Convert an item in the JSON-formatted training data to the format
    used by Example.
    doc (dict): One entry in the training data.
    YIELDS (tuple): The reformatted data - one training example per paragraph
    """
    for paragraph in doc["paragraphs"]:
        example = {"text": paragraph.get("raw", None)}
        words = []
        spaces = []
        ids = []
        tags = []
        ner_tags = []
        pos = []
        morphs = []
        lemmas = []
        heads = []
        labels = []
        sent_starts = []
        brackets = []
        for sent in paragraph["sentences"]:
            sent_start_i = len(words)
            for i, token in enumerate(sent["tokens"]):
                words.append(token["orth"])
                spaces.append(token.get("space", None))
                ids.append(token.get('id', sent_start_i + i))
                tags.append(token.get("tag", None))
                pos.append(token.get("pos", None))
                morphs.append(token.get("morph", None))
                lemmas.append(token.get("lemma", None))
                if "head" in token:
                    heads.append(token["head"] + sent_start_i + i)
                else:
                    heads.append(None)
                if "dep" in token:
                    labels.append(token["dep"])
                    # Ensure ROOT label is case-insensitive
                    if labels[-1].lower() == "root":
                        labels[-1] = "ROOT"
                else:
                    labels.append(None)
                ner_tags.append(token.get("ner", None))
                if i == 0:
                    sent_starts.append(1)
                else:
                    sent_starts.append(0)
            if "brackets" in sent:
                brackets.extend((b["first"] + sent_start_i,
                                 b["last"] + sent_start_i, b["label"])
                                 for b in sent["brackets"])
        example["token_annotation"] = dict(
            ids=ids,
            words=words,
            spaces=spaces,
            sent_starts=sent_starts,
            brackets=brackets
        )
        # avoid including dummy values that looks like gold info was present
        if any(tags):
            example["token_annotation"]["tags"] = tags
        if any(pos):
            example["token_annotation"]["pos"] = pos
        if any(morphs):
            example["token_annotation"]["morphs"] = morphs
        if any(lemmas):
            example["token_annotation"]["lemmas"] = lemmas
        if any(head is not None for head in heads):
            example["token_annotation"]["heads"] = heads
        if any(labels):
            example["token_annotation"]["deps"] = labels
        cats = {}
        for cat in paragraph.get("cats", {}):
            cats[cat["label"]] = cat["value"]
        example["doc_annotation"] = dict(
            cats=cats,
            entities=ner_tags,
            links=paragraph.get("links", [])
        )
        yield example
 def json_iterate(bytes utf8_str):
    # We should've made these files jsonl...But since we didn't, parse out
    # the docs one-by-one to reduce memory usage.
    # It's okay to read in the whole file -- just don't parse it into JSON.
    cdef long file_length = len(utf8_str)
    if file_length > 2 ** 30:
        warnings.warn(Warnings.W027.format(size=file_length))
    raw = <char*>utf8_str
    cdef int square_depth = 0
    cdef int curly_depth = 0
    cdef int inside_string = 0
    cdef int escape = 0
    cdef long start = -1
    cdef char c
    cdef char quote = ord('"')
    cdef char backslash = ord("\\")
    cdef char open_square = ord("[")
    cdef char close_square = ord("]")
    cdef char open_curly = ord("{")
    cdef char close_curly = ord("}")
    for i in range(file_length):
        c = raw[i]
        if escape:
            escape = False
            continue
        if c == backslash:
            escape = True
            continue
        if c == quote:
            inside_string = not inside_string
            continue
        if inside_string:
            continue
        if c == open_square:
            square_depth += 1
        elif c == close_square:
            square_depth -= 1
        elif c == open_curly:
            if square_depth == 1 and curly_depth == 0:
                start = i
            curly_depth += 1
        elif c == close_curly:
            curly_depth -= 1
            if square_depth == 1 and curly_depth == 0:
                substr = utf8_str[start : i + 1].decode("utf8")
                yield srsly.json_loads(substr)
                start = -1
--- a/spacy/gold/iob_utils.py
+++ b/spacy/gold/iob_utils.py
@ -0,0 +1,209 @@
 import warnings
 from ..errors import Errors, Warnings
 from ..tokens import Span
 def iob_to_biluo(tags):
    out = []
    tags = list(tags)
    while tags:
        out.extend(_consume_os(tags))
        out.extend(_consume_ent(tags))
    return out
 def biluo_to_iob(tags):
    out = []
    for tag in tags:
        if tag is None:
            out.append(tag)
        else:
            tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1)
            out.append(tag)
    return out
 def _consume_os(tags):
    while tags and tags[0] == "O":
        yield tags.pop(0)
 def _consume_ent(tags):
    if not tags:
        return []
    tag = tags.pop(0)
    target_in = "I" + tag[1:]
    target_last = "L" + tag[1:]
    length = 1
    while tags and tags[0] in {target_in, target_last}:
        length += 1
        tags.pop(0)
    label = tag[2:]
    if length == 1:
        if len(label) == 0:
            raise ValueError(Errors.E177.format(tag=tag))
        return ["U-" + label]
    else:
        start = "B-" + label
        end = "L-" + label
        middle = [f"I-{label}" for _ in range(1, length - 1)]
        return [start] + middle + [end]
 def biluo_tags_from_doc(doc, missing="O"):
    return biluo_tags_from_offsets(
        doc,
        [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents],
        missing=missing,
    )
 def biluo_tags_from_offsets(doc, entities, missing="O"):
    """Encode labelled spans into per-token tags, using the
    Begin/In/Last/Unit/Out scheme (BILUO).
    doc (Doc): The document that the entity offsets refer to. The output tags
        will refer to the token boundaries within the document.
    entities (iterable): A sequence of `(start, end, label)` triples. `start`
        and `end` should be character-offset integers denoting the slice into
        the original string.
    RETURNS (list): A list of unicode strings, describing the tags. Each tag
        string will be of the form either "", "O" or "{action}-{label}", where
        action is one of "B", "I", "L", "U". The string "-" is used where the
        entity offsets don't align with the tokenization in the `Doc` object.
        The training algorithm will view these as missing values. "O" denotes a
        non-entity token. "B" denotes the beginning of a multi-token entity,
        "I" the inside of an entity of three or more tokens, and "L" the end
        of an entity of two or more tokens. "U" denotes a single-token entity.
    EXAMPLE:
        >>> text = 'I like London.'
        >>> entities = [(len('I like '), len('I like London'), 'LOC')]
        >>> doc = nlp.tokenizer(text)
        >>> tags = biluo_tags_from_offsets(doc, entities)
        >>> assert tags == ["O", "O", 'U-LOC', "O"]
    """
    # Ensure no overlapping entity labels exist
    tokens_in_ents = {}
    starts = {token.idx: token.i for token in doc}
    ends = {token.idx + len(token): token.i for token in doc}
    biluo = ["-" for _ in doc]
    # Handle entity cases
    for start_char, end_char, label in entities:
        if not label:
            for s in starts:   # account for many-to-one
                if s >= start_char and s < end_char:
                    biluo[starts[s]] = "O"
        else:
            for token_index in range(start_char, end_char):
                if token_index in tokens_in_ents.keys():
                    raise ValueError(
                        Errors.E103.format(
                            span1=(
                                tokens_in_ents[token_index][0],
                                tokens_in_ents[token_index][1],
                                tokens_in_ents[token_index][2],
                            ),
                            span2=(start_char, end_char, label),
                        )
                    )
                tokens_in_ents[token_index] = (start_char, end_char, label)
            start_token = starts.get(start_char)
            end_token = ends.get(end_char)
            # Only interested if the tokenization is correct
            if start_token is not None and end_token is not None:
                if start_token == end_token:
                    biluo[start_token] = f"U-{label}"
                else:
                    biluo[start_token] = f"B-{label}"
                    for i in range(start_token + 1, end_token):
                        biluo[i] = f"I-{label}"
                    biluo[end_token] = f"L-{label}"
    # Now distinguish the O cases from ones where we miss the tokenization
    entity_chars = set()
    for start_char, end_char, label in entities:
        for i in range(start_char, end_char):
            entity_chars.add(i)
    for token in doc:
        for i in range(token.idx, token.idx + len(token)):
            if i in entity_chars:
                break
        else:
            biluo[token.i] = missing
    if "-" in biluo and missing != "-":
        ent_str = str(entities)
        warnings.warn(
            Warnings.W030.format(
                text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
                entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
            )
        )
    return biluo
 def spans_from_biluo_tags(doc, tags):
    """Encode per-token tags following the BILUO scheme into Span object, e.g.
    to overwrite the doc.ents.
    doc (Doc): The document that the BILUO tags refer to.
    entities (iterable): A sequence of BILUO tags with each tag describing one
        token. Each tags string will be of the form of either "", "O" or
        "{action}-{label}", where action is one of "B", "I", "L", "U".
    RETURNS (list): A sequence of Span objects.
    """
    token_offsets = tags_to_entities(tags)
    spans = []
    for label, start_idx, end_idx in token_offsets:
        span = Span(doc, start_idx, end_idx + 1, label=label)
        spans.append(span)
    return spans
 def offsets_from_biluo_tags(doc, tags):
    """Encode per-token tags following the BILUO scheme into entity offsets.
    doc (Doc): The document that the BILUO tags refer to.
    entities (iterable): A sequence of BILUO tags with each tag describing one
        token. Each tags string will be of the form of either "", "O" or
        "{action}-{label}", where action is one of "B", "I", "L", "U".
    RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
        `end` will be character-offset integers denoting the slice into the
        original string.
    """
    spans = spans_from_biluo_tags(doc, tags)
    return [(span.start_char, span.end_char, span.label_) for span in spans]
 def tags_to_entities(tags):
    """ Note that the end index returned by this function is inclusive.
    To use it for Span creation, increment the end by 1."""
    entities = []
    start = None
    for i, tag in enumerate(tags):
        if tag is None:
            continue
        if tag.startswith("O"):
            # TODO: We shouldn't be getting these malformed inputs. Fix this.
            if start is not None:
                start = None
            else:
                entities.append(("", i, i))
            continue
        elif tag == "-":
            continue
        elif tag.startswith("I"):
            if start is None:
                raise ValueError(Errors.E067.format(tags=tags[: i + 1]))
            continue
        if tag.startswith("U"):
            entities.append((tag[2:], i, i))
        elif tag.startswith("B"):
            start = i
        elif tag.startswith("L"):
            entities.append((tag[2:], start, i))
            start = None
        else:
            raise ValueError(Errors.E068.format(tag=tag))
    return entities
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -446,6 +446,8 @@ cdef class Writer:
            assert not path.isdir(loc), f"{loc} is directory"
        if isinstance(loc, Path):
            loc = bytes(loc)
        if path.exists(loc):
            assert not path.isdir(loc), "%s is directory." % loc
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
@ -487,10 +489,10 @@ cdef class Writer:
 cdef class Reader:
    def __init__(self, object loc):
        assert path.exists(loc)
        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
        assert path.exists(loc)
        assert not path.isdir(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
--- a/spacy/lang/el/syntax_iterators.py
+++ b/spacy/lang/el/syntax_iterators.py
@ -20,29 +20,25 @@ def noun_chunks(doclike):
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            if any(w.i in seen for w in word.subtree):
                continue
            flag = False
            if word.pos == NOUN:
                #  check for patterns such as γραμμή παραγωγής
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
-                        seen.update(
+                        prev_end = potential_nmod.i
                            j for j in range(word.left_edge.i, potential_nmod.i + 1)
                        )
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        flag = True
                        break
            if flag is False:
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -51,9 +47,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/en/syntax_iterators.py
+++ b/spacy/lang/en/syntax_iterators.py
@ -25,17 +25,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.i + 1))
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -136,7 +136,19 @@ for pron in ["he", "she", "it"]:
 # W-words, relative pronouns, prepositions etc.
-for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
+for word in [
    "who",
    "what",
    "when",
    "where",
    "why",
    "how",
    "there",
    "that",
    "this",
    "these",
    "those",
 ]:
    for orth in [word, word.title()]:
        _exc[orth + "'s"] = [
            {ORTH: orth, LEMMA: word, NORM: word},
@ -396,6 +408,8 @@ _other_exc = {
        {ORTH: "Let", LEMMA: "let", NORM: "let"},
        {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"},
    ],
    "c'mon": [{ORTH: "c'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
    "C'mon": [{ORTH: "C'm", NORM: "come", LEMMA: "come"}, {ORTH: "on"}],
 }
 _exc.update(_other_exc)
--- a/spacy/lang/es/examples.py
+++ b/spacy/lang/es/examples.py
@ -14,5 +14,9 @@ sentences = [
    "El gato come pescado.",
    "Veo al hombre con el telescopio.",
    "La araña come moscas.",
-    "El pingüino incuba en su nido.",
+    "El pingüino incuba en su nido sobre el hielo.",
    "¿Dónde estais?",
    "¿Quién es el presidente Francés?",
    "¿Dónde está encuentra la capital de Argentina?",
    "¿Cuándo nació José de San Martín?",
 ]
--- a/spacy/lang/es/punctuation.py
+++ b/spacy/lang/es/punctuation.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
 from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
--- a/spacy/lang/es/tokenizer_exceptions.py
+++ b/spacy/lang/es/tokenizer_exceptions.py
@ -7,8 +7,12 @@ _exc = {
 for exc_data in [
    {ORTH: "n°", LEMMA: "número"},
    {ORTH: "°C", LEMMA: "grados Celcius"},
    {ORTH: "aprox.", LEMMA: "aproximadamente"},
    {ORTH: "dna.", LEMMA: "docena"},
    {ORTH: "dpto.", LEMMA: "departamento"},
    {ORTH: "ej.", LEMMA: "ejemplo"},
    {ORTH: "esq.", LEMMA: "esquina"},
    {ORTH: "pág.", LEMMA: "página"},
    {ORTH: "p.ej.", LEMMA: "por ejemplo"},
@ -16,6 +20,7 @@ for exc_data in [
    {ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"},
    {ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
    {ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
    {ORTH: "vol.", NORM: "volúmen"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]
@ -35,10 +40,14 @@ for h in range(1, 12 + 1):
 for orth in [
    "a.C.",
    "a.J.C.",
    "d.C.",
    "d.J.C.",
    "apdo.",
    "Av.",
    "Avda.",
    "Cía.",
    "Dr.",
    "Dra.",
    "EE.UU.",
    "etc.",
    "fig.",
@ -54,9 +63,9 @@ for orth in [
    "Prof.",
    "Profa.",
    "q.e.p.d.",
-    "S.A.",
+    "Q.E.P.D." "S.A.",
    "S.L.",
-    "s.s.s.",
+    "S.R.L." "s.s.s.",
    "Sr.",
    "Sra.",
    "Srta.",
--- a/spacy/lang/fa/syntax_iterators.py
+++ b/spacy/lang/fa/syntax_iterators.py
@ -25,17 +25,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.i + 1))
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -43,9 +41,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/fr/_tokenizer_exceptions_list.py
+++ b/spacy/lang/fr/_tokenizer_exceptions_list.py
@ -531,7 +531,6 @@ FR_BASE_EXCEPTIONS = [
    "Beaumont-Hamel",
    "Beaumont-Louestault",
    "Beaumont-Monteux",
    "Beaumont-Pied-de-Buf",
    "Beaumont-Pied-de-Bœuf",
    "Beaumont-Sardolles",
    "Beaumont-Village",
@ -948,7 +947,7 @@ FR_BASE_EXCEPTIONS = [
    "Buxières-sous-les-Côtes",
    "Buzy-Darmont",
    "Byhleguhre-Byhlen",
-    "Burs-en-Othe",
+    "Bœurs-en-Othe",
    "Bâle-Campagne",
    "Bâle-Ville",
    "Béard-Géovreissiat",
@ -1586,11 +1585,11 @@ FR_BASE_EXCEPTIONS = [
    "Cruci-Falgardiens",
    "Cruquius-Oost",
    "Cruviers-Lascours",
-    "Crèvecur-en-Auge",
+    "Crèvecœur-en-Auge",
-    "Crèvecur-en-Brie",
+    "Crèvecœur-en-Brie",
-    "Crèvecur-le-Grand",
+    "Crèvecœur-le-Grand",
-    "Crèvecur-le-Petit",
+    "Crèvecœur-le-Petit",
-    "Crèvecur-sur-l'Escaut",
+    "Crèvecœur-sur-l'Escaut",
    "Crécy-Couvé",
    "Créon-d'Armagnac",
    "Cubjac-Auvézère-Val-d'Ans",
@ -1616,7 +1615,7 @@ FR_BASE_EXCEPTIONS = [
    "Cuxac-Cabardès",
    "Cuxac-d'Aude",
    "Cuyk-Sainte-Agathe",
-    "Cuvres-et-Valsery",
+    "Cœuvres-et-Valsery",
    "Céaux-d'Allègre",
    "Céleste-Empire",
    "Cénac-et-Saint-Julien",
@ -1679,7 +1678,7 @@ FR_BASE_EXCEPTIONS = [
    "Devrai-Gondragnières",
    "Dhuys et Morin-en-Brie",
    "Diane-Capelle",
-    "Dieffenbach-lès-Wrth",
+    "Dieffenbach-lès-Wœrth",
    "Diekhusen-Fahrstedt",
    "Diennes-Aubigny",
    "Diensdorf-Radlow",
@ -1752,7 +1751,7 @@ FR_BASE_EXCEPTIONS = [
    "Durdat-Larequille",
    "Durfort-Lacapelette",
    "Durfort-et-Saint-Martin-de-Sossenac",
-    "Duil-sur-le-Mignon",
+    "Dœuil-sur-le-Mignon",
    "Dão-Lafões",
    "Débats-Rivière-d'Orpra",
    "Décines-Charpieu",
@ -2687,8 +2686,8 @@ FR_BASE_EXCEPTIONS = [
    "Kuhlen-Wendorf",
    "KwaZulu-Natal",
    "Kyzyl-Arvat",
-    "Kur-la-Grande",
+    "Kœur-la-Grande",
-    "Kur-la-Petite",
+    "Kœur-la-Petite",
    "Kölln-Reisiek",
    "Königsbach-Stein",
    "Königshain-Wiederau",
@ -4024,7 +4023,7 @@ FR_BASE_EXCEPTIONS = [
    "Marcilly-d'Azergues",
    "Marcillé-Raoul",
    "Marcillé-Robert",
-    "Marcq-en-Barul",
+    "Marcq-en-Barœul",
    "Marcy-l'Etoile",
    "Marcy-l'Étoile",
    "Mareil-Marly",
@ -4258,7 +4257,7 @@ FR_BASE_EXCEPTIONS = [
    "Monlezun-d'Armagnac",
    "Monléon-Magnoac",
    "Monnetier-Mornex",
-    "Mons-en-Barul",
+    "Mons-en-Barœul",
    "Monsempron-Libos",
    "Monsteroux-Milieu",
    "Montacher-Villegardin",
@ -4348,7 +4347,7 @@ FR_BASE_EXCEPTIONS = [
    "Mornay-Berry",
    "Mortain-Bocage",
    "Morteaux-Couliboeuf",
-    "Morteaux-Coulibuf",
+    "Morteaux-Coulibœuf",
    "Morteaux-Coulibœuf",
    "Mortes-Frontières",
    "Mory-Montcrux",
@ -4391,7 +4390,7 @@ FR_BASE_EXCEPTIONS = [
    "Muncq-Nieurlet",
    "Murtin-Bogny",
    "Murtin-et-le-Châtelet",
-    "Murs-Verdey",
+    "Mœurs-Verdey",
    "Ménestérol-Montignac",
    "Ménil'muche",
    "Ménil-Annelles",
@ -4612,7 +4611,7 @@ FR_BASE_EXCEPTIONS = [
    "Neuves-Maisons",
    "Neuvic-Entier",
    "Neuvicq-Montguyon",
-    "Neuville-lès-Luilly",
+    "Neuville-lès-Lœuilly",
    "Neuvy-Bouin",
    "Neuvy-Deux-Clochers",
    "Neuvy-Grandchamp",
@ -4773,8 +4772,8 @@ FR_BASE_EXCEPTIONS = [
    "Nuncq-Hautecôte",
    "Nurieux-Volognat",
    "Nuthe-Urstromtal",
-    "Nux-les-Mines",
+    "Nœux-les-Mines",
-    "Nux-lès-Auxi",
+    "Nœux-lès-Auxi",
    "Nâves-Parmelan",
    "Nézignan-l'Evêque",
    "Nézignan-l'Évêque",
@ -5343,7 +5342,7 @@ FR_BASE_EXCEPTIONS = [
    "Quincy-Voisins",
    "Quincy-sous-le-Mont",
    "Quint-Fonsegrives",
-    "Quux-Haut-Maînil",
+    "Quœux-Haut-Maînil",
    "Quœux-Haut-Maînil",
    "Qwa-Qwa",
    "R.-V.",
@ -5631,12 +5630,12 @@ FR_BASE_EXCEPTIONS = [
    "Saint Aulaye-Puymangou",
    "Saint Geniez d'Olt et d'Aubrac",
    "Saint Martin de l'If",
-    "Saint-Denux",
+    "Saint-Denœux",
-    "Saint-Jean-de-Buf",
+    "Saint-Jean-de-Bœuf",
-    "Saint-Martin-le-Nud",
+    "Saint-Martin-le-Nœud",
-    "Saint-Michel-Tubuf",
+    "Saint-Michel-Tubœuf",
    "Saint-Paul - Flaugnac",
-    "Saint-Pierre-de-Buf",
+    "Saint-Pierre-de-Bœuf",
    "Saint-Thegonnec Loc-Eguiner",
    "Sainte-Alvère-Saint-Laurent Les Bâtons",
    "Salignac-Eyvignes",
@ -6208,7 +6207,7 @@ FR_BASE_EXCEPTIONS = [
    "Tite-Live",
    "Titisee-Neustadt",
    "Tobel-Tägerschen",
-    "Togny-aux-Bufs",
+    "Togny-aux-Bœufs",
    "Tongre-Notre-Dame",
    "Tonnay-Boutonne",
    "Tonnay-Charente",
@ -6336,7 +6335,7 @@ FR_BASE_EXCEPTIONS = [
    "Vals-près-le-Puy",
    "Valverde-Enrique",
    "Valzin-en-Petite-Montagne",
-    "Vanduvre-lès-Nancy",
+    "Vandœuvre-lès-Nancy",
    "Varces-Allières-et-Risset",
    "Varenne-l'Arconce",
    "Varenne-sur-le-Doubs",
@ -6457,9 +6456,9 @@ FR_BASE_EXCEPTIONS = [
    "Villenave-d'Ornon",
    "Villequier-Aumont",
    "Villerouge-Termenès",
-    "Villers-aux-Nuds",
+    "Villers-aux-Nœuds",
    "Villez-sur-le-Neubourg",
-    "Villiers-en-Désuvre",
+    "Villiers-en-Désœuvre",
    "Villieu-Loyes-Mollon",
    "Villingen-Schwenningen",
    "Villié-Morgon",
@ -6467,7 +6466,7 @@ FR_BASE_EXCEPTIONS = [
    "Vilosnes-Haraumont",
    "Vilters-Wangs",
    "Vincent-Froideville",
-    "Vincy-Manuvre",
+    "Vincy-Manœuvre",
    "Vincy-Manœuvre",
    "Vincy-Reuil-et-Magny",
    "Vindrac-Alayrac",
@ -6511,8 +6510,8 @@ FR_BASE_EXCEPTIONS = [
    "Vrigne-Meusiens",
    "Vrijhoeve-Capelle",
    "Vuisternens-devant-Romont",
-    "Vlfling-lès-Bouzonville",
+    "Vœlfling-lès-Bouzonville",
-    "Vuil-et-Giget",
+    "Vœuil-et-Giget",
    "Vélez-Blanco",
    "Vélez-Málaga",
    "Vélez-Rubio",
@ -6615,7 +6614,7 @@ FR_BASE_EXCEPTIONS = [
    "Wust-Fischbeck",
    "Wutha-Farnroda",
    "Wy-dit-Joli-Village",
-    "Wlfling-lès-Sarreguemines",
+    "Wœlfling-lès-Sarreguemines",
    "Wünnewil-Flamatt",
    "X-SAMPA",
    "X-arbre",
--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@ -24,17 +24,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -1,7 +1,6 @@
 import re
 from .punctuation import ELISION, HYPHENS
 from ..tokenizer_exceptions import URL_PATTERN
 from ..char_classes import ALPHA_LOWER, ALPHA
 from ...symbols import ORTH, LEMMA
@ -452,9 +451,6 @@ _regular_exp += [
    for hc in _hyphen_combination
 ]
 # URLs
 _regular_exp.append(URL_PATTERN)
 TOKENIZER_EXCEPTIONS = _exc
 TOKEN_MATCH = re.compile(
--- a/spacy/lang/gu/init.py
+++ b/spacy/lang/gu/init.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from ...language import Language
--- a/spacy/lang/gu/examples.py
+++ b/spacy/lang/gu/examples.py
@ -1,7 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/gu/stop_words.py
+++ b/spacy/lang/gu/stop_words.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set(
    """
 એમ
--- a/spacy/lang/hu/punctuation.py
+++ b/spacy/lang/hu/punctuation.py
@ -7,7 +7,6 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "")
 _currency = r"\$¢£€¥฿"
 _quotes = CONCAT_QUOTES.replace("'", "")
 _units = UNITS.replace("%", "")
 _prefixes = (
    LIST_PUNCT
@ -18,7 +17,8 @@ _prefixes = (
 )
 _suffixes = (
-    LIST_PUNCT
+    [r"\+"]
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + [_concat_icons]
@ -26,7 +26,7 @@ _suffixes = (
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:[{c}])".format(c=_currency),
-        r"(?<=[0-9])(?:{u})".format(u=_units),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
        r"(?<=[{al}{e}{q}(?:{c})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency
        ),
--- a/spacy/lang/hu/tokenizer_exceptions.py
+++ b/spacy/lang/hu/tokenizer_exceptions.py
@ -1,7 +1,6 @@
 import re
 from ..punctuation import ALPHA_LOWER, CURRENCY
 from ..tokenizer_exceptions import URL_PATTERN
 from ...symbols import ORTH
@ -646,4 +645,4 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(
 TOKENIZER_EXCEPTIONS = _exc
-TOKEN_MATCH = re.compile(r"^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match
+TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match
--- a/spacy/lang/hy/init.py
+++ b/spacy/lang/hy/init.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.hy.examples import sentences
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -1,12 +1,9 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
 _num_words = [
-    "զրօ",
+    "զրո",
-    "մէկ",
+    "մեկ",
    "երկու",
    "երեք",
    "չորս",
@ -28,10 +25,10 @@ _num_words = [
    "քսան" "երեսուն",
    "քառասուն",
    "հիսուն",
-    "վաթցսուն",
+    "վաթսուն",
    "յոթանասուն",
    "ութսուն",
-    "ինիսուն",
+    "իննսուն",
    "հարյուր",
    "հազար",
    "միլիոն",
--- a/spacy/lang/hy/stop_words.py
+++ b/spacy/lang/hy/stop_words.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set(
    """
 նա
--- a/spacy/lang/hy/tag_map.py
+++ b/spacy/lang/hy/tag_map.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
 from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
--- a/spacy/lang/id/syntax_iterators.py
+++ b/spacy/lang/id/syntax_iterators.py
@ -24,17 +24,15 @@ def noun_chunks(doclike):
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -42,9 +40,7 @@ def noun_chunks(doclike):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -1,111 +1,266 @@
-import re
+import srsly
-from collections import namedtuple
+from collections import namedtuple, OrderedDict
 from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .tag_map import TAG_MAP
 from .tag_orth_map import TAG_ORTH_MAP
 from .tag_bigram_map import TAG_BIGRAM_MAP
 from ...attrs import LANG
 from ...language import Language
 from ...tokens import Doc
 from ...compat import copy_reg
 from ...errors import Errors
 from ...language import Language
 from ...symbols import POS
 from ...tokens import Doc
 from ...util import DummyTokenizer
 from ... import util
 # Hold the attributes we need with convenient names
 DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
 # Handling for multiple spaces in a row is somewhat awkward, this simplifies
 # the flow by creating a dummy with the same interface.
-DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"])
+DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
-DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"])
+DummySpace = DummyNode(" ", " ", " ")
 DummySpace = DummyNode(" ", " ", DummyNodeFeatures(" "))
-def try_fugashi_import():
+def try_sudachi_import(split_mode="A"):
-    """Fugashi is required for Japanese support, so check for it.
+    """SudachiPy is required for Japanese support, so check for it.
-    It it's not available blow up and explain how to fix it."""
+    It it's not available blow up and explain how to fix it.
    split_mode should be one of these values: "A", "B", "C", None->"A"."""
    try:
-        import fugashi
+        from sudachipy import dictionary, tokenizer
-        return fugashi
+        split_mode = {
            None: tokenizer.Tokenizer.SplitMode.A,
            "A": tokenizer.Tokenizer.SplitMode.A,
            "B": tokenizer.Tokenizer.SplitMode.B,
            "C": tokenizer.Tokenizer.SplitMode.C,
        }[split_mode]
        tok = dictionary.Dictionary().create(mode=split_mode)
        return tok
    except ImportError:
        raise ImportError(
-            "Japanese support requires Fugashi: " "https://github.com/polm/fugashi"
+            "Japanese support requires SudachiPy and SudachiDict-core "
            "(https://github.com/WorksApplications/SudachiPy). "
            "Install with `pip install sudachipy sudachidict_core` or "
            "install spaCy with `pip install spacy[ja]`."
        )
-def resolve_pos(token):
+def resolve_pos(orth, pos, next_pos):
    """If necessary, add a field to the POS tag for UD mapping.
    Under Universal Dependencies, sometimes the same Unidic POS tag can
    be mapped differently depending on the literal token or its context
-    in the sentence. This function adds information to the POS tag to
+    in the sentence. This function returns resolved POSs for both token
-    resolve ambiguous mappings.
+    and next_token by tuple.
    """
-    # this is only used for consecutive ascii spaces
+    # Some tokens have their UD tag decided based on the POS of the following
-    if token.surface == " ":
+    # token.
        return "空白"
-    # TODO: This is a first take. The rules here are crude approximations.
+    # orth based rules
-    # For many of these, full dependencies are needed to properly resolve
+    if pos[0] in TAG_ORTH_MAP:
-    # PoS mappings.
+        orth_map = TAG_ORTH_MAP[pos[0]]
-    if token.pos == "連体詞,*,*,*":
+        if orth in orth_map:
-        if re.match(r"[こそあど此其彼]の", token.surface):
+            return orth_map[orth], None
-            return token.pos + ",DET"
+
-        if re.match(r"[こそあど此其彼]", token.surface):
+    # tag bi-gram mapping
-            return token.pos + ",PRON"
+    if next_pos:
-        return token.pos + ",ADJ"
+        tag_bigram = pos[0], next_pos[0]
-    return token.pos
+        if tag_bigram in TAG_BIGRAM_MAP:
            bipos = TAG_BIGRAM_MAP[tag_bigram]
            if bipos[0] is None:
                return TAG_MAP[pos[0]][POS], bipos[1]
            else:
                return bipos
    return TAG_MAP[pos[0]][POS], None
-def get_words_and_spaces(tokenizer, text):
+# Use a mapping of paired punctuation to avoid splitting quoted sentences.
-    """Get the individual tokens that make up the sentence and handle white space.
+pairpunct = {"「": "」", "『": "』", "【": "】"}
-    Japanese doesn't usually use white space, and MeCab's handling of it for
+
-    multiple spaces in a row is somewhat awkward.
+def separate_sentences(doc):
    """Given a doc, mark tokens that start sentences based on Unidic tags.
    """
-    tokens = tokenizer.parseToNodeList(text)
+    stack = []  # save paired punctuation
    for i, token in enumerate(doc[:-2]):
        # Set all tokens after the first to false by default. This is necessary
        # for the doc code to be aware we've done sentencization, see
        # `is_sentenced`.
        token.sent_start = i == 0
        if token.tag_:
            if token.tag_ == "補助記号-括弧開":
                ts = str(token)
                if ts in pairpunct:
                    stack.append(pairpunct[ts])
                elif stack and ts == stack[-1]:
                    stack.pop()
            if token.tag_ == "補助記号-句点":
                next_token = doc[i + 1]
                if next_token.tag_ != token.tag_ and not stack:
                    next_token.sent_start = True
 def get_dtokens(tokenizer, text):
    tokens = tokenizer.tokenize(text)
    words = []
-    spaces = []
+    for ti, token in enumerate(tokens):
-    for token in tokens:
+        tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
-        # If there's more than one space, spaces after the first become tokens
+        inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
-        for ii in range(len(token.white_space) - 1):
+        dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
-            words.append(DummySpace)
+        if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
-            spaces.append(False)
+            # don't add multiple space tokens in a row
            continue
        words.append(dtoken)
-        words.append(token)
+    # remove empty tokens. These can be produced with characters like … that
-        spaces.append(bool(token.white_space))
+    # Sudachi normalizes internally.
-    return words, spaces
+    words = [ww for ww in words if len(ww.surface) > 0]
    return words
 def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
    words = [x.surface for x in dtokens]
    if "".join("".join(words).split()) != "".join(text.split()):
        raise ValueError(Errors.E194.format(text=text, words=words))
    text_words = []
    text_lemmas = []
    text_tags = []
    text_spaces = []
    text_pos = 0
    # handle empty and whitespace-only texts
    if len(words) == 0:
        return text_words, text_lemmas, text_tags, text_spaces
    elif len([word for word in words if not word.isspace()]) == 0:
        assert text.isspace()
        text_words = [text]
        text_lemmas = [text]
        text_tags = [gap_tag]
        text_spaces = [False]
        return text_words, text_lemmas, text_tags, text_spaces
    # normalize words to remove all whitespace tokens
    norm_words, norm_dtokens = zip(
        *[
            (word, dtokens)
            for word, dtokens in zip(words, dtokens)
            if not word.isspace()
        ]
    )
    # align words with text
    for word, dtoken in zip(norm_words, norm_dtokens):
        try:
            word_start = text[text_pos:].index(word)
        except ValueError:
            raise ValueError(Errors.E194.format(text=text, words=words))
        if word_start > 0:
            w = text[text_pos : text_pos + word_start]
            text_words.append(w)
            text_lemmas.append(w)
            text_tags.append(gap_tag)
            text_spaces.append(False)
            text_pos += word_start
        text_words.append(word)
        text_lemmas.append(dtoken.lemma)
        text_tags.append(dtoken.pos)
        text_spaces.append(False)
        text_pos += len(word)
        if text_pos < len(text) and text[text_pos] == " ":
            text_spaces[-1] = True
            text_pos += 1
    if text_pos < len(text):
        w = text[text_pos:]
        text_words.append(w)
        text_lemmas.append(w)
        text_tags.append(gap_tag)
        text_spaces.append(False)
    return text_words, text_lemmas, text_tags, text_spaces
 class JapaneseTokenizer(DummyTokenizer):
-    def __init__(self, cls, nlp=None):
+    def __init__(self, cls, nlp=None, config={}):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.tokenizer = try_fugashi_import().Tagger()
+        self.split_mode = config.get("split_mode", None)
-        self.tokenizer.parseToNodeList("")  # see #2901
+        self.tokenizer = try_sudachi_import(self.split_mode)
    def __call__(self, text):
-        dtokens, spaces = get_words_and_spaces(self.tokenizer, text)
+        dtokens = get_dtokens(self.tokenizer, text)
-        words = [x.surface for x in dtokens]
+
        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        unidic_tags = []
+        next_pos = None
-        for token, dtoken in zip(doc, dtokens):
+        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
-            unidic_tags.append(dtoken.pos)
+            token.tag_ = unidic_tag[0]
-            token.tag_ = resolve_pos(dtoken)
+            if next_pos:
                token.pos = next_pos
                next_pos = None
            else:
                token.pos, next_pos = resolve_pos(
                    token.orth_,
                    unidic_tag,
                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
                )
            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = dtoken.feature.lemma or dtoken.surface
+            token.lemma_ = lemma
        doc.user_data["unidic_tags"] = unidic_tags
        return doc
    def _get_config(self):
        config = OrderedDict((("split_mode", self.split_mode),))
        return config
    def _set_config(self, config={}):
        self.split_mode = config.get("split_mode", None)
    def to_bytes(self, **kwargs):
        serializers = OrderedDict(
            (("cfg", lambda: srsly.json_dumps(self._get_config())),)
        )
        return util.to_bytes(serializers, [])
    def from_bytes(self, data, **kwargs):
        deserializers = OrderedDict(
            (("cfg", lambda b: self._set_config(srsly.json_loads(b))),)
        )
        util.from_bytes(data, deserializers, [])
        self.tokenizer = try_sudachi_import(self.split_mode)
        return self
    def to_disk(self, path, **kwargs):
        path = util.ensure_path(path)
        serializers = OrderedDict(
            (("cfg", lambda p: srsly.write_json(p, self._get_config())),)
        )
        return util.to_disk(path, serializers, [])
    def from_disk(self, path, **kwargs):
        path = util.ensure_path(path)
        serializers = OrderedDict(
            (("cfg", lambda p: self._set_config(srsly.read_json(p))),)
        )
        util.from_disk(path, serializers, [])
        self.tokenizer = try_sudachi_import(self.split_mode)
 class JapaneseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda _text: "ja"
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    syntax_iterators = SYNTAX_ITERATORS
    writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
    @classmethod
-    def create_tokenizer(cls, nlp=None):
+    def create_tokenizer(cls, nlp=None, config={}):
-        return JapaneseTokenizer(cls, nlp)
+        return JapaneseTokenizer(cls, nlp, config)
 class Japanese(Language):
--- a/spacy/lang/ja/bunsetu.py
+++ b/spacy/lang/ja/bunsetu.py
@ -0,0 +1,176 @@
 POS_PHRASE_MAP = {
    "NOUN": "NP",
    "NUM": "NP",
    "PRON": "NP",
    "PROPN": "NP",
    "VERB": "VP",
    "ADJ": "ADJP",
    "ADV": "ADVP",
    "CCONJ": "CCONJP",
 }
 # return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
 def yield_bunsetu(doc, debug=False):
    bunsetu = []
    bunsetu_may_end = False
    phrase_type = None
    phrase = None
    prev = None
    prev_tag = None
    prev_dep = None
    prev_head = None
    for t in doc:
        pos = t.pos_
        pos_type = POS_PHRASE_MAP.get(pos, None)
        tag = t.tag_
        dep = t.dep_
        head = t.head.i
        if debug:
            print(
                t.i,
                t.orth_,
                pos,
                pos_type,
                dep,
                head,
                bunsetu_may_end,
                phrase_type,
                phrase,
                bunsetu,
            )
        # DET is always an individual bunsetu
        if pos == "DET":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            yield [t], None, None
            bunsetu = []
            bunsetu_may_end = False
            phrase_type = None
            phrase = None
        # PRON or Open PUNCT always splits bunsetu
        elif tag == "補助記号-括弧開":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            bunsetu = [t]
            bunsetu_may_end = True
            phrase_type = None
            phrase = None
        # bunsetu head not appeared
        elif phrase_type is None:
            if bunsetu and prev_tag == "補助記号-読点":
                yield bunsetu, phrase_type, phrase
                bunsetu = []
                bunsetu_may_end = False
                phrase_type = None
                phrase = None
            bunsetu.append(t)
            if pos_type:  # begin phrase
                phrase = [t]
                phrase_type = pos_type
                if pos_type in {"ADVP", "CCONJP"}:
                    bunsetu_may_end = True
        # entering new bunsetu
        elif pos_type and (
            pos_type != phrase_type
            or bunsetu_may_end  # different phrase type arises  # same phrase type but bunsetu already ended
        ):
            # exceptional case: NOUN to VERB
            if (
                phrase_type == "NP"
                and pos_type == "VP"
                and prev_dep == "compound"
                and prev_head == t.i
            ):
                bunsetu.append(t)
                phrase_type = "VP"
                phrase.append(t)
            # exceptional case: VERB to NOUN
            elif (
                phrase_type == "VP"
                and pos_type == "NP"
                and (
                    prev_dep == "compound"
                    and prev_head == t.i
                    or dep == "compound"
                    and prev == head
                    or prev_dep == "nmod"
                    and prev_head == t.i
                )
            ):
                bunsetu.append(t)
                phrase_type = "NP"
                phrase.append(t)
            else:
                yield bunsetu, phrase_type, phrase
                bunsetu = [t]
                bunsetu_may_end = False
                phrase_type = pos_type
                phrase = [t]
        # NOUN bunsetu
        elif phrase_type == "NP":
            bunsetu.append(t)
            if not bunsetu_may_end and (
                (
                    (pos_type == "NP" or pos == "SYM")
                    and (prev_head == t.i or prev_head == head)
                    and prev_dep in {"compound", "nummod"}
                )
                or (
                    pos == "PART"
                    and (prev == head or prev_head == head)
                    and dep == "mark"
                )
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # VERB bunsetu
        elif phrase_type == "VP":
            bunsetu.append(t)
            if (
                not bunsetu_may_end
                and pos == "VERB"
                and prev_head == t.i
                and prev_dep == "compound"
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # ADJ bunsetu
        elif phrase_type == "ADJP" and tag != "連体詞":
            bunsetu.append(t)
            if not bunsetu_may_end and (
                (
                    pos == "NOUN"
                    and (prev_head == t.i or prev_head == head)
                    and prev_dep in {"amod", "compound"}
                )
                or (
                    pos == "PART"
                    and (prev == head or prev_head == head)
                    and dep == "mark"
                )
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # other bunsetu
        else:
            bunsetu.append(t)
        prev = t.i
        prev_tag = t.tag_
        prev_dep = t.dep_
        prev_head = head
    if bunsetu:
        yield bunsetu, phrase_type, phrase
--- a/spacy/lang/ja/syntax_iterators.py
+++ b/spacy/lang/ja/syntax_iterators.py
@ -0,0 +1,54 @@
 from ...symbols import NOUN, PROPN, PRON, VERB
 # XXX this can probably be pruned a bit
 labels = [
    "nsubj",
    "nmod",
    "dobj",
    "nsubjpass",
    "pcomp",
    "pobj",
    "obj",
    "obl",
    "dative",
    "appos",
    "attr",
    "ROOT",
 ]
 def noun_chunks(obj):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
    doc = obj.doc  # Ensure works on both Doc and Span.
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
    seen = set()
    for i, word in enumerate(obj):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.i in seen:
            continue
        if word.dep in np_deps:
            unseen = [w.i for w in word.subtree if w.i not in seen]
            if not unseen:
                continue
            # this takes care of particles etc.
            seen.update(j.i for j in word.subtree)
            # This avoids duplicating embedded clauses
            seen.update(range(word.i + 1))
            # if the head of this is a verb, mark that and rights seen
            # Don't do the subtree as that can hide other phrases
            if word.head.pos == VERB:
                seen.add(word.head.i)
                seen.update(w.i for w in word.head.rights)
            yield unseen[0], word.i + 1, np_label
 SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
--- a/spacy/lang/ja/tag_bigram_map.py
+++ b/spacy/lang/ja/tag_bigram_map.py
@ -0,0 +1,28 @@
 from ...symbols import ADJ, AUX, NOUN, PART, VERB
 # mapping from tag bi-gram to pos of previous token
 TAG_BIGRAM_MAP = {
    # This covers only small part of AUX.
    ("形容詞-非自立可能", "助詞-終助詞"): (AUX, None),
    ("名詞-普通名詞-形状詞可能", "助動詞"): (ADJ, None),
    # ("副詞", "名詞-普通名詞-形状詞可能"): (None, ADJ),
    # This covers acl, advcl, obl and root, but has side effect for compound.
    ("名詞-普通名詞-サ変可能", "動詞-非自立可能"): (VERB, AUX),
    # This covers almost all of the deps
    ("名詞-普通名詞-サ変形状詞可能", "動詞-非自立可能"): (VERB, AUX),
    ("名詞-普通名詞-副詞可能", "動詞-非自立可能"): (None, VERB),
    ("副詞", "動詞-非自立可能"): (None, VERB),
    ("形容詞-一般", "動詞-非自立可能"): (None, VERB),
    ("形容詞-非自立可能", "動詞-非自立可能"): (None, VERB),
    ("接頭辞", "動詞-非自立可能"): (None, VERB),
    ("助詞-係助詞", "動詞-非自立可能"): (None, VERB),
    ("助詞-副助詞", "動詞-非自立可能"): (None, VERB),
    ("助詞-格助詞", "動詞-非自立可能"): (None, VERB),
    ("補助記号-読点", "動詞-非自立可能"): (None, VERB),
    ("形容詞-一般", "接尾辞-名詞的-一般"): (None, PART),
    ("助詞-格助詞", "形状詞-助動詞語幹"): (None, NOUN),
    ("連体詞", "形状詞-助動詞語幹"): (None, NOUN),
    ("動詞-一般", "助詞-副助詞"): (None, PART),
    ("動詞-非自立可能", "助詞-副助詞"): (None, PART),
    ("助動詞", "助詞-副助詞"): (None, PART),
 }
--- a/spacy/lang/ja/tag_map.py
+++ b/spacy/lang/ja/tag_map.py
@ -1,79 +1,68 @@
-from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN
+from ...symbols import POS, PUNCT, INTJ, ADJ, AUX, ADP, PART, SCONJ, NOUN
-from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE
+from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE, CCONJ
 TAG_MAP = {
    # Explanation of Unidic tags:
    # https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
-    # Universal Dependencies Mapping:
+    # Universal Dependencies Mapping: (Some of the entries in this mapping are updated to v2.6 in the list below)
    # http://universaldependencies.org/ja/overview/morphology.html
    # http://universaldependencies.org/ja/pos/all.html
-    "記号,一般,*,*": {
+    "記号-一般": {POS: NOUN},  # this includes characters used to represent sounds like ドレミ
-        POS: PUNCT
+    "記号-文字": {
-    },  # this includes characters used to represent sounds like ドレミ
+        POS: NOUN
-    "記号,文字,*,*": {
+    },  # this is for Greek and Latin characters having some meanings, or used as symbols, as in math
-        POS: PUNCT
+    "感動詞-フィラー": {POS: INTJ},
-    },  # this is for Greek and Latin characters used as sumbols, as in math
+    "感動詞-一般": {POS: INTJ},
    "感動詞,フィラー,*,*": {POS: INTJ},
    "感動詞,一般,*,*": {POS: INTJ},
    # this is specifically for unicode full-width space
    "空白,*,*,*": {POS: X},
    # This is used when sequential half-width spaces are present
    "空白": {POS: SPACE},
-    "形状詞,一般,*,*": {POS: ADJ},
+    "形状詞-一般": {POS: ADJ},
-    "形状詞,タリ,*,*": {POS: ADJ},
+    "形状詞-タリ": {POS: ADJ},
-    "形状詞,助動詞語幹,*,*": {POS: ADJ},
+    "形状詞-助動詞語幹": {POS: AUX},
-    "形容詞,一般,*,*": {POS: ADJ},
+    "形容詞-一般": {POS: ADJ},
-    "形容詞,非自立可能,*,*": {POS: AUX},  # XXX ADJ if alone, AUX otherwise
+    "形容詞-非自立可能": {POS: ADJ},  # XXX ADJ if alone, AUX otherwise
-    "助詞,格助詞,*,*": {POS: ADP},
+    "助詞-格助詞": {POS: ADP},
-    "助詞,係助詞,*,*": {POS: ADP},
+    "助詞-係助詞": {POS: ADP},
-    "助詞,終助詞,*,*": {POS: PART},
+    "助詞-終助詞": {POS: PART},
-    "助詞,準体助詞,*,*": {POS: SCONJ},  # の as in 走るのが速い
+    "助詞-準体助詞": {POS: SCONJ},  # の as in 走るのが速い
-    "助詞,接続助詞,*,*": {POS: SCONJ},  # verb ending て
+    "助詞-接続助詞": {POS: SCONJ},  # verb ending て0
-    "助詞,副助詞,*,*": {POS: PART},  # ばかり, つつ after a verb
+    "助詞-副助詞": {POS: ADP},  # ばかり, つつ after a verb
-    "助動詞,*,*,*": {POS: AUX},
+    "助動詞": {POS: AUX},
-    "接続詞,*,*,*": {POS: SCONJ},  # XXX: might need refinement
+    "接続詞": {POS: CCONJ},  # XXX: might need refinement
-    "接頭辞,*,*,*": {POS: NOUN},
+    "接頭辞": {POS: NOUN},
-    "接尾辞,形状詞的,*,*": {POS: ADJ},  # がち, チック
+    "接尾辞-形状詞的": {POS: PART},  # がち, チック
-    "接尾辞,形容詞的,*,*": {POS: ADJ},  # -らしい
+    "接尾辞-形容詞的": {POS: AUX},  # -らしい
-    "接尾辞,動詞的,*,*": {POS: NOUN},  # -じみ
+    "接尾辞-動詞的": {POS: PART},  # -じみ
-    "接尾辞,名詞的,サ変可能,*": {POS: NOUN},  # XXX see 名詞,普通名詞,サ変可能,*
+    "接尾辞-名詞的-サ変可能": {POS: NOUN},  # XXX see 名詞,普通名詞,サ変可能,*
-    "接尾辞,名詞的,一般,*": {POS: NOUN},
+    "接尾辞-名詞的-一般": {POS: NOUN},
-    "接尾辞,名詞的,助数詞,*": {POS: NOUN},
+    "接尾辞-名詞的-助数詞": {POS: NOUN},
-    "接尾辞,名詞的,副詞可能,*": {POS: NOUN},  # -後, -過ぎ
+    "接尾辞-名詞的-副詞可能": {POS: NOUN},  # -後, -過ぎ
-    "代名詞,*,*,*": {POS: PRON},
+    "代名詞": {POS: PRON},
-    "動詞,一般,*,*": {POS: VERB},
+    "動詞-一般": {POS: VERB},
-    "動詞,非自立可能,*,*": {POS: VERB},  # XXX VERB if alone, AUX otherwise
+    "動詞-非自立可能": {POS: AUX},  # XXX VERB if alone, AUX otherwise
-    "動詞,非自立可能,*,*,AUX": {POS: AUX},
+    "副詞": {POS: ADV},
-    "動詞,非自立可能,*,*,VERB": {POS: VERB},
+    "補助記号-ＡＡ-一般": {POS: SYM},  # text art
-    "副詞,*,*,*": {POS: ADV},
+    "補助記号-ＡＡ-顔文字": {POS: PUNCT},  # kaomoji
-    "補助記号,ＡＡ,一般,*": {POS: SYM},  # text art
+    "補助記号-一般": {POS: SYM},
-    "補助記号,ＡＡ,顔文字,*": {POS: SYM},  # kaomoji
+    "補助記号-括弧開": {POS: PUNCT},  # open bracket
-    "補助記号,一般,*,*": {POS: SYM},
+    "補助記号-括弧閉": {POS: PUNCT},  # close bracket
-    "補助記号,括弧開,*,*": {POS: PUNCT},  # open bracket
+    "補助記号-句点": {POS: PUNCT},  # period or other EOS marker
-    "補助記号,括弧閉,*,*": {POS: PUNCT},  # close bracket
+    "補助記号-読点": {POS: PUNCT},  # comma
-    "補助記号,句点,*,*": {POS: PUNCT},  # period or other EOS marker
+    "名詞-固有名詞-一般": {POS: PROPN},  # general proper noun
-    "補助記号,読点,*,*": {POS: PUNCT},  # comma
+    "名詞-固有名詞-人名-一般": {POS: PROPN},  # person's name
-    "名詞,固有名詞,一般,*": {POS: PROPN},  # general proper noun
+    "名詞-固有名詞-人名-姓": {POS: PROPN},  # surname
-    "名詞,固有名詞,人名,一般": {POS: PROPN},  # person's name
+    "名詞-固有名詞-人名-名": {POS: PROPN},  # first name
-    "名詞,固有名詞,人名,姓": {POS: PROPN},  # surname
+    "名詞-固有名詞-地名-一般": {POS: PROPN},  # place name
-    "名詞,固有名詞,人名,名": {POS: PROPN},  # first name
+    "名詞-固有名詞-地名-国": {POS: PROPN},  # country name
-    "名詞,固有名詞,地名,一般": {POS: PROPN},  # place name
+    "名詞-助動詞語幹": {POS: AUX},
-    "名詞,固有名詞,地名,国": {POS: PROPN},  # country name
+    "名詞-数詞": {POS: NUM},  # includes Chinese numerals
-    "名詞,助動詞語幹,*,*": {POS: AUX},
+    "名詞-普通名詞-サ変可能": {POS: NOUN},  # XXX: sometimes VERB in UDv2; suru-verb noun
-    "名詞,数詞,*,*": {POS: NUM},  # includes Chinese numerals
+    "名詞-普通名詞-サ変形状詞可能": {POS: NOUN},
-    "名詞,普通名詞,サ変可能,*": {POS: NOUN},  # XXX: sometimes VERB in UDv2; suru-verb noun
+    "名詞-普通名詞-一般": {POS: NOUN},
-    "名詞,普通名詞,サ変可能,*,NOUN": {POS: NOUN},
+    "名詞-普通名詞-形状詞可能": {POS: NOUN},  # XXX: sometimes ADJ in UDv2
-    "名詞,普通名詞,サ変可能,*,VERB": {POS: VERB},
+    "名詞-普通名詞-助数詞可能": {POS: NOUN},  # counter / unit
-    "名詞,普通名詞,サ変形状詞可能,*": {POS: NOUN},  # ex: 下手
+    "名詞-普通名詞-副詞可能": {POS: NOUN},
-    "名詞,普通名詞,一般,*": {POS: NOUN},
+    "連体詞": {POS: DET},  # XXX this has exceptions based on literal token
-    "名詞,普通名詞,形状詞可能,*": {POS: NOUN},  # XXX: sometimes ADJ in UDv2
+    # GSD tags. These aren't in Unidic, but we need them for the GSD data.
-    "名詞,普通名詞,形状詞可能,*,NOUN": {POS: NOUN},
+    "外国語": {POS: PROPN},  # Foreign words
-    "名詞,普通名詞,形状詞可能,*,ADJ": {POS: ADJ},
+    "絵文字・記号等": {POS: SYM},  # emoji / kaomoji ^^;
    "名詞,普通名詞,助数詞可能,*": {POS: NOUN},  # counter / unit
    "名詞,普通名詞,副詞可能,*": {POS: NOUN},
    "連体詞,*,*,*": {POS: ADJ},  # XXX this has exceptions based on literal token
    "連体詞,*,*,*,ADJ": {POS: ADJ},
    "連体詞,*,*,*,PRON": {POS: PRON},
    "連体詞,*,*,*,DET": {POS: DET},
 }
--- a/spacy/lang/ja/tag_orth_map.py
+++ b/spacy/lang/ja/tag_orth_map.py
@ -0,0 +1,22 @@
 from ...symbols import DET, PART, PRON, SPACE, X
 # mapping from tag bi-gram to pos of previous token
 TAG_ORTH_MAP = {
    "空白": {" ": SPACE, "　": X},
    "助詞-副助詞": {"たり": PART},
    "連体詞": {
        "あの": DET,
        "かの": DET,
        "この": DET,
        "その": DET,
        "どの": DET,
        "彼の": DET,
        "此の": DET,
        "其の": DET,
        "ある": PRON,
        "こんな": PRON,
        "そんな": PRON,
        "どんな": PRON,
        "あらゆる": PRON,
    },
 }
--- a/spacy/lang/kn/examples.py
+++ b/spacy/lang/kn/examples.py
@ -1,7 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/ml/init.py
+++ b/spacy/lang/ml/init.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from ...language import Language
--- a/spacy/lang/ml/examples.py
+++ b/spacy/lang/ml/examples.py
@ -1,7 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
--- a/Show More
+++ b/Show More