Merge branch 'develop' into refactor/remove-symlinks

2025-09-07 04:44:59 +03:00 · 2020-02-18 17:22:20 +01:00 · 2020-02-18 17:22:20 +01:00 · a3335d36b8
commit a3335d36b8
parent 09cbeaef27 a138acb220
189 changed files with 3690 additions and 736 deletions
--- a/.github/contributors/AlJohri.md
+++ b/.github/contributors/AlJohri.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Al Johri             |
 | Company name (if applicable)   | N/A                  |
 | Title or role (if applicable)  | N/A                  |
 | Date                           | December 27th, 2019  |
 | GitHub username                | AlJohri              |
 | Website (optional)             | http://aljohri.com/  |
--- a/.github/contributors/Jan-711.md
+++ b/.github/contributors/Jan-711.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jan Jessewitsch      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 16.02.2020           |
 | GitHub username                | Jan-711              |
 | Website (optional)             |                      |
--- a/.github/contributors/ceteri.md
+++ b/.github/contributors/ceteri.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                  |
 |------------------------------- | ---------------------- |
 | Name                           | Paco Nathan            |
 | Company name (if applicable)   | Derwen, Inc.           |
 | Title or role (if applicable)  | Managing Partner       |
 | Date                           | 2020-01-25             |
 | GitHub username                | ceteri                 |
 | Website (optional)             | https://derwen.ai/paco |
--- a/.github/contributors/drndos.md
+++ b/.github/contributors/drndos.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Filip Bednárik       |
 | Company name (if applicable)   | Ardevop, s. r. o.    |
 | Title or role (if applicable)  | IT Consultant        |
 | Date                           | 2020-01-26           |
 | GitHub username                | drndos               |
 | Website (optional)             | https://ardevop.sk   |
--- a/.github/contributors/iechevarria.md
+++ b/.github/contributors/iechevarria.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                 |
 |------------------------------- | --------------------- |
 | Name                           | Ivan Echevarria       |
 | Company name (if applicable)   |                       |
 | Title or role (if applicable)  |                       |
 | Date                           | 2019-12-24            |
 | GitHub username                | iechevarria           |
 | Website (optional)             | https://echevarria.io |
--- a/.github/contributors/iurshina.md
+++ b/.github/contributors/iurshina.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Anastasiia Iurshina  |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 28.12.2019           |
 | GitHub username                | iurshina             |
 | Website (optional)             |                      |
--- a/.github/contributors/onlyanegg.md
+++ b/.github/contributors/onlyanegg.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1.  The term "contribution" or "contributed materials" means any source code,
    object code, patch, tool, sample, graphic, specification, manual,
    documentation, or any other material posted or submitted by you to the project.
 2.  With respect to any worldwide copyrights, or copyright applications and
    registrations, in your contribution:
        * you hereby assign to us joint ownership, and to the extent that such
        assignment is or becomes invalid, ineffective or unenforceable, you hereby
        grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
        royalty-free, unrestricted license to exercise all rights under those
        copyrights. This includes, at our option, the right to sublicense these same
        rights to third parties through multiple levels of sublicensees or other
        licensing arrangements;
        * you agree that each of us can do all things in relation to your
        contribution as if each of us were the sole owners, and if one of us makes
        a derivative work of your contribution, the one who makes the derivative
        work (or has it made will be the sole owner of that derivative work;
        * you agree that you will not assert any moral rights in your contribution
        against us, our licensees or transferees;
        * you agree that we may register a copyright in your contribution and
        exercise all ownership rights associated with it; and
        * you agree that neither of us has any duty to consult with, obtain the
        consent of, pay or render an accounting to the other for any use or
        distribution of your contribution.
 3.  With respect to any patents you own, or that you can license without payment
    to any third party, you hereby grant to us a perpetual, irrevocable,
    non-exclusive, worldwide, no-charge, royalty-free license to:
        * make, have made, use, sell, offer to sell, import, and otherwise transfer
        your contribution in whole or in part, alone or in combination with or
        included in any product, work or materials arising out of the project to
        which your contribution was submitted, and
        * at our option, to sublicense these same rights to third parties through
        multiple levels of sublicensees or other licensing arrangements.
 4.  Except as set out above, you keep all right, title, and interest in your
    contribution. The rights that you grant to us under these terms are effective
    on the date you first submitted a contribution to us, even if your submission
    took place before the date you sign these terms.
 5.  You covenant, represent, warrant and agree that:
    - Each contribution that you submit is and shall be an original work of
      authorship and you can legally grant the rights set out in this SCA;
    - to the best of your knowledge, each contribution will not violate any
      third party's copyrights, trademarks, patents, or other intellectual
      property rights; and
    - each contribution shall be in compliance with U.S. export control laws and
      other applicable export and import laws. You agree to notify us if you
      become aware of any circumstance which would make any of the foregoing
      representations inaccurate in any respect. We may publicly disclose your
      participation in the project, including the fact that you have signed the SCA.
 6.  This SCA is governed by the laws of the State of California and applicable
    U.S. Federal law. Any choice of law rules will not apply.
 7.  Please place an “x” on one of the applicable statement below. Please do NOT
    mark both statements:
        * [x] I am signing on behalf of myself as an individual and no other person
        or entity, including my employer, has or will have rights with respect to my
        contributions.
        * [ ] I am signing on behalf of my employer or a legal entity and I have the
        actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                         | Entry            |
 | ----------------------------- | ---------------- |
 | Name                          | Tyler Couto      |
 | Company name (if applicable)  |                  |
 | Title or role (if applicable) |                  |
 | Date                          | January 29, 2020 |
 | GitHub username               | onlyanegg        |
 | Website (optional)            |                  |
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,5 +1,5 @@
 recursive-include include *.h
-recursive-include spacy *.pyx *.pxd *.txt
+recursive-include spacy *.txt *.pyx *.pxd
 include LICENSE
 include README.md
 include bin/spacy
--- a/bin/spacy
+++ b/bin/spacy
@ -1 +1,2 @@
 #! /bin/sh
 python -m spacy "$@"
--- a/bin/wiki_entity_linking/README.md
+++ b/bin/wiki_entity_linking/README.md
@ -7,16 +7,17 @@ Run  `wikipedia_pretrain_kb.py`
  * WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
  * Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
 * You can set the filtering parameters for KB construction:
-  * `max_per_alias`: (max) number of candidate entities in the KB per alias/synonym
+  * `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
-  * `min_freq`: threshold of number of times an entity should occur in the corpus to be included in the KB
+  * `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
-  * `min_pair`: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
+  * `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
 * Further parameters to set:
-  * `descriptions_from_wikipedia`: whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
+  * `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
-  * `entity_vector_length`: length of the pre-trained entity description vectors
+  * `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
-  * `lang`: language for which to fetch Wikidata information (as the dump contains all languages)
+  * `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
 Quick testing and rerunning: 
-* When trying out the pipeline for a quick test, set `limit_prior`, `limit_train` and/or `limit_wd` to read only parts of the dumps instead of everything. 
+* When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything. 
  * e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
 * If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
@ -24,11 +25,13 @@ Quick testing and rerunning:
 Run  `wikidata_train_entity_linker.py` 
 * This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
 * Specify the output directory (`-o`) in which the final, trained model will be saved
 * You can set the learning parameters for the EL training:
-  * `epochs`: number of training iterations
+  * `epochs` (`-e`): number of training iterations
-  * `dropout`: dropout rate
+  * `dropout` (`-p`): dropout rate
-  * `lr`: learning rate
+  * `lr` (`-n`): learning rate
-  * `l2`: L2 regularization
+  * `l2` (`-r`): L2 regularization
-* Specify the number of training and dev testing entities with `train_inst` and `dev_inst` respectively
+* Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
  * If not specified, the full dataset will be processed - this may take a LONG time !
 * Further parameters to set:
-  * `labels_discard`: NER label types to discard during training
+  * `labels_discard` (`-l`): NER label types to discard during training
--- a/bin/wiki_entity_linking/entity_linker_evaluation.py
+++ b/bin/wiki_entity_linking/entity_linker_evaluation.py
@ -1,6 +1,8 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import logging
 import random
 from tqdm import tqdm
 from collections import defaultdict
@ -92,102 +94,81 @@ class BaselineResults(object):
        self.random.update_metrics(ent_label, true_entity, random_candidate)
-def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True):
+def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
    counts = dict()
    baseline_results = BaselineResults()
    context_results = EvaluationResults()
    combo_results = EvaluationResults()
    for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
        if len(doc) > 0:
            correct_ents = dict()
            for entity, kb_dict in gold.links.items():
                start, end = entity
                for gold_kb, value in kb_dict.items():
                    if value:
                        # only evaluating on positive examples
                        offset = _offset(start, end)
                        correct_ents[offset] = gold_kb
            if baseline:
-        baseline_accuracies, counts = measure_baselines(dev_data, kb)
+                _add_baseline(baseline_results, counts, doc, correct_ents, kb)
        logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
        logger.info(baseline_accuracies.report_performance("random"))
        logger.info(baseline_accuracies.report_performance("prior"))
        logger.info(baseline_accuracies.report_performance("oracle"))
            if context:
                # using only context
                el_pipe.cfg["incl_context"] = True
                el_pipe.cfg["incl_prior"] = False
-        results = get_eval_results(dev_data, el_pipe)
+                _add_eval_result(context_results, doc, correct_ents, el_pipe)
        logger.info(results.report_metrics("context only"))
                # measuring combined accuracy (prior + context)
                el_pipe.cfg["incl_context"] = True
                el_pipe.cfg["incl_prior"] = True
-        results = get_eval_results(dev_data, el_pipe)
+                _add_eval_result(combo_results, doc, correct_ents, el_pipe)
-        logger.info(results.report_metrics("context and prior"))
+
    if baseline:
        logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
        logger.info(baseline_results.report_performance("random"))
        logger.info(baseline_results.report_performance("prior"))
        logger.info(baseline_results.report_performance("oracle"))
    if context:
        logger.info(context_results.report_metrics("context only"))
        logger.info(combo_results.report_metrics("context and prior"))
-def get_eval_results(data, el_pipe=None):
+def _add_eval_result(results, doc, correct_ents, el_pipe):
    """
    Evaluate the ent.kb_id_ annotations against the gold standard.
    Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
    If the docs in the data require further processing with an entity linker, set el_pipe.
    """
    docs = []
    golds = []
    for d, g in tqdm(data, leave=False):
        if len(d) > 0:
            golds.append(g)
            if el_pipe is not None:
                docs.append(el_pipe(d))
            else:
                docs.append(d)
    results = EvaluationResults()
    for doc, gold in zip(docs, golds):
    try:
-            correct_entries_per_article = dict()
+        doc = el_pipe(doc)
            for entity, kb_dict in gold.links.items():
                start, end = entity
                for gold_kb, value in kb_dict.items():
                    if value:
                        # only evaluating on positive examples
                        offset = _offset(start, end)
                        correct_entries_per_article[offset] = gold_kb
        for ent in doc.ents:
            ent_label = ent.label_
                pred_entity = ent.kb_id_
            start = ent.start_char
            end = ent.end_char
            offset = _offset(start, end)
-                gold_entity = correct_entries_per_article.get(offset, None)
+            gold_entity = correct_ents.get(offset, None)
            # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
            if gold_entity is not None:
                pred_entity = ent.kb_id_
                results.update_metrics(ent_label, gold_entity, pred_entity)
    except Exception as e:
        logging.error("Error assessing accuracy " + str(e))
    return results
-
+def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
 def measure_baselines(data, kb):
    """
    Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
    Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
    Also return a dictionary of counts by entity label.
    """
    counts_d = dict()
    baseline_results = BaselineResults()
    docs = [d for d, g in data if len(d) > 0]
    golds = [g for d, g in data if len(d) > 0]
    for doc, gold in zip(docs, golds):
        correct_entries_per_article = dict()
        for entity, kb_dict in gold.links.items():
            start, end = entity
            for gold_kb, value in kb_dict.items():
                # only evaluating on positive examples
                if value:
                    offset = _offset(start, end)
                    correct_entries_per_article[offset] = gold_kb
    for ent in doc.ents:
        ent_label = ent.label_
        start = ent.start_char
        end = ent.end_char
        offset = _offset(start, end)
-            gold_entity = correct_entries_per_article.get(offset, None)
+        gold_entity = correct_ents.get(offset, None)
        # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
        if gold_entity is not None:
@ -207,8 +188,8 @@ def measure_baselines(data, kb):
                prior_candidate = candidates[best_index].entity_
                random_candidate = random.choice(candidates).entity_
-                current_count = counts_d.get(ent_label, 0)
+            current_count = counts.get(ent_label, 0)
-                counts_d[ent_label] = current_count+1
+            counts[ent_label] = current_count+1
            baseline_results.update_baselines(
                gold_entity,
@ -218,8 +199,6 @@ def measure_baselines(data, kb):
                oracle_candidate,
            )
    return baseline_results, counts_d
 def _offset(start, end):
    return "{}_{}".format(start, end)
--- a/bin/wiki_entity_linking/wikidata_pretrain_kb.py
+++ b/bin/wiki_entity_linking/wikidata_pretrain_kb.py
@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)
    loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
    loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
    loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
-    descr_from_wp=("Flag for using wp descriptions not wd", "flag", "wp"),
+    descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
    limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
    limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
    limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
--- a/bin/wiki_entity_linking/wikidata_train_entity_linker.py
+++ b/bin/wiki_entity_linking/wikidata_train_entity_linker.py
@ -1,5 +1,5 @@
 # coding: utf-8
-"""Script to take a previously created Knowledge Base and train an entity linking
+"""Script that takes a previously created Knowledge Base and trains an entity linking
 pipeline. The provided KB directory should hold the kb, the original nlp object and
 its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
 as created by the script `wikidata_create_kb`.
@ -14,9 +14,16 @@ import logging
 import spacy
 from pathlib import Path
 import plac
 from tqdm import tqdm
 from bin.wiki_entity_linking import wikipedia_processor
-from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
+from bin.wiki_entity_linking import (
    TRAINING_DATA_FILE,
    KB_MODEL_DIR,
    KB_FILE,
    LOG_FORMAT,
    OUTPUT_MODEL_DIR,
 )
 from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
 from bin.wiki_entity_linking.kb_creator import read_kb
@ -33,8 +40,8 @@ logger = logging.getLogger(__name__)
    dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
    lr=("Learning rate (default 0.005)", "option", "n", float),
    l2=("L2 regularization", "option", "r", float),
-    train_inst=("# training instances (default 90% of all)", "option", "t", int),
+    train_articles=("# training articles (default 90% of all)", "option", "t", int),
-    dev_inst=("# test instances (default 10% of all)", "option", "d", int),
+    dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
    labels_discard=("NER labels to discard (default None)", "option", "l", str),
 )
 def main(
@ -45,10 +52,15 @@ def main(
    dropout=0.5,
    lr=0.005,
    l2=1e-6,
-    train_inst=None,
+    train_articles=None,
-    dev_inst=None,
+    dev_articles=None,
-    labels_discard=None
+    labels_discard=None,
 ):
    if not output_dir:
        logger.warning(
            "No output dir specified so no results will be written, are you sure about this ?"
        )
    logger.info("Creating Entity Linker with Wikipedia and WikiData")
    output_dir = Path(output_dir) if output_dir else dir_kb
@ -64,47 +76,57 @@ def main(
    # STEP 1 : load the NLP object
    logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
    nlp = spacy.load(nlp_dir)
-    logger.info("STEP 1b: Loading KB from {}".format(kb_path))
+    logger.info(
-    kb = read_kb(nlp, kb_path)
+        "Original NLP pipeline has following pipeline components: {}".format(
            nlp.pipe_names
        )
    )
    # check that there is a NER component in the pipeline
    if "ner" not in nlp.pipe_names:
        raise ValueError("The `nlp` object should have a pretrained `ner` component.")
-    # STEP 2: read the training dataset previously created from WP
+    logger.info("STEP 1b: Loading KB from {}".format(kb_path))
-    logger.info("STEP 2: Reading training dataset from {}".format(training_path))
+    kb = read_kb(nlp, kb_path)
    # STEP 2: read the training dataset previously created from WP
    logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
    train_indices, dev_indices = wikipedia_processor.read_training_indices(
        training_path
    )
    logger.info(
        "Training set has {} articles, limit set to roughly {} articles per epoch".format(
            len(train_indices), train_articles if train_articles else "all"
        )
    )
    logger.info(
        "Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
            len(dev_indices), dev_articles if dev_articles else "all"
        )
    )
    if dev_articles:
        dev_indices = dev_indices[0:dev_articles]
    # STEP 3: create and train an entity linking pipe
    logger.info(
        "STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
            epochs
        )
    )
    if labels_discard:
        labels_discard = [x.strip() for x in labels_discard.split(",")]
-        logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
+        logger.info(
            "Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
        )
    else:
        labels_discard = []
    train_data = wikipedia_processor.read_training(
        nlp=nlp,
        entity_file_path=training_path,
        dev=False,
        limit=train_inst,
        kb=kb,
        labels_discard=labels_discard
    )
    # for testing, get all pos instances (independently of KB)
    dev_data = wikipedia_processor.read_training(
        nlp=nlp,
        entity_file_path=training_path,
        dev=True,
        limit=dev_inst,
        kb=None,
        labels_discard=labels_discard
    )
    # STEP 3: create and train an entity linking pipe
    logger.info("STEP 3: Creating and training an Entity Linking pipe")
    el_pipe = nlp.create_pipe(
-        name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors,
+        name="entity_linker",
-                                      "labels_discard": labels_discard}
+        config={
            "pretrained_vectors": nlp.vocab.vectors,
            "labels_discard": labels_discard,
        },
    )
    el_pipe.set_kb(kb)
    nlp.add_pipe(el_pipe, last=True)
@ -115,78 +137,96 @@ def main(
        optimizer.learn_rate = lr
        optimizer.L2 = l2
    logger.info("Training on {} articles".format(len(train_data)))
    logger.info("Dev testing on {} articles".format(len(dev_data)))
    # baseline performance on dev data
    logger.info("Dev Baseline Accuracies:")
-    measure_performance(dev_data, kb, el_pipe, baseline=True, context=False)
+    dev_data = wikipedia_processor.read_el_docs_golds(
        nlp=nlp,
        entity_file_path=training_path,
        dev=True,
        line_ids=dev_indices,
        kb=kb,
        labels_discard=labels_discard,
    )
    measure_performance(
        dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
    )
    for itn in range(epochs):
-        random.shuffle(train_data)
+        random.shuffle(train_indices)
        losses = {}
-        batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
+        batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
        batchnr = 0
        articles_processed = 0
-        with nlp.disable_pipes(*other_pipes):
+        # we either process the whole training file, or just a part each epoch
        bar_total = len(train_indices)
        if train_articles:
            bar_total = train_articles
        with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
            for batch in batches:
                if not train_articles or articles_processed < train_articles:
                    with nlp.disable_pipes("entity_linker"):
                        train_batch = wikipedia_processor.read_el_docs_golds(
                            nlp=nlp,
                            entity_file_path=training_path,
                            dev=False,
                            line_ids=batch,
                            kb=kb,
                            labels_discard=labels_discard,
                        )
                        docs, golds = zip(*train_batch)
                    try:
                        with nlp.disable_pipes(*other_pipes):
                            nlp.update(
-                        examples=batch,
+                                docs=docs,
                                golds=golds,
                                sgd=optimizer,
                                drop=dropout,
                                losses=losses,
                            )
                            batchnr += 1
                            articles_processed += len(docs)
                            pbar.update(len(docs))
                    except Exception as e:
                        logger.error("Error updating batch:" + str(e))
        if batchnr > 0:
-            logging.info("Epoch {}, train loss {}".format(itn, round(losses["entity_linker"] / batchnr, 2)))
+            logging.info(
-            measure_performance(dev_data, kb, el_pipe, baseline=False, context=True)
+                "Epoch {} trained on {} articles, train loss {}".format(
-
+                    itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
-    # STEP 4: measure the performance of our trained pipe on an independent dev set
+                )
-    logger.info("STEP 4: Final performance measurement of Entity Linking pipe")
+            )
-    measure_performance(dev_data, kb, el_pipe)
+            # re-read the dev_data (data is returned as a generator)
-
+            dev_data = wikipedia_processor.read_el_docs_golds(
-    # STEP 5: apply the EL pipe on a toy example
+                nlp=nlp,
-    logger.info("STEP 5: Applying Entity Linking to toy example")
+                entity_file_path=training_path,
-    run_el_toy_example(nlp=nlp)
+                dev=True,
                line_ids=dev_indices,
                kb=kb,
                labels_discard=labels_discard,
            )
            measure_performance(
                dev_data,
                kb,
                el_pipe,
                baseline=False,
                context=True,
                dev_limit=len(dev_indices),
            )
    if output_dir:
-        # STEP 6: write the NLP pipeline (now including an EL model) to file
+        # STEP 4: write the NLP pipeline (now including an EL model) to file
-        logger.info("STEP 6: Writing trained NLP to {}".format(nlp_output_dir))
+        logger.info(
            "Final NLP pipeline has following pipeline components: {}".format(
                nlp.pipe_names
            )
        )
        logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
        nlp.to_disk(nlp_output_dir)
        logger.info("Done!")
 def check_kb(kb):
    for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
        candidates = kb.get_candidates(mention)
        logger.info("generating candidates for " + mention + " :")
        for c in candidates:
            logger.info(" ".join[
                str(c.prior_prob),
                c.alias_,
                "-->",
                c.entity_ + " (freq=" + str(c.entity_freq) + ")"
            ])
 def run_el_toy_example(nlp):
    text = (
        "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, "
        "Douglas reminds us to always bring our towel, even in China or Brazil. "
        "The main character in Doug's novel is the man Arthur Dent, "
        "but Dougledydoug doesn't write about George Washington or Homer Simpson."
    )
    doc = nlp(text)
    logger.info(text)
    for ent in doc.ents:
        logger.info(" ".join(["ent", ent.text, ent.label_, ent.kb_id_]))
 if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
    plac.call(main)
--- a/bin/wiki_entity_linking/wikipedia_processor.py
+++ b/bin/wiki_entity_linking/wikipedia_processor.py
@ -6,9 +6,6 @@ import bz2
 import logging
 import random
 import json
 from tqdm import tqdm
 from functools import partial
 from spacy.gold import GoldParse
 from bin.wiki_entity_linking import wiki_io as io
@ -454,25 +451,40 @@ def _write_training_entities(outputfile, article_id, clean_text, entities):
    outputfile.write(line)
-def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
+def read_training_indices(entity_file_path):
-    """ This method provides training examples that correspond to the entity annotations found by the nlp object.
+    """ This method creates two lists of indices into the training file: one with indices for the
     training examples, and one for the dev examples."""
    train_indices = []
    dev_indices = []
    with entity_file_path.open("r", encoding="utf8") as file:
        for i, line in enumerate(file):
            example = json.loads(line)
            article_id = example["article_id"]
            clean_text = example["clean_text"]
            if is_valid_article(clean_text):
                if is_dev(article_id):
                    dev_indices.append(i)
                else:
                    train_indices.append(i)
    return train_indices, dev_indices
 def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
    """ This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
     For training, it will include both positive and negative examples by using the candidate generator from the kb.
     For testing (kb=None), it will include all positive examples only."""
    if not labels_discard:
        labels_discard = []
-    data = []
+    texts = []
-    num_entities = 0
+    entities_list = []
    get_gold_parse = partial(
        _get_gold_parse, dev=dev, kb=kb, labels_discard=labels_discard
    )
    logger.info(
        "Reading {} data with limit {}".format("dev" if dev else "train", limit)
    )
    with entity_file_path.open("r", encoding="utf8") as file:
        with tqdm(total=limit, leave=False) as pbar:
        for i, line in enumerate(file):
            if i in line_ids:
                example = json.loads(line)
                article_id = example["article_id"]
                clean_text = example["clean_text"]
@ -481,16 +493,15 @@ def read_training(nlp, entity_file_path, dev, limit, kb, labels_discard=None):
                if dev != is_dev(article_id) or not is_valid_article(clean_text):
                    continue
-                doc = nlp(clean_text)
+                texts.append(clean_text)
-                gold = get_gold_parse(doc, entities)
+                entities_list.append(entities)
    docs = nlp.pipe(texts, batch_size=50)
    for doc, entities in zip(docs, entities_list):
        gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
        if gold and len(gold.links) > 0:
-                    data.append((doc, gold))
+            yield doc, gold
                    num_entities += len(gold.links)
                    pbar.update(len(gold.links))
                if limit and num_entities >= limit:
                    break
    logger.info("Read {} entities in {} articles".format(num_entities, len(data)))
    return data
 def _get_gold_parse(doc, entities, dev, kb, labels_discard):
--- a/examples/streamlit_spacy.py
+++ b/examples/streamlit_spacy.py
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
 HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def load_model(name):
    return spacy.load(name)
-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def process_text(model_name, text):
    nlp = load_model(model_name)
    return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
    st.header("Named Entities")
    st.sidebar.header("Named Entities")
    label_set = nlp.get_pipe("ner").labels
-    labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
+    labels = st.sidebar.multiselect(
        "Entity labels", options=label_set, default=list(label_set)
    )
    html = displacy.render(doc, style="ent", options={"ents": labels})
    # Newlines seem to mess with the rendering
    html = html.replace("\n", " ")
--- a/examples/training/pretrain_kb.py
+++ b/examples/training/pretrain_kb.py
@ -32,27 +32,24 @@ DESC_WIDTH = 64  # dimension of output entity vectors
@plac.annotations(
-    vocab_path=("Path to the vocab for the kb", "option", "v", Path),
+    model=("Model name, should have pretrained word embeddings", "positional", None, str),
    model=("Model name, should have pretrained word embeddings", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
-def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
+def main(model=None, output_dir=None, n_iter=50):
    """Load the model, create the KB and pretrain the entity encodings.
    Either an nlp model or a vocab is needed to provide access to pretrained word embeddings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
-    When providing an nlp model, the updated vocab will also be written to a directory in the output_dir."""
+    The updated vocab will also be written to a directory in the output_dir."""
    if model is None and vocab_path is None:
        raise ValueError("Either the `nlp` model or the `vocab` should be specified.")
    if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
-    else:
+
-        vocab = Vocab().from_disk(vocab_path)
+    # check the length of the nlp vectors
-        # create blank Language class with specified vocab
+    if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
-        nlp = spacy.blank("en", vocab=vocab)
+        raise ValueError(
-        print("Created blank 'en' model with vocab from '%s'" % vocab_path)
+            "The `nlp` object should have access to pretrained word vectors, "
            " cf. https://spacy.io/usage/models#languages."
        )
    kb = KnowledgeBase(vocab=nlp.vocab)
@ -103,8 +100,6 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50):
        print()
        print("Saved KB to", kb_path)
        # only storing the vocab if we weren't already reading it from file
        if not vocab_path:
        vocab_path = output_dir / "vocab"
        kb.vocab.to_disk(vocab_path)
        print("Saved vocab to", vocab_path)
--- a/examples/training/pretrain_textcat.py
+++ b/examples/training/pretrain_textcat.py
@ -131,7 +131,8 @@ def train_textcat(nlp, n_texts, n_iter=10):
    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
+    pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        textcat.model.tok2vec.from_bytes(tok2vec_weights)
--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -63,7 +63,8 @@ def main(model_name, unlabelled_loc):
    optimizer.b2 = 0.0
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
+    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    sizes = compounding(1.0, 4.0, 1.001)
    with nlp.disable_pipes(*other_pipes):
        for itn in range(n_iter):
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -113,7 +113,8 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
        TRAIN_DOCS.append((doc, annotation_clean))
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
+    pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train entity linker
        # reset and initialize the weights randomly
        optimizer = nlp.begin_training()
--- a/examples/training/train_intent_parser.py
+++ b/examples/training/train_intent_parser.py
@ -124,7 +124,8 @@ def main(model=None, output_dir=None, n_iter=15):
        for dep in annotations.get("deps", []):
            parser.add_label(dep)
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
+    pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train parser
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -55,7 +55,8 @@ def main(model=None, output_dir=None, n_iter=100):
            ner.add_label(ent[2])
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
+    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -95,7 +95,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
+    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
--- a/examples/training/train_parser.py
+++ b/examples/training/train_parser.py
@ -65,7 +65,8 @@ def main(model=None, output_dir=None, n_iter=15):
            parser.add_label(dep)
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "parser"]
+    pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train parser
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
--- a/examples/training/train_textcat.py
+++ b/examples/training/train_textcat.py
@ -68,7 +68,8 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
    # get names of other pipes to disable them during training
-    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
+    pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        if init_tok2vec is not None:
--- a/setup.cfg
+++ b/setup.cfg
@ -49,6 +49,7 @@ install_requires =
    catalogue>=0.0.7,<1.1.0
    ml_datasets
    # Third-party dependencies
    tqdm>=4.38.0,<5.0.0
    setuptools
    numpy>=1.15.0
    plac>=0.9.6,<1.2.0
--- a/spacy/init.py
+++ b/spacy/init.py
@ -5,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed")
 warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
 # These are imported as part of the API
-from thinc.util import prefer_gpu, require_gpu
+from thinc.api import prefer_gpu, require_gpu
 from . import pipeline
 from .cli.info import info as cli_info
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -92,3 +92,4 @@ cdef enum attr_id_t:
    LANG
    ENT_KB_ID = symbols.ENT_KB_ID
    MORPH
    ENT_ID = symbols.ENT_ID
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -81,6 +81,7 @@ IDS = {
    "DEP": DEP,
    "ENT_IOB": ENT_IOB,
    "ENT_TYPE": ENT_TYPE,
    "ENT_ID": ENT_ID,
    "ENT_KB_ID": ENT_KB_ID,
    "HEAD": HEAD,
    "SENT_START": SENT_START,
--- a/spacy/cli/converters/conllu2json.py
+++ b/spacy/cli/converters/conllu2json.py
@ -9,8 +9,14 @@ from wasabi import Printer
 def conllu2json(
-    input_data, n_sents=10, append_morphology=False, lang=None, ner_map=None,
+    input_data,
-    merge_subtokens=False, no_print=False, **_
+    n_sents=10,
    append_morphology=False,
    lang=None,
    ner_map=None,
    merge_subtokens=False,
    no_print=False,
    **_
 ):
    """
    Convert conllu files into JSON format for use with train cli.
@ -26,9 +32,13 @@ def conllu2json(
    docs = []
    raw = ""
    sentences = []
-    conll_data = read_conllx(input_data, append_morphology=append_morphology,
+    conll_data = read_conllx(
-                             ner_tag_pattern=MISC_NER_PATTERN, ner_map=ner_map,
+        input_data,
-                             merge_subtokens=merge_subtokens)
+        append_morphology=append_morphology,
        ner_tag_pattern=MISC_NER_PATTERN,
        ner_map=ner_map,
        merge_subtokens=merge_subtokens,
    )
    has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
    for i, example in enumerate(conll_data):
        raw += example.text
@ -72,20 +82,28 @@ def has_ner(input_data, ner_tag_pattern):
                    return False
-def read_conllx(input_data, append_morphology=False, merge_subtokens=False,
+def read_conllx(
-            ner_tag_pattern="", ner_map=None):
+    input_data,
    append_morphology=False,
    merge_subtokens=False,
    ner_tag_pattern="",
    ner_map=None,
 ):
    """ Yield examples, one for each sentence """
    vocab = Language.Defaults.create_vocab()  # need vocab to make a minimal Doc
    i = 0
    for sent in input_data.strip().split("\n\n"):
        lines = sent.strip().split("\n")
        if lines:
            while lines[0].startswith("#"):
                lines.pop(0)
-            example = example_from_conllu_sentence(vocab, lines,
+            example = example_from_conllu_sentence(
-                    ner_tag_pattern, merge_subtokens=merge_subtokens,
+                vocab,
                lines,
                ner_tag_pattern,
                merge_subtokens=merge_subtokens,
                append_morphology=append_morphology,
-                    ner_map=ner_map)
+                ner_map=ner_map,
            )
            yield example
@ -157,8 +175,14 @@ def create_json_doc(raw, sentences, id_):
    return doc
-def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
+def example_from_conllu_sentence(
-        merge_subtokens=False, append_morphology=False, ner_map=None):
+    vocab,
    lines,
    ner_tag_pattern,
    merge_subtokens=False,
    append_morphology=False,
    ner_map=None,
 ):
    """Create an Example from the lines for one CoNLL-U sentence, merging
    subtokens and appending morphology to tags if required.
@ -182,7 +206,6 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
    in_subtok = False
    for i in range(len(lines)):
        line = lines[i]
        subtok_lines = []
        parts = line.split("\t")
        id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
        if "." in id_:
@ -212,7 +235,7 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
            subtok_word = ""
            in_subtok = False
        id_ = int(id_) - 1
-        head = (int(head) - 1) if head != "0" else id_
+        head = (int(head) - 1) if head not in ("0", "_") else id_
        tag = pos if tag == "_" else tag
        morph = morph if morph != "_" else ""
        dep = "ROOT" if dep == "root" else dep
@ -266,9 +289,17 @@ def example_from_conllu_sentence(vocab, lines, ner_tag_pattern,
        if space:
            raw += " "
    example = Example(doc=raw)
-    example.set_token_annotation(ids=ids, words=words, tags=tags, pos=pos,
+    example.set_token_annotation(
-                                 morphs=morphs, lemmas=lemmas, heads=heads,
+        ids=ids,
-                                 deps=deps, entities=ents)
+        words=words,
        tags=tags,
        pos=pos,
        morphs=morphs,
        lemmas=lemmas,
        heads=heads,
        deps=deps,
        entities=ents,
    )
    return example
@ -292,7 +323,7 @@ def merge_conllu_subtokens(lines, doc):
                if token._.merged_morph:
                    for feature in token._.merged_morph.split("|"):
                        field, values = feature.split("=", 1)
-                        if not field in morphs:
+                        if field not in morphs:
                            morphs[field] = set()
                        for value in values.split(","):
                            morphs[field].add(value)
@ -306,7 +337,9 @@ def merge_conllu_subtokens(lines, doc):
                token._.merged_lemma = " ".join(lemmas)
                token.tag_ = "_".join(tags)
                token._.merged_morph = "|".join(sorted(morphs.values()))
-                token._.merged_spaceafter = True if subtok_span[-1].whitespace_ else False
+                token._.merged_spaceafter = (
                    True if subtok_span[-1].whitespace_ else False
                )
    with doc.retokenize() as retokenizer:
        for span in subtok_spans:
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -166,6 +166,7 @@ def debug_data(
        has_low_data_warning = False
        has_no_neg_warning = False
        has_ws_ents_error = False
        has_punct_ents_warning = False
        msg.divider("Named Entity Recognition")
        msg.info(
@ -190,6 +191,10 @@ def debug_data(
            msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
            has_ws_ents_error = True
        if gold_train_data["punct_ents"]:
            msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
            has_punct_ents_warning = True
        for label in new_labels:
            if label_counts[label] <= NEW_LABEL_THRESHOLD:
                msg.warn(
@ -209,6 +214,8 @@ def debug_data(
            msg.good("Examples without occurrences available for all labels")
        if not has_ws_ents_error:
            msg.good("No entities consisting of or starting/ending with whitespace")
        if not has_punct_ents_warning:
            msg.good("No entities consisting of or starting/ending with punctuation")
        if has_low_data_warning:
            msg.text(
@ -229,6 +236,12 @@ def debug_data(
                "with whitespace characters are considered invalid."
            )
        if has_punct_ents_warning:
            msg.text(
                "Entity spans consisting of or starting/ending "
                "with punctuation can not be trained with a noise level > 0."
            )
    if "textcat" in pipeline:
        msg.divider("Text Classification")
        labels = [label for label in gold_train_data["cats"]]
@ -446,6 +459,7 @@ def _compile_gold(examples, pipeline):
        "words": Counter(),
        "roots": Counter(),
        "ws_ents": 0,
        "punct_ents": 0,
        "n_words": 0,
        "n_misaligned_words": 0,
        "n_sents": 0,
@ -469,6 +483,16 @@ def _compile_gold(examples, pipeline):
                if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
                    # "Illegal" whitespace entity
                    data["ws_ents"] += 1
                if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
                    ".",
                    "'",
                    "!",
                    "?",
                    ",",
                ]:
                    # punctuation entity: could be replaced by whitespace when training with noise,
                    # so add a warning to alert the user to this unexpected side effect.
                    data["punct_ents"] += 1
                if label.startswith(("B-", "U-")):
                    combined_label = label.split("-")[1]
                    data["ner"][combined_label] += 1
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -4,14 +4,12 @@ import time
 import re
 from collections import Counter
 from pathlib import Path
-from thinc.layers import Linear, Maxout
+from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu
-from thinc.util import prefer_gpu
+from thinc.api import CosineDistance, L2Distance
 from wasabi import msg
 import srsly
 from thinc.layers import chain, list2array
 from thinc.loss import CosineDistance, L2Distance
-from spacy.gold import Example
+from ..gold import Example
 from ..errors import Errors
 from ..tokens import Doc
 from ..attrs import ID, HEAD
@ -28,7 +26,7 @@ def pretrain(
    vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
    output_dir: ("Directory to write models to on each epoch", "positional", None, str),
    width: ("Width of CNN layers", "option", "cw", int) = 96,
-    depth: ("Depth of CNN layers", "option", "cd", int) = 4,
+    conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
    bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
    cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
    sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
@ -77,9 +75,15 @@ def pretrain(
    msg.info("Using GPU" if has_gpu else "Not using GPU")
    output_dir = Path(output_dir)
    if output_dir.exists() and [p for p in output_dir.iterdir()]:
        msg.warn(
            "Output directory is not empty",
            "It is better to use an empty directory or refer to a new output path, "
            "then the new directory will be created for you.",
        )
    if not output_dir.exists():
        output_dir.mkdir()
-        msg.good("Created output directory")
+        msg.good(f"Created output directory: {output_dir}")
    srsly.write_json(output_dir / "config.json", config)
    msg.good("Saved settings to config.json")
@ -107,7 +111,7 @@ def pretrain(
        Tok2Vec(
            width,
            embed_rows,
-            conv_depth=depth,
+            conv_depth=conv_depth,
            pretrained_vectors=pretrained_vectors,
            bilstm_depth=bilstm_depth,  # Requires PyTorch. Experimental.
            subword_features=not use_chars,  # Set to False for Chinese etc
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -1,7 +1,7 @@
 import os
 import tqdm
 from pathlib import Path
-from thinc.backends import use_ops
+from thinc.api import use_ops
 from timeit import default_timer as timer
 import shutil
 import srsly
@ -10,6 +10,7 @@ import contextlib
 import random
 from ..util import create_default_optimizer
 from ..util import use_gpu as set_gpu
 from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from .. import util
@ -26,6 +27,14 @@ def train(
    base_model: ("Name of model to update (optional)", "option", "b", str) = None,
    pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
    vectors: ("Model to load vectors from", "option", "v", str) = None,
    replace_components: ("Replace components from base model", "flag", "R", bool) = False,
    width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
    conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
    cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
    cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
    use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
    bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
    embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
    n_iter: ("Number of iterations", "option", "n", int) = 30,
    n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
    n_examples: ("Number of examples", "option", "ns", int) = 0,
@ -80,6 +89,7 @@ def train(
        )
    if not output_path.exists():
        output_path.mkdir()
        msg.good(f"Created output directory: {output_path}")
    tag_map = {}
    if tag_map_path is not None:
@ -113,6 +123,21 @@ def train(
    # training starts from a blank model, intitalize the language class.
    pipeline = [p.strip() for p in pipeline.split(",")]
    msg.text(f"Training pipeline: {pipeline}")
    disabled_pipes = None
    pipes_added = False
    msg.text(f"Training pipeline: {pipeline}")
    if use_gpu >= 0:
        activated_gpu = None
        try:
            activated_gpu = set_gpu(use_gpu)
        except Exception as e:
            msg.warn(f"Exception: {e}")
        if activated_gpu is not None:
            msg.text(f"Using GPU: {use_gpu}")
        else:
            msg.warn(f"Unable to activate GPU: {use_gpu}")
            msg.text("Using CPU only")
            use_gpu = -1
    if base_model:
        msg.text(f"Starting with base model '{base_model}'")
        nlp = util.load_model(base_model)
@ -122,9 +147,8 @@ def train(
                f"specified as `lang` argument ('{lang}') ",
                exits=1,
            )
        nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
        for pipe in pipeline:
-            if pipe not in nlp.pipe_names:
+            pipe_cfg = {}
            if pipe == "parser":
                pipe_cfg = {"learn_tokens": learn_tokens}
            elif pipe == "textcat":
@ -133,9 +157,14 @@ def train(
                    "architecture": textcat_arch,
                    "positive_label": textcat_positive_label,
                }
-                else:
+            if pipe not in nlp.pipe_names:
-                    pipe_cfg = {}
+                msg.text(f"Adding component to base model '{pipe}'")
                nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
                pipes_added = True
            elif replace_components:
                msg.text(f"Replacing component from base model '{pipe}'")
                nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
                pipes_added = True
            else:
                if pipe == "textcat":
                    textcat_cfg = nlp.get_pipe("textcat").cfg
@ -144,11 +173,6 @@ def train(
                        "architecture": textcat_cfg["architecture"],
                        "positive_label": textcat_cfg["positive_label"],
                    }
                    pipe_cfg = {
                        "exclusive_classes": not textcat_multilabel,
                        "architecture": textcat_arch,
                        "positive_label": textcat_positive_label,
                    }
                    if base_cfg != pipe_cfg:
                        msg.fail(
                            f"The base textcat model configuration does"
@ -156,6 +180,10 @@ def train(
                            f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
                            exits=1,
                        )
                msg.text(f"Extending component from base model '{pipe}'")
        disabled_pipes = nlp.disable_pipes(
            [p for p in nlp.pipe_names if p not in pipeline]
        )
    else:
        msg.text(f"Starting with blank model '{lang}'")
        lang_cls = util.get_lang_class(lang)
@ -198,13 +226,20 @@ def train(
    corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
    n_train_words = corpus.count_train()
-    if base_model:
+    if base_model and not pipes_added:
        # Start with an existing model, use default optimizer
        optimizer = create_default_optimizer()
    else:
        # Start with a blank model, call begin_training
-        optimizer = nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
+        cfg = {"device": use_gpu}
-
+        cfg["conv_depth"] = conv_depth
        cfg["token_vector_width"] = width
        cfg["bilstm_depth"] = bilstm_depth
        cfg["cnn_maxout_pieces"] = cnn_pieces
        cfg["embed_size"] = embed_rows
        cfg["conv_window"] = cnn_window
        cfg["subword_features"] = not use_chars
        optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
    nlp._optimizer = None
    # Load in pretrained weights
@ -214,7 +249,7 @@ def train(
    # Verify textcat config
    if "textcat" in pipeline:
-        textcat_labels = nlp.get_pipe("textcat").cfg["labels"]
+        textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
        if textcat_positive_label and textcat_positive_label not in textcat_labels:
            msg.fail(
                f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
@ -327,12 +362,22 @@ def train(
                for batch in util.minibatch_by_words(train_data, size=batch_sizes):
                    if not batch:
                        continue
                    docs, golds = zip(*batch)
                    try:
                        nlp.update(
-                        batch,
+                            docs,
                            golds,
                            sgd=optimizer,
                            drop=next(dropout_rates),
                            losses=losses,
                        )
                    except ValueError as e:
                        msg.warn("Error during training")
                        if init_tok2vec:
                            msg.warn(
                                "Did you provide the same parameters during 'train' as during 'pretrain'?"
                            )
                        msg.fail(f"Original error message: {e}", exits=1)
                    if raw_text:
                        # If raw text is available, perform 'rehearsal' updates,
                        # which use unlabelled data to reduce overfitting.
@ -396,11 +441,16 @@ def train(
                            "cpu": cpu_wps,
                            "gpu": gpu_wps,
                        }
-                        meta["accuracy"] = scorer.scores
+                        meta.setdefault("accuracy", {})
                        for component in nlp.pipe_names:
                            for metric in _get_metrics(component):
                                meta["accuracy"][metric] = scorer.scores[metric]
                    else:
                        meta.setdefault("beam_accuracy", {})
                        meta.setdefault("beam_speed", {})
-                        meta["beam_accuracy"][beam_width] = scorer.scores
+                        for component in nlp.pipe_names:
                            for metric in _get_metrics(component):
                                meta["beam_accuracy"][metric] = scorer.scores[metric]
                        meta["beam_speed"][beam_width] = {
                            "nwords": nwords,
                            "cpu": cpu_wps,
@ -453,13 +503,19 @@ def train(
                            f"Best score = {best_score}; Final iteration score = {current_score}"
                        )
                        break
    except Exception as e:
        msg.warn(f"Aborting and saving final best model. Encountered exception: {e}")
    finally:
        best_pipes = nlp.pipe_names
        if disabled_pipes:
            disabled_pipes.restore()
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / "model-final"
            nlp.to_disk(final_model_path)
            final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
        msg.good("Saved model to output directory", final_model_path)
        with msg.loading("Creating best model..."):
-            best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
+            best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
        msg.good("Created best model", best_model_path)
@ -519,15 +575,14 @@ def _load_pretrained_tok2vec(nlp, loc):
 def _collate_best_model(meta, output_path, components):
    bests = {}
    meta.setdefault("accuracy", {})
    for component in components:
        bests[component] = _find_best(output_path, component)
    best_dest = output_path / "model-best"
    shutil.copytree(str(output_path / "model-final"), str(best_dest))
    for component, best_component_src in bests.items():
        shutil.rmtree(str(best_dest / component))
-        shutil.copytree(
+        shutil.copytree(str(best_component_src / component), str(best_dest / component))
            str(best_component_src / component), str(best_dest / component)
        )
        accs = srsly.read_json(best_component_src / "accuracy.json")
        for metric in _get_metrics(component):
            meta["accuracy"][metric] = accs[metric]
@ -550,13 +605,15 @@ def _find_best(experiment_dir, component):
 def _get_metrics(component):
    if component == "parser":
-        return ("las", "uas", "token_acc", "sent_f")
+        return ("las", "uas", "las_per_type", "token_acc", "sent_f")
    elif component == "tagger":
        return ("tags_acc",)
    elif component == "ner":
-        return ("ents_f", "ents_p", "ents_r")
+        return ("ents_f", "ents_p", "ents_r", "enty_per_type")
    elif component == "sentrec":
        return ("sent_f", "sent_p", "sent_r")
    elif component == "textcat":
        return ("textcat_score",)
    return ("token_acc",)
@ -568,8 +625,12 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
            row_head.extend(["Tag Loss ", " Tag %  "])
            output_stats.extend(["tag_loss", "tags_acc"])
        elif pipe == "parser":
-            row_head.extend(["Dep Loss ", " UAS  ", " LAS  ", "Sent P", "Sent R", "Sent F"])
+            row_head.extend(
-            output_stats.extend(["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"])
+                ["Dep Loss ", " UAS  ", " LAS  ", "Sent P", "Sent R", "Sent F"]
            )
            output_stats.extend(
                ["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
            )
        elif pipe == "ner":
            row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
            output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
--- a/spacy/cli/train_from_config.py
+++ b/spacy/cli/train_from_config.py
@ -1,19 +1,20 @@
 from typing import Optional, Dict, List, Union, Sequence
 import plac
 from thinc.util import require_gpu
 from wasabi import msg
 from pathlib import Path
 import thinc
 import thinc.schedules
-from thinc.model import Model
+from thinc.api import Model
 from spacy.gold import GoldCorpus
 import spacy
 from spacy.pipeline.tok2vec import Tok2VecListener
 from typing import Optional, Dict, List, Union, Sequence
 from pydantic import BaseModel, FilePath, StrictInt
 import tqdm
-from ..ml import component_models
+# TODO: relative imports?
-from .. import util
+import spacy
 from spacy.gold import GoldCorpus
 from spacy.pipeline.tok2vec import Tok2VecListener
 from spacy.ml import component_models
 from spacy import util
 registry = util.registry
@ -153,10 +154,9 @@ def create_tb_parser_model(
    hidden_width: StrictInt = 64,
    maxout_pieces: StrictInt = 3,
 ):
-    from thinc.layers import Linear, chain, list2array
+    from thinc.api import Linear, chain, list2array, use_ops, zero_init
    from spacy.ml._layers import PrecomputableAffine
    from spacy.syntax._parser_model import ParserModel
    from thinc.api import use_ops, zero_init
    token_vector_width = tok2vec.get_dim("nO")
    tok2vec = chain(tok2vec, list2array())
@ -221,13 +221,9 @@ def train_from_config_cli(
 def train_from_config(
-    config_path,
+    config_path, data_paths, raw_text=None, meta_path=None, output_path=None,
    data_paths,
    raw_text=None,
    meta_path=None,
    output_path=None,
 ):
-    msg.info("Loading config from: {}".format(config_path))
+    msg.info(f"Loading config from: {config_path}")
    config = util.load_from_config(config_path, create_objects=True)
    use_gpu = config["training"]["use_gpu"]
    if use_gpu >= 0:
@ -241,9 +237,7 @@ def train_from_config(
    msg.info("Loading training corpus")
    corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
    msg.info("Initializing the nlp pipeline")
-    nlp.begin_training(
+    nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
        lambda: corpus.train_examples, device=use_gpu
    )
    train_batches = create_train_batches(nlp, corpus, config["training"])
    evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
@ -260,7 +254,7 @@ def train_from_config(
        config["training"]["eval_frequency"],
    )
-    msg.info("Training. Initial learn rate: {}".format(optimizer.learn_rate))
+    msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}")
    print_row = setup_printer(config)
    try:
@ -414,7 +408,7 @@ def subdivide_batch(batch):
 def setup_printer(config):
    score_cols = config["training"]["scores"]
    score_widths = [max(len(col), 6) for col in score_cols]
-    loss_cols = ["Loss {}".format(pipe) for pipe in config["nlp"]["pipeline"]]
+    loss_cols = [f"Loss {pipe}" for pipe in config["nlp"]["pipeline"]]
    loss_widths = [max(len(col), 8) for col in loss_cols]
    table_header = ["#"] + loss_cols + score_cols + ["Score"]
    table_header = [col.upper() for col in table_header]
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -29,7 +29,7 @@ try:
 except ImportError:
    cupy = None
-from thinc.optimizers import Optimizer  # noqa: F401
+from thinc.api import Optimizer  # noqa: F401
 pickle = pickle
 copy_reg = copy_reg
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -51,9 +51,10 @@ def render(
        html = RENDER_WRAPPER(html)
    if jupyter or (jupyter is None and is_in_jupyter()):
        # return HTML rendered by IPython display()
        # See #4840 for details on span wrapper to disable mathjax
        from IPython.core.display import display, HTML
-        return display(HTML(html))
+        return display(HTML('<span class="tex2jax_ignore">{}</span>'.format(html)))
    return html
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@ -1,4 +1,3 @@
 # Setting explicit height and max-width: none on the SVG is required for
 # Jupyter to render it properly in a cell
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -75,10 +75,9 @@ class Warnings(object):
    W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
            "being serialized or deserialized is deprecated. Please use the "
            "`exclude` argument instead. For example: exclude=['{arg}'].")
-    W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
+    W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
-            "the v2.x models cannot release the global interpreter lock. "
+            "the argument `n_process` controls parallel inference via "
-            "Future versions may introduce a `n_process` argument for "
+            "multiprocessing.")
            "parallel inference via multiprocessing.")
    W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
    W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
            "ignoring the duplicate entry.")
@ -170,7 +169,8 @@ class Errors(object):
            "and satisfies the correct annotations specified in the GoldParse. "
            "For example, are all labels added to the model? If you're "
            "training a named entity recognizer, also make sure that none of "
-            "your annotated entity spans have leading or trailing whitespace. "
+            "your annotated entity spans have leading or trailing whitespace "
            "or punctuation. "
            "You can also use the experimental `debug-data` command to "
            "validate your JSON-formatted training data. For details, run:\n"
            "python -m spacy debug-data --help")
@ -536,8 +536,8 @@ class Errors(object):
    E997 = ("Tokenizer special cases are not allowed to modify the text. "
            "This would map '{chunk}' to '{orth}' given token attributes "
            "'{token_attrs}'.")
-    E998 = ("Can only create GoldParse's from Example's without a Doc, "
+    E998 = ("Can only create GoldParse objects from Example objects without a "
-            "if get_gold_parses() is called with a Vocab object.")
+            "Doc if get_gold_parses() is called with a Vocab object.")
    E999 = ("Encountered an unexpected format for the dictionary holding "
            "gold annotations: {gold_dict}")
--- a/spacy/glossary.py
+++ b/spacy/glossary.py
@ -1,4 +1,3 @@
 def explain(term):
    """Get a description for a given POS tag, dependency label or entity type.
--- a/spacy/gold.pxd
+++ b/spacy/gold.pxd
@ -1,6 +1,6 @@
 from cymem.cymem cimport Pool
-from spacy.tokens import Doc
+from .tokens import Doc
 from .typedefs cimport attr_t
 from .syntax.transition_system cimport Transition
@ -65,5 +65,3 @@ cdef class Example:
    cdef public TokenAnnotation token_annotation
    cdef public DocAnnotation doc_annotation
    cdef public object goldparse
--- a/spacy/kb.pxd
+++ b/spacy/kb.pxd
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
 from libc.stdint cimport int32_t, int64_t
 from libc.stdio cimport FILE
-from spacy.vocab cimport Vocab
+from .vocab cimport Vocab
 from .typedefs cimport hash_t
 from .structs cimport KBEntryC, AliasC
@ -169,4 +169,3 @@ cdef class Reader:
    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
    cdef int _read(self, void* value, size_t size) except -1
--- a/spacy/lang/af/stop_words.py
+++ b/spacy/lang/af/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/stopwords-iso/stopwords-af
 STOP_WORDS = set(
--- a/spacy/lang/bg/stop_words.py
+++ b/spacy/lang/bg/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/Alir3z4/stop-words
 STOP_WORDS = set(
--- a/spacy/lang/bn/examples.py
+++ b/spacy/lang/bn/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/bn/stop_words.py
+++ b/spacy/lang/bn/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত  অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে
--- a/spacy/lang/ca/examples.py
+++ b/spacy/lang/ca/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -14,6 +14,17 @@ _tamil = r"\u0B80-\u0BFF"
 _telugu = r"\u0C00-\u0C7F"
 # from the final table in: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
 _cjk = (
    r"\u4E00-\u62FF\u6300-\u77FF\u7800-\u8CFF\u8D00-\u9FFF\u3400-\u4DBF"
    r"\U00020000-\U000215FF\U00021600-\U000230FF\U00023100-\U000245FF"
    r"\U00024600-\U000260FF\U00026100-\U000275FF\U00027600-\U000290FF"
    r"\U00029100-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F"
    r"\U0002B820-\U0002CEAF\U0002CEB0-\U0002EBEF\u2E80-\u2EFF\u2F00-\u2FDF"
    r"\u2FF0-\u2FFF\u3000-\u303F\u31C0-\u31EF\u3200-\u32FF\u3300-\u33FF"
    r"\uF900-\uFAFF\uFE30-\uFE4F\U0001F200-\U0001F2FF\U0002F800-\U0002FA1F"
 )
 # Latin standard
 _latin_u_standard = r"A-Z"
 _latin_l_standard = r"a-z"
@ -212,6 +223,7 @@ _uncased = (
    + _tamil
    + _telugu
    + _hangul
    + _cjk
 )
 ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
--- a/spacy/lang/cs/stop_words.py
+++ b/spacy/lang/cs/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/Alir3z4/stop-words
 STOP_WORDS = set(
--- a/spacy/lang/da/examples.py
+++ b/spacy/lang/da/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/de/examples.py
+++ b/spacy/lang/de/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/de/stop_words.py
+++ b/spacy/lang/de/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
@ -26,7 +25,7 @@ früher fünf fünfte fünften fünfter fünftes für
 gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
 geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
-gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
+gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
 großen grosser großer grosses großes gut gute guter gutes
 habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -44,9 +43,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
 lang lange leicht leider lieber los
 machen macht machte mag magst man manche manchem manchen mancher manches mehr
-mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
+mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
-mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
+mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
 musste mussten
 na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
 neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -1,5 +1,5 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .tag_map_general import TAG_MAP
+from ..tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import GreekLemmatizer
--- a/spacy/lang/el/get_pos_from_wiktionary.py
+++ b/spacy/lang/el/get_pos_from_wiktionary.py
@ -1,4 +1,3 @@
 def get_pos_from_wiktionary():
    import re
    from gensim.corpora.wikicorpus import extract_pages
--- a/spacy/lang/el/norm_exceptions.py
+++ b/spacy/lang/el/norm_exceptions.py
@ -1,4 +1,3 @@
 # These exceptions are used to add NORM values based on a token's ORTH value.
 # Norms are only set if no alternative is provided in the tokenizer exceptions.
--- a/spacy/lang/el/stop_words.py
+++ b/spacy/lang/el/stop_words.py
@ -1,4 +1,3 @@
 # Stop words
 # Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
 STOP_WORDS = set(
--- a/spacy/lang/el/tag_map_fine.py
+++ b/spacy/lang/el/tag_map_fine.py
--- a/spacy/lang/el/tag_map_general.py
+++ b/spacy/lang/el/tag_map_general.py
@ -1,24 +0,0 @@
 from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
 from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
 TAG_MAP = {
    "ADJ": {POS: ADJ},
    "ADV": {POS: ADV},
    "INTJ": {POS: INTJ},
    "NOUN": {POS: NOUN},
    "PROPN": {POS: PROPN},
    "VERB": {POS: VERB},
    "ADP": {POS: ADP},
    "CCONJ": {POS: CCONJ},
    "SCONJ": {POS: SCONJ},
    "PART": {POS: PART},
    "PUNCT": {POS: PUNCT},
    "SYM": {POS: SYM},
    "NUM": {POS: NUM},
    "PRON": {POS: PRON},
    "AUX": {POS: AUX},
    "SPACE": {POS: SPACE},
    "DET": {POS: DET},
    "X": {POS: X},
 }
--- a/spacy/lang/en/examples.py
+++ b/spacy/lang/en/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/en/norm_exceptions.py
+++ b/spacy/lang/en/norm_exceptions.py
@ -1,4 +1,3 @@
 _exc = {
    # Slang and abbreviations
    "cos": "because",
--- a/spacy/lang/en/stop_words.py
+++ b/spacy/lang/en/stop_words.py
@ -1,4 +1,3 @@
 # Stop words
 STOP_WORDS = set(
    """
--- a/spacy/lang/es/examples.py
+++ b/spacy/lang/es/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/es/stop_words.py
+++ b/spacy/lang/es/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
--- a/spacy/lang/et/stop_words.py
+++ b/spacy/lang/et/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/stopwords-iso/stopwords-et
 STOP_WORDS = set(
--- a/spacy/lang/fa/examples.py
+++ b/spacy/lang/fa/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/fa/generate_verbs_exc.py
+++ b/spacy/lang/fa/generate_verbs_exc.py
@ -1,4 +1,3 @@
 verb_roots = """
 #هست
 آخت#آهنج
--- a/spacy/lang/fa/stop_words.py
+++ b/spacy/lang/fa/stop_words.py
@ -1,4 +1,3 @@
 # Stop words from HAZM package
 STOP_WORDS = set(
    """
--- a/spacy/lang/fi/punctuation.py
+++ b/spacy/lang/fi/punctuation.py
@ -1,9 +1,10 @@
-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from ..punctuation import TOKENIZER_SUFFIXES
 _quotes = CONCAT_QUOTES.replace("'", "")
 DASHES = "|".join(x for x in LIST_HYPHENS if x != "-")
 _infixes = (
    LIST_ELLIPSES
@ -11,11 +12,9 @@ _infixes = (
    + [
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
-        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])(?:{d})(?=[{a}])".format(a=ALPHA, d=DASHES),
-        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}0-9])[<>=/](?=[{a}])".format(a=ALPHA),
    ]
 )
--- a/spacy/lang/fi/stop_words.py
+++ b/spacy/lang/fi/stop_words.py
@ -1,4 +1,3 @@
 # Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
 # Reformatted with some minor corrections
 STOP_WORDS = set(
--- a/spacy/lang/fi/tokenizer_exceptions.py
+++ b/spacy/lang/fi/tokenizer_exceptions.py
@ -28,6 +28,9 @@ for exc_data in [
    {ORTH: "myöh.", LEMMA: "myöhempi"},
    {ORTH: "n.", LEMMA: "noin"},
    {ORTH: "nimim.", LEMMA: "nimimerkki"},
    {ORTH: "n:o", LEMMA: "numero"},
    {ORTH: "N:o", LEMMA: "numero"},
    {ORTH: "nro", LEMMA: "numero"},
    {ORTH: "ns.", LEMMA: "niin sanottu"},
    {ORTH: "nyk.", LEMMA: "nykyinen"},
    {ORTH: "oik.", LEMMA: "oikealla"},
--- a/spacy/lang/fr/examples.py
+++ b/spacy/lang/fr/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/fr/stop_words.py
+++ b/spacy/lang/fr/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons
--- a/spacy/lang/ga/irish_morphology_helpers.py
+++ b/spacy/lang/ga/irish_morphology_helpers.py
@ -1,4 +1,3 @@
 # fmt: off
 consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
 broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
--- a/spacy/lang/he/examples.py
+++ b/spacy/lang/he/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/hi/examples.py
+++ b/spacy/lang/hi/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/hi/stop_words.py
+++ b/spacy/lang/hi/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6
 STOP_WORDS = set(
--- a/spacy/lang/hu/examples.py
+++ b/spacy/lang/hu/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/hu/stop_words.py
+++ b/spacy/lang/hu/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
--- a/spacy/lang/id/examples.py
+++ b/spacy/lang/id/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/is/stop_words.py
+++ b/spacy/lang/is/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/Xangis/extra-stopwords
 STOP_WORDS = set(
--- a/spacy/lang/it/examples.py
+++ b/spacy/lang/it/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/it/stop_words.py
+++ b/spacy/lang/it/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl
--- a/spacy/lang/ja/examples.py
+++ b/spacy/lang/ja/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/kn/stop_words.py
+++ b/spacy/lang/kn/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 ಹಲವು
--- a/spacy/lang/lt/examples.py
+++ b/spacy/lang/lt/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/lv/stop_words.py
+++ b/spacy/lang/lv/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/stopwords-iso/stopwords-lv
 STOP_WORDS = set(
--- a/spacy/lang/mr/stop_words.py
+++ b/spacy/lang/mr/stop_words.py
@ -1,4 +1,3 @@
 # Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json
 STOP_WORDS = set(
    """
--- a/spacy/lang/nb/examples.py
+++ b/spacy/lang/nb/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/nl/examples.py
+++ b/spacy/lang/nl/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/norm_exceptions.py
+++ b/spacy/lang/norm_exceptions.py
@ -1,4 +1,3 @@
 # These exceptions are used to add NORM values based on a token's ORTH value.
 # Individual languages can also add their own exceptions and overwrite them -
 # for example, British vs. American spelling in English.
--- a/spacy/lang/pl/examples.py
+++ b/spacy/lang/pl/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/pt/examples.py
+++ b/spacy/lang/pt/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/pt/stop_words.py
+++ b/spacy/lang/pt/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
--- a/spacy/lang/ro/examples.py
+++ b/spacy/lang/ro/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/ru/examples.py
+++ b/spacy/lang/ru/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/ru/norm_exceptions.py
+++ b/spacy/lang/ru/norm_exceptions.py
@ -1,4 +1,3 @@
 _exc = {
    # Slang
    "прив": "привет",
--- a/spacy/lang/si/examples.py
+++ b/spacy/lang/si/examples.py
@ -1,4 +1,3 @@
 """
 Example sentences to test spaCy and its language models.
--- a/spacy/lang/si/stop_words.py
+++ b/spacy/lang/si/stop_words.py
@ -1,4 +1,3 @@
 STOP_WORDS = set(
    """
 අතර
--- a/spacy/lang/sk/init.py
+++ b/spacy/lang/sk/init.py
@ -1,11 +1,16 @@
 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 from ...attrs import LANG
 class SlovakDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "sk"
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/sk/examples.py
+++ b/spacy/lang/sk/examples.py
@ -0,0 +1,23 @@
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.sk.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Ardevop, s.r.o. je malá startup firma na území SR.",
    "Samojazdiace autá presúvajú poistnú zodpovednosť na výrobcov automobilov.",
    "Košice sú na východe.",
    "Bratislava je hlavné mesto Slovenskej republiky.",
    "Kde si?",
    "Kto je prezidentom Francúzska?",
    "Aké je hlavné mesto Slovenska?",
    "Kedy sa narodil Andrej Kiska?",
    "Včera som dostal 100€ na ruku.",
    "Dnes je nedeľa 26.1.2020.",
    "Narodil sa 15.4.1998 v Ružomberku.",
    "Niekto mi povedal, že 500 eur je veľa peňazí.",
    "Podaj mi ruku!",
 ]
--- a/spacy/lang/sk/lex_attrs.py
+++ b/spacy/lang/sk/lex_attrs.py
@ -0,0 +1,59 @@
 from ...attrs import LIKE_NUM
 _num_words = [
    "nula",
    "jeden",
    "dva",
    "tri",
    "štyri",
    "päť",
    "šesť",
    "sedem",
    "osem",
    "deväť",
    "desať",
    "jedenásť",
    "dvanásť",
    "trinásť",
    "štrnásť",
    "pätnásť",
    "šestnásť",
    "sedemnásť",
    "osemnásť",
    "devätnásť",
    "dvadsať",
    "tridsať",
    "štyridsať",
    "päťdesiat",
    "šesťdesiat",
    "sedemdesiat",
    "osemdesiat",
    "deväťdesiat",
    "sto",
    "tisíc",
    "milión",
    "miliarda",
    "bilión",
    "biliarda",
    "trilión",
    "triliarda",
    "kvadrilión",
 ]
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/sk/stop_words.py
+++ b/spacy/lang/sk/stop_words.py
@ -1,5 +1,4 @@
-
+# Source: https://github.com/Ardevop-sk/stopwords-sk
 # Source: https://github.com/stopwords-iso/stopwords-sk
 STOP_WORDS = set(
    """
@ -7,17 +6,41 @@ a
 aby
 aj
 ak
 akej
 akejže
 ako
 akom
 akomže
 akou
 akouže
 akože
 aká
 akáže
 aké
 akého
 akéhože
 akému
 akémuže
 akéže
 akú
 akúže
 aký
 akých
 akýchže
 akým
 akými
 akýmiže
 akýmže
 akýže
 ale
 alebo
 and
 ani
 asi
 avšak
 až
 ba
 bez
 bezo
 bol
 bola
 boli
@ -28,23 +51,32 @@ budeme
 budete
 budeš
 budú
 buï
 buď
 by
 byť
 cez
 cezo
 dnes
 do
 ešte
 for
 ho
 hoci
 i
 iba
 ich
 im
 inej
 inom
 iná
 iné
 iného
 inému
 iní
 inú
 iný
 iných
 iným
 inými
 ja
 je
 jeho
@ -53,80 +85,185 @@ jemu
 ju
 k
 kam
 kamže
 každou
 každá
 každé
 každého
 každému
 každí
 každú
 každý
 každých
 každým
 každými
 kde
-kedže
+kej
-keï
+kejže
 keď
 keďže
 kie
 kieho
 kiehože
 kiemu
 kiemuže
 kieže
 koho
 kom
 komu
 kou
 kouže
 kto
 ktorej
 ktorou
 ktorá
 ktoré
 ktorí
 ktorú
 ktorý
 ktorých
 ktorým
 ktorými
 ku
 ká
 káže
 ké
 kéže
 kú
 kúže
 ký
 kýho
 kýhože
 kým
 kýmu
 kýmuže
 kýže
 lebo
 leda
 ledaže
 len
 ma
 majú
 mal
 mala
 mali
 mať
 medzi
 menej
 mi
 mna
 mne
 mnou
 moja
 moje
 mojej
 mojich
 mojim
 mojimi
 mojou
 moju
 možno
 mu
 musia
 musieť
 musí
 musím
 musíme
 musíte
 musíš
 my
 má
 mám
 máme
 máte
-mòa
+máš
 môcť
 môj
 môjho
 môže
 môžem
 môžeme
 môžete
 môžeš
 môžu
 mňa
 na
 nad
 nado
 najmä
 nami
 naša
 naše
 našej
 naši
 našich
 našim
 našimi
 našou
 ne
 nech
 neho
 nej
 nejakej
 nejakom
 nejakou
 nejaká
 nejaké
 nejakého
 nejakému
 nejakú
 nejaký
 nejakých
 nejakým
 nejakými
 nemu
 než
 nich
 nie
 niektorej
 niektorom
 niektorou
 niektorá
 niektoré
 niektorého
 niektorému
 niektorú
 niektorý
 niektorých
 niektorým
 niektorými
 nielen
 niečo
 nim
 nimi
 nič
 ničoho
 ničom
 ničomu
 ničím
 no
 nová
 nové
 noví
 nový
 nám
 nás
 náš
 nášho
 ním
 o
 od
 odo
 of
 on
 ona
 oni
 ono
 ony
 oň
 oňho
 po
 pod
 podo
 podľa
 pokiaľ
 popod
 popri
 potom
 poza
 pre
 pred
 predo
@ -134,42 +271,56 @@ preto
 pretože
 prečo
 pri
 prvá
 prvé
 prví
 prvý
 práve
 pýta
 s
 sa
 seba
 sebe
 sebou
 sem
 si
 sme
 so
 som
 späť
 ste
 svoj
 svoja
 svoje
 svojho
 svojich
 svojim
 svojimi
 svojou
 svoju
 svojím
 svojími
 sú
 ta
 tak
 takej
 takejto
 taká
 takáto
 také
 takého
 takéhoto
 takému
 takémuto
 takéto
 takí
 takú
 takúto
 taký
 takýto
 takže
 tam
 te
 teba
 tebe
 tebou
 teda
 tej
 tejto
 ten
 tento
 the
 ti
 tie
 tieto
@ -177,52 +328,97 @@ tiež
 to
 toho
 tohoto
 tohto
 tom
 tomto
 tomu
 tomuto
 toto
 tou
 touto
 tu
 tvoj
-tvojími
+tvoja
 tvoje
 tvojej
 tvojho
 tvoji
 tvojich
 tvojim
 tvojimi
 tvojím
 ty
 tá
 táto
 tí
 títo
 tú
 túto
 tých
 tým
 tými
 týmto
-tě
+u
 už
 v
 vami
 vaša
 vaše
-veï
+vašej
 vaši
 vašich
 vašim
 vaším
 veď
 viac
 vo
 vy
 vám
 vás
 váš
 vášho
 však
 všetci
 všetka
 všetko
 všetky
 všetok
 z
 za
 začo
 začože
 zo
 a
 áno
-èi
+čej
 èo
 èí
 òom
 òou
 òu
 či
 čia
 čie
 čieho
 čiemu
 čiu
 čo
 čoho
 čom
 čomu
 čou
 čože
 čí
 čím
 čími
 ďalšia
 ďalšie
 ďalšieho
 ďalšiemu
 ďalšiu
 ďalšom
 ďalšou
 ďalší
 ďalších
 ďalším
 ďalšími
 ňom
 ňou
 ňu
 že
 """.split()
 )
--- a/Show More
+++ b/Show More
`@ -1 +1,2 @@`
		`#! /bin/sh`
	`python -m spacy "$@"`	`python -m spacy "$@"`
`@ -1,4 +1,3 @@`

	`# Setting explicit height and max-width: none on the SVG is required for`	`# Setting explicit height and max-width: none on the SVG is required for`
	`# Jupyter to render it properly in a cell`	`# Jupyter to render it properly in a cell`
`@ -1,4 +1,3 @@`

	`def explain(term):`	`def explain(term):`
	`"""Get a description for a given POS tag, dependency label or entity type.`	`"""Get a description for a given POS tag, dependency label or entity type.`
`@ -1,4 +1,3 @@`

	`# Source: https://github.com/stopwords-iso/stopwords-af`	`# Source: https://github.com/stopwords-iso/stopwords-af`

	`STOP_WORDS = set(`	`STOP_WORDS = set(`
`@ -1,4 +1,3 @@`

	`# Source: https://github.com/Alir3z4/stop-words`	`# Source: https://github.com/Alir3z4/stop-words`

	`STOP_WORDS = set(`	`STOP_WORDS = set(`
`@ -1,4 +1,3 @@`

	`"""`	`"""`
	`Example sentences to test spaCy and its language models.`	`Example sentences to test spaCy and its language models.`
`@ -1,4 +1,3 @@`

	`# These exceptions are used to add NORM values based on a token's ORTH value.`	`# These exceptions are used to add NORM values based on a token's ORTH value.`
	`# Norms are only set if no alternative is provided in the tokenizer exceptions.`	`# Norms are only set if no alternative is provided in the tokenizer exceptions.`
`@ -1,4 +1,3 @@`

	`# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6`	`# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6`

	`STOP_WORDS = set(`	`STOP_WORDS = set(`
`@ -1,4 +1,3 @@`

	`# Source: https://github.com/Xangis/extra-stopwords`	`# Source: https://github.com/Xangis/extra-stopwords`

	`STOP_WORDS = set(`	`STOP_WORDS = set(`