Merge pull request #5788 from explosion/master-tmp

2025-10-18 09:44:16 +03:00 · 2020-07-20 15:39:24 +02:00 · 2020-07-20 15:39:24 +02:00 · 311d0bde29
commit 311d0bde29
parent c9da9605f7 d51db72e46
45 changed files with 30377 additions and 28795 deletions
--- a/.github/contributors/PluieElectrique.md
+++ b/.github/contributors/PluieElectrique.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Pluie                |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-06-18           |
 | GitHub username                | PluieElectrique      |
 | Website (optional)             |                      |
--- a/.github/contributors/abchapman93.md
+++ b/.github/contributors/abchapman93.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Alec Chapman         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 7/17/2020            |
 | GitHub username                | abchapman93          |
 | Website (optional)             |                      |
--- a/.github/contributors/gandersen101.md
+++ b/.github/contributors/gandersen101.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Grant Andersen       |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 07.06.2020           |
 | GitHub username                | gandersen101         |
 | Website (optional)             |                      |
--- a/.github/contributors/jbesomi.md
+++ b/.github/contributors/jbesomi.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jonathan B.          |
 | Company name (if applicable)   | besomi.ai            |
 | Title or role (if applicable)  | -                    |
 | Date                           | 07.07.2020           |
 | GitHub username                | jbesomi              |
 | Website (optional)             | besomi.ai            |
--- a/.github/contributors/mikeizbicki.md
+++ b/.github/contributors/mikeizbicki.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Mike Izbicki         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 02 Jun 2020          |
 | GitHub username                | mikeizbicki          |
 | Website (optional)             | https://izbicki.me   |
--- a/.github/contributors/rameshhpathak.md
+++ b/.github/contributors/rameshhpathak.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Ramesh Pathak        |
 | Company name (if applicable)   | Diyo AI              |
 | Title or role (if applicable)  | AI Engineer          |
 | Date                           | June 21, 2020        |
 | GitHub username                | rameshhpathak        |
 | Website (optional)             |rameshhpathak.github.io|                      |
--- a/.github/contributors/richardliaw.md
+++ b/.github/contributors/richardliaw.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Richard Liaw         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 06/22/2020           |
 | GitHub username                | richardliaw          |
 | Website (optional)             |                      |
--- a/.gitignore
+++ b/.gitignore
@ -71,6 +71,7 @@ Pipfile.lock
 *.egg
 .eggs
 MANIFEST
 spacy/git_info.py
 # Temporary files
 *.~*
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -5,3 +5,4 @@ include README.md
 include pyproject.toml
 recursive-exclude spacy/lang *.json
 recursive-include spacy/lang *.json.gz
 recursive-include licenses *
--- a/4
+++ b/4
@ -5,7 +5,7 @@ VENV := ./env$(PYVER)
 version := $(shell "bin/get-version.sh")
 dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
 	chmod a+rx $@
 	cp $@ dist/spacy.pex
@ -15,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl
 wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
 	$(VENV)/bin/pip wheel . -w ./wheelhouse
-	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
+	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
 	touch $@
 wheelhouse/pytest-%.whl : $(VENV)/bin/pex
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -16,8 +16,6 @@ from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 from spacy.vocab import Vocab
 import spacy
 from spacy.kb import KnowledgeBase
@ -61,13 +59,13 @@ TRAIN_DATA = sample_train_data()
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
-def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
+def main(kb_path, vocab_path, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
    vocab = Vocab().from_disk(vocab_path)
    # create blank English model with correct vocab
-    nlp = spacy.blank("en", vocab=vocab)
+    nlp = spacy.blank("en")
-    nlp.vocab.vectors.name = "nel_vectors"
+    nlp.vocab.from_disk(vocab_path)
    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)
    # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
@ -96,7 +94,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
    # Convert the texts to docs to make sure we have doc.ents set for the training examples.
    # Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
    kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
-    train_examples  = []
+    train_examples = []
    for text, annotation in TRAIN_DATA:
        with nlp.select_pipes(disable="entity_linker"):
            doc = nlp(text)
@ -111,7 +109,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
                        "Removed", kb_id, "from training because it is not in the KB."
                    )
            annotation_clean["links"][offset] = new_dict
-        train_examples .append(Example.from_dict(doc, annotation_clean))
+        train_examples.append(Example.from_dict(doc, annotation_clean))
    with nlp.select_pipes(enable="entity_linker"):  # only train entity linker
        # reset and initialize the weights randomly
--- a/setup.py
+++ b/setup.py
@ -4,13 +4,14 @@ import sys
 import platform
 from distutils.command.build_ext import build_ext
 from distutils.sysconfig import get_python_inc
 import distutils.util
 from distutils import ccompiler, msvccompiler
 import numpy
 from pathlib import Path
 import shutil
 from Cython.Build import cythonize
 from Cython.Compiler import Options
 import os
 import subprocess
 ROOT = Path(__file__).parent
@ -75,7 +76,6 @@ COPY_FILES = {
 def is_new_osx():
    """Check whether we're on OSX >= 10.7"""
    name = distutils.util.get_platform()
    if sys.platform != "darwin":
        return False
    mac_ver = platform.mac_ver()[0]
@ -118,6 +118,53 @@ class build_ext_subclass(build_ext, build_ext_options):
        build_ext.build_extensions(self)
 # Include the git version in the build (adapted from NumPy)
 # Copyright (c) 2005-2020, NumPy Developers.
 # BSD 3-Clause license, see licenses/3rd_party_licenses.txt
 def write_git_info_py(filename="spacy/git_info.py"):
    def _minimal_ext_cmd(cmd):
        # construct minimal environment
        env = {}
        for k in ["SYSTEMROOT", "PATH", "HOME"]:
            v = os.environ.get(k)
            if v is not None:
                env[k] = v
        # LANGUAGE is used on win32
        env["LANGUAGE"] = "C"
        env["LANG"] = "C"
        env["LC_ALL"] = "C"
        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, env=env)
        return out
    git_version = "Unknown"
    if Path(".git").exists():
        try:
            out = _minimal_ext_cmd(["git", "rev-parse", "--short", "HEAD"])
            git_version = out.strip().decode("ascii")
        except Exception:
            pass
    elif Path(filename).exists():
        # must be a source distribution, use existing version file
        try:
            a = open(filename, "r")
            lines = a.readlines()
            git_version = lines[-1].split('"')[1]
        except Exception:
            pass
        finally:
            a.close()
    text = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
 #
 GIT_VERSION = "%(git_version)s"
 """
    a = open(filename, "w")
    try:
        a.write(text % {"git_version": git_version})
    finally:
        a.close()
 def clean(path):
    for path in path.glob("**/*"):
        if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
@ -126,6 +173,7 @@ def clean(path):
 def setup_package():
    write_git_info_py()
    if len(sys.argv) > 1 and sys.argv[1] == "clean":
        return clean(PACKAGE_ROOT)
--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -31,6 +31,41 @@ class EnglishDefaults(Language.Defaults):
        {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
    ]
    @classmethod
    def is_base_form(cls, univ_pos, morphology=None):
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.
        univ_pos (unicode / int): The token's universal part-of-speech tag.
        morphology (dict): The token's morphological features following the
            Universal Dependencies scheme.
        """
        if morphology is None:
            morphology = {}
        if univ_pos == "noun" and morphology.get("Number") == "sing":
            return True
        elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
            return True
        # This maps 'VBP' to base form -- probably just need 'IS_BASE'
        # morphology
        elif univ_pos == "verb" and (
            morphology.get("VerbForm") == "fin"
            and morphology.get("Tense") == "pres"
            and morphology.get("Number") is None
        ):
            return True
        elif univ_pos == "adj" and morphology.get("Degree") == "pos":
            return True
        elif morphology.get("VerbForm") == "inf":
            return True
        elif morphology.get("VerbForm") == "none":
            return True
        elif morphology.get("Degree") == "pos":
            return True
        else:
            return False
 class English(Language):
    lang = "en"
--- a/spacy/lang/fr/lemmatizer.py
+++ b/spacy/lang/fr/lemmatizer.py
@ -41,9 +41,6 @@ class FrenchLemmatizer(Lemmatizer):
            univ_pos = "sconj"
        else:
            return [self.lookup(string)]
        # See Issue #435 for example of where this logic is requied.
        if self.is_base_form(univ_pos, morphology):
            return list(set([string.lower()]))
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
        rules_table = self.lookups.get_table("lemma_rules", {})
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -8,6 +8,6 @@ Example sentences to test spaCy and its language models.
 sentences = [
    "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
    "Ո՞վ է Ֆրանսիայի նախագահը։",
-    "Որն է Միացյալ Նահանգների մայրաքաղաքը։",
+    "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
    "Ե՞րբ է ծնվել Բարաք Օբաման։",
 ]
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -15,14 +15,15 @@ _num_words = [
    "տասը",
    "տասնմեկ",
    "տասներկու",
-    "տասներեք",
+    "տասներեք",
-    "տասնչորս",
+    "տասնչորս",
-    "տասնհինգ",
+    "տասնհինգ",
-    "տասնվեց",
+    "տասնվեց",
-    "տասնյոթ",
+    "տասնյոթ",
-    "տասնութ",
+    "տասնութ",
-    "տասնինը",
+    "տասնինը",
-    "քսան" "երեսուն",
+    "քսան",
    "երեսուն",
    "քառասուն",
    "հիսուն",
    "վաթսուն",
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -17,12 +17,9 @@ from ... import util
 # Hold the attributes we need with convenient names
-DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
+DetailedToken = namedtuple(
-
+    "DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]
-# Handling for multiple spaces in a row is somewhat awkward, this simplifies
+)
 # the flow by creating a dummy with the same interface.
 DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
 DummySpace = DummyNode(" ", " ", " ")
 def try_sudachi_import(split_mode="A"):
@ -49,7 +46,7 @@ def try_sudachi_import(split_mode="A"):
        )
-def resolve_pos(orth, pos, next_pos):
+def resolve_pos(orth, tag, next_tag):
    """If necessary, add a field to the POS tag for UD mapping.
    Under Universal Dependencies, sometimes the same Unidic POS tag can
    be mapped differently depending on the literal token or its context
@ -60,127 +57,80 @@ def resolve_pos(orth, pos, next_pos):
    # Some tokens have their UD tag decided based on the POS of the following
    # token.
-    # orth based rules
+    # apply orth based mapping
-    if pos[0] in TAG_ORTH_MAP:
+    if tag in TAG_ORTH_MAP:
-        orth_map = TAG_ORTH_MAP[pos[0]]
+        orth_map = TAG_ORTH_MAP[tag]
        if orth in orth_map:
-            return orth_map[orth], None
+            return orth_map[orth], None  # current_pos, next_pos
-    # tag bi-gram mapping
+    # apply tag bi-gram mapping
-    if next_pos:
+    if next_tag:
-        tag_bigram = pos[0], next_pos[0]
+        tag_bigram = tag, next_tag
        if tag_bigram in TAG_BIGRAM_MAP:
-            bipos = TAG_BIGRAM_MAP[tag_bigram]
+            current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
-            if bipos[0] is None:
+            if current_pos is None:  # apply tag uni-gram mapping for current_pos
-                return TAG_MAP[pos[0]][POS], bipos[1]
+                return (
                    TAG_MAP[tag][POS],
                    next_pos,
                )  # only next_pos is identified by tag bi-gram mapping
            else:
-                return bipos
+                return current_pos, next_pos
-    return TAG_MAP[pos[0]][POS], None
+    # apply tag uni-gram mapping
    return TAG_MAP[tag][POS], None
-# Use a mapping of paired punctuation to avoid splitting quoted sentences.
+def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
-pairpunct = {"「": "」", "『": "』", "【": "】"}
+    # Compare the content of tokens and text, first
 def separate_sentences(doc):
    """Given a doc, mark tokens that start sentences based on Unidic tags.
    """
    stack = []  # save paired punctuation
    for i, token in enumerate(doc[:-2]):
        # Set all tokens after the first to false by default. This is necessary
        # for the doc code to be aware we've done sentencization, see
        # `is_sentenced`.
        token.sent_start = i == 0
        if token.tag_:
            if token.tag_ == "補助記号-括弧開":
                ts = str(token)
                if ts in pairpunct:
                    stack.append(pairpunct[ts])
                elif stack and ts == stack[-1]:
                    stack.pop()
            if token.tag_ == "補助記号-句点":
                next_token = doc[i + 1]
                if next_token.tag_ != token.tag_ and not stack:
                    next_token.sent_start = True
 def get_dtokens(tokenizer, text):
    tokens = tokenizer.tokenize(text)
    words = []
    for ti, token in enumerate(tokens):
        tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
        inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
        dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
        if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
            # don't add multiple space tokens in a row
            continue
        words.append(dtoken)
    # remove empty tokens. These can be produced with characters like … that
    # Sudachi normalizes internally.
    words = [ww for ww in words if len(ww.surface) > 0]
    return words
 def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
    words = [x.surface for x in dtokens]
    if "".join("".join(words).split()) != "".join(text.split()):
        raise ValueError(Errors.E194.format(text=text, words=words))
-    text_words = []
+
-    text_lemmas = []
+    text_dtokens = []
    text_tags = []
    text_spaces = []
    text_pos = 0
    # handle empty and whitespace-only texts
    if len(words) == 0:
-        return text_words, text_lemmas, text_tags, text_spaces
+        return text_dtokens, text_spaces
    elif len([word for word in words if not word.isspace()]) == 0:
        assert text.isspace()
-        text_words = [text]
+        text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)]
        text_lemmas = [text]
        text_tags = [gap_tag]
        text_spaces = [False]
-        return text_words, text_lemmas, text_tags, text_spaces
+        return text_dtokens, text_spaces
-    # normalize words to remove all whitespace tokens
+
-    norm_words, norm_dtokens = zip(
+    # align words and dtokens by referring text, and insert gap tokens for the space char spans
-        *[
+    for word, dtoken in zip(words, dtokens):
-            (word, dtokens)
+        # skip all space tokens
-            for word, dtokens in zip(words, dtokens)
+        if word.isspace():
-            if not word.isspace()
+            continue
        ]
    )
    # align words with text
    for word, dtoken in zip(norm_words, norm_dtokens):
        try:
            word_start = text[text_pos:].index(word)
        except ValueError:
            raise ValueError(Errors.E194.format(text=text, words=words))
        # space token
        if word_start > 0:
            w = text[text_pos : text_pos + word_start]
-            text_words.append(w)
+            text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
            text_lemmas.append(w)
            text_tags.append(gap_tag)
            text_spaces.append(False)
            text_pos += word_start
-        text_words.append(word)
+
-        text_lemmas.append(dtoken.lemma)
+        # content word
-        text_tags.append(dtoken.pos)
+        text_dtokens.append(dtoken)
        text_spaces.append(False)
        text_pos += len(word)
        # poll a space char after the word
        if text_pos < len(text) and text[text_pos] == " ":
            text_spaces[-1] = True
            text_pos += 1
    # trailing space token
    if text_pos < len(text):
        w = text[text_pos:]
-        text_words.append(w)
+        text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
        text_lemmas.append(w)
        text_tags.append(gap_tag)
        text_spaces.append(False)
-    return text_words, text_lemmas, text_tags, text_spaces
+
    return text_dtokens, text_spaces
 class JapaneseTokenizer(DummyTokenizer):
@ -190,29 +140,96 @@ class JapaneseTokenizer(DummyTokenizer):
        self.tokenizer = try_sudachi_import(self.split_mode)
    def __call__(self, text):
-        dtokens = get_dtokens(self.tokenizer, text)
+        # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
        sudachipy_tokens = self.tokenizer.tokenize(text)
        dtokens = self._get_dtokens(sudachipy_tokens)
        dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
-        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
+        # create Doc with tag bi-gram based part-of-speech identification rules
        words, tags, inflections, lemmas, readings, sub_tokens_list = (
            zip(*dtokens) if dtokens else [[]] * 6
        )
        sub_tokens_list = list(sub_tokens_list)
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        next_pos = None
+        next_pos = None  # for bi-gram rules
-        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
+        for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
-            token.tag_ = unidic_tag[0]
+            token.tag_ = dtoken.tag
-            if next_pos:
+            if next_pos:  # already identified in previous iteration
                token.pos = next_pos
                next_pos = None
            else:
                token.pos, next_pos = resolve_pos(
                    token.orth_,
-                    unidic_tag,
+                    dtoken.tag,
-                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
+                    tags[idx + 1] if idx + 1 < len(tags) else None,
                )
            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = lemma
+            token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
-        doc.user_data["unidic_tags"] = unidic_tags
+
        doc.user_data["inflections"] = inflections
        doc.user_data["reading_forms"] = readings
        doc.user_data["sub_tokens"] = sub_tokens_list
        return doc
    def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
        sub_tokens_list = (
            self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
        )
        dtokens = [
            DetailedToken(
                token.surface(),  # orth
                "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]),  # tag
                ",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]),  # inf
                token.dictionary_form(),  # lemma
                token.reading_form(),  # user_data['reading_forms']
                sub_tokens_list[idx]
                if sub_tokens_list
                else None,  # user_data['sub_tokens']
            )
            for idx, token in enumerate(sudachipy_tokens)
            if len(token.surface()) > 0
            # remove empty tokens which can be produced with characters like … that
        ]
        # Sudachi normalizes internally and outputs each space char as a token.
        # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
        return [
            t
            for idx, t in enumerate(dtokens)
            if idx == 0
            or not t.surface.isspace()
            or t.tag != "空白"
            or not dtokens[idx - 1].surface.isspace()
            or dtokens[idx - 1].tag != "空白"
        ]
    def _get_sub_tokens(self, sudachipy_tokens):
        if (
            self.split_mode is None or self.split_mode == "A"
        ):  # do nothing for default split mode
            return None
        sub_tokens_list = []  # list of (list of list of DetailedToken | None)
        for token in sudachipy_tokens:
            sub_a = token.split(self.tokenizer.SplitMode.A)
            if len(sub_a) == 1:  # no sub tokens
                sub_tokens_list.append(None)
            elif self.split_mode == "B":
                sub_tokens_list.append([self._get_dtokens(sub_a, False)])
            else:  # "C"
                sub_b = token.split(self.tokenizer.SplitMode.B)
                if len(sub_a) == len(sub_b):
                    dtokens = self._get_dtokens(sub_a, False)
                    sub_tokens_list.append([dtokens, dtokens])
                else:
                    sub_tokens_list.append(
                        [
                            self._get_dtokens(sub_a, False),
                            self._get_dtokens(sub_b, False),
                        ]
                    )
        return sub_tokens_list
    def _get_config(self):
        config = OrderedDict((("split_mode", self.split_mode),))
        return config
--- a/spacy/lang/ja/bunsetu.py
+++ b/spacy/lang/ja/bunsetu.py
@ -1,176 +0,0 @@
 POS_PHRASE_MAP = {
    "NOUN": "NP",
    "NUM": "NP",
    "PRON": "NP",
    "PROPN": "NP",
    "VERB": "VP",
    "ADJ": "ADJP",
    "ADV": "ADVP",
    "CCONJ": "CCONJP",
 }
 # return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
 def yield_bunsetu(doc, debug=False):
    bunsetu = []
    bunsetu_may_end = False
    phrase_type = None
    phrase = None
    prev = None
    prev_tag = None
    prev_dep = None
    prev_head = None
    for t in doc:
        pos = t.pos_
        pos_type = POS_PHRASE_MAP.get(pos, None)
        tag = t.tag_
        dep = t.dep_
        head = t.head.i
        if debug:
            print(
                t.i,
                t.orth_,
                pos,
                pos_type,
                dep,
                head,
                bunsetu_may_end,
                phrase_type,
                phrase,
                bunsetu,
            )
        # DET is always an individual bunsetu
        if pos == "DET":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            yield [t], None, None
            bunsetu = []
            bunsetu_may_end = False
            phrase_type = None
            phrase = None
        # PRON or Open PUNCT always splits bunsetu
        elif tag == "補助記号-括弧開":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            bunsetu = [t]
            bunsetu_may_end = True
            phrase_type = None
            phrase = None
        # bunsetu head not appeared
        elif phrase_type is None:
            if bunsetu and prev_tag == "補助記号-読点":
                yield bunsetu, phrase_type, phrase
                bunsetu = []
                bunsetu_may_end = False
                phrase_type = None
                phrase = None
            bunsetu.append(t)
            if pos_type:  # begin phrase
                phrase = [t]
                phrase_type = pos_type
                if pos_type in {"ADVP", "CCONJP"}:
                    bunsetu_may_end = True
        # entering new bunsetu
        elif pos_type and (
            pos_type != phrase_type
            or bunsetu_may_end  # different phrase type arises  # same phrase type but bunsetu already ended
        ):
            # exceptional case: NOUN to VERB
            if (
                phrase_type == "NP"
                and pos_type == "VP"
                and prev_dep == "compound"
                and prev_head == t.i
            ):
                bunsetu.append(t)
                phrase_type = "VP"
                phrase.append(t)
            # exceptional case: VERB to NOUN
            elif (
                phrase_type == "VP"
                and pos_type == "NP"
                and (
                    prev_dep == "compound"
                    and prev_head == t.i
                    or dep == "compound"
                    and prev == head
                    or prev_dep == "nmod"
                    and prev_head == t.i
                )
            ):
                bunsetu.append(t)
                phrase_type = "NP"
                phrase.append(t)
            else:
                yield bunsetu, phrase_type, phrase
                bunsetu = [t]
                bunsetu_may_end = False
                phrase_type = pos_type
                phrase = [t]
        # NOUN bunsetu
        elif phrase_type == "NP":
            bunsetu.append(t)
            if not bunsetu_may_end and (
                (
                    (pos_type == "NP" or pos == "SYM")
                    and (prev_head == t.i or prev_head == head)
                    and prev_dep in {"compound", "nummod"}
                )
                or (
                    pos == "PART"
                    and (prev == head or prev_head == head)
                    and dep == "mark"
                )
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # VERB bunsetu
        elif phrase_type == "VP":
            bunsetu.append(t)
            if (
                not bunsetu_may_end
                and pos == "VERB"
                and prev_head == t.i
                and prev_dep == "compound"
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # ADJ bunsetu
        elif phrase_type == "ADJP" and tag != "連体詞":
            bunsetu.append(t)
            if not bunsetu_may_end and (
                (
                    pos == "NOUN"
                    and (prev_head == t.i or prev_head == head)
                    and prev_dep in {"amod", "compound"}
                )
                or (
                    pos == "PART"
                    and (prev == head or prev_head == head)
                    and dep == "mark"
                )
            ):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # other bunsetu
        else:
            bunsetu.append(t)
        prev = t.i
        prev_tag = t.tag_
        prev_dep = t.dep_
        prev_head = head
    if bunsetu:
        yield bunsetu, phrase_type, phrase
--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -39,7 +39,11 @@ def check_spaces(text, tokens):
 class KoreanTokenizer(DummyTokenizer):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.Tokenizer = try_mecab_import()
+        MeCab = try_mecab_import()
        self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
    def __del__(self):
        self.mecab_tokenizer.__del__()
    def __call__(self, text):
        dtokens = list(self.detailed_tokens(text))
@ -55,17 +59,16 @@ class KoreanTokenizer(DummyTokenizer):
    def detailed_tokens(self, text):
        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
-        with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
+        for node in self.mecab_tokenizer.parse(text, as_nodes=True):
-            for node in tokenizer.parse(text, as_nodes=True):
+            if node.is_eos():
-                if node.is_eos():
+                break
-                    break
+            surface = node.surface
-                surface = node.surface
+            feature = node.feature
-                feature = node.feature
+            tag, _, expr = feature.partition(",")
-                tag, _, expr = feature.partition(",")
+            lemma, _, remainder = expr.partition("/")
-                lemma, _, remainder = expr.partition("/")
+            if lemma == "*":
-                if lemma == "*":
+                lemma = surface
-                    lemma = surface
+            yield {"surface": surface, "lemma": lemma, "tag": tag}
                yield {"surface": surface, "lemma": lemma, "tag": tag}
 class KoreanDefaults(Language.Defaults):
--- a/spacy/lang/ne/init.py
+++ b/spacy/lang/ne/init.py
@ -0,0 +1,23 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 from ...attrs import LANG
 class NepaliDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
    stop_words = STOP_WORDS
 class Nepali(Language):
    lang = "ne"
    Defaults = NepaliDefaults
 __all__ = ["Nepali"]
--- a/spacy/lang/ne/examples.py
+++ b/spacy/lang/ne/examples.py
@ -0,0 +1,22 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.ne.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
    "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
    "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
    "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
    "तिमी कहाँ छौ?",
    "फ्रान्स को राष्ट्रपति को हो?",
    "संयुक्त राज्यको राजधानी के हो?",
    "बराक ओबामा कहिले कहिले जन्मेका हुन्?",
 ]
--- a/spacy/lang/ne/lex_attrs.py
+++ b/spacy/lang/ne/lex_attrs.py
@ -0,0 +1,98 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..norm_exceptions import BASE_NORMS
 from ...attrs import NORM, LIKE_NUM
 # fmt: off
 _stem_suffixes = [
    ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
    ["ँ", "ं", "्", "ः"],
    ["लाई", "ले", "बाट", "को", "मा", "हरू"],
    ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
    ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
    ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
    ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
    ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
    ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
    ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
    ["याइ", "ाइ", "बार", "वार", "चाँहि"],
    ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
 ]
 # fmt: on
 # reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
 # reference 2: https://www.imnepal.com/nepali-numbers/
 _num_words = [
    "शुन्य",
    "एक",
    "दुई",
    "तीन",
    "चार",
    "पाँच",
    "छ",
    "सात",
    "आठ",
    "नौ",
    "दश",
    "एघार",
    "बाह्र",
    "तेह्र",
    "चौध",
    "पन्ध्र",
    "सोह्र",
    "सोह्र",
    "सत्र",
    "अठार",
    "उन्नाइस",
    "बीस",
    "तीस",
    "चालीस",
    "पचास",
    "साठी",
    "सत्तरी",
    "असी",
    "नब्बे",
    "सय",
    "हजार",
    "लाख",
    "करोड",
    "अर्ब",
    "खर्ब",
 ]
 def norm(string):
    # normalise base exceptions,  e.g. punctuation or currency symbols
    if string in BASE_NORMS:
        return BASE_NORMS[string]
    # set stem word as norm,  if available,  adapted from:
    # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
    # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
    for suffix_group in reversed(_stem_suffixes):
        length = len(suffix_group[0])
        if len(string) <= length:
            break
        for suffix in suffix_group:
            if string.endswith(suffix):
                return string[:-length]
    return string
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(", ", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
--- a/spacy/lang/ne/stop_words.py
+++ b/spacy/lang/ne/stop_words.py
@ -0,0 +1,498 @@
 # coding: utf8
 from __future__ import unicode_literals
 # Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
 STOP_WORDS = set(
    """
 अक्सर
 अगाडि
 अगाडी
 अघि
 अझै
 अठार
 अथवा
 अनि
 अनुसार
 अन्तर्गत
 अन्य
 अन्यत्र
 अन्यथा
 अब
 अरु
 अरुलाई
 अरू
 अर्को
 अर्थात
 अर्थात्
 अलग
 अलि
 अवस्था
 अहिले
 आए
 आएका
 आएको
 आज
 आजको
 आठ
 आत्म
 आदि
 आदिलाई
 आफनो
 आफू
 आफूलाई
 आफै
 आफैँ
 आफ्नै
 आफ्नो
 आयो
 उ
 उक्त
 उदाहरण
 उनको
 उनलाई
 उनले
 उनि
 उनी
 उनीहरुको
 उन्नाइस
 उप
 उसको
 उसलाई
 उसले
 उहालाई
 ऊ
 एउटा
 एउटै
 एक
 एकदम
 एघार
 ओठ
 औ
 औं
 कता
 कति
 कतै
 कम
 कमसेकम
 कसरि
 कसरी
 कसै
 कसैको
 कसैलाई
 कसैले
 कसैसँग
 कस्तो
 कहाँबाट
 कहिलेकाहीं
 का
 काम
 कारण
 कि
 किन
 किनभने
 कुन
 कुनै
 कुन्नी
 कुरा
 कृपया
 के
 केहि
 केही
 को
 कोहि
 कोहिपनि
 कोही
 कोहीपनि
 क्रमशः
 गए
 गएको
 गएर
 गयौ
 गरि
 गरी
 गरे
 गरेका
 गरेको
 गरेर
 गरौं
 गर्छ
 गर्छन्
 गर्छु
 गर्दा
 गर्दै
 गर्न
 गर्नु
 गर्नुपर्छ
 गर्ने
 गैर
 घर
 चार
 चाले
 चाहनुहुन्छ
 चाहन्छु
 चाहिं
 चाहिए
 चाहिंले
 चाहीं
 चाहेको
 चाहेर
 चोटी
 चौथो
 चौध
 छ
 छन
 छन्
 छु
 छू
 छैन
 छैनन्
 छौ
 छौं
 जता
 जताततै
 जना
 जनाको
 जनालाई
 जनाले
 जब
 जबकि
 जबकी
 जसको
 जसबाट
 जसमा
 जसरी
 जसलाई
 जसले
 जस्ता
 जस्तै
 जस्तो
 जस्तोसुकै
 जहाँ
 जान
 जाने
 जाहिर
 जुन
 जुनै
 जे
 जो
 जोपनि
 जोपनी
 झैं
 ठाउँमा
 ठीक
 ठूलो
 त
 तता
 तत्काल
 तथा
 तथापि
 तथापी
 तदनुसार
 तपाइ
 तपाई
 तपाईको
 तब
 तर
 तर्फ
 तल
 तसरी
 तापनि
 तापनी
 तिन
 तिनि
 तिनिहरुलाई
 तिनी
 तिनीहरु
 तिनीहरुको
 तिनीहरू
 तिनीहरूको
 तिनै
 तिमी
 तिर
 तिरको
 ती
 तीन
 तुरन्त
 तुरुन्त
 तुरुन्तै
 तेश्रो
 तेस्कारण
 तेस्रो
 तेह्र
 तैपनि
 तैपनी
 त्यत्तिकै
 त्यत्तिकैमा
 त्यस
 त्यसकारण
 त्यसको
 त्यसले
 त्यसैले
 त्यसो
 त्यस्तै
 त्यस्तो
 त्यहाँ
 त्यहिँ
 त्यही
 त्यहीँ
 त्यहीं
 त्यो
 त्सपछि
 त्सैले
 थप
 थरि
 थरी
 थाहा
 थिए
 थिएँ
 थिएन
 थियो
 दर्ता
 दश
 दिए
 दिएको
 दिन
 दिनुभएको
 दिनुहुन्छ
 दुइ
 दुइवटा
 दुई
 देखि
 देखिन्छ
 देखियो
 देखे
 देखेको
 देखेर
 दोश्री
 दोश्रो
 दोस्रो
 द्वारा
 धन्न
 धेरै
 धौ
 न
 नगर्नु
 नगर्नू
 नजिकै
 नत्र
 नत्रभने
 नभई
 नभएको
 नभनेर
 नयाँ
 नि
 निकै
 निम्ति
 निम्न
 निम्नानुसार
 निर्दिष्ट
 नै
 नौ
 पक्का
 पक्कै
 पछाडि
 पछाडी
 पछि
 पछिल्लो
 पछी
 पटक
 पनि
 पन्ध्र
 पर्छ
 पर्थ्यो
 पर्दैन
 पर्ने
 पर्नेमा
 पर्याप्त
 पहिले
 पहिलो
 पहिल्यै
 पाँच
 पांच
 पाचौँ
 पाँचौं
 पिच्छे
 पूर्व
 पो
 प्रति
 प्रतेक
 प्रत्यक
 प्राय
 प्लस
 फरक
 फेरि
 फेरी
 बढी
 बताए
 बने
 बरु
 बाट
 बारे
 बाहिर
 बाहेक
 बाह्र
 बिच
 बिचमा
 बिरुद्ध
 बिशेष
 बिस
 बीच
 बीचमा
 बीस
 भए
 भएँ
 भएका
 भएकालाई
 भएको
 भएन
 भएर
 भन
 भने
 भनेको
 भनेर
 भन्
 भन्छन्
 भन्छु
 भन्दा
 भन्दै
 भन्नुभयो
 भन्ने
 भन्या
 भयेन
 भयो
 भर
 भरि
 भरी
 भा
 भित्र
 भित्री
 भीत्र
 म
 मध्य
 मध्ये
 मलाई
 मा
 मात्र
 मात्रै
 माथि
 माथी
 मुख्य
 मुनि
 मुन्तिर
 मेरो
 मैले
 यति
 यथोचित
 यदि
 यद्ध्यपि
 यद्यपि
 यस
 यसका
 यसको
 यसपछि
 यसबाहेक
 यसमा
 यसरी
 यसले
 यसो
 यस्तै
 यस्तो
 यहाँ
 यहाँसम्म
 यही
 या
 यी
 यो
 र
 रही
 रहेका
 रहेको
 रहेछ
 राखे
 राख्छ
 राम्रो
 रुपमा
 रूप
 रे
 लगभग
 लगायत
 लाई
 लाख
 लागि
 लागेको
 ले
 वटा
 वरीपरी
 वा
 वाट
 वापत
 वास्तवमा
 शायद
 सक्छ
 सक्ने
 सँग
 संग
 सँगको
 सँगसँगै
 सँगै
 संगै
 सङ्ग
 सङ्गको
 सट्टा
 सत्र
 सधै
 सबै
 सबैको
 सबैलाई
 समय
 समेत
 सम्भव
 सम्म
 सय
 सरह
 सहित
 सहितै
 सही
 साँच्चै
 सात
 साथ
 साथै
 सायद
 सारा
 सुनेको
 सुनेर
 सुरु
 सुरुको
 सुरुमै
 सो
 सोचेको
 सोचेर
 सोही
 सोह्र
 स्थित
 स्पष्ट
 हजार
 हरे
 हरेक
 हामी
 हामीले
 हाम्रा
 हाम्रो
 हुँदैन
 हुन
 हुनत
 हुनु
 हुने
 हुनेछ
 हुन्
 हुन्छ
 हुन्थ्यो
 हैन
 हो
 होइन
 होकि
 होला
 """.split()
 )
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -14,7 +14,7 @@ from .stop_words import STOP_WORDS
 from ... import util
-_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.22` or from https://github.com/lancopku/pkuseg-python"
+_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.25` or from https://github.com/lancopku/pkuseg-python"
 def try_jieba_import(segmenter):
--- a/spacy/language.py
+++ b/spacy/language.py
@ -32,6 +32,7 @@ from .lang.tag_map import TAG_MAP
 from .tokens import Doc
 from .lang.lex_attrs import LEX_ATTRS, is_stop
 from .errors import Errors, Warnings
 from .git_info import GIT_VERSION
 from . import util
 from . import about
@ -44,7 +45,7 @@ class BaseDefaults:
    def create_lemmatizer(cls, nlp=None, lookups=None):
        if lookups is None:
            lookups = cls.create_lookups(nlp=nlp)
-        return Lemmatizer(lookups=lookups)
+        return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form)
    @classmethod
    def create_lookups(cls, nlp=None):
@ -116,6 +117,7 @@ class BaseDefaults:
    tokenizer_exceptions = {}
    stop_words = set()
    morph_rules = {}
    is_base_form = None
    lex_attr_getters = LEX_ATTRS
    syntax_iterators = {}
    resources = {}
@ -212,6 +214,7 @@ class Language:
        self._meta.setdefault("email", "")
        self._meta.setdefault("url", "")
        self._meta.setdefault("license", "")
        self._meta.setdefault("spacy_git_version", GIT_VERSION)
        self._meta["vectors"] = {
            "width": self.vocab.vectors_length,
            "vectors": len(self.vocab.vectors),
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@ -14,7 +14,7 @@ class Lemmatizer:
    def load(cls, *args, **kwargs):
        raise NotImplementedError(Errors.E172)
-    def __init__(self, lookups):
+    def __init__(self, lookups, is_base_form=None):
        """Initialize a Lemmatizer.
        lookups (Lookups): The lookups object containing the (optional) tables
@ -22,6 +22,7 @@ class Lemmatizer:
        RETURNS (Lemmatizer): The newly constructed object.
        """
        self.lookups = lookups
        self.is_base_form = is_base_form
    def __call__(self, string, univ_pos, morphology=None):
        """Lemmatize a string.
@ -42,7 +43,7 @@ class Lemmatizer:
        if univ_pos in ("", "eol", "space"):
            return [string.lower()]
        # See Issue #435 for example of where this logic is requied.
-        if self.is_base_form(univ_pos, morphology):
+        if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
            return [string.lower()]
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -346,7 +346,7 @@ cdef class Lexeme:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        return self.orth in self.vocab.vectors
+        return self.orth not in self.vocab.vectors
    property is_stop:
        """RETURNS (bool): Whether the lexeme is a stop word."""
--- a/spacy/lookups.py
+++ b/spacy/lookups.py
@ -117,8 +117,7 @@ class Lookups:
        """
        self._tables = {}
        for key, value in srsly.msgpack_loads(bytes_data).items():
-            self._tables[key] = Table(key)
+            self._tables[key] = Table(key, value)
            self._tables[key].update(value)
        return self
    def to_disk(self, path, filename="lookups.bin", **kwargs):
@ -189,7 +188,7 @@ class Table(OrderedDict):
        self.name = name
        # Assume a default size of 1M items
        self.default_size = 1e6
-        size = len(data) if data and len(data) > 0 else self.default_size
+        size = max(len(data), 1) if data is not None else self.default_size
        self.bloom = BloomFilter.from_error_rate(size)
        if data:
            self.update(data)
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -781,6 +781,20 @@ class ClozeMultitask(Pipe):
        if losses is not None:
            losses[self.name] += loss
    @staticmethod
    def decode_utf8_predictions(char_array):
        # The format alternates filling from start and end, and 255 is missing
        words = []
        char_array = char_array.reshape((char_array.shape[0], -1, 256))
        nr_char = char_array.shape[1]
        char_array = char_array.argmax(axis=-1)
        for row in char_array:
            starts = [chr(c) for c in row[::2] if c != 255]
            ends = [chr(c) for c in row[1::2] if c != 255]
            word = "".join(starts + list(reversed(ends)))
            words.append(word)
        return words
@component("textcat", assigns=["doc.cats"], default_model=default_textcat)
 class TextCategorizer(Pipe):
@ -949,6 +963,7 @@ cdef class DependencyParser(Parser):
    assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
    requires = []
    TransitionSystem = ArcEager
    nr_feature = 8
    @property
    def postprocesses(self):
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -167,6 +167,11 @@ def nb_tokenizer():
    return get_lang_class("nb").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def ne_tokenizer():
    return get_lang_class("ne").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def nl_tokenizer():
    return get_lang_class("nl").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -102,10 +102,16 @@ def test_doc_api_getitem(en_tokenizer):
 )
 def test_doc_api_serialize(en_tokenizer, text):
    tokens = en_tokenizer(text)
    tokens[0].lemma_ = "lemma"
    tokens[0].norm_ = "norm"
    tokens[0].ent_kb_id_ = "ent_kb_id"
    new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
    assert tokens.text == new_tokens.text
    assert [t.text for t in tokens] == [t.text for t in new_tokens]
    assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
    assert new_tokens[0].lemma_ == "lemma"
    assert new_tokens[0].norm_ == "norm"
    assert new_tokens[0].ent_kb_id_ == "ent_kb_id"
    new_tokens = Doc(tokens.vocab).from_bytes(
        tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -1,7 +1,7 @@
 import pytest
 from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
-from spacy.lang.ja import Japanese
+from spacy.lang.ja import Japanese, DetailedToken
 # fmt: off
 TOKENIZER_TESTS = [
@ -93,6 +93,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
    assert len(nlp_c(text)) == len_c
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
    [
        (
            "選挙管理委員会",
            [None, None, None, None],
            [None, None, [
                [
                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
                ]
            ]],
            [[
                [
                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
                ], [
                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
                    DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
                ]
            ]]
        ),
    ]
 )
 def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
    nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
    nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
    nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
    assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
    assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
    assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
    assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
@pytest.mark.parametrize("text,inflections,reading_forms",
    [
        (
            "取ってつけた",
            ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
            ("トッ", "テ", "ツケ", "タ"),
        ),
    ]
 )
 def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
    assert ja_tokenizer(text).user_data["inflections"] == inflections
    assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
 def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
    doc = ja_tokenizer("")
    assert len(doc) == 0
--- a/spacy/tests/lang/ne/init.py
+++ b/spacy/tests/lang/ne/init.py
--- a/spacy/tests/lang/ne/test_text.py
+++ b/spacy/tests/lang/ne/test_text.py
@ -0,0 +1,19 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
    text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
    tokens = ne_tokenizer(text)
    assert len(tokens) == 24
@pytest.mark.parametrize(
    "text,length",
    [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
 )
 def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
    tokens = ne_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -4,7 +4,9 @@ from spacy import util
 from spacy.gold import Example
 from spacy.lang.en import English
 from spacy.language import Language
-from spacy.tests.util import make_tempdir
+from spacy.symbols import POS, NOUN
 from ..util import make_tempdir
 def test_label_types():
@ -15,6 +17,19 @@ def test_label_types():
        nlp.get_pipe("tagger").add_label(9)
 def test_tagger_begin_training_tag_map():
    """Test that Tagger.begin_training() without gold tuples does not clobber
    the tag map."""
    nlp = Language()
    tagger = nlp.create_pipe("tagger")
    orig_tag_count = len(tagger.labels)
    tagger.add_label("A", {"POS": "NOUN"})
    nlp.add_pipe(tagger)
    nlp.begin_training()
    assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
    assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
 TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
 MORPH_RULES = {"V": {"like": {"lemma": "luck"}}}
--- a/spacy/tests/regression/test_issue1-1000.py
+++ b/spacy/tests/regression/test_issue1-1000.py
@ -11,6 +11,7 @@ from spacy.lang.en import English
 from spacy.lemmatizer import Lemmatizer
 from spacy.lookups import Lookups
 from spacy.tokens import Doc, Span
 from spacy.lang.en import EnglishDefaults
 from ..util import get_doc, make_tempdir
@ -164,7 +165,7 @@ def test_issue595():
    lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
    lookups.add_table("lemma_index", {"verb": {}})
    lookups.add_table("lemma_exc", {"verb": {}})
-    lemmatizer = Lemmatizer(lookups)
+    lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form)
    vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
    doc = Doc(vocab, words=words)
    doc[2].tag_ = "VB"
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -57,7 +57,7 @@ def test_issue2626_2835(en_tokenizer, text):
 def test_issue2656(en_tokenizer):
-    """Test that tokenizer correctly splits of punctuation after numbers with
+    """Test that tokenizer correctly splits off punctuation after numbers with
    decimal points.
    """
    doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
--- a/spacy/tests/test_lemmatizer.py
+++ b/spacy/tests/test_lemmatizer.py
@ -2,6 +2,7 @@ import pytest
 from spacy.tokens import Doc
 from spacy.language import Language
 from spacy.lookups import Lookups
 from spacy.lemmatizer import Lemmatizer
 def test_lemmatizer_reflects_lookups_changes():
@ -46,3 +47,14 @@ def test_tagger_warns_no_lookups():
    with pytest.warns(None) as record:
        nlp.begin_training()
        assert not record.list
 def test_lemmatizer_without_is_base_form_implementation():
    # Norwegian example from #5658
    lookups = Lookups()
    lookups.add_table("lemma_rules", {"noun": []})
    lookups.add_table("lemma_index", {"noun": {}})
    lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}})
    lemmatizer = Lemmatizer(lookups, is_base_form=None)
    assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"]
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@ -370,6 +370,6 @@ def test_vector_is_oov():
    data[1] = 2.0
    vocab.set_vector("cat", data[0])
    vocab.set_vector("dog", data[1])
-    assert vocab["cat"].is_oov is True
+    assert vocab["cat"].is_oov is False
-    assert vocab["dog"].is_oov is True
+    assert vocab["dog"].is_oov is False
-    assert vocab["hamster"].is_oov is False
+    assert vocab["hamster"].is_oov is True
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -1062,7 +1062,7 @@ cdef class Doc:
        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM]  # TODO: ENT_KB_ID ?
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM, ENT_KB_ID]
        if self.is_tagged:
            array_head.extend([TAG, POS])
        # If doc parsed add head and dep attribute
@ -1071,6 +1071,14 @@ cdef class Doc:
        # Otherwise add sent_start
        else:
            array_head.append(SENT_START)
        strings = set()
        for token in self:
            strings.add(token.tag_)
            strings.add(token.lemma_)
            strings.add(token.dep_)
            strings.add(token.ent_type_)
            strings.add(token.ent_kb_id_)
            strings.add(token.norm_)
        # Msgpack doesn't distinguish between lists and tuples, which is
        # vexing for user data. As a best guess, we *know* that within
        # keys, we must have tuples. In values we just have to hope
@ -1082,6 +1090,7 @@ cdef class Doc:
            "sentiment": lambda: self.sentiment,
            "tensor": lambda: self.tensor,
            "cats": lambda: self.cats,
            "strings": lambda: list(strings),
            "has_unknown_spaces": lambda: self.has_unknown_spaces
        }
        if "user_data" not in exclude and self.user_data:
@ -1110,6 +1119,7 @@ cdef class Doc:
            "sentiment": lambda b: None,
            "tensor": lambda b: None,
            "cats": lambda b: None,
            "strings": lambda b: None,
            "user_data_keys": lambda b: None,
            "user_data_values": lambda b: None,
            "has_unknown_spaces": lambda b: None
@ -1130,6 +1140,9 @@ cdef class Doc:
            self.tensor = msg["tensor"]
        if "cats" not in exclude and "cats" in msg:
            self.cats = msg["cats"]
        if "strings" not in exclude and "strings" in msg:
            for s in msg["strings"]:
                self.vocab.strings.add(s)
        if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
            self.has_unknown_spaces = msg["has_unknown_spaces"]
        start = 0
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -923,7 +923,7 @@ cdef class Token:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the token is out-of-vocabulary."""
-        return self.c.lex.orth in self.vocab.vectors
+        return self.c.lex.orth not in self.vocab.vectors
    @property
    def is_stop(self):
--- a/spacy/util.py
+++ b/spacy/util.py
@ -187,6 +187,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
        pipeline = nlp.Defaults.pipe_names
    elif pipeline in (False, None):
        pipeline = []
    # skip "vocab" from overrides in component initialization since vocab is
    # already configured from overrides when nlp is initialized above
    if "vocab" in overrides:
        del overrides["vocab"]
    for name in pipeline:
        if name not in disable:
            config = meta.get("pipeline_args", {}).get(name, {})
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -105,8 +105,8 @@ The Chinese language class supports three word segmentation options:
 > ```
 1. **Character segmentation:** Character segmentation is the default
-   segmentation option. It's enabled when you create a new `Chinese`
+   segmentation option. It's enabled when you create a new `Chinese` language
-   language class or call `spacy.blank("zh")`.
+   class or call `spacy.blank("zh")`.
 2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
   segmentation with the tokenizer option `{"segmenter": "jieba"}`.
 3. **PKUSeg**: As of spaCy v2.3.0, support for
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1,5 +1,58 @@
 {
    "resources": [
        {
            "id": "spacy-streamlit",
            "title": "spacy-streamlit",
            "slogan": "spaCy building blocks for Streamlit apps",
            "github": "explosion/spacy-streamlit",
            "description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
            "pip": "spacy-streamlit",
            "category": ["visualizers"],
            "thumb": "https://i.imgur.com/mhEjluE.jpg",
            "image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
            "code_example": [
                "import spacy_streamlit",
                "",
                "models = [\"en_core_web_sm\", \"en_core_web_md\"]",
                "default_text = \"Sundar Pichai is the CEO of Google.\"",
                "spacy_streamlit.visualize(models, default_text))"
            ],
            "author": "Ines Montani",
            "author_links": {
                "twitter": "_inesmontani",
                "github": "ines",
                "website": "https://ines.io"
            }
        },
        {
            "id": "spaczz",
            "title": "spaczz",
            "slogan": "Fuzzy matching and more for spaCy.",
            "description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
            "github": "gandersen101/spaczz",
            "pip": "spaczz",
            "code_example": [
                "import spacy",
                "from spaczz.pipeline import SpaczzRuler",
                "",
                "nlp = spacy.blank('en')",
                "ruler = SpaczzRuler(nlp)",
                "ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
                "nlp.add_pipe(ruler)",
                "",
                "doc = nlp('Oops, I spelled Bill Gatez wrong.')",
                "print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
            ],
            "code_language": "python",
            "url": "https://spaczz.readthedocs.io/en/latest/",
            "author": "Grant Andersen",
            "author_links": {
                "twitter": "gandersen101",
                "github": "gandersen101"
            },
            "category": ["pipeline"],
            "tags": ["fuzzy-matching", "regex"]
        },
        {
            "id": "spacy-universal-sentence-encoder",
            "title": "SpaCy - Universal Sentence Encoder",
@ -1238,6 +1291,19 @@
            "youtube": "K1elwpgDdls",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-spacy-course-es",
            "title": "NLP avanzado con spaCy · Un curso en línea gratis",
            "description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
            "url": "https://course.spacy.io/es",
            "author": "Camila Gutiérrez",
            "author_links": {
                "twitter": "Mariacamilagl30"
            },
            "youtube": "RNiLVCE5d4k",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-intro-to-nlp-episode-1",
@ -1294,6 +1360,20 @@
            "youtube": "IqOJU1-_Fi0",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-intro-to-nlp-episode-5",
            "title": "Intro to NLP with spaCy (5)",
            "slogan": "Episode 5: Rules vs. Machine Learning",
            "description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
            "author": "Vincent Warmerdam",
            "author_links": {
                "twitter": "fishnets88",
                "github": "koaning"
            },
            "youtube": "f4sqeLRzkPg",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-spacy-irl-entity-linking",
@ -2348,6 +2428,56 @@
            },
            "category": ["pipeline", "conversational", "research"],
            "tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
        },
        {
            "id": "texthero",
            "title": "Texthero",
            "slogan": "Text preprocessing, representation and visualization from zero to hero.",
            "description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
            "github": "jbesomi/texthero",
            "pip": "texthero",
            "code_example": [
                "import texthero as hero",
                "import pandas as pd",
                "",
                "df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
                "df['named_entities'] = hero.named_entities(df['text'])",
                "df.head()"
            ],
            "code_language": "python",
            "url": "https://texthero.org",
            "thumb": "https://texthero.org/img/T.png",
            "image": "https://texthero.org/docs/assets/texthero.png",
            "author": "Jonathan Besomi",
            "author_links": {
                "github": "jbesomi",
                "website": "https://besomi.ai"
            },
            "category": ["standalone"]
        },
        {
            "id": "cov-bsv",
            "title": "VA COVID-19 NLP BSV",
            "slogan": "spaCy pipeline for COVID-19 surveillance.",
            "github": "abchapman93/VA_COVID-19_NLP_BSV",
            "description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
            "pip": "cov-bsv",
            "code_example": [
              "import cov_bsv",
              "",
              "nlp = cov_bsv.load()",
              "text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
              "",
              "print(doc.ents)",
              "print(doc._.cov_classification)",
              "cov_bsv.visualize_doc(doc)"
            ],
            "category": ["pipeline", "standalone", "biomedical", "scientific"],
            "tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
            "author": "Alec Chapman",
            "author_links": {
                "github": "abchapman93"
            }
        }
    ],
--- a/website/package-lock.json
+++ b/website/package-lock.json