Merge pull request #5788 from explosion/master-tmp

2025-10-17 17:24:14 +03:00 · 2020-07-20 15:39:24 +02:00 · 2020-07-20 15:39:24 +02:00 · 311d0bde29
commit 311d0bde29
parent c9da9605f7 d51db72e46
45 changed files with 30377 additions and 28795 deletions
--- a/.github/contributors/PluieElectrique.md
+++ b/.github/contributors/PluieElectrique.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Pluie                |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-06-18           |
+| GitHub username                | PluieElectrique      |
+| Website (optional)             |                      |
--- a/.github/contributors/abchapman93.md
+++ b/.github/contributors/abchapman93.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Alec Chapman         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 7/17/2020            |
+| GitHub username                | abchapman93          |
+| Website (optional)             |                      |
--- a/.github/contributors/gandersen101.md
+++ b/.github/contributors/gandersen101.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Grant Andersen       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 07.06.2020           |
+| GitHub username                | gandersen101         |
+| Website (optional)             |                      |
--- a/.github/contributors/jbesomi.md
+++ b/.github/contributors/jbesomi.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Jonathan B.          |
+| Company name (if applicable)   | besomi.ai            |
+| Title or role (if applicable)  | -                    |
+| Date                           | 07.07.2020           |
+| GitHub username                | jbesomi              |
+| Website (optional)             | besomi.ai            |
--- a/.github/contributors/mikeizbicki.md
+++ b/.github/contributors/mikeizbicki.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Mike Izbicki         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 02 Jun 2020          |
+| GitHub username                | mikeizbicki          |
+| Website (optional)             | https://izbicki.me   |
--- a/.github/contributors/rameshhpathak.md
+++ b/.github/contributors/rameshhpathak.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Ramesh Pathak        |
+| Company name (if applicable)   | Diyo AI              |
+| Title or role (if applicable)  | AI Engineer          |
+| Date                           | June 21, 2020        |
+| GitHub username                | rameshhpathak        |
+| Website (optional)             |rameshhpathak.github.io|                      |
--- a/.github/contributors/richardliaw.md
+++ b/.github/contributors/richardliaw.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Richard Liaw         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06/22/2020           |
+| GitHub username                | richardliaw          |
+| Website (optional)             |                      |
--- a/.gitignore
+++ b/.gitignore
@ -71,6 +71,7 @@ Pipfile.lock
 *.egg
 .eggs
 MANIFEST
+spacy/git_info.py

 # Temporary files
 *.~*
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -5,3 +5,4 @@ include README.md
 include pyproject.toml
 recursive-exclude spacy/lang *.json
 recursive-include spacy/lang *.json.gz
+recursive-include licenses *
--- a/4
+++ b/4
@ -5,7 +5,7 @@ VENV := ./env$(PYVER)
 version := $(shell "bin/get-version.sh")

 dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
 	chmod a+rx $@
 	cp $@ dist/spacy.pex

@ -15,7 +15,7 @@ dist/pytest.pex : wheelhouse/pytest-*.whl

 wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
 	$(VENV)/bin/pip wheel . -w ./wheelhouse
-	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.22 sudachipy sudachidict_core -w ./wheelhouse
+	$(VENV)/bin/pip wheel spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
 	touch $@

 wheelhouse/pytest-%.whl : $(VENV)/bin/pex
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -16,8 +16,6 @@ from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
-
-from spacy.vocab import Vocab
 import spacy
 from spacy.kb import KnowledgeBase

@ -61,13 +59,13 @@ TRAIN_DATA = sample_train_data()
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
-def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
+def main(kb_path, vocab_path, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
-    vocab = Vocab().from_disk(vocab_path)
    # create blank English model with correct vocab
-    nlp = spacy.blank("en", vocab=vocab)
-    nlp.vocab.vectors.name = "nel_vectors"
+    nlp = spacy.blank("en")
+    nlp.vocab.from_disk(vocab_path)
+    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)

    # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
@ -96,7 +94,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
    # Convert the texts to docs to make sure we have doc.ents set for the training examples.
    # Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
    kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
-    train_examples  = []
+    train_examples = []
    for text, annotation in TRAIN_DATA:
        with nlp.select_pipes(disable="entity_linker"):
            doc = nlp(text)
@ -111,7 +109,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
                        "Removed", kb_id, "from training because it is not in the KB."
                    )
            annotation_clean["links"][offset] = new_dict
-        train_examples .append(Example.from_dict(doc, annotation_clean))
+        train_examples.append(Example.from_dict(doc, annotation_clean))

    with nlp.select_pipes(enable="entity_linker"):  # only train entity linker
        # reset and initialize the weights randomly
--- a/setup.py
+++ b/setup.py
@ -4,13 +4,14 @@ import sys
 import platform
 from distutils.command.build_ext import build_ext
 from distutils.sysconfig import get_python_inc
-import distutils.util
 from distutils import ccompiler, msvccompiler
 import numpy
 from pathlib import Path
 import shutil
 from Cython.Build import cythonize
 from Cython.Compiler import Options
+import os
+import subprocess


 ROOT = Path(__file__).parent
@ -75,7 +76,6 @@ COPY_FILES = {

 def is_new_osx():
    """Check whether we're on OSX >= 10.7"""
-    name = distutils.util.get_platform()
    if sys.platform != "darwin":
        return False
    mac_ver = platform.mac_ver()[0]
@ -118,6 +118,53 @@ class build_ext_subclass(build_ext, build_ext_options):
        build_ext.build_extensions(self)


+# Include the git version in the build (adapted from NumPy)
+# Copyright (c) 2005-2020, NumPy Developers.
+# BSD 3-Clause license, see licenses/3rd_party_licenses.txt
+def write_git_info_py(filename="spacy/git_info.py"):
+    def _minimal_ext_cmd(cmd):
+        # construct minimal environment
+        env = {}
+        for k in ["SYSTEMROOT", "PATH", "HOME"]:
+            v = os.environ.get(k)
+            if v is not None:
+                env[k] = v
+        # LANGUAGE is used on win32
+        env["LANGUAGE"] = "C"
+        env["LANG"] = "C"
+        env["LC_ALL"] = "C"
+        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, env=env)
+        return out
+
+    git_version = "Unknown"
+    if Path(".git").exists():
+        try:
+            out = _minimal_ext_cmd(["git", "rev-parse", "--short", "HEAD"])
+            git_version = out.strip().decode("ascii")
+        except Exception:
+            pass
+    elif Path(filename).exists():
+        # must be a source distribution, use existing version file
+        try:
+            a = open(filename, "r")
+            lines = a.readlines()
+            git_version = lines[-1].split('"')[1]
+        except Exception:
+            pass
+        finally:
+            a.close()
+
+    text = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
+#
+GIT_VERSION = "%(git_version)s"
+"""
+    a = open(filename, "w")
+    try:
+        a.write(text % {"git_version": git_version})
+    finally:
+        a.close()
+
+
 def clean(path):
    for path in path.glob("**/*"):
        if path.is_file() and path.suffix in (".so", ".cpp", ".html"):
@ -126,6 +173,7 @@ def clean(path):


 def setup_package():
+    write_git_info_py()
    if len(sys.argv) > 1 and sys.argv[1] == "clean":
        return clean(PACKAGE_ROOT)

--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -31,6 +31,41 @@ class EnglishDefaults(Language.Defaults):
        {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
    ]

+    @classmethod
+    def is_base_form(cls, univ_pos, morphology=None):
+        """
+        Check whether we're dealing with an uninflected paradigm, so we can
+        avoid lemmatization entirely.
+
+        univ_pos (unicode / int): The token's universal part-of-speech tag.
+        morphology (dict): The token's morphological features following the
+            Universal Dependencies scheme.
+        """
+        if morphology is None:
+            morphology = {}
+        if univ_pos == "noun" and morphology.get("Number") == "sing":
+            return True
+        elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
+            return True
+        # This maps 'VBP' to base form -- probably just need 'IS_BASE'
+        # morphology
+        elif univ_pos == "verb" and (
+            morphology.get("VerbForm") == "fin"
+            and morphology.get("Tense") == "pres"
+            and morphology.get("Number") is None
+        ):
+            return True
+        elif univ_pos == "adj" and morphology.get("Degree") == "pos":
+            return True
+        elif morphology.get("VerbForm") == "inf":
+            return True
+        elif morphology.get("VerbForm") == "none":
+            return True
+        elif morphology.get("Degree") == "pos":
+            return True
+        else:
+            return False
+

 class English(Language):
    lang = "en"
--- a/spacy/lang/fr/lemmatizer.py
+++ b/spacy/lang/fr/lemmatizer.py
@ -41,9 +41,6 @@ class FrenchLemmatizer(Lemmatizer):
            univ_pos = "sconj"
        else:
            return [self.lookup(string)]
-        # See Issue #435 for example of where this logic is requied.
-        if self.is_base_form(univ_pos, morphology):
-            return list(set([string.lower()]))
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
        rules_table = self.lookups.get_table("lemma_rules", {})
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -8,6 +8,6 @@ Example sentences to test spaCy and its language models.
 sentences = [
    "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
    "Ո՞վ է Ֆրանսիայի նախագահը։",
-    "Որն է Միացյալ Նահանգների մայրաքաղաքը։",
+    "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
    "Ե՞րբ է ծնվել Բարաք Օբաման։",
 ]
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -15,14 +15,15 @@ _num_words = [
    "տասը",
    "տասնմեկ",
    "տասներկու",
-    "տասներեք",
-    "տասնչորս",
-    "տասնհինգ",
-    "տասնվեց",
-    "տասնյոթ",
-    "տասնութ",
-    "տասնինը",
-    "քսան" "երեսուն",
+    "տասներեք",
+    "տասնչորս",
+    "տասնհինգ",
+    "տասնվեց",
+    "տասնյոթ",
+    "տասնութ",
+    "տասնինը",
+    "քսան",
+    "երեսուն",
    "քառասուն",
    "հիսուն",
    "վաթսուն",
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -17,12 +17,9 @@ from ... import util


 # Hold the attributes we need with convenient names
-DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
-
-# Handling for multiple spaces in a row is somewhat awkward, this simplifies
-# the flow by creating a dummy with the same interface.
-DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
-DummySpace = DummyNode(" ", " ", " ")
+DetailedToken = namedtuple(
+    "DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]
+)


 def try_sudachi_import(split_mode="A"):
@ -49,7 +46,7 @@ def try_sudachi_import(split_mode="A"):
        )


-def resolve_pos(orth, pos, next_pos):
+def resolve_pos(orth, tag, next_tag):
    """If necessary, add a field to the POS tag for UD mapping.
    Under Universal Dependencies, sometimes the same Unidic POS tag can
    be mapped differently depending on the literal token or its context
@ -60,127 +57,80 @@ def resolve_pos(orth, pos, next_pos):
    # Some tokens have their UD tag decided based on the POS of the following
    # token.

-    # orth based rules
-    if pos[0] in TAG_ORTH_MAP:
-        orth_map = TAG_ORTH_MAP[pos[0]]
+    # apply orth based mapping
+    if tag in TAG_ORTH_MAP:
+        orth_map = TAG_ORTH_MAP[tag]
        if orth in orth_map:
-            return orth_map[orth], None
+            return orth_map[orth], None  # current_pos, next_pos

-    # tag bi-gram mapping
-    if next_pos:
-        tag_bigram = pos[0], next_pos[0]
+    # apply tag bi-gram mapping
+    if next_tag:
+        tag_bigram = tag, next_tag
        if tag_bigram in TAG_BIGRAM_MAP:
-            bipos = TAG_BIGRAM_MAP[tag_bigram]
-            if bipos[0] is None:
-                return TAG_MAP[pos[0]][POS], bipos[1]
+            current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
+            if current_pos is None:  # apply tag uni-gram mapping for current_pos
+                return (
+                    TAG_MAP[tag][POS],
+                    next_pos,
+                )  # only next_pos is identified by tag bi-gram mapping
            else:
-                return bipos
+                return current_pos, next_pos

-    return TAG_MAP[pos[0]][POS], None
+    # apply tag uni-gram mapping
+    return TAG_MAP[tag][POS], None


-# Use a mapping of paired punctuation to avoid splitting quoted sentences.
-pairpunct = {"「": "」", "『": "』", "【": "】"}
-
-
-def separate_sentences(doc):
-    """Given a doc, mark tokens that start sentences based on Unidic tags.
-    """
-
-    stack = []  # save paired punctuation
-
-    for i, token in enumerate(doc[:-2]):
-        # Set all tokens after the first to false by default. This is necessary
-        # for the doc code to be aware we've done sentencization, see
-        # `is_sentenced`.
-        token.sent_start = i == 0
-        if token.tag_:
-            if token.tag_ == "補助記号-括弧開":
-                ts = str(token)
-                if ts in pairpunct:
-                    stack.append(pairpunct[ts])
-                elif stack and ts == stack[-1]:
-                    stack.pop()
-
-            if token.tag_ == "補助記号-句点":
-                next_token = doc[i + 1]
-                if next_token.tag_ != token.tag_ and not stack:
-                    next_token.sent_start = True
-
-
-def get_dtokens(tokenizer, text):
-    tokens = tokenizer.tokenize(text)
-    words = []
-    for ti, token in enumerate(tokens):
-        tag = "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"])
-        inf = "-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"])
-        dtoken = DetailedToken(token.surface(), (tag, inf), token.dictionary_form())
-        if ti > 0 and words[-1].pos[0] == "空白" and tag == "空白":
-            # don't add multiple space tokens in a row
-            continue
-        words.append(dtoken)
-
-    # remove empty tokens. These can be produced with characters like … that
-    # Sudachi normalizes internally.
-    words = [ww for ww in words if len(ww.surface) > 0]
-    return words
-
-
-def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
+def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
+    # Compare the content of tokens and text, first
    words = [x.surface for x in dtokens]
    if "".join("".join(words).split()) != "".join(text.split()):
        raise ValueError(Errors.E194.format(text=text, words=words))
-    text_words = []
-    text_lemmas = []
-    text_tags = []
+
+    text_dtokens = []
    text_spaces = []
    text_pos = 0
    # handle empty and whitespace-only texts
    if len(words) == 0:
-        return text_words, text_lemmas, text_tags, text_spaces
+        return text_dtokens, text_spaces
    elif len([word for word in words if not word.isspace()]) == 0:
        assert text.isspace()
-        text_words = [text]
-        text_lemmas = [text]
-        text_tags = [gap_tag]
+        text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)]
        text_spaces = [False]
-        return text_words, text_lemmas, text_tags, text_spaces
-    # normalize words to remove all whitespace tokens
-    norm_words, norm_dtokens = zip(
-        *[
-            (word, dtokens)
-            for word, dtokens in zip(words, dtokens)
-            if not word.isspace()
-        ]
-    )
-    # align words with text
-    for word, dtoken in zip(norm_words, norm_dtokens):
+        return text_dtokens, text_spaces
+
+    # align words and dtokens by referring text, and insert gap tokens for the space char spans
+    for word, dtoken in zip(words, dtokens):
+        # skip all space tokens
+        if word.isspace():
+            continue
        try:
            word_start = text[text_pos:].index(word)
        except ValueError:
            raise ValueError(Errors.E194.format(text=text, words=words))
+
+        # space token
        if word_start > 0:
            w = text[text_pos : text_pos + word_start]
-            text_words.append(w)
-            text_lemmas.append(w)
-            text_tags.append(gap_tag)
+            text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
            text_spaces.append(False)
            text_pos += word_start
-        text_words.append(word)
-        text_lemmas.append(dtoken.lemma)
-        text_tags.append(dtoken.pos)
+
+        # content word
+        text_dtokens.append(dtoken)
        text_spaces.append(False)
        text_pos += len(word)
+        # poll a space char after the word
        if text_pos < len(text) and text[text_pos] == " ":
            text_spaces[-1] = True
            text_pos += 1
+
+    # trailing space token
    if text_pos < len(text):
        w = text[text_pos:]
-        text_words.append(w)
-        text_lemmas.append(w)
-        text_tags.append(gap_tag)
+        text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None))
        text_spaces.append(False)
-    return text_words, text_lemmas, text_tags, text_spaces
+
+    return text_dtokens, text_spaces


 class JapaneseTokenizer(DummyTokenizer):
@ -190,29 +140,96 @@ class JapaneseTokenizer(DummyTokenizer):
        self.tokenizer = try_sudachi_import(self.split_mode)

    def __call__(self, text):
-        dtokens = get_dtokens(self.tokenizer, text)
+        # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
+        sudachipy_tokens = self.tokenizer.tokenize(text)
+        dtokens = self._get_dtokens(sudachipy_tokens)
+        dtokens, spaces = get_dtokens_and_spaces(dtokens, text)

-        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
+        # create Doc with tag bi-gram based part-of-speech identification rules
+        words, tags, inflections, lemmas, readings, sub_tokens_list = (
+            zip(*dtokens) if dtokens else [[]] * 6
+        )
+        sub_tokens_list = list(sub_tokens_list)
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        next_pos = None
-        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
-            token.tag_ = unidic_tag[0]
-            if next_pos:
+        next_pos = None  # for bi-gram rules
+        for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
+            token.tag_ = dtoken.tag
+            if next_pos:  # already identified in previous iteration
                token.pos = next_pos
                next_pos = None
            else:
                token.pos, next_pos = resolve_pos(
                    token.orth_,
-                    unidic_tag,
-                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None,
+                    dtoken.tag,
+                    tags[idx + 1] if idx + 1 < len(tags) else None,
                )
-
            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = lemma
-        doc.user_data["unidic_tags"] = unidic_tags
+            token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
+
+        doc.user_data["inflections"] = inflections
+        doc.user_data["reading_forms"] = readings
+        doc.user_data["sub_tokens"] = sub_tokens_list

        return doc

+    def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
+        sub_tokens_list = (
+            self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
+        )
+        dtokens = [
+            DetailedToken(
+                token.surface(),  # orth
+                "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]),  # tag
+                ",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]),  # inf
+                token.dictionary_form(),  # lemma
+                token.reading_form(),  # user_data['reading_forms']
+                sub_tokens_list[idx]
+                if sub_tokens_list
+                else None,  # user_data['sub_tokens']
+            )
+            for idx, token in enumerate(sudachipy_tokens)
+            if len(token.surface()) > 0
+            # remove empty tokens which can be produced with characters like … that
+        ]
+        # Sudachi normalizes internally and outputs each space char as a token.
+        # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
+        return [
+            t
+            for idx, t in enumerate(dtokens)
+            if idx == 0
+            or not t.surface.isspace()
+            or t.tag != "空白"
+            or not dtokens[idx - 1].surface.isspace()
+            or dtokens[idx - 1].tag != "空白"
+        ]
+
+    def _get_sub_tokens(self, sudachipy_tokens):
+        if (
+            self.split_mode is None or self.split_mode == "A"
+        ):  # do nothing for default split mode
+            return None
+
+        sub_tokens_list = []  # list of (list of list of DetailedToken | None)
+        for token in sudachipy_tokens:
+            sub_a = token.split(self.tokenizer.SplitMode.A)
+            if len(sub_a) == 1:  # no sub tokens
+                sub_tokens_list.append(None)
+            elif self.split_mode == "B":
+                sub_tokens_list.append([self._get_dtokens(sub_a, False)])
+            else:  # "C"
+                sub_b = token.split(self.tokenizer.SplitMode.B)
+                if len(sub_a) == len(sub_b):
+                    dtokens = self._get_dtokens(sub_a, False)
+                    sub_tokens_list.append([dtokens, dtokens])
+                else:
+                    sub_tokens_list.append(
+                        [
+                            self._get_dtokens(sub_a, False),
+                            self._get_dtokens(sub_b, False),
+                        ]
+                    )
+        return sub_tokens_list
+
    def _get_config(self):
        config = OrderedDict((("split_mode", self.split_mode),))
        return config
--- a/spacy/lang/ja/bunsetu.py
+++ b/spacy/lang/ja/bunsetu.py
@ -1,176 +0,0 @@
-POS_PHRASE_MAP = {
-    "NOUN": "NP",
-    "NUM": "NP",
-    "PRON": "NP",
-    "PROPN": "NP",
-    "VERB": "VP",
-    "ADJ": "ADJP",
-    "ADV": "ADVP",
-    "CCONJ": "CCONJP",
-}
-
-
-# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
-def yield_bunsetu(doc, debug=False):
-    bunsetu = []
-    bunsetu_may_end = False
-    phrase_type = None
-    phrase = None
-    prev = None
-    prev_tag = None
-    prev_dep = None
-    prev_head = None
-    for t in doc:
-        pos = t.pos_
-        pos_type = POS_PHRASE_MAP.get(pos, None)
-        tag = t.tag_
-        dep = t.dep_
-        head = t.head.i
-        if debug:
-            print(
-                t.i,
-                t.orth_,
-                pos,
-                pos_type,
-                dep,
-                head,
-                bunsetu_may_end,
-                phrase_type,
-                phrase,
-                bunsetu,
-            )
-
-        # DET is always an individual bunsetu
-        if pos == "DET":
-            if bunsetu:
-                yield bunsetu, phrase_type, phrase
-            yield [t], None, None
-            bunsetu = []
-            bunsetu_may_end = False
-            phrase_type = None
-            phrase = None
-
-        # PRON or Open PUNCT always splits bunsetu
-        elif tag == "補助記号-括弧開":
-            if bunsetu:
-                yield bunsetu, phrase_type, phrase
-            bunsetu = [t]
-            bunsetu_may_end = True
-            phrase_type = None
-            phrase = None
-
-        # bunsetu head not appeared
-        elif phrase_type is None:
-            if bunsetu and prev_tag == "補助記号-読点":
-                yield bunsetu, phrase_type, phrase
-                bunsetu = []
-                bunsetu_may_end = False
-                phrase_type = None
-                phrase = None
-            bunsetu.append(t)
-            if pos_type:  # begin phrase
-                phrase = [t]
-                phrase_type = pos_type
-                if pos_type in {"ADVP", "CCONJP"}:
-                    bunsetu_may_end = True
-
-        # entering new bunsetu
-        elif pos_type and (
-            pos_type != phrase_type
-            or bunsetu_may_end  # different phrase type arises  # same phrase type but bunsetu already ended
-        ):
-            # exceptional case: NOUN to VERB
-            if (
-                phrase_type == "NP"
-                and pos_type == "VP"
-                and prev_dep == "compound"
-                and prev_head == t.i
-            ):
-                bunsetu.append(t)
-                phrase_type = "VP"
-                phrase.append(t)
-            # exceptional case: VERB to NOUN
-            elif (
-                phrase_type == "VP"
-                and pos_type == "NP"
-                and (
-                    prev_dep == "compound"
-                    and prev_head == t.i
-                    or dep == "compound"
-                    and prev == head
-                    or prev_dep == "nmod"
-                    and prev_head == t.i
-                )
-            ):
-                bunsetu.append(t)
-                phrase_type = "NP"
-                phrase.append(t)
-            else:
-                yield bunsetu, phrase_type, phrase
-                bunsetu = [t]
-                bunsetu_may_end = False
-                phrase_type = pos_type
-                phrase = [t]
-
-        # NOUN bunsetu
-        elif phrase_type == "NP":
-            bunsetu.append(t)
-            if not bunsetu_may_end and (
-                (
-                    (pos_type == "NP" or pos == "SYM")
-                    and (prev_head == t.i or prev_head == head)
-                    and prev_dep in {"compound", "nummod"}
-                )
-                or (
-                    pos == "PART"
-                    and (prev == head or prev_head == head)
-                    and dep == "mark"
-                )
-            ):
-                phrase.append(t)
-            else:
-                bunsetu_may_end = True
-
-        # VERB bunsetu
-        elif phrase_type == "VP":
-            bunsetu.append(t)
-            if (
-                not bunsetu_may_end
-                and pos == "VERB"
-                and prev_head == t.i
-                and prev_dep == "compound"
-            ):
-                phrase.append(t)
-            else:
-                bunsetu_may_end = True
-
-        # ADJ bunsetu
-        elif phrase_type == "ADJP" and tag != "連体詞":
-            bunsetu.append(t)
-            if not bunsetu_may_end and (
-                (
-                    pos == "NOUN"
-                    and (prev_head == t.i or prev_head == head)
-                    and prev_dep in {"amod", "compound"}
-                )
-                or (
-                    pos == "PART"
-                    and (prev == head or prev_head == head)
-                    and dep == "mark"
-                )
-            ):
-                phrase.append(t)
-            else:
-                bunsetu_may_end = True
-
-        # other bunsetu
-        else:
-            bunsetu.append(t)
-
-        prev = t.i
-        prev_tag = t.tag_
-        prev_dep = t.dep_
-        prev_head = head
-
-    if bunsetu:
-        yield bunsetu, phrase_type, phrase
--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -39,7 +39,11 @@ def check_spaces(text, tokens):
 class KoreanTokenizer(DummyTokenizer):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.Tokenizer = try_mecab_import()
+        MeCab = try_mecab_import()
+        self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
+
+    def __del__(self):
+        self.mecab_tokenizer.__del__()

    def __call__(self, text):
        dtokens = list(self.detailed_tokens(text))
@ -55,17 +59,16 @@ class KoreanTokenizer(DummyTokenizer):
    def detailed_tokens(self, text):
        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
-        with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
-            for node in tokenizer.parse(text, as_nodes=True):
-                if node.is_eos():
-                    break
-                surface = node.surface
-                feature = node.feature
-                tag, _, expr = feature.partition(",")
-                lemma, _, remainder = expr.partition("/")
-                if lemma == "*":
-                    lemma = surface
-                yield {"surface": surface, "lemma": lemma, "tag": tag}
+        for node in self.mecab_tokenizer.parse(text, as_nodes=True):
+            if node.is_eos():
+                break
+            surface = node.surface
+            feature = node.feature
+            tag, _, expr = feature.partition(",")
+            lemma, _, remainder = expr.partition("/")
+            if lemma == "*":
+                lemma = surface
+            yield {"surface": surface, "lemma": lemma, "tag": tag}


 class KoreanDefaults(Language.Defaults):
--- a/spacy/lang/ne/init.py
+++ b/spacy/lang/ne/init.py
@ -0,0 +1,23 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+
+from ...language import Language
+from ...attrs import LANG
+
+
+class NepaliDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
+    stop_words = STOP_WORDS
+
+
+class Nepali(Language):
+    lang = "ne"
+    Defaults = NepaliDefaults
+
+
+__all__ = ["Nepali"]
--- a/spacy/lang/ne/examples.py
+++ b/spacy/lang/ne/examples.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ne.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
+    "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
+    "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
+    "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
+    "तिमी कहाँ छौ?",
+    "फ्रान्स को राष्ट्रपति को हो?",
+    "संयुक्त राज्यको राजधानी के हो?",
+    "बराक ओबामा कहिले कहिले जन्मेका हुन्?",
+]
--- a/spacy/lang/ne/lex_attrs.py
+++ b/spacy/lang/ne/lex_attrs.py
@ -0,0 +1,98 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..norm_exceptions import BASE_NORMS
+from ...attrs import NORM, LIKE_NUM
+
+
+# fmt: off
+_stem_suffixes = [
+    ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
+    ["ँ", "ं", "्", "ः"],
+    ["लाई", "ले", "बाट", "को", "मा", "हरू"],
+    ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
+    ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
+    ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
+    ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
+    ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
+    ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
+    ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
+    ["याइ", "ाइ", "बार", "वार", "चाँहि"],
+    ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
+]
+# fmt: on
+
+# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
+# reference 2: https://www.imnepal.com/nepali-numbers/
+_num_words = [
+    "शुन्य",
+    "एक",
+    "दुई",
+    "तीन",
+    "चार",
+    "पाँच",
+    "छ",
+    "सात",
+    "आठ",
+    "नौ",
+    "दश",
+    "एघार",
+    "बाह्र",
+    "तेह्र",
+    "चौध",
+    "पन्ध्र",
+    "सोह्र",
+    "सोह्र",
+    "सत्र",
+    "अठार",
+    "उन्नाइस",
+    "बीस",
+    "तीस",
+    "चालीस",
+    "पचास",
+    "साठी",
+    "सत्तरी",
+    "असी",
+    "नब्बे",
+    "सय",
+    "हजार",
+    "लाख",
+    "करोड",
+    "अर्ब",
+    "खर्ब",
+]
+
+
+def norm(string):
+    # normalise base exceptions,  e.g. punctuation or currency symbols
+    if string in BASE_NORMS:
+        return BASE_NORMS[string]
+    # set stem word as norm,  if available,  adapted from:
+    # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
+    # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
+    for suffix_group in reversed(_stem_suffixes):
+        length = len(suffix_group[0])
+        if len(string) <= length:
+            break
+        for suffix in suffix_group:
+            if string.endswith(suffix):
+                return string[:-length]
+    return string
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(", ", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text.lower() in _num_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
--- a/spacy/lang/ne/stop_words.py
+++ b/spacy/lang/ne/stop_words.py
@ -0,0 +1,498 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
+
+STOP_WORDS = set(
+    """
+अक्सर
+अगाडि
+अगाडी
+अघि
+अझै
+अठार
+अथवा
+अनि
+अनुसार
+अन्तर्गत
+अन्य
+अन्यत्र
+अन्यथा
+अब
+अरु
+अरुलाई
+अरू
+अर्को
+अर्थात
+अर्थात्
+अलग
+अलि
+अवस्था
+अहिले
+आए
+आएका
+आएको
+आज
+आजको
+आठ
+आत्म
+आदि
+आदिलाई
+आफनो
+आफू
+आफूलाई
+आफै
+आफैँ
+आफ्नै
+आफ्नो
+आयो
+उ
+उक्त
+उदाहरण
+उनको
+उनलाई
+उनले
+उनि
+उनी
+उनीहरुको
+उन्नाइस
+उप
+उसको
+उसलाई
+उसले
+उहालाई
+ऊ
+एउटा
+एउटै
+एक
+एकदम
+एघार
+ओठ
+औ
+औं
+कता
+कति
+कतै
+कम
+कमसेकम
+कसरि
+कसरी
+कसै
+कसैको
+कसैलाई
+कसैले
+कसैसँग
+कस्तो
+कहाँबाट
+कहिलेकाहीं
+का
+काम
+कारण
+कि
+किन
+किनभने
+कुन
+कुनै
+कुन्नी
+कुरा
+कृपया
+के
+केहि
+केही
+को
+कोहि
+कोहिपनि
+कोही
+कोहीपनि
+क्रमशः
+गए
+गएको
+गएर
+गयौ
+गरि
+गरी
+गरे
+गरेका
+गरेको
+गरेर
+गरौं
+गर्छ
+गर्छन्
+गर्छु
+गर्दा
+गर्दै
+गर्न
+गर्नु
+गर्नुपर्छ
+गर्ने
+गैर
+घर
+चार
+चाले
+चाहनुहुन्छ
+चाहन्छु
+चाहिं
+चाहिए
+चाहिंले
+चाहीं
+चाहेको
+चाहेर
+चोटी
+चौथो
+चौध
+छ
+छन
+छन्
+छु
+छू
+छैन
+छैनन्
+छौ
+छौं
+जता
+जताततै
+जना
+जनाको
+जनालाई
+जनाले
+जब
+जबकि
+जबकी
+जसको
+जसबाट
+जसमा
+जसरी
+जसलाई
+जसले
+जस्ता
+जस्तै
+जस्तो
+जस्तोसुकै
+जहाँ
+जान
+जाने
+जाहिर
+जुन
+जुनै
+जे
+जो
+जोपनि
+जोपनी
+झैं
+ठाउँमा
+ठीक
+ठूलो
+त
+तता
+तत्काल
+तथा
+तथापि
+तथापी
+तदनुसार
+तपाइ
+तपाई
+तपाईको
+तब
+तर
+तर्फ
+तल
+तसरी
+तापनि
+तापनी
+तिन
+तिनि
+तिनिहरुलाई
+तिनी
+तिनीहरु
+तिनीहरुको
+तिनीहरू
+तिनीहरूको
+तिनै
+तिमी
+तिर
+तिरको
+ती
+तीन
+तुरन्त
+तुरुन्त
+तुरुन्तै
+तेश्रो
+तेस्कारण
+तेस्रो
+तेह्र
+तैपनि
+तैपनी
+त्यत्तिकै
+त्यत्तिकैमा
+त्यस
+त्यसकारण
+त्यसको
+त्यसले
+त्यसैले
+त्यसो
+त्यस्तै
+त्यस्तो
+त्यहाँ
+त्यहिँ
+त्यही
+त्यहीँ
+त्यहीं
+त्यो
+त्सपछि
+त्सैले
+थप
+थरि
+थरी
+थाहा
+थिए
+थिएँ
+थिएन
+थियो
+दर्ता
+दश
+दिए
+दिएको
+दिन
+दिनुभएको
+दिनुहुन्छ
+दुइ
+दुइवटा
+दुई
+देखि
+देखिन्छ
+देखियो
+देखे
+देखेको
+देखेर
+दोश्री
+दोश्रो
+दोस्रो
+द्वारा
+धन्न
+धेरै
+धौ
+न
+नगर्नु
+नगर्नू
+नजिकै
+नत्र
+नत्रभने
+नभई
+नभएको
+नभनेर
+नयाँ
+नि
+निकै
+निम्ति
+निम्न
+निम्नानुसार
+निर्दिष्ट
+नै
+नौ
+पक्का
+पक्कै
+पछाडि
+पछाडी
+पछि
+पछिल्लो
+पछी
+पटक
+पनि
+पन्ध्र
+पर्छ
+पर्थ्यो
+पर्दैन
+पर्ने
+पर्नेमा
+पर्याप्त
+पहिले
+पहिलो
+पहिल्यै
+पाँच
+पांच
+पाचौँ
+पाँचौं
+पिच्छे
+पूर्व
+पो
+प्रति
+प्रतेक
+प्रत्यक
+प्राय
+प्लस
+फरक
+फेरि
+फेरी
+बढी
+बताए
+बने
+बरु
+बाट
+बारे
+बाहिर
+बाहेक
+बाह्र
+बिच
+बिचमा
+बिरुद्ध
+बिशेष
+बिस
+बीच
+बीचमा
+बीस
+भए
+भएँ
+भएका
+भएकालाई
+भएको
+भएन
+भएर
+भन
+भने
+भनेको
+भनेर
+भन्
+भन्छन्
+भन्छु
+भन्दा
+भन्दै
+भन्नुभयो
+भन्ने
+भन्या
+भयेन
+भयो
+भर
+भरि
+भरी
+भा
+भित्र
+भित्री
+भीत्र
+म
+मध्य
+मध्ये
+मलाई
+मा
+मात्र
+मात्रै
+माथि
+माथी
+मुख्य
+मुनि
+मुन्तिर
+मेरो
+मैले
+यति
+यथोचित
+यदि
+यद्ध्यपि
+यद्यपि
+यस
+यसका
+यसको
+यसपछि
+यसबाहेक
+यसमा
+यसरी
+यसले
+यसो
+यस्तै
+यस्तो
+यहाँ
+यहाँसम्म
+यही
+या
+यी
+यो
+र
+रही
+रहेका
+रहेको
+रहेछ
+राखे
+राख्छ
+राम्रो
+रुपमा
+रूप
+रे
+लगभग
+लगायत
+लाई
+लाख
+लागि
+लागेको
+ले
+वटा
+वरीपरी
+वा
+वाट
+वापत
+वास्तवमा
+शायद
+सक्छ
+सक्ने
+सँग
+संग
+सँगको
+सँगसँगै
+सँगै
+संगै
+सङ्ग
+सङ्गको
+सट्टा
+सत्र
+सधै
+सबै
+सबैको
+सबैलाई
+समय
+समेत
+सम्भव
+सम्म
+सय
+सरह
+सहित
+सहितै
+सही
+साँच्चै
+सात
+साथ
+साथै
+सायद
+सारा
+सुनेको
+सुनेर
+सुरु
+सुरुको
+सुरुमै
+सो
+सोचेको
+सोचेर
+सोही
+सोह्र
+स्थित
+स्पष्ट
+हजार
+हरे
+हरेक
+हामी
+हामीले
+हाम्रा
+हाम्रो
+हुँदैन
+हुन
+हुनत
+हुनु
+हुने
+हुनेछ
+हुन्
+हुन्छ
+हुन्थ्यो
+हैन
+हो
+होइन
+होकि
+होला
+""".split()
+)
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -14,7 +14,7 @@ from .stop_words import STOP_WORDS
 from ... import util


-_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.22` or from https://github.com/lancopku/pkuseg-python"
+_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.25` or from https://github.com/lancopku/pkuseg-python"


 def try_jieba_import(segmenter):
--- a/spacy/language.py
+++ b/spacy/language.py
@ -32,6 +32,7 @@ from .lang.tag_map import TAG_MAP
 from .tokens import Doc
 from .lang.lex_attrs import LEX_ATTRS, is_stop
 from .errors import Errors, Warnings
+from .git_info import GIT_VERSION
 from . import util
 from . import about

@ -44,7 +45,7 @@ class BaseDefaults:
    def create_lemmatizer(cls, nlp=None, lookups=None):
        if lookups is None:
            lookups = cls.create_lookups(nlp=nlp)
-        return Lemmatizer(lookups=lookups)
+        return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form)

    @classmethod
    def create_lookups(cls, nlp=None):
@ -116,6 +117,7 @@ class BaseDefaults:
    tokenizer_exceptions = {}
    stop_words = set()
    morph_rules = {}
+    is_base_form = None
    lex_attr_getters = LEX_ATTRS
    syntax_iterators = {}
    resources = {}
@ -212,6 +214,7 @@ class Language:
        self._meta.setdefault("email", "")
        self._meta.setdefault("url", "")
        self._meta.setdefault("license", "")
+        self._meta.setdefault("spacy_git_version", GIT_VERSION)
        self._meta["vectors"] = {
            "width": self.vocab.vectors_length,
            "vectors": len(self.vocab.vectors),
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@ -14,7 +14,7 @@ class Lemmatizer:
    def load(cls, *args, **kwargs):
        raise NotImplementedError(Errors.E172)

-    def __init__(self, lookups):
+    def __init__(self, lookups, is_base_form=None):
        """Initialize a Lemmatizer.

        lookups (Lookups): The lookups object containing the (optional) tables
@ -22,6 +22,7 @@ class Lemmatizer:
        RETURNS (Lemmatizer): The newly constructed object.
        """
        self.lookups = lookups
+        self.is_base_form = is_base_form

    def __call__(self, string, univ_pos, morphology=None):
        """Lemmatize a string.
@ -42,7 +43,7 @@ class Lemmatizer:
        if univ_pos in ("", "eol", "space"):
            return [string.lower()]
        # See Issue #435 for example of where this logic is requied.
-        if self.is_base_form(univ_pos, morphology):
+        if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
            return [string.lower()]
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -346,7 +346,7 @@ cdef class Lexeme:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        return self.orth in self.vocab.vectors
+        return self.orth not in self.vocab.vectors

    property is_stop:
        """RETURNS (bool): Whether the lexeme is a stop word."""
--- a/spacy/lookups.py
+++ b/spacy/lookups.py
@ -117,8 +117,7 @@ class Lookups:
        """
        self._tables = {}
        for key, value in srsly.msgpack_loads(bytes_data).items():
-            self._tables[key] = Table(key)
-            self._tables[key].update(value)
+            self._tables[key] = Table(key, value)
        return self

    def to_disk(self, path, filename="lookups.bin", **kwargs):
@ -189,7 +188,7 @@ class Table(OrderedDict):
        self.name = name
        # Assume a default size of 1M items
        self.default_size = 1e6
-        size = len(data) if data and len(data) > 0 else self.default_size
+        size = max(len(data), 1) if data is not None else self.default_size
        self.bloom = BloomFilter.from_error_rate(size)
        if data:
            self.update(data)
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -781,6 +781,20 @@ class ClozeMultitask(Pipe):
        if losses is not None:
            losses[self.name] += loss

+    @staticmethod
+    def decode_utf8_predictions(char_array):
+        # The format alternates filling from start and end, and 255 is missing
+        words = []
+        char_array = char_array.reshape((char_array.shape[0], -1, 256))
+        nr_char = char_array.shape[1]
+        char_array = char_array.argmax(axis=-1)
+        for row in char_array:
+            starts = [chr(c) for c in row[::2] if c != 255]
+            ends = [chr(c) for c in row[1::2] if c != 255]
+            word = "".join(starts + list(reversed(ends)))
+            words.append(word)
+        return words
+

@component("textcat", assigns=["doc.cats"], default_model=default_textcat)
 class TextCategorizer(Pipe):
@ -949,6 +963,7 @@ cdef class DependencyParser(Parser):
    assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
    requires = []
    TransitionSystem = ArcEager
+    nr_feature = 8

    @property
    def postprocesses(self):
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -167,6 +167,11 @@ def nb_tokenizer():
    return get_lang_class("nb").Defaults.create_tokenizer()


+@pytest.fixture(scope="session")
+def ne_tokenizer():
+    return get_lang_class("ne").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
 def nl_tokenizer():
    return get_lang_class("nl").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -102,10 +102,16 @@ def test_doc_api_getitem(en_tokenizer):
 )
 def test_doc_api_serialize(en_tokenizer, text):
    tokens = en_tokenizer(text)
+    tokens[0].lemma_ = "lemma"
+    tokens[0].norm_ = "norm"
+    tokens[0].ent_kb_id_ = "ent_kb_id"
    new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
    assert tokens.text == new_tokens.text
    assert [t.text for t in tokens] == [t.text for t in new_tokens]
    assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
+    assert new_tokens[0].lemma_ == "lemma"
+    assert new_tokens[0].norm_ == "norm"
+    assert new_tokens[0].ent_kb_id_ == "ent_kb_id"

    new_tokens = Doc(tokens.vocab).from_bytes(
        tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -1,7 +1,7 @@
 import pytest

 from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
-from spacy.lang.ja import Japanese
+from spacy.lang.ja import Japanese, DetailedToken

 # fmt: off
 TOKENIZER_TESTS = [
@ -93,6 +93,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
    assert len(nlp_c(text)) == len_c


+@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
+    [
+        (
+            "選挙管理委員会",
+            [None, None, None, None],
+            [None, None, [
+                [
+                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
+                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
+                ]
+            ]],
+            [[
+                [
+                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
+                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
+                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
+                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
+                ], [
+                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
+                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
+                    DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
+                ]
+            ]]
+        ),
+    ]
+)
+def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
+    nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
+    nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
+    nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
+
+    assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
+    assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
+    assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
+    assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
+
+
+@pytest.mark.parametrize("text,inflections,reading_forms",
+    [
+        (
+            "取ってつけた",
+            ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
+            ("トッ", "テ", "ツケ", "タ"),
+        ),
+    ]
+)
+def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
+    assert ja_tokenizer(text).user_data["inflections"] == inflections
+    assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
+
+
 def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
    doc = ja_tokenizer("")
    assert len(doc) == 0
--- a/spacy/tests/lang/ne/init.py
+++ b/spacy/tests/lang/ne/init.py
--- a/spacy/tests/lang/ne/test_text.py
+++ b/spacy/tests/lang/ne/test_text.py
@ -0,0 +1,19 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
+    text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
+    tokens = ne_tokenizer(text)
+    assert len(tokens) == 24
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
+)
+def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
+    tokens = ne_tokenizer(text)
+    assert len(tokens) == length
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -4,7 +4,9 @@ from spacy import util
 from spacy.gold import Example
 from spacy.lang.en import English
 from spacy.language import Language
-from spacy.tests.util import make_tempdir
+from spacy.symbols import POS, NOUN
+
+from ..util import make_tempdir


 def test_label_types():
@ -15,6 +17,19 @@ def test_label_types():
        nlp.get_pipe("tagger").add_label(9)


+def test_tagger_begin_training_tag_map():
+    """Test that Tagger.begin_training() without gold tuples does not clobber
+    the tag map."""
+    nlp = Language()
+    tagger = nlp.create_pipe("tagger")
+    orig_tag_count = len(tagger.labels)
+    tagger.add_label("A", {"POS": "NOUN"})
+    nlp.add_pipe(tagger)
+    nlp.begin_training()
+    assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
+    assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
+
+
 TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}

 MORPH_RULES = {"V": {"like": {"lemma": "luck"}}}
--- a/spacy/tests/regression/test_issue1-1000.py
+++ b/spacy/tests/regression/test_issue1-1000.py
@ -11,6 +11,7 @@ from spacy.lang.en import English
 from spacy.lemmatizer import Lemmatizer
 from spacy.lookups import Lookups
 from spacy.tokens import Doc, Span
+from spacy.lang.en import EnglishDefaults

 from ..util import get_doc, make_tempdir

@ -164,7 +165,7 @@ def test_issue595():
    lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
    lookups.add_table("lemma_index", {"verb": {}})
    lookups.add_table("lemma_exc", {"verb": {}})
-    lemmatizer = Lemmatizer(lookups)
+    lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form)
    vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
    doc = Doc(vocab, words=words)
    doc[2].tag_ = "VB"
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -57,7 +57,7 @@ def test_issue2626_2835(en_tokenizer, text):


 def test_issue2656(en_tokenizer):
-    """Test that tokenizer correctly splits of punctuation after numbers with
+    """Test that tokenizer correctly splits off punctuation after numbers with
    decimal points.
    """
    doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
--- a/spacy/tests/test_lemmatizer.py
+++ b/spacy/tests/test_lemmatizer.py
@ -2,6 +2,7 @@ import pytest
 from spacy.tokens import Doc
 from spacy.language import Language
 from spacy.lookups import Lookups
+from spacy.lemmatizer import Lemmatizer


 def test_lemmatizer_reflects_lookups_changes():
@ -46,3 +47,14 @@ def test_tagger_warns_no_lookups():
    with pytest.warns(None) as record:
        nlp.begin_training()
        assert not record.list
+
+
+def test_lemmatizer_without_is_base_form_implementation():
+    # Norwegian example from #5658
+    lookups = Lookups()
+    lookups.add_table("lemma_rules", {"noun": []})
+    lookups.add_table("lemma_index", {"noun": {}})
+    lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}})
+
+    lemmatizer = Lemmatizer(lookups, is_base_form=None)
+    assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"]
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@ -370,6 +370,6 @@ def test_vector_is_oov():
    data[1] = 2.0
    vocab.set_vector("cat", data[0])
    vocab.set_vector("dog", data[1])
-    assert vocab["cat"].is_oov is True
-    assert vocab["dog"].is_oov is True
-    assert vocab["hamster"].is_oov is False
+    assert vocab["cat"].is_oov is False
+    assert vocab["dog"].is_oov is False
+    assert vocab["hamster"].is_oov is True
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -1062,7 +1062,7 @@ cdef class Doc:

        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM]  # TODO: ENT_KB_ID ?
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM, ENT_KB_ID]
        if self.is_tagged:
            array_head.extend([TAG, POS])
        # If doc parsed add head and dep attribute
@ -1071,6 +1071,14 @@ cdef class Doc:
        # Otherwise add sent_start
        else:
            array_head.append(SENT_START)
+        strings = set()
+        for token in self:
+            strings.add(token.tag_)
+            strings.add(token.lemma_)
+            strings.add(token.dep_)
+            strings.add(token.ent_type_)
+            strings.add(token.ent_kb_id_)
+            strings.add(token.norm_)
        # Msgpack doesn't distinguish between lists and tuples, which is
        # vexing for user data. As a best guess, we *know* that within
        # keys, we must have tuples. In values we just have to hope
@ -1082,6 +1090,7 @@ cdef class Doc:
            "sentiment": lambda: self.sentiment,
            "tensor": lambda: self.tensor,
            "cats": lambda: self.cats,
+            "strings": lambda: list(strings),
            "has_unknown_spaces": lambda: self.has_unknown_spaces
        }
        if "user_data" not in exclude and self.user_data:
@ -1110,6 +1119,7 @@ cdef class Doc:
            "sentiment": lambda b: None,
            "tensor": lambda b: None,
            "cats": lambda b: None,
+            "strings": lambda b: None,
            "user_data_keys": lambda b: None,
            "user_data_values": lambda b: None,
            "has_unknown_spaces": lambda b: None
@ -1130,6 +1140,9 @@ cdef class Doc:
            self.tensor = msg["tensor"]
        if "cats" not in exclude and "cats" in msg:
            self.cats = msg["cats"]
+        if "strings" not in exclude and "strings" in msg:
+            for s in msg["strings"]:
+                self.vocab.strings.add(s)
        if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
            self.has_unknown_spaces = msg["has_unknown_spaces"]
        start = 0
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -923,7 +923,7 @@ cdef class Token:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the token is out-of-vocabulary."""
-        return self.c.lex.orth in self.vocab.vectors
+        return self.c.lex.orth not in self.vocab.vectors

    @property
    def is_stop(self):
--- a/spacy/util.py
+++ b/spacy/util.py
@ -187,6 +187,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
        pipeline = nlp.Defaults.pipe_names
    elif pipeline in (False, None):
        pipeline = []
+    # skip "vocab" from overrides in component initialization since vocab is
+    # already configured from overrides when nlp is initialized above
+    if "vocab" in overrides:
+        del overrides["vocab"]
    for name in pipeline:
        if name not in disable:
            config = meta.get("pipeline_args", {}).get(name, {})
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -105,8 +105,8 @@ The Chinese language class supports three word segmentation options:
 > ```

 1. **Character segmentation:** Character segmentation is the default
-   segmentation option. It's enabled when you create a new `Chinese`
-   language class or call `spacy.blank("zh")`.
+   segmentation option. It's enabled when you create a new `Chinese` language
+   class or call `spacy.blank("zh")`.
 2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
   segmentation with the tokenizer option `{"segmenter": "jieba"}`.
 3. **PKUSeg**: As of spaCy v2.3.0, support for
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1,5 +1,58 @@
 {
    "resources": [
+        {
+            "id": "spacy-streamlit",
+            "title": "spacy-streamlit",
+            "slogan": "spaCy building blocks for Streamlit apps",
+            "github": "explosion/spacy-streamlit",
+            "description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
+            "pip": "spacy-streamlit",
+            "category": ["visualizers"],
+            "thumb": "https://i.imgur.com/mhEjluE.jpg",
+            "image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
+            "code_example": [
+                "import spacy_streamlit",
+                "",
+                "models = [\"en_core_web_sm\", \"en_core_web_md\"]",
+                "default_text = \"Sundar Pichai is the CEO of Google.\"",
+                "spacy_streamlit.visualize(models, default_text))"
+            ],
+            "author": "Ines Montani",
+            "author_links": {
+                "twitter": "_inesmontani",
+                "github": "ines",
+                "website": "https://ines.io"
+            }
+        },
+        {
+            "id": "spaczz",
+            "title": "spaczz",
+            "slogan": "Fuzzy matching and more for spaCy.",
+            "description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
+            "github": "gandersen101/spaczz",
+            "pip": "spaczz",
+            "code_example": [
+                "import spacy",
+                "from spaczz.pipeline import SpaczzRuler",
+                "",
+                "nlp = spacy.blank('en')",
+                "ruler = SpaczzRuler(nlp)",
+                "ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
+                "nlp.add_pipe(ruler)",
+                "",
+                "doc = nlp('Oops, I spelled Bill Gatez wrong.')",
+                "print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
+            ],
+            "code_language": "python",
+            "url": "https://spaczz.readthedocs.io/en/latest/",
+            "author": "Grant Andersen",
+            "author_links": {
+                "twitter": "gandersen101",
+                "github": "gandersen101"
+            },
+            "category": ["pipeline"],
+            "tags": ["fuzzy-matching", "regex"]
+        },
        {
            "id": "spacy-universal-sentence-encoder",
            "title": "SpaCy - Universal Sentence Encoder",
@ -1238,6 +1291,19 @@
            "youtube": "K1elwpgDdls",
            "category": ["videos"]
        },
+        {
+            "type": "education",
+            "id": "video-spacy-course-es",
+            "title": "NLP avanzado con spaCy · Un curso en línea gratis",
+            "description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
+            "url": "https://course.spacy.io/es",
+            "author": "Camila Gutiérrez",
+            "author_links": {
+                "twitter": "Mariacamilagl30"
+            },
+            "youtube": "RNiLVCE5d4k",
+            "category": ["videos"]
+        },
        {
            "type": "education",
            "id": "video-intro-to-nlp-episode-1",
@ -1294,6 +1360,20 @@
            "youtube": "IqOJU1-_Fi0",
            "category": ["videos"]
        },
+        {
+            "type": "education",
+            "id": "video-intro-to-nlp-episode-5",
+            "title": "Intro to NLP with spaCy (5)",
+            "slogan": "Episode 5: Rules vs. Machine Learning",
+            "description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
+            "author": "Vincent Warmerdam",
+            "author_links": {
+                "twitter": "fishnets88",
+                "github": "koaning"
+            },
+            "youtube": "f4sqeLRzkPg",
+            "category": ["videos"]
+        },
        {
            "type": "education",
            "id": "video-spacy-irl-entity-linking",
@ -2348,6 +2428,56 @@
            },
            "category": ["pipeline", "conversational", "research"],
            "tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
+        },
+        {
+            "id": "texthero",
+            "title": "Texthero",
+            "slogan": "Text preprocessing, representation and visualization from zero to hero.",
+            "description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
+            "github": "jbesomi/texthero",
+            "pip": "texthero",
+            "code_example": [
+                "import texthero as hero",
+                "import pandas as pd",
+                "",
+                "df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
+                "df['named_entities'] = hero.named_entities(df['text'])",
+                "df.head()"
+            ],
+            "code_language": "python",
+            "url": "https://texthero.org",
+            "thumb": "https://texthero.org/img/T.png",
+            "image": "https://texthero.org/docs/assets/texthero.png",
+            "author": "Jonathan Besomi",
+            "author_links": {
+                "github": "jbesomi",
+                "website": "https://besomi.ai"
+            },
+            "category": ["standalone"]
+        },
+        {
+            "id": "cov-bsv",
+            "title": "VA COVID-19 NLP BSV",
+            "slogan": "spaCy pipeline for COVID-19 surveillance.",
+            "github": "abchapman93/VA_COVID-19_NLP_BSV",
+            "description": "A spaCy rule-based pipeline for identifying positive cases of COVID-19 from clinical text. A version of this system was deployed as part of the US Department of Veterans Affairs biosurveillance response to COVID-19.",
+            "pip": "cov-bsv",
+            "code_example": [
+              "import cov_bsv",
+              "",
+              "nlp = cov_bsv.load()",
+              "text = 'Pt tested for COVID-19. His wife was recently diagnosed with novel coronavirus. SARS-COV-2: Detected'",
+              "",
+              "print(doc.ents)",
+              "print(doc._.cov_classification)",
+              "cov_bsv.visualize_doc(doc)"
+            ],
+            "category": ["pipeline", "standalone", "biomedical", "scientific"],
+            "tags": ["clinical", "epidemiology", "covid-19", "surveillance"],
+            "author": "Alec Chapman",
+            "author_links": {
+                "github": "abchapman93"
+            }
        }
    ],

--- a/website/package-lock.json
+++ b/website/package-lock.json