Merge branch 'master' into spacy.io

2025-11-03 01:17:52 +03:00 · 2019-07-12 14:30:49 +02:00 · 2019-07-12 14:30:49 +02:00 · 69dbd59a13
commit 69dbd59a13
parent 30d6c2ccc2 02e12b0852
109 changed files with 249446 additions and 1296 deletions
--- a/.github/contributors/ameyuuno.md
+++ b/.github/contributors/ameyuuno.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Alexey Kim           |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2019-07-09           |
 | GitHub username                | ameyuuno             |
 | Website (optional)             | https://ameyuuno.io  |
--- a/.github/contributors/askhogan.md
+++ b/.github/contributors/askhogan.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Patrick Hogan        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 7/7/2019             |
 | GitHub username                | askhogan@gmail.com   |
 | Website (optional)             |                      |
--- a/.github/contributors/cedar101.md
+++ b/.github/contributors/cedar101.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Kim, Baeg-il             |
 | Company name (if applicable)   |                          |
 | Title or role (if applicable)  |                          |
 | Date                           | 2019-07-03               |
 | GitHub username                | cedar101                 |
 | Website (optional)             |                          |
--- a/.github/contributors/khellan.md
+++ b/.github/contributors/khellan.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Knut O. Hellan       |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 02.07.2019           |
 | GitHub username                | khellan              |
 | Website (optional)             | knuthellan.com       |
--- a/.github/contributors/kognate.md
+++ b/.github/contributors/kognate.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Joshua B. Smith      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | July 7, 2019         |
 | GitHub username                | kognate              |
 | Website (optional)             |                      |
--- a/.github/contributors/rokasramas.md
+++ b/.github/contributors/rokasramas.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                   |
 |------------------------------- | ----------------------- |
 | Name                           | Rokas Ramanauskas       |
 | Company name (if applicable)   | TokenMill               |
 | Title or role (if applicable)  | Software Engineer       |
 | Date                           | 2019-07-02              |
 | GitHub username                | rokasramas              |
 | Website (optional)             | http://www.tokenmill.lt |
--- a/.github/contributors/yashpatadia.md
+++ b/.github/contributors/yashpatadia.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |     Yash Patadia     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |      11/07/2019      |
 | GitHub username                |       yash1994       |
 | Website (optional)             |                      |
--- a/.gitignore
+++ b/.gitignore
@ -56,6 +56,8 @@ parts/
 sdist/
 var/
 *.egg-info/
 pip-wheel-metadata/
 Pipfile.lock
 .installed.cfg
 *.egg
 .eggs
--- a/bin/init.py
+++ b/bin/init.py
--- a/bin/train_word_vectors.py
+++ b/bin/train_word_vectors.py
@ -5,7 +5,6 @@ import logging
 from pathlib import Path
 from collections import defaultdict
 from gensim.models import Word2Vec
 from preshed.counter import PreshCounter
 import plac
 import spacy
--- a/bin/ud/conll17_ud_eval.py
+++ b/bin/ud/conll17_ud_eval.py
@ -292,8 +292,8 @@ def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
    def spans_score(gold_spans, system_spans):
        correct, gi, si = 0, 0, 0
-        undersegmented = list()
+        undersegmented = []
-        oversegmented = list()
+        oversegmented = []
        combo = 0
        previous_end_si_earlier = False
        previous_end_gi_earlier = False
--- a/bin/wiki_entity_linking/init.py
+++ b/bin/wiki_entity_linking/init.py
--- a/bin/wiki_entity_linking/kb_creator.py
+++ b/bin/wiki_entity_linking/kb_creator.py
@ -0,0 +1,171 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from .train_descriptions import EntityEncoder
 from . import wikidata_processor as wd, wikipedia_processor as wp
 from spacy.kb import KnowledgeBase
 import csv
 import datetime
 INPUT_DIM = 300  # dimension of pre-trained input vectors
 DESC_WIDTH = 64  # dimension of output entity vectors
 def create_kb(nlp, max_entities_per_alias, min_entity_freq, min_occ,
              entity_def_output, entity_descr_output,
              count_input, prior_prob_input, wikidata_input):
    # Create the knowledge base from Wikidata entries
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=DESC_WIDTH)
    # disable this part of the pipeline when rerunning the KB generation from preprocessed files
    read_raw_data = True
    if read_raw_data:
        print()
        print(" * _read_wikidata_entities", datetime.datetime.now())
        title_to_id, id_to_descr = wd.read_wikidata_entities_json(wikidata_input)
        # write the title-ID and ID-description mappings to file
        _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr)
    else:
        # read the mappings from file
        title_to_id = get_entity_to_id(entity_def_output)
        id_to_descr = get_id_to_description(entity_descr_output)
    print()
    print(" * _get_entity_frequencies", datetime.datetime.now())
    print()
    entity_frequencies = wp.get_all_frequencies(count_input=count_input)
    # filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
    filtered_title_to_id = dict()
    entity_list = []
    description_list = []
    frequency_list = []
    for title, entity in title_to_id.items():
        freq = entity_frequencies.get(title, 0)
        desc = id_to_descr.get(entity, None)
        if desc and freq > min_entity_freq:
            entity_list.append(entity)
            description_list.append(desc)
            frequency_list.append(freq)
            filtered_title_to_id[title] = entity
    print("Kept", len(filtered_title_to_id.keys()), "out of", len(title_to_id.keys()),
          "titles with filter frequency", min_entity_freq)
    print()
    print(" * train entity encoder", datetime.datetime.now())
    print()
    encoder = EntityEncoder(nlp, INPUT_DIM, DESC_WIDTH)
    encoder.train(description_list=description_list, to_print=True)
    print()
    print(" * get entity embeddings", datetime.datetime.now())
    print()
    embeddings = encoder.apply_encoder(description_list)
    print()
    print(" * adding", len(entity_list), "entities", datetime.datetime.now())
    kb.set_entities(entity_list=entity_list, prob_list=frequency_list, vector_list=embeddings)
    print()
    print(" * adding aliases", datetime.datetime.now())
    print()
    _add_aliases(kb, title_to_id=filtered_title_to_id,
                 max_entities_per_alias=max_entities_per_alias, min_occ=min_occ,
                 prior_prob_input=prior_prob_input)
    print()
    print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases())
    print("done with kb", datetime.datetime.now())
    return kb
 def _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr):
    with open(entity_def_output, mode='w', encoding='utf8') as id_file:
        id_file.write("WP_title" + "|" + "WD_id" + "\n")
        for title, qid in title_to_id.items():
            id_file.write(title + "|" + str(qid) + "\n")
    with open(entity_descr_output, mode='w', encoding='utf8') as descr_file:
        descr_file.write("WD_id" + "|" + "description" + "\n")
        for qid, descr in id_to_descr.items():
            descr_file.write(str(qid) + "|" + descr + "\n")
 def get_entity_to_id(entity_def_output):
    entity_to_id = dict()
    with open(entity_def_output, 'r', encoding='utf8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='|')
        # skip header
        next(csvreader)
        for row in csvreader:
            entity_to_id[row[0]] = row[1]
    return entity_to_id
 def get_id_to_description(entity_descr_output):
    id_to_desc = dict()
    with open(entity_descr_output, 'r', encoding='utf8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='|')
        # skip header
        next(csvreader)
        for row in csvreader:
            id_to_desc[row[0]] = row[1]
    return id_to_desc
 def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input):
    wp_titles = title_to_id.keys()
    # adding aliases with prior probabilities
    # we can read this file sequentially, it's sorted by alias, and then by count
    with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
        # skip header
        prior_file.readline()
        line = prior_file.readline()
        previous_alias = None
        total_count = 0
        counts = []
        entities = []
        while line:
            splits = line.replace('\n', "").split(sep='|')
            new_alias = splits[0]
            count = int(splits[1])
            entity = splits[2]
            if new_alias != previous_alias and previous_alias:
                # done reading the previous alias --> output
                if len(entities) > 0:
                    selected_entities = []
                    prior_probs = []
                    for ent_count, ent_string in zip(counts, entities):
                        if ent_string in wp_titles:
                            wd_id = title_to_id[ent_string]
                            p_entity_givenalias = ent_count / total_count
                            selected_entities.append(wd_id)
                            prior_probs.append(p_entity_givenalias)
                    if selected_entities:
                        try:
                            kb.add_alias(alias=previous_alias, entities=selected_entities, probabilities=prior_probs)
                        except ValueError as e:
                            print(e)
                total_count = 0
                counts = []
                entities = []
            total_count += count
            if len(entities) < max_entities_per_alias and count >= min_occ:
                counts.append(count)
                entities.append(entity)
            previous_alias = new_alias
            line = prior_file.readline()
--- a/bin/wiki_entity_linking/train_descriptions.py
+++ b/bin/wiki_entity_linking/train_descriptions.py
@ -0,0 +1,121 @@
 # coding: utf-8
 from random import shuffle
 import numpy as np
 from spacy._ml import zero_init, create_default_optimizer
 from spacy.cli.pretrain import get_cossim_loss
 from thinc.v2v import Model
 from thinc.api import chain
 from thinc.neural._classes.affine import Affine
 class EntityEncoder:
    """
    Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
    This entity vector will be stored in the KB, for further downstream use in the entity model.
    """
    DROP = 0
    EPOCHS = 5
    STOP_THRESHOLD = 0.04
    BATCH_SIZE = 1000
    def __init__(self, nlp, input_dim, desc_width):
        self.nlp = nlp
        self.input_dim = input_dim
        self.desc_width = desc_width
    def apply_encoder(self, description_list):
        if self.encoder is None:
            raise ValueError("Can not apply encoder before training it")
        batch_size = 100000
        start = 0
        stop = min(batch_size, len(description_list))
        encodings = []
        while start < len(description_list):
            docs = list(self.nlp.pipe(description_list[start:stop]))
            doc_embeddings = [self._get_doc_embedding(doc) for doc in docs]
            enc = self.encoder(np.asarray(doc_embeddings))
            encodings.extend(enc.tolist())
            start = start + batch_size
            stop = min(stop + batch_size, len(description_list))
        return encodings
    def train(self, description_list, to_print=False):
        processed, loss = self._train_model(description_list)
        if to_print:
            print("Trained on", processed, "entities across", self.EPOCHS, "epochs")
            print("Final loss:", loss)
    def _train_model(self, description_list):
        # TODO: when loss gets too low, a 'mean of empty slice' warning is thrown by numpy
        self._build_network(self.input_dim, self.desc_width)
        processed = 0
        loss = 1
        descriptions = description_list.copy()   # copy this list so that shuffling does not affect other functions
        for i in range(self.EPOCHS):
            shuffle(descriptions)
            batch_nr = 0
            start = 0
            stop = min(self.BATCH_SIZE, len(descriptions))
            while loss > self.STOP_THRESHOLD and start < len(descriptions):
                batch = []
                for descr in descriptions[start:stop]:
                    doc = self.nlp(descr)
                    doc_vector = self._get_doc_embedding(doc)
                    batch.append(doc_vector)
                loss = self._update(batch)
                print(i, batch_nr, loss)
                processed += len(batch)
                batch_nr += 1
                start = start + self.BATCH_SIZE
                stop = min(stop + self.BATCH_SIZE, len(descriptions))
        return processed, loss
    @staticmethod
    def _get_doc_embedding(doc):
        indices = np.zeros((len(doc),), dtype="i")
        for i, word in enumerate(doc):
            if word.orth in doc.vocab.vectors.key2row:
                indices[i] = doc.vocab.vectors.key2row[word.orth]
            else:
                indices[i] = 0
        word_vectors = doc.vocab.vectors.data[indices]
        doc_vector = np.mean(word_vectors, axis=0)
        return doc_vector
    def _build_network(self, orig_width, hidden_with):
        with Model.define_operators({">>": chain}):
            # very simple encoder-decoder model
            self.encoder = (
                Affine(hidden_with, orig_width)
            )
            self.model = self.encoder >> zero_init(Affine(orig_width, hidden_with, drop_factor=0.0))
        self.sgd = create_default_optimizer(self.model.ops)
    def _update(self, vectors):
        predictions, bp_model = self.model.begin_update(np.asarray(vectors), drop=self.DROP)
        loss, d_scores = self._get_loss(scores=predictions, golds=np.asarray(vectors))
        bp_model(d_scores, sgd=self.sgd)
        return loss / len(vectors)
    @staticmethod
    def _get_loss(golds, scores):
        loss, gradients = get_cossim_loss(scores, golds)
        return loss, gradients
--- a/bin/wiki_entity_linking/training_set_creator.py
+++ b/bin/wiki_entity_linking/training_set_creator.py
@ -0,0 +1,353 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import os
 import re
 import bz2
 import datetime
 from spacy.gold import GoldParse
 from bin.wiki_entity_linking import kb_creator
 """
 Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
 Gold-standard entities are stored in one file in standoff format (by character offset).
 """
 ENTITY_FILE = "gold_entities.csv"
 def create_training(wikipedia_input, entity_def_input, training_output):
    wp_to_id = kb_creator.get_entity_to_id(entity_def_input)
    _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None)
 def _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None):
    """
    Read the XML wikipedia data to parse out training data:
    raw text data + positive instances
    """
    title_regex = re.compile(r'(?<=<title>).*(?=</title>)')
    id_regex = re.compile(r'(?<=<id>)\d*(?=</id>)')
    read_ids = set()
    entityfile_loc = training_output / ENTITY_FILE
    with open(entityfile_loc, mode="w", encoding='utf8') as entityfile:
        # write entity training header file
        _write_training_entity(outputfile=entityfile,
                               article_id="article_id",
                               alias="alias",
                               entity="WD_id",
                               start="start",
                               end="end")
        with bz2.open(wikipedia_input, mode='rb') as file:
            line = file.readline()
            cnt = 0
            article_text = ""
            article_title = None
            article_id = None
            reading_text = False
            reading_revision = False
            while line and (not limit or cnt < limit):
                if cnt % 1000000 == 0:
                    print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
                clean_line = line.strip().decode("utf-8")
                if clean_line == "<revision>":
                    reading_revision = True
                elif clean_line == "</revision>":
                    reading_revision = False
                # Start reading new page
                if clean_line == "<page>":
                    article_text = ""
                    article_title = None
                    article_id = None
                # finished reading this page
                elif clean_line == "</page>":
                    if article_id:
                        try:
                            _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text.strip(),
                                             training_output)
                        except Exception as e:
                            print("Error processing article", article_id, article_title, e)
                    else:
                        print("Done processing a page, but couldn't find an article_id ?", article_title)
                    article_text = ""
                    article_title = None
                    article_id = None
                    reading_text = False
                    reading_revision = False
                # start reading text within a page
                if "<text" in clean_line:
                    reading_text = True
                if reading_text:
                    article_text += " " + clean_line
                # stop reading text within a page (we assume a new page doesn't start on the same line)
                if "</text" in clean_line:
                    reading_text = False
                # read the ID of this article (outside the revision portion of the document)
                if not reading_revision:
                    ids = id_regex.search(clean_line)
                    if ids:
                        article_id = ids[0]
                        if article_id in read_ids:
                            print("Found duplicate article ID", article_id, clean_line)  # This should never happen ...
                        read_ids.add(article_id)
                # read the title of this article (outside the revision portion of the document)
                if not reading_revision:
                    titles = title_regex.search(clean_line)
                    if titles:
                        article_title = titles[0].strip()
                line = file.readline()
                cnt += 1
 text_regex = re.compile(r'(?<=<text xml:space=\"preserve\">).*(?=</text)')
 def _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text, training_output):
    found_entities = False
    # ignore meta Wikipedia pages
    if article_title.startswith("Wikipedia:"):
        return
    # remove the text tags
    text = text_regex.search(article_text).group(0)
    # stop processing if this is a redirect page
    if text.startswith("#REDIRECT"):
        return
    # get the raw text without markup etc, keeping only interwiki links
    clean_text = _get_clean_wp_text(text)
    # read the text char by char to get the right offsets for the interwiki links
    final_text = ""
    open_read = 0
    reading_text = True
    reading_entity = False
    reading_mention = False
    reading_special_case = False
    entity_buffer = ""
    mention_buffer = ""
    for index, letter in enumerate(clean_text):
        if letter == '[':
            open_read += 1
        elif letter == ']':
            open_read -= 1
        elif letter == '|':
            if reading_text:
                final_text += letter
            # switch from reading entity to mention in the [[entity|mention]] pattern
            elif reading_entity:
                reading_text = False
                reading_entity = False
                reading_mention = True
            else:
                reading_special_case = True
        else:
            if reading_entity:
                entity_buffer += letter
            elif reading_mention:
                mention_buffer += letter
            elif reading_text:
                final_text += letter
            else:
                raise ValueError("Not sure at point", clean_text[index-2:index+2])
        if open_read > 2:
            reading_special_case = True
        if open_read == 2 and reading_text:
            reading_text = False
            reading_entity = True
            reading_mention = False
        # we just finished reading an entity
        if open_read == 0 and not reading_text:
            if '#' in entity_buffer or entity_buffer.startswith(':'):
                reading_special_case = True
            # Ignore cases with nested structures like File: handles etc
            if not reading_special_case:
                if not mention_buffer:
                    mention_buffer = entity_buffer
                start = len(final_text)
                end = start + len(mention_buffer)
                qid = wp_to_id.get(entity_buffer, None)
                if qid:
                    _write_training_entity(outputfile=entityfile,
                                           article_id=article_id,
                                           alias=mention_buffer,
                                           entity=qid,
                                           start=start,
                                           end=end)
                found_entities = True
                final_text += mention_buffer
            entity_buffer = ""
            mention_buffer = ""
            reading_text = True
            reading_entity = False
            reading_mention = False
            reading_special_case = False
    if found_entities:
        _write_training_article(article_id=article_id, clean_text=final_text, training_output=training_output)
 info_regex = re.compile(r'{[^{]*?}')
 htlm_regex = re.compile(r'&lt;!--[^-]*--&gt;')
 category_regex = re.compile(r'\[\[Category:[^\[]*]]')
 file_regex = re.compile(r'\[\[File:[^[\]]+]]')
 ref_regex = re.compile(r'&lt;ref.*?&gt;')     # non-greedy
 ref_2_regex = re.compile(r'&lt;/ref.*?&gt;')  # non-greedy
 def _get_clean_wp_text(article_text):
    clean_text = article_text.strip()
    # remove bolding & italic markup
    clean_text = clean_text.replace('\'\'\'', '')
    clean_text = clean_text.replace('\'\'', '')
    # remove nested {{info}} statements by removing the inner/smallest ones first and iterating
    try_again = True
    previous_length = len(clean_text)
    while try_again:
        clean_text = info_regex.sub('', clean_text)  # non-greedy match excluding a nested {
        if len(clean_text) < previous_length:
            try_again = True
        else:
            try_again = False
        previous_length = len(clean_text)
    # remove HTML comments
    clean_text = htlm_regex.sub('', clean_text)
    # remove Category and File statements
    clean_text = category_regex.sub('', clean_text)
    clean_text = file_regex.sub('', clean_text)
    # remove multiple =
    while '==' in clean_text:
        clean_text = clean_text.replace("==", "=")
    clean_text = clean_text.replace(". =", ".")
    clean_text = clean_text.replace(" = ", ". ")
    clean_text = clean_text.replace("= ", ".")
    clean_text = clean_text.replace(" =", "")
    # remove refs (non-greedy match)
    clean_text = ref_regex.sub('', clean_text)
    clean_text = ref_2_regex.sub('', clean_text)
    # remove additional wikiformatting
    clean_text = re.sub(r'&lt;blockquote&gt;', '', clean_text)
    clean_text = re.sub(r'&lt;/blockquote&gt;', '', clean_text)
    # change special characters back to normal ones
    clean_text = clean_text.replace(r'&lt;', '<')
    clean_text = clean_text.replace(r'&gt;', '>')
    clean_text = clean_text.replace(r'&quot;', '"')
    clean_text = clean_text.replace(r'&amp;nbsp;', ' ')
    clean_text = clean_text.replace(r'&amp;', '&')
    # remove multiple spaces
    while '  ' in clean_text:
        clean_text = clean_text.replace('  ', ' ')
    return clean_text.strip()
 def _write_training_article(article_id, clean_text, training_output):
    file_loc = training_output / str(article_id) + ".txt"
    with open(file_loc, mode='w', encoding='utf8') as outputfile:
        outputfile.write(clean_text)
 def _write_training_entity(outputfile, article_id, alias, entity, start, end):
    outputfile.write(article_id + "|" + alias + "|" + entity + "|" + str(start) + "|" + str(end) + "\n")
 def is_dev(article_id):
    return article_id.endswith("3")
 def read_training(nlp, training_dir, dev, limit):
    # This method provides training examples that correspond to the entity annotations found by the nlp object
    entityfile_loc = training_dir / ENTITY_FILE
    data = []
    # assume the data is written sequentially, so we can reuse the article docs
    current_article_id = None
    current_doc = None
    ents_by_offset = dict()
    skip_articles = set()
    total_entities = 0
    with open(entityfile_loc, mode='r', encoding='utf8') as file:
        for line in file:
            if not limit or len(data) < limit:
                fields = line.replace('\n', "").split(sep='|')
                article_id = fields[0]
                alias = fields[1]
                wp_title = fields[2]
                start = fields[3]
                end = fields[4]
                if dev == is_dev(article_id) and article_id != "article_id" and article_id not in skip_articles:
                    if not current_doc or (current_article_id != article_id):
                        # parse the new article text
                        file_name = article_id + ".txt"
                        try:
                            with open(os.path.join(training_dir, file_name), mode="r", encoding='utf8') as f:
                                text = f.read()
                                if len(text) < 30000:   # threshold for convenience / speed of processing
                                    current_doc = nlp(text)
                                    current_article_id = article_id
                                    ents_by_offset = dict()
                                    for ent in current_doc.ents:
                                        sent_length = len(ent.sent)
                                        # custom filtering to avoid too long or too short sentences
                                        if 5 < sent_length < 100:
                                            ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
                                else:
                                    skip_articles.add(article_id)
                                    current_doc = None
                        except Exception as e:
                            print("Problem parsing article", article_id, e)
                            skip_articles.add(article_id)
                            raise e
                    # repeat checking this condition in case an exception was thrown
                    if current_doc and (current_article_id == article_id):
                        found_ent = ents_by_offset.get(start + "_" + end,  None)
                        if found_ent:
                            if found_ent.text != alias:
                                skip_articles.add(article_id)
                                current_doc = None
                            else:
                                sent = found_ent.sent.as_doc()
                                # currently feeding the gold data one entity per sentence at a time
                                gold_start = int(start) - found_ent.sent.start_char
                                gold_end = int(end) - found_ent.sent.start_char
                                gold_entities = [(gold_start, gold_end, wp_title)]
                                gold = GoldParse(doc=sent, links=gold_entities)
                                data.append((sent, gold))
                                total_entities += 1
                                if len(data) % 2500 == 0:
                                    print(" -read", total_entities, "entities")
    print(" -read", total_entities, "entities")
    return data
--- a/bin/wiki_entity_linking/wikidata_processor.py
+++ b/bin/wiki_entity_linking/wikidata_processor.py
@ -0,0 +1,119 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import bz2
 import json
 import datetime
 def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False):
    # Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines.
    # get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
    lang = 'en'
    site_filter = 'enwiki'
    # properties filter (currently disabled to get ALL data)
    prop_filter = dict()
    # prop_filter = {'P31': {'Q5', 'Q15632617'}}     # currently defined as OR: one property suffices to be selected
    title_to_id = dict()
    id_to_descr = dict()
    # parse appropriate fields - depending on what we need in the KB
    parse_properties = False
    parse_sitelinks = True
    parse_labels = False
    parse_descriptions = True
    parse_aliases = False
    parse_claims = False
    with bz2.open(wikidata_file, mode='rb') as file:
        line = file.readline()
        cnt = 0
        while line and (not limit or cnt < limit):
            if cnt % 500000 == 0:
                print(datetime.datetime.now(), "processed", cnt, "lines of WikiData dump")
            clean_line = line.strip()
            if clean_line.endswith(b","):
                clean_line = clean_line[:-1]
            if len(clean_line) > 1:
                obj = json.loads(clean_line)
                entry_type = obj["type"]
                if entry_type == "item":
                    # filtering records on their properties (currently disabled to get ALL data)
                    # keep = False
                    keep = True
                    claims = obj["claims"]
                    if parse_claims:
                        for prop, value_set in prop_filter.items():
                            claim_property = claims.get(prop, None)
                            if claim_property:
                                for cp in claim_property:
                                    cp_id = cp['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
                                    cp_rank = cp['rank']
                                    if cp_rank != "deprecated" and cp_id in value_set:
                                        keep = True
                    if keep:
                        unique_id = obj["id"]
                        if to_print:
                            print("ID:", unique_id)
                            print("type:", entry_type)
                        # parsing all properties that refer to other entities
                        if parse_properties:
                            for prop, claim_property in claims.items():
                                cp_dicts = [cp['mainsnak']['datavalue'].get('value') for cp in claim_property
                                            if cp['mainsnak'].get('datavalue')]
                                cp_values = [cp_dict.get('id') for cp_dict in cp_dicts if isinstance(cp_dict, dict)
                                             if cp_dict.get('id') is not None]
                                if cp_values:
                                    if to_print:
                                        print("prop:", prop, cp_values)
                        found_link = False
                        if parse_sitelinks:
                            site_value = obj["sitelinks"].get(site_filter, None)
                            if site_value:
                                site = site_value['title']
                                if to_print:
                                    print(site_filter, ":", site)
                                title_to_id[site] = unique_id
                                found_link = True
                        if parse_labels:
                            labels = obj["labels"]
                            if labels:
                                lang_label = labels.get(lang, None)
                                if lang_label:
                                    if to_print:
                                        print("label (" + lang + "):", lang_label["value"])
                        if found_link and parse_descriptions:
                            descriptions = obj["descriptions"]
                            if descriptions:
                                lang_descr = descriptions.get(lang, None)
                                if lang_descr:
                                    if to_print:
                                        print("description (" + lang + "):", lang_descr["value"])
                                    id_to_descr[unique_id] = lang_descr["value"]
                        if parse_aliases:
                            aliases = obj["aliases"]
                            if aliases:
                                lang_aliases = aliases.get(lang, None)
                                if lang_aliases:
                                    for item in lang_aliases:
                                        if to_print:
                                            print("alias (" + lang + "):", item["value"])
                        if to_print:
                            print()
            line = file.readline()
            cnt += 1
    return title_to_id, id_to_descr
--- a/bin/wiki_entity_linking/wikipedia_processor.py
+++ b/bin/wiki_entity_linking/wikipedia_processor.py
@ -0,0 +1,182 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import re
 import bz2
 import csv
 import datetime
 """
 Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
 Write these results to file for downstream KB and training data generation.
 """
 map_alias_to_link = dict()
 # these will/should be matched ignoring case
 wiki_namespaces = ["b", "betawikiversity", "Book", "c", "Category", "Commons",
                   "d", "dbdump", "download", "Draft", "Education", "Foundation",
                   "Gadget", "Gadget definition", "gerrit", "File", "Help", "Image", "Incubator",
                   "m", "mail", "mailarchive", "media", "MediaWiki", "MediaWiki talk", "Mediawikiwiki",
                   "MediaZilla", "Meta", "Metawikipedia", "Module",
                   "mw", "n", "nost", "oldwikisource", "outreach", "outreachwiki", "otrs", "OTRSwiki",
                   "Portal", "phab", "Phabricator", "Project", "q", "quality", "rev",
                   "s", "spcom", "Special", "species", "Strategy", "sulutil", "svn",
                   "Talk", "Template", "Template talk", "Testwiki", "ticket", "TimedText", "Toollabs", "tools",
                   "tswiki", "User", "User talk", "v", "voy",
                   "w", "Wikibooks", "Wikidata", "wikiHow", "Wikinvest", "wikilivres", "Wikimedia", "Wikinews",
                   "Wikipedia", "Wikipedia talk", "Wikiquote", "Wikisource", "Wikispecies", "Wikitech",
                   "Wikiversity", "Wikivoyage", "wikt", "wiktionary", "wmf", "wmania", "WP"]
 # find the links
 link_regex = re.compile(r'\[\[[^\[\]]*\]\]')
 # match on interwiki links, e.g. `en:` or `:fr:`
 ns_regex = r":?" + "[a-z][a-z]" + ":"
 # match on Namespace: optionally preceded by a :
 for ns in wiki_namespaces:
    ns_regex += "|" + ":?" + ns + ":"
 ns_regex = re.compile(ns_regex, re.IGNORECASE)
 def read_wikipedia_prior_probs(wikipedia_input, prior_prob_output):
    """
    Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
    The full file takes about 2h to parse 1100M lines.
    It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from.
    """
    with bz2.open(wikipedia_input, mode='rb') as file:
        line = file.readline()
        cnt = 0
        while line:
            if cnt % 5000000 == 0:
                print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump")
            clean_line = line.strip().decode("utf-8")
            aliases, entities, normalizations = get_wp_links(clean_line)
            for alias, entity, norm in zip(aliases, entities, normalizations):
                _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
                _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True)
            line = file.readline()
            cnt += 1
    # write all aliases and their entities and count occurrences to file
    with open(prior_prob_output, mode='w', encoding='utf8') as outputfile:
        outputfile.write("alias" + "|" + "count" + "|" + "entity" + "\n")
        for alias, alias_dict in sorted(map_alias_to_link.items(), key=lambda x: x[0]):
            for entity, count in sorted(alias_dict.items(), key=lambda x: x[1], reverse=True):
                outputfile.write(alias + "|" + str(count) + "|" + entity + "\n")
 def _store_alias(alias, entity, normalize_alias=False, normalize_entity=True):
    alias = alias.strip()
    entity = entity.strip()
    # remove everything after # as this is not part of the title but refers to a specific paragraph
    if normalize_entity:
        # wikipedia titles are always capitalized
        entity = _capitalize_first(entity.split("#")[0])
    if normalize_alias:
        alias = alias.split("#")[0]
    if alias and entity:
        alias_dict = map_alias_to_link.get(alias, dict())
        entity_count = alias_dict.get(entity, 0)
        alias_dict[entity] = entity_count + 1
        map_alias_to_link[alias] = alias_dict
 def get_wp_links(text):
    aliases = []
    entities = []
    normalizations = []
    matches = link_regex.findall(text)
    for match in matches:
        match = match[2:][:-2].replace("_", " ").strip()
        if ns_regex.match(match):
            pass  # ignore namespaces at the beginning of the string
        # this is a simple [[link]], with the alias the same as the mention
        elif "|" not in match:
            aliases.append(match)
            entities.append(match)
            normalizations.append(True)
        # in wiki format, the link is written as [[entity|alias]]
        else:
            splits = match.split("|")
            entity = splits[0].strip()
            alias = splits[1].strip()
            # specific wiki format  [[alias (specification)|]]
            if len(alias) == 0 and "(" in entity:
                alias = entity.split("(")[0]
                aliases.append(alias)
                entities.append(entity)
                normalizations.append(False)
            else:
                aliases.append(alias)
                entities.append(entity)
                normalizations.append(False)
    return aliases, entities, normalizations
 def _capitalize_first(text):
    if not text:
        return None
    result = text[0].capitalize()
    if len(result) > 0:
        result += text[1:]
    return result
 def write_entity_counts(prior_prob_input, count_output, to_print=False):
    # Write entity counts for quick access later
    entity_to_count = dict()
    total_count = 0
    with open(prior_prob_input, mode='r', encoding='utf8') as prior_file:
        # skip header
        prior_file.readline()
        line = prior_file.readline()
        while line:
            splits = line.replace('\n', "").split(sep='|')
            # alias = splits[0]
            count = int(splits[1])
            entity = splits[2]
            current_count = entity_to_count.get(entity, 0)
            entity_to_count[entity] = current_count + count
            total_count += count
            line = prior_file.readline()
    with open(count_output, mode='w', encoding='utf8') as entity_file:
        entity_file.write("entity" + "|" + "count" + "\n")
        for entity, count in entity_to_count.items():
            entity_file.write(entity + "|" + str(count) + "\n")
    if to_print:
        for entity, count in entity_to_count.items():
            print("Entity count:", entity, count)
        print("Total count:", total_count)
 def get_all_frequencies(count_input):
    entity_to_count = dict()
    with open(count_input, 'r', encoding='utf8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='|')
        # skip header
        next(csvreader)
        for row in csvreader:
            entity_to_count[row[0]] = int(row[1])
    return entity_to_count
--- a/examples/information_extraction/entity_relations.py
+++ b/examples/information_extraction/entity_relations.py
@ -51,7 +51,6 @@ def filter_spans(spans):
 def extract_currency_relations(doc):
    # Merge entities and noun chunks into one token
    seen_tokens = set()
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
--- a/examples/pipeline/dummy_entity_linking.py
+++ b/examples/pipeline/dummy_entity_linking.py
@ -9,26 +9,26 @@ from spacy.kb import KnowledgeBase
 def create_kb(vocab):
-    kb = KnowledgeBase(vocab=vocab)
+    kb = KnowledgeBase(vocab=vocab, entity_vector_length=1)
    # adding entities
    entity_0 = "Q1004791_Douglas"
    print("adding entity", entity_0)
-    kb.add_entity(entity=entity_0, prob=0.5)
+    kb.add_entity(entity=entity_0, prob=0.5, entity_vector=[0])
    entity_1 = "Q42_Douglas_Adams"
    print("adding entity", entity_1)
-    kb.add_entity(entity=entity_1, prob=0.5)
+    kb.add_entity(entity=entity_1, prob=0.5, entity_vector=[1])
    entity_2 = "Q5301561_Douglas_Haig"
    print("adding entity", entity_2)
-    kb.add_entity(entity=entity_2, prob=0.5)
+    kb.add_entity(entity=entity_2, prob=0.5, entity_vector=[2])
    # adding aliases
    print()
    alias_0 = "Douglas"
    print("adding alias", alias_0)
-    kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2])
+    kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.6, 0.1, 0.2])
    alias_1 = "Douglas Adams"
    print("adding alias", alias_1)
@ -41,8 +41,12 @@ def create_kb(vocab):
 def add_el(kb, nlp):
-    el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb})
+    el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
    el_pipe.set_kb(kb)
    nlp.add_pipe(el_pipe, last=True)
    nlp.begin_training()
    el_pipe.context_weight = 0
    el_pipe.prior_weight = 1
    for alias in ["Douglas Adams", "Douglas"]:
        candidates = nlp.linker.kb.get_candidates(alias)
@ -66,6 +70,6 @@ def add_el(kb, nlp):
 if __name__ == "__main__":
-    nlp = spacy.load('en_core_web_sm')
+    my_nlp = spacy.load('en_core_web_sm')
-    my_kb = create_kb(nlp.vocab)
+    my_kb = create_kb(my_nlp.vocab)
-    add_el(my_kb, nlp)
+    add_el(my_kb, my_nlp)
--- a/examples/pipeline/wikidata_entity_linking.py
+++ b/examples/pipeline/wikidata_entity_linking.py
@ -0,0 +1,442 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import random
 import datetime
 from pathlib import Path
 from bin.wiki_entity_linking import training_set_creator, kb_creator, wikipedia_processor as wp
 from bin.wiki_entity_linking.kb_creator import DESC_WIDTH
 import spacy
 from spacy.kb import KnowledgeBase
 from spacy.util import minibatch, compounding
 """
 Demonstrate how to build a knowledge base from WikiData and run an Entity Linking algorithm.
 """
 ROOT_DIR = Path("C:/Users/Sofie/Documents/data/")
 OUTPUT_DIR = ROOT_DIR / 'wikipedia'
 TRAINING_DIR = OUTPUT_DIR / 'training_data_nel'
 PRIOR_PROB = OUTPUT_DIR / 'prior_prob.csv'
 ENTITY_COUNTS = OUTPUT_DIR / 'entity_freq.csv'
 ENTITY_DEFS = OUTPUT_DIR / 'entity_defs.csv'
 ENTITY_DESCR = OUTPUT_DIR / 'entity_descriptions.csv'
 KB_FILE = OUTPUT_DIR / 'kb_1' / 'kb'
 NLP_1_DIR = OUTPUT_DIR / 'nlp_1'
 NLP_2_DIR = OUTPUT_DIR / 'nlp_2'
 # get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
 WIKIDATA_JSON = ROOT_DIR / 'wikidata' / 'wikidata-20190304-all.json.bz2'
 # get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/
 ENWIKI_DUMP = ROOT_DIR / 'wikipedia' / 'enwiki-20190320-pages-articles-multistream.xml.bz2'
 # KB construction parameters
 MAX_CANDIDATES = 10
 MIN_ENTITY_FREQ = 20
 MIN_PAIR_OCC = 5
 # model training parameters
 EPOCHS = 10
 DROPOUT = 0.5
 LEARN_RATE = 0.005
 L2 = 1e-6
 CONTEXT_WIDTH = 128
 def run_pipeline():
    # set the appropriate booleans to define which parts of the pipeline should be re(run)
    print("START", datetime.datetime.now())
    print()
    nlp_1 = spacy.load('en_core_web_lg')
    nlp_2 = None
    kb_2 = None
    # one-time methods to create KB and write to file
    to_create_prior_probs = False
    to_create_entity_counts = False
    to_create_kb = False
    # read KB back in from file
    to_read_kb = True
    to_test_kb = False
    # create training dataset
    create_wp_training = False
    # train the EL pipe
    train_pipe = True
    measure_performance = True
    # test the EL pipe on a simple example
    to_test_pipeline = True
    # write the NLP object, read back in and test again
    to_write_nlp = True
    to_read_nlp = True
    test_from_file = False
    # STEP 1 : create prior probabilities from WP (run only once)
    if to_create_prior_probs:
        print("STEP 1: to_create_prior_probs", datetime.datetime.now())
        wp.read_wikipedia_prior_probs(wikipedia_input=ENWIKI_DUMP, prior_prob_output=PRIOR_PROB)
        print()
    # STEP 2 : deduce entity frequencies from WP (run only once)
    if to_create_entity_counts:
        print("STEP 2: to_create_entity_counts", datetime.datetime.now())
        wp.write_entity_counts(prior_prob_input=PRIOR_PROB, count_output=ENTITY_COUNTS, to_print=False)
        print()
    # STEP 3 : create KB and write to file (run only once)
    if to_create_kb:
        print("STEP 3a: to_create_kb", datetime.datetime.now())
        kb_1 = kb_creator.create_kb(nlp_1,
                                    max_entities_per_alias=MAX_CANDIDATES,
                                    min_entity_freq=MIN_ENTITY_FREQ,
                                    min_occ=MIN_PAIR_OCC,
                                    entity_def_output=ENTITY_DEFS,
                                    entity_descr_output=ENTITY_DESCR,
                                    count_input=ENTITY_COUNTS,
                                    prior_prob_input=PRIOR_PROB,
                                    wikidata_input=WIKIDATA_JSON)
        print("kb entities:", kb_1.get_size_entities())
        print("kb aliases:", kb_1.get_size_aliases())
        print()
        print("STEP 3b: write KB and NLP", datetime.datetime.now())
        kb_1.dump(KB_FILE)
        nlp_1.to_disk(NLP_1_DIR)
        print()
    # STEP 4 : read KB back in from file
    if to_read_kb:
        print("STEP 4: to_read_kb", datetime.datetime.now())
        nlp_2 = spacy.load(NLP_1_DIR)
        kb_2 = KnowledgeBase(vocab=nlp_2.vocab, entity_vector_length=DESC_WIDTH)
        kb_2.load_bulk(KB_FILE)
        print("kb entities:", kb_2.get_size_entities())
        print("kb aliases:", kb_2.get_size_aliases())
        print()
        # test KB
        if to_test_kb:
            check_kb(kb_2)
            print()
    # STEP 5: create a training dataset from WP
    if create_wp_training:
        print("STEP 5: create training dataset", datetime.datetime.now())
        training_set_creator.create_training(wikipedia_input=ENWIKI_DUMP,
                                             entity_def_input=ENTITY_DEFS,
                                             training_output=TRAINING_DIR)
    # STEP 6: create and train the entity linking pipe
    if train_pipe:
        print("STEP 6: training Entity Linking pipe", datetime.datetime.now())
        type_to_int = {label: i for i, label in enumerate(nlp_2.entity.labels)}
        print(" -analysing", len(type_to_int), "different entity types")
        el_pipe = nlp_2.create_pipe(name='entity_linker',
                                    config={"context_width": CONTEXT_WIDTH,
                                            "pretrained_vectors": nlp_2.vocab.vectors.name,
                                            "type_to_int": type_to_int})
        el_pipe.set_kb(kb_2)
        nlp_2.add_pipe(el_pipe, last=True)
        other_pipes = [pipe for pipe in nlp_2.pipe_names if pipe != "entity_linker"]
        with nlp_2.disable_pipes(*other_pipes):  # only train Entity Linking
            optimizer = nlp_2.begin_training()
            optimizer.learn_rate = LEARN_RATE
            optimizer.L2 = L2
        # define the size (nr of entities) of training and dev set
        train_limit = 5000
        dev_limit = 5000
        train_data = training_set_creator.read_training(nlp=nlp_2,
                                                        training_dir=TRAINING_DIR,
                                                        dev=False,
                                                        limit=train_limit)
        print("Training on", len(train_data), "articles")
        print()
        dev_data = training_set_creator.read_training(nlp=nlp_2,
                                                      training_dir=TRAINING_DIR,
                                                      dev=True,
                                                      limit=dev_limit)
        print("Dev testing on", len(dev_data), "articles")
        print()
        if not train_data:
            print("Did not find any training data")
        else:
            for itn in range(EPOCHS):
                random.shuffle(train_data)
                losses = {}
                batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001))
                batchnr = 0
                with nlp_2.disable_pipes(*other_pipes):
                    for batch in batches:
                        try:
                            docs, golds = zip(*batch)
                            nlp_2.update(
                                docs,
                                golds,
                                sgd=optimizer,
                                drop=DROPOUT,
                                losses=losses,
                            )
                            batchnr += 1
                        except Exception as e:
                            print("Error updating batch:", e)
                if batchnr > 0:
                    el_pipe.cfg["context_weight"] = 1
                    el_pipe.cfg["prior_weight"] = 1
                    dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
                    losses['entity_linker'] = losses['entity_linker'] / batchnr
                    print("Epoch, train loss", itn, round(losses['entity_linker'], 2),
                          " / dev acc avg", round(dev_acc_context, 3))
        # STEP 7: measure the performance of our trained pipe on an independent dev set
        if len(dev_data) and measure_performance:
            print()
            print("STEP 7: performance measurement of Entity Linking pipe", datetime.datetime.now())
            print()
            counts, acc_r, acc_r_label, acc_p, acc_p_label, acc_o, acc_o_label = _measure_baselines(dev_data, kb_2)
            print("dev counts:", sorted(counts.items(), key=lambda x: x[0]))
            print("dev acc oracle:", round(acc_o, 3), [(x, round(y, 3)) for x, y in acc_o_label.items()])
            print("dev acc random:", round(acc_r, 3), [(x, round(y, 3)) for x, y in acc_r_label.items()])
            print("dev acc prior:", round(acc_p, 3), [(x, round(y, 3)) for x, y in acc_p_label.items()])
            # using only context
            el_pipe.cfg["context_weight"] = 1
            el_pipe.cfg["prior_weight"] = 0
            dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe)
            print("dev acc context avg:", round(dev_acc_context, 3),
                  [(x, round(y, 3)) for x, y in dev_acc_context_dict.items()])
            # measuring combined accuracy (prior + context)
            el_pipe.cfg["context_weight"] = 1
            el_pipe.cfg["prior_weight"] = 1
            dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe, error_analysis=False)
            print("dev acc combo avg:", round(dev_acc_combo, 3),
                  [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
        # STEP 8: apply the EL pipe on a toy example
        if to_test_pipeline:
            print()
            print("STEP 8: applying Entity Linking to toy example", datetime.datetime.now())
            print()
            run_el_toy_example(nlp=nlp_2)
        # STEP 9: write the NLP pipeline (including entity linker) to file
        if to_write_nlp:
            print()
            print("STEP 9: testing NLP IO", datetime.datetime.now())
            print()
            print("writing to", NLP_2_DIR)
            nlp_2.to_disk(NLP_2_DIR)
            print()
    # verify that the IO has gone correctly
    if to_read_nlp:
        print("reading from", NLP_2_DIR)
        nlp_3 = spacy.load(NLP_2_DIR)
        print("running toy example with NLP 3")
        run_el_toy_example(nlp=nlp_3)
    # testing performance with an NLP model from file
    if test_from_file:
        nlp_2 = spacy.load(NLP_1_DIR)
        nlp_3 = spacy.load(NLP_2_DIR)
        el_pipe = nlp_3.get_pipe("entity_linker")
        dev_limit = 5000
        dev_data = training_set_creator.read_training(nlp=nlp_2,
                                                      training_dir=TRAINING_DIR,
                                                      dev=True,
                                                      limit=dev_limit)
        print("Dev testing from file on", len(dev_data), "articles")
        print()
        dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe=el_pipe, error_analysis=False)
        print("dev acc combo avg:", round(dev_acc_combo, 3),
              [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()])
    print()
    print("STOP", datetime.datetime.now())
 def _measure_accuracy(data, el_pipe=None, error_analysis=False):
    # If the docs in the data require further processing with an entity linker, set el_pipe
    correct_by_label = dict()
    incorrect_by_label = dict()
    docs = [d for d, g in data if len(d) > 0]
    if el_pipe is not None:
        docs = list(el_pipe.pipe(docs))
    golds = [g for d, g in data if len(d) > 0]
    for doc, gold in zip(docs, golds):
        try:
            correct_entries_per_article = dict()
            for entity in gold.links:
                start, end, gold_kb = entity
                correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
            for ent in doc.ents:
                ent_label = ent.label_
                pred_entity = ent.kb_id_
                start = ent.start_char
                end = ent.end_char
                gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
                # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
                if gold_entity is not None:
                    if gold_entity == pred_entity:
                        correct = correct_by_label.get(ent_label, 0)
                        correct_by_label[ent_label] = correct + 1
                    else:
                        incorrect = incorrect_by_label.get(ent_label, 0)
                        incorrect_by_label[ent_label] = incorrect + 1
                        if error_analysis:
                            print(ent.text, "in", doc)
                            print("Predicted",  pred_entity, "should have been", gold_entity)
                            print()
        except Exception as e:
            print("Error assessing accuracy", e)
    acc, acc_by_label = calculate_acc(correct_by_label,  incorrect_by_label)
    return acc, acc_by_label
 def _measure_baselines(data, kb):
    # Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound
    counts_by_label = dict()
    random_correct_by_label = dict()
    random_incorrect_by_label = dict()
    oracle_correct_by_label = dict()
    oracle_incorrect_by_label = dict()
    prior_correct_by_label = dict()
    prior_incorrect_by_label = dict()
    docs = [d for d, g in data if len(d) > 0]
    golds = [g for d, g in data if len(d) > 0]
    for doc, gold in zip(docs, golds):
        try:
            correct_entries_per_article = dict()
            for entity in gold.links:
                start, end, gold_kb = entity
                correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb
            for ent in doc.ents:
                ent_label = ent.label_
                start = ent.start_char
                end = ent.end_char
                gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None)
                # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
                if gold_entity is not None:
                    counts_by_label[ent_label] = counts_by_label.get(ent_label, 0) + 1
                    candidates = kb.get_candidates(ent.text)
                    oracle_candidate = ""
                    best_candidate = ""
                    random_candidate = ""
                    if candidates:
                        scores = []
                        for c in candidates:
                            scores.append(c.prior_prob)
                            if c.entity_ == gold_entity:
                                oracle_candidate = c.entity_
                        best_index = scores.index(max(scores))
                        best_candidate = candidates[best_index].entity_
                        random_candidate = random.choice(candidates).entity_
                    if gold_entity == best_candidate:
                        prior_correct_by_label[ent_label] = prior_correct_by_label.get(ent_label, 0) + 1
                    else:
                        prior_incorrect_by_label[ent_label] = prior_incorrect_by_label.get(ent_label, 0) + 1
                    if gold_entity == random_candidate:
                        random_correct_by_label[ent_label] = random_correct_by_label.get(ent_label, 0) + 1
                    else:
                        random_incorrect_by_label[ent_label] = random_incorrect_by_label.get(ent_label, 0) + 1
                    if gold_entity == oracle_candidate:
                        oracle_correct_by_label[ent_label] = oracle_correct_by_label.get(ent_label, 0) + 1
                    else:
                        oracle_incorrect_by_label[ent_label] = oracle_incorrect_by_label.get(ent_label, 0) + 1
        except Exception as e:
            print("Error assessing accuracy", e)
    acc_prior, acc_prior_by_label = calculate_acc(prior_correct_by_label, prior_incorrect_by_label)
    acc_rand, acc_rand_by_label = calculate_acc(random_correct_by_label, random_incorrect_by_label)
    acc_oracle, acc_oracle_by_label = calculate_acc(oracle_correct_by_label, oracle_incorrect_by_label)
    return counts_by_label, acc_rand, acc_rand_by_label, acc_prior, acc_prior_by_label, acc_oracle, acc_oracle_by_label
 def calculate_acc(correct_by_label, incorrect_by_label):
    acc_by_label = dict()
    total_correct = 0
    total_incorrect = 0
    all_keys = set()
    all_keys.update(correct_by_label.keys())
    all_keys.update(incorrect_by_label.keys())
    for label in sorted(all_keys):
        correct = correct_by_label.get(label, 0)
        incorrect = incorrect_by_label.get(label, 0)
        total_correct += correct
        total_incorrect += incorrect
        if correct == incorrect == 0:
            acc_by_label[label] = 0
        else:
            acc_by_label[label] = correct / (correct + incorrect)
    acc = 0
    if not (total_correct == total_incorrect == 0):
        acc = total_correct / (total_correct + total_incorrect)
    return acc, acc_by_label
 def check_kb(kb):
    for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"):
        candidates = kb.get_candidates(mention)
        print("generating candidates for " + mention + " :")
        for c in candidates:
            print(" ", c.prior_prob, c.alias_, "-->", c.entity_ + " (freq=" + str(c.entity_freq) + ")")
        print()
 def run_el_toy_example(nlp):
    text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \
           "Douglas reminds us to always bring our towel, even in China or Brazil. " \
           "The main character in Doug's novel is the man Arthur Dent, " \
           "but Douglas doesn't write about George Washington or Homer Simpson."
    doc = nlp(text)
    print(text)
    for ent in doc.ents:
        print(" ent", ent.text, ent.label_, ent.kb_id_)
    print()
 if __name__ == "__main__":
    run_pipeline()
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,6 +5,6 @@ requires = ["setuptools",
            "cymem>=2.0.2,<2.1.0",
            "preshed>=2.0.1,<2.1.0",
            "murmurhash>=0.28.0,<1.1.0",
-            "thinc==7.0.0.dev6",
+            "thinc>=7.0.8,<7.1.0",
            ]
 build-backend = "setuptools.build_meta"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,7 +1,7 @@
 # Our libraries
 cymem>=2.0.2,<2.1.0
 preshed>=2.0.1,<2.1.0
-thinc>=7.0.2,<7.1.0
+thinc>=7.0.8,<7.1.0
 blis>=0.2.2,<0.3.0
 murmurhash>=0.28.0,<1.1.0
 wasabi>=0.2.0,<1.1.0
--- a/setup.py
+++ b/setup.py
@ -228,7 +228,7 @@ def setup_package():
                "murmurhash>=0.28.0,<1.1.0",
                "cymem>=2.0.2,<2.1.0",
                "preshed>=2.0.1,<2.1.0",
-                "thinc>=7.0.2,<7.1.0",
+                "thinc>=7.0.8,<7.1.0",
                "blis>=0.2.2,<0.3.0",
                "plac<1.0.0,>=0.9.6",
                "requests>=2.13.0,<3.0.0",
@ -246,6 +246,7 @@ def setup_package():
                "cuda100": ["thinc_gpu_ops>=0.0.1,<0.1.0", "cupy-cuda100>=5.0.0b4"],
                # Language tokenizers with external dependencies
                "ja": ["mecab-python3==0.7"],
                "ko": ["natto-py==0.9.0"],
            },
            python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*",
            classifiers=[
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -24,7 +24,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
 import thinc.extra.load_nlp
 from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
-from .errors import Errors
+from .errors import Errors, user_warning, Warnings
 from . import util
 try:
@ -299,7 +299,17 @@ def link_vectors_to_models(vocab):
    data = ops.asarray(vectors.data)
    # Set an entry here, so that vectors are accessed by StaticVectors
    # (unideal, I know)
-    thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
+    key = (ops.device, vectors.name)
    if key in thinc.extra.load_nlp.VECTORS:
        if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
            # This is a hack to avoid the problem in #3853. Maybe we should
            # print a warning as well?
            old_name = vectors.name
            new_name = vectors.name + "_%d" % data.shape[0]
            user_warning(Warnings.W019.format(old=old_name, new=new_name))
            vectors.name = new_name
            key = (ops.device, vectors.name)
    thinc.extra.load_nlp.VECTORS[key] = data
 def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
@ -652,6 +662,51 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
    return model
 def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg):
    # TODO proper error
    if "entity_width" not in cfg:
        raise ValueError("entity_width not found")
    if "context_width" not in cfg:
        raise ValueError("context_width not found")
    conv_depth = cfg.get("conv_depth", 2)
    cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3)
    pretrained_vectors = cfg.get("pretrained_vectors")  # self.nlp.vocab.vectors.name
    context_width = cfg.get("context_width")
    entity_width = cfg.get("entity_width")
    with Model.define_operators({">>": chain, "**": clone}):
        model = (
            Affine(entity_width, entity_width + context_width + 1 + ner_types)
            >> Affine(1, entity_width, drop_factor=0.0)
            >> logistic
        )
        # context encoder
        tok2vec = (
            Tok2Vec(
                width=hidden_width,
                embed_size=embed_width,
                pretrained_vectors=pretrained_vectors,
                cnn_maxout_pieces=cnn_maxout_pieces,
                subword_features=True,
                conv_depth=conv_depth,
                bilstm_depth=0,
            )
            >> flatten_add_lengths
            >> Pooling(mean_pool)
            >> Residual(zero_init(Maxout(hidden_width, hidden_width)))
            >> zero_init(Affine(context_width, hidden_width))
        )
        model.tok2vec = tok2vec
    model.tok2vec = tok2vec
    model.tok2vec.nO = context_width
    model.nO = 1
    return model
@layerize
 def flatten(seqs, drop=0.0):
    ops = Model.ops
--- a/spacy/about.py
+++ b/spacy/about.py
@ -4,13 +4,13 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "2.1.4"
+__version__ = "2.1.5"
 __summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
 __uri__ = "https://spacy.io"
 __author__ = "Explosion AI"
 __email__ = "contact@explosion.ai"
 __license__ = "MIT"
-__release__ = False
+__release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -82,6 +82,7 @@ cdef enum attr_id_t:
    DEP
    ENT_IOB
    ENT_TYPE
    ENT_KB_ID
    HEAD
    SENT_START
    SPACY
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -84,6 +84,7 @@ IDS = {
    "DEP": DEP,
    "ENT_IOB": ENT_IOB,
    "ENT_TYPE": ENT_TYPE,
    "ENT_KB_ID": ENT_KB_ID,
    "HEAD": HEAD,
    "SENT_START": SENT_START,
    "SPACY": SPACY,
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -5,6 +5,7 @@ import plac
 import random
 import numpy
 import time
 import re
 from collections import Counter
 from pathlib import Path
 from thinc.v2v import Affine, Maxout
@ -65,6 +66,13 @@ from .train import _load_pretrained_tok2vec
        "t2v",
        Path,
    ),
    epoch_start=(
        "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
        "renamed. Prevents unintended overwriting of existing weight files.",
        "option",
        "es",
        int
    ),
 )
 def pretrain(
    texts_loc,
@ -83,6 +91,7 @@ def pretrain(
    seed=0,
    n_save_every=None,
    init_tok2vec=None,
    epoch_start=None,
 ):
    """
    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -151,9 +160,29 @@ def pretrain(
    if init_tok2vec is not None:
        components = _load_pretrained_tok2vec(nlp, init_tok2vec)
        msg.text("Loaded pretrained tok2vec for: {}".format(components))
        # Parse the epoch number from the given weight file
        model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
        if model_name:
            # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
            epoch_start = int(model_name.group(0)[5:][:-4]) + 1
        else:
            if not epoch_start:
                msg.fail(
                    "You have to use the '--epoch-start' argument when using a renamed weight file for "
                    "'--init-tok2vec'", exits=True
                )
            elif epoch_start < 0:
                msg.fail(
                    "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
                    exits=True
                )
    else:
        # Without '--init-tok2vec' the '--epoch-start' argument is ignored
        epoch_start = 0
    optimizer = create_default_optimizer(model.ops)
    tracker = ProgressTracker(frequency=10000)
-    msg.divider("Pre-training tok2vec layer")
+    msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
    row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
    msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
@ -174,7 +203,7 @@ def pretrain(
                file_.write(srsly.json_dumps(log) + "\n")
    skip_counter = 0
-    for epoch in range(n_iter):
+    for epoch in range(epoch_start, n_iter + epoch_start):
        for batch_id, batch in enumerate(
            util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
        ):
@ -272,7 +301,7 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
    elif objective == "cosine":
        loss, d_target = get_cossim_loss(prediction, target)
    else:
-        raise ValueError(Errors.E139.format(loss_func=objective))
+        raise ValueError(Errors.E142.format(loss_func=objective))
    return loss, d_target
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -82,6 +82,8 @@ class Warnings(object):
            "parallel inference via multiprocessing.")
    W017 = ("Alias '{alias}' already exists in the Knowledge base.")
    W018 = ("Entity '{entity}' already exists in the Knowledge base.")
    W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
            "previously loaded vectors. See Issue #3853.")
@add_codes
@ -399,7 +401,11 @@ class Errors(object):
    E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input includes either the "
            "`text` or `tokens` key. For more info, see the docs:\n"
            "https://spacy.io/api/cli#pretrain-jsonl")
-    E139 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
+    E139 = ("Knowledge base for component '{name}' not initialized. Did you forget to call set_kb()?")
    E140 = ("The list of entities, prior probabilities and entity vectors should be of equal length.")
    E141 = ("Entity vectors should be of length {required} instead of the provided {found}.")
    E142 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'")
    E143 = ("Labels for component '{name}' not initialized. Did you forget to call add_label()?")
@add_codes
--- a/spacy/gold.pxd
+++ b/spacy/gold.pxd
@ -31,6 +31,7 @@ cdef class GoldParse:
    cdef public list ents
    cdef public dict brackets
    cdef public object cats
    cdef public list links
    cdef readonly list cand_to_gold
    cdef readonly list gold_to_cand
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -427,7 +427,7 @@ cdef class GoldParse:
    def __init__(self, doc, annot_tuples=None, words=None, tags=None,
                 heads=None, deps=None, entities=None, make_projective=False,
-                 cats=None, **_):
+                 cats=None, links=None, **_):
        """Create a GoldParse.
        doc (Doc): The document the annotations refer to.
@ -450,6 +450,8 @@ cdef class GoldParse:
            examples of a label to have the value 0.0. Labels not in the
            dictionary are treated as missing - the gradient for those labels
            will be zero.
        links (iterable): A sequence of `(start_char, end_char, kb_id)` tuples,
            representing the external ID of an entity in a knowledge base.
        RETURNS (GoldParse): The newly constructed object.
        """
        if words is None:
@ -485,6 +487,7 @@ cdef class GoldParse:
        self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition))
        self.cats = {} if cats is None else dict(cats)
        self.links = links
        self.words = [None] * len(doc)
        self.tags = [None] * len(doc)
        self.heads = [None] * len(doc)
--- a/spacy/kb.pxd
+++ b/spacy/kb.pxd
@ -1,53 +1,27 @@
 """Knowledge-base for entity or concept linking."""
 from cymem.cymem cimport Pool
 from preshed.maps cimport PreshMap
 from libcpp.vector cimport vector
 from libc.stdint cimport int32_t, int64_t
 from libc.stdio cimport FILE
 from spacy.vocab cimport Vocab
 from .typedefs cimport hash_t
-
+from .structs cimport KBEntryC, AliasC
-# Internal struct, for storage and disambiguation. This isn't what we return
+ctypedef vector[KBEntryC] entry_vec
-# to the user as the answer to "here's your entity". It's the minimum number
+ctypedef vector[AliasC] alias_vec
-# of bits we need to keep track of the answers.
+ctypedef vector[float] float_vec
-cdef struct _EntryC:
+ctypedef vector[float_vec] float_matrix
    # The hash of this entry's unique ID and name in the kB
    hash_t entity_hash
    # Allows retrieval of one or more vectors.
    # Each element of vector_rows should be an index into a vectors table.
    # Every entry should have the same number of vectors, so we can avoid storing
    # the number of vectors in each knowledge-base struct
    int32_t* vector_rows
    # Allows retrieval of a struct of non-vector features. We could make this a
    # pointer, but we have 32 bits left over in the struct after prob, so we'd
    # like this to only be 32 bits. We can also set this to -1, for the common
    # case where there are no features.
    int32_t feats_row
    # log probability of entity, based on corpus frequency
    float prob
 # Each alias struct stores a list of Entry pointers with their prior probabilities
 # for this specific mention/alias.
 cdef struct _AliasC:
    # All entry candidates for this alias
    vector[int64_t] entry_indices
    # Prior probability P(entity|alias) - should sum up to (at most) 1.
    vector[float] probs
 # Object used by the Entity Linker that summarizes one entity-alias candidate combination.
 cdef class Candidate:
    cdef readonly KnowledgeBase kb
    cdef hash_t entity_hash
    cdef float entity_freq
    cdef vector[float] entity_vector
    cdef hash_t alias_hash
    cdef float prior_prob
@ -55,9 +29,10 @@ cdef class Candidate:
 cdef class KnowledgeBase:
    cdef Pool mem
    cpdef readonly Vocab vocab
    cdef int64_t entity_vector_length
    # This maps 64bit keys (hash of unique entity string)
-    # to 64bit values (position of the _EntryC struct in the _entries vector).
+    # to 64bit values (position of the _KBEntryC struct in the _entries vector).
    # The PreshMap is pretty space efficient, as it uses open addressing. So
    # the only overhead is the vacancy rate, which is approximately 30%.
    cdef PreshMap _entry_index
@ -66,7 +41,7 @@ cdef class KnowledgeBase:
    # over allocation.
    # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
    # Storing 1m entries would take 41.6mb under this scheme.
-    cdef vector[_EntryC] _entries
+    cdef entry_vec _entries
    # This maps 64bit keys (hash of unique alias string)
    # to 64bit values (position of the _AliasC struct in the _aliases_table vector).
@ -76,7 +51,7 @@ cdef class KnowledgeBase:
    # should be P(entity | mention), which is pretty important to know.
    # We can pack both pieces of information into a 64-bit value, to keep things
    # efficient.
-    cdef vector[_AliasC] _aliases_table
+    cdef alias_vec _aliases_table
    # This is the part which might take more space: storing various
    # categorical features for the entries, and storing vectors for disambiguation
@ -87,7 +62,7 @@ cdef class KnowledgeBase:
    # model, that embeds different features of the entities into vectors. We'll
    # still want some per-entity features, like the Wikipedia text or entity
    # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
-    cdef object _vectors_table
+    cdef float_matrix _vectors_table
    # It's very useful to track categorical features, at least for output, even
    # if they're not useful in the model itself. For instance, we should be
@ -96,53 +71,102 @@ cdef class KnowledgeBase:
    # optional data, we can let users configure a DB as the backend for this.
    cdef object _features_table
    cdef inline int64_t c_add_vector(self, vector[float] entity_vector) nogil:
        """Add an entity vector to the vectors table."""
        cdef int64_t new_index = self._vectors_table.size()
        self._vectors_table.push_back(entity_vector)
        return new_index
    cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
-                                     int32_t* vector_rows, int feats_row):
+                                     int32_t vector_index, int feats_row) nogil:
-        """Add an entry to the knowledge base."""
+        """Add an entry to the vector of entries.
-        # This is what we'll map the hash key to. It's where the entry will sit
+        After calling this method, make sure to update also the _entry_index using the return value"""
        # This is what we'll map the entity hash key to. It's where the entry will sit
        # in the vector of entries, so we can get it later.
        cdef int64_t new_index = self._entries.size()
-        self._entries.push_back(
+
-            _EntryC(
+        # Avoid struct initializer to enable nogil, cf https://github.com/cython/cython/issues/1642
-                entity_hash=entity_hash,
+        cdef KBEntryC entry
-                vector_rows=vector_rows,
+        entry.entity_hash = entity_hash
-                feats_row=feats_row,
+        entry.vector_index = vector_index
-                prob=prob
+        entry.feats_row = feats_row
-            ))
+        entry.prob = prob
-        self._entry_index[entity_hash] = new_index
+
        self._entries.push_back(entry)
        return new_index
-    cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs):
+    cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs) nogil:
-        """Connect a mention to a list of potential entities with their prior probabilities ."""
+        """Connect a mention to a list of potential entities with their prior probabilities .
        After calling this method, make sure to update also the _alias_index using the return value"""
        # This is what we'll map the alias hash key to. It's where the alias will be defined
        # in the vector of aliases.
        cdef int64_t new_index = self._aliases_table.size()
-        self._aliases_table.push_back(
+        # Avoid struct initializer to enable nogil
-            _AliasC(
+        cdef AliasC alias
-                entry_indices=entry_indices,
+        alias.entry_indices = entry_indices
-                probs=probs
+        alias.probs = probs
-            ))
+
-        self._alias_index[alias_hash] = new_index
+        self._aliases_table.push_back(alias)
        return new_index
-    cdef inline _create_empty_vectors(self):
+    cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil:
        """ 
-        Making sure the first element of each vector is a dummy,
+        Initializing the vectors and making sure the first element of each vector is a dummy,
        because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
        cf. https://github.com/explosion/preshed/issues/17
        """
        cdef int32_t dummy_value = 0
-        self.vocab.strings.add("")
+
-        self._entries.push_back(
+        # Avoid struct initializer to enable nogil
-            _EntryC(
+        cdef KBEntryC entry
-                entity_hash=self.vocab.strings[""],
+        entry.entity_hash = dummy_hash
-                vector_rows=&dummy_value,
+        entry.vector_index = dummy_value
-                feats_row=dummy_value,
+        entry.feats_row = dummy_value
-                prob=dummy_value
+        entry.prob = dummy_value
-            ))
+
-        self._aliases_table.push_back(
+        # Avoid struct initializer to enable nogil
-            _AliasC(
+        cdef vector[int64_t] dummy_entry_indices
-                entry_indices=[dummy_value],
+        dummy_entry_indices.push_back(0)
-                probs=[dummy_value]
+        cdef vector[float] dummy_probs
-            ))
+        dummy_probs.push_back(0)
        cdef AliasC alias
        alias.entry_indices = dummy_entry_indices
        alias.probs = dummy_probs
        self._entries.push_back(entry)
        self._aliases_table.push_back(alias)
    cpdef load_bulk(self, loc)
    cpdef set_entities(self, entity_list, prob_list, vector_list)
 cdef class Writer:
    cdef FILE* _fp
    cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1
    cdef int write_vector_element(self, float element) except -1
    cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1
    cdef int write_alias_length(self, int64_t alias_length) except -1
    cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1
    cdef int write_alias(self, int64_t entry_index, float prob) except -1
    cdef int _write(self, void* value, size_t size) except -1
 cdef class Reader:
    cdef FILE* _fp
    cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1
    cdef int read_vector_element(self, float* element) except -1
    cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1
    cdef int read_alias_length(self, int64_t* alias_length) except -1
    cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1
    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
    cdef int _read(self, void* value, size_t size) except -1
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -1,13 +1,30 @@
 # cython: infer_types=True
 # cython: profile=True
 # coding: utf8
 from spacy.errors import Errors, Warnings, user_warning
 from pathlib import Path
 from cymem.cymem cimport Pool
 from preshed.maps cimport PreshMap
 from cpython.exc cimport PyErr_SetFromErrno
 from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
 from libc.stdint cimport int32_t, int64_t
 from .typedefs cimport hash_t
 from os import path
 from libcpp.vector cimport vector
 cdef class Candidate:
-    def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob):
+    def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob):
        self.kb = kb
        self.entity_hash = entity_hash
        self.entity_freq = entity_freq
        self.entity_vector = entity_vector
        self.alias_hash = alias_hash
        self.prior_prob = prior_prob
@ -19,7 +36,7 @@ cdef class Candidate:
    @property
    def entity_(self):
        """RETURNS (unicode): ID/name of this entity in the KB"""
-        return self.kb.vocab.strings[self.entity]
+        return self.kb.vocab.strings[self.entity_hash]
    @property
    def alias(self):
@ -29,7 +46,15 @@ cdef class Candidate:
    @property
    def alias_(self):
        """RETURNS (unicode): ID of the original alias"""
-        return self.kb.vocab.strings[self.alias]
+        return self.kb.vocab.strings[self.alias_hash]
    @property
    def entity_freq(self):
        return self.entity_freq
    @property
    def entity_vector(self):
        return self.entity_vector
    @property
    def prior_prob(self):
@ -38,26 +63,41 @@ cdef class Candidate:
 cdef class KnowledgeBase:
-    def __init__(self, Vocab vocab):
+    def __init__(self, Vocab vocab, entity_vector_length):
        self.vocab = vocab
        self.mem = Pool()
        self.entity_vector_length = entity_vector_length
        self._entry_index = PreshMap()
        self._alias_index = PreshMap()
-        self.mem = Pool()
+
-        self._create_empty_vectors()
+        self.vocab.strings.add("")
        self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
    @property
    def entity_vector_length(self):
        """RETURNS (uint64): length of the entity vectors"""
        return self.entity_vector_length
    def __len__(self):
        return self.get_size_entities()
    def get_size_entities(self):
-        return self._entries.size() - 1  # not counting dummy element on index 0
+        return len(self._entry_index)
    def get_entity_strings(self):
        return [self.vocab.strings[x] for x in self._entry_index]
    def get_size_aliases(self):
-        return self._aliases_table.size() - 1 # not counting dummy element on index 0
+        return len(self._alias_index)
-    def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None):
+    def get_alias_strings(self):
        return [self.vocab.strings[x] for x in self._alias_index]
    def add_entity(self, unicode entity, float prob, vector[float] entity_vector):
        """
-        Add an entity to the KB.
+        Add an entity to the KB, optionally specifying its log probability based on corpus frequency
-        Return the hash of the entity ID at the end
+        Return the hash of the entity ID/name at the end.
        """
        cdef hash_t entity_hash = self.vocab.strings.add(entity)
@ -66,40 +106,72 @@ cdef class KnowledgeBase:
            user_warning(Warnings.W018.format(entity=entity))
            return
-        cdef int32_t dummy_value = 342
+        # Raise an error if the provided entity vector is not of the correct length
-        self.c_add_entity(entity_hash=entity_hash, prob=prob,
+        if len(entity_vector) != self.entity_vector_length:
-                          vector_rows=&dummy_value, feats_row=dummy_value)
+            raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
-        # TODO self._vectors_table.get_pointer(vectors),
+
-        # self._features_table.get(features))
+        vector_index = self.c_add_vector(entity_vector=entity_vector)
        new_index = self.c_add_entity(entity_hash=entity_hash,
                                      prob=prob,
                                      vector_index=vector_index,
                                      feats_row=-1)  # Features table currently not implemented
        self._entry_index[entity_hash] = new_index
        return entity_hash
    cpdef set_entities(self, entity_list, prob_list, vector_list):
        if len(entity_list) != len(prob_list) or len(entity_list) != len(vector_list):
            raise ValueError(Errors.E140)
        nr_entities = len(entity_list)
        self._entry_index = PreshMap(nr_entities+1)
        self._entries = entry_vec(nr_entities+1)
        i = 0
        cdef KBEntryC entry
        while i < nr_entities:
            entity_vector = vector_list[i]
            if len(entity_vector) != self.entity_vector_length:
                raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length))
            entity_hash = self.vocab.strings.add(entity_list[i])
            entry.entity_hash = entity_hash
            entry.prob = prob_list[i]
            vector_index = self.c_add_vector(entity_vector=vector_list[i])
            entry.vector_index = vector_index
            entry.feats_row = -1   # Features table currently not implemented
            self._entries[i+1] = entry
            self._entry_index[entity_hash] = i+1
            i += 1
    def add_alias(self, unicode alias, entities, probabilities):
        """
        For a given alias, add its potential entities and prior probabilies to the KB.
        Return the alias_hash at the end
        """
        # Throw an error if the length of entities and probabilities are not the same
        if not len(entities) == len(probabilities):
            raise ValueError(Errors.E132.format(alias=alias,
                                                entities_length=len(entities),
                                                probabilities_length=len(probabilities)))
-        # Throw an error if the probabilities sum up to more than 1
+        # Throw an error if the probabilities sum up to more than 1 (allow for some rounding errors)
        prob_sum = sum(probabilities)
-        if prob_sum > 1:
+        if prob_sum > 1.00001:
            raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
        cdef hash_t alias_hash = self.vocab.strings.add(alias)
-        # Return if this alias was added before
+        # Check whether this alias was added before
        if alias_hash in self._alias_index:
            user_warning(Warnings.W017.format(alias=alias))
            return
        cdef hash_t entity_hash
        cdef vector[int64_t] entry_indices
        cdef vector[float] probs
@ -112,20 +184,295 @@ cdef class KnowledgeBase:
            entry_indices.push_back(int(entry_index))
            probs.push_back(float(prob))
-        self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
+        new_index = self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
        self._alias_index[alias_hash] = new_index
        return alias_hash
    def get_candidates(self, unicode alias):
        """ TODO: where to put this functionality ?"""
        cdef hash_t alias_hash = self.vocab.strings[alias]
        alias_index = <int64_t>self._alias_index.get(alias_hash)
        alias_entry = self._aliases_table[alias_index]
        return [Candidate(kb=self,
                          entity_hash=self._entries[entry_index].entity_hash,
                          entity_freq=self._entries[entry_index].prob,
                          entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
                          alias_hash=alias_hash,
                          prior_prob=prob)
                for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
                if entry_index != 0]
    def dump(self, loc):
        cdef Writer writer = Writer(loc)
        writer.write_header(self.get_size_entities(), self.entity_vector_length)
        # dumping the entity vectors in their original order
        i = 0
        for entity_vector in self._vectors_table:
            for element in entity_vector:
                writer.write_vector_element(element)
            i = i+1
        # dumping the entry records in the order in which they are in the _entries vector.
        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
        i = 1
        for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]):
            entry = self._entries[entry_index]
            assert entry.entity_hash == entry_hash
            assert entry_index == i
            writer.write_entry(entry.entity_hash, entry.prob, entry.vector_index)
            i = i+1
        writer.write_alias_length(self.get_size_aliases())
        # dumping the aliases in the order in which they are in the _alias_index vector.
        # index 0 is a dummy object not stored in the _aliases_table and can be ignored.
        i = 1
        for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]):
            alias = self._aliases_table[alias_index]
            assert alias_index == i
            candidate_length = len(alias.entry_indices)
            writer.write_alias_header(alias_hash, candidate_length)
            for j in range(0, candidate_length):
                writer.write_alias(alias.entry_indices[j], alias.probs[j])
            i = i+1
        writer.close()
    cpdef load_bulk(self, loc):
        cdef hash_t entity_hash
        cdef hash_t alias_hash
        cdef int64_t entry_index
        cdef float prob
        cdef int32_t vector_index
        cdef KBEntryC entry
        cdef AliasC alias
        cdef float vector_element
        cdef Reader reader = Reader(loc)
        # STEP 0: load header and initialize KB
        cdef int64_t nr_entities
        cdef int64_t entity_vector_length
        reader.read_header(&nr_entities, &entity_vector_length)
        self.entity_vector_length = entity_vector_length
        self._entry_index = PreshMap(nr_entities+1)
        self._entries = entry_vec(nr_entities+1)
        self._vectors_table = float_matrix(nr_entities+1)
        # STEP 1: load entity vectors
        cdef int i = 0
        cdef int j = 0
        while i < nr_entities:
            entity_vector = float_vec(entity_vector_length)
            j = 0
            while j < entity_vector_length:
                reader.read_vector_element(&vector_element)
                entity_vector[j] = vector_element
                j = j+1
            self._vectors_table[i] = entity_vector
            i = i+1
        # STEP 2: load entities
        # we assume that the entity data was written in sequence
        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
        i = 1
        while i <= nr_entities:
            reader.read_entry(&entity_hash, &prob, &vector_index)
            entry.entity_hash = entity_hash
            entry.prob = prob
            entry.vector_index = vector_index
            entry.feats_row = -1    # Features table currently not implemented
            self._entries[i] = entry
            self._entry_index[entity_hash] = i
            i += 1
        # check that all entities were read in properly
        assert nr_entities == self.get_size_entities()
        # STEP 3: load aliases
        cdef int64_t nr_aliases
        reader.read_alias_length(&nr_aliases)
        self._alias_index = PreshMap(nr_aliases+1)
        self._aliases_table = alias_vec(nr_aliases+1)
        cdef int64_t nr_candidates
        cdef vector[int64_t] entry_indices
        cdef vector[float] probs
        i = 1
        # we assume the alias data was written in sequence
        # index 0 is a dummy object not stored in the _entry_index and can be ignored.
        while i <= nr_aliases:
            reader.read_alias_header(&alias_hash, &nr_candidates)
            entry_indices = vector[int64_t](nr_candidates)
            probs = vector[float](nr_candidates)
            for j in range(0, nr_candidates):
                reader.read_alias(&entry_index, &prob)
                entry_indices[j] = entry_index
                probs[j] = prob
            alias.entry_indices = entry_indices
            alias.probs = probs
            self._aliases_table[i] = alias
            self._alias_index[alias_hash] = i
            i += 1
        # check that all aliases were read in properly
        assert nr_aliases == self.get_size_aliases()
 cdef class Writer:
    def __init__(self, object loc):
        if path.exists(loc):
            assert not path.isdir(loc), "%s is directory." % loc
        if isinstance(loc, Path):
            loc = bytes(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        assert self._fp != NULL
        fseek(self._fp, 0, 0)
    def close(self):
        cdef size_t status = fclose(self._fp)
        assert status == 0
    cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1:
        self._write(&nr_entries, sizeof(nr_entries))
        self._write(&entity_vector_length, sizeof(entity_vector_length))
    cdef int write_vector_element(self, float element) except -1:
        self._write(&element, sizeof(element))
    cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1:
        self._write(&entry_hash, sizeof(entry_hash))
        self._write(&entry_prob, sizeof(entry_prob))
        self._write(&vector_index, sizeof(vector_index))
        # Features table currently not implemented and not written to file
    cdef int write_alias_length(self, int64_t alias_length) except -1:
        self._write(&alias_length, sizeof(alias_length))
    cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1:
        self._write(&alias_hash, sizeof(alias_hash))
        self._write(&candidate_length, sizeof(candidate_length))
    cdef int write_alias(self, int64_t entry_index, float prob) except -1:
        self._write(&entry_index, sizeof(entry_index))
        self._write(&prob, sizeof(prob))
    cdef int _write(self, void* value, size_t size) except -1:
        status = fwrite(value, size, 1, self._fp)
        assert status == 1, status
 cdef class Reader:
    def __init__(self, object loc):
        assert path.exists(loc)
        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
            PyErr_SetFromErrno(IOError)
        status = fseek(self._fp, 0, 0)  # this can be 0 if there is no header
    def __dealloc__(self):
        fclose(self._fp)
    cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1:
        status = self._read(nr_entries, sizeof(int64_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading header from input file")
        status = self._read(entity_vector_length, sizeof(int64_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading header from input file")
    cdef int read_vector_element(self, float* element) except -1:
        status = self._read(element, sizeof(float))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading entity vector from input file")
    cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1:
        status = self._read(entity_hash, sizeof(hash_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading entity hash from input file")
        status = self._read(prob, sizeof(float))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading entity prob from input file")
        status = self._read(vector_index, sizeof(int32_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading entity vector from input file")
        if feof(self._fp):
            return 0
        else:
            return 1
    cdef int read_alias_length(self, int64_t* alias_length) except -1:
        status = self._read(alias_length, sizeof(int64_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading alias length from input file")
    cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1:
        status = self._read(alias_hash, sizeof(hash_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading alias hash from input file")
        status = self._read(candidate_length, sizeof(int64_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading candidate length from input file")
    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1:
        status = self._read(entry_index, sizeof(int64_t))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading entry index for alias from input file")
        status = self._read(prob, sizeof(float))
        if status < 1:
            if feof(self._fp):
                return 0  # end of file
            raise IOError("error reading prob for entity/alias from input file")
    cdef int _read(self, void* value, size_t size) except -1:
        status = fread(value, size, 1, self._fp)
        return status
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -9,6 +9,8 @@ _bengali = r"\u0980-\u09FF"
 _hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F"
 _hindi = r"\u0900-\u097F"
 # Latin standard
 _latin_u_standard = r"A-Z"
 _latin_l_standard = r"a-z"
@ -193,7 +195,7 @@ _ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ"
 _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper
 _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
-_uncased = _bengali + _hebrew + _persian + _sinhala
+_uncased = _bengali + _hebrew + _persian + _sinhala + _hindi
 ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
 ALPHA_LOWER = group_chars(_lower + _uncased)
--- a/spacy/lang/id/examples.py
+++ b/spacy/lang/id/examples.py
@ -5,7 +5,7 @@ from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
->>> from spacy.lang.en.examples import sentences
+>>> from spacy.lang.id.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -0,0 +1,120 @@
 # encoding: utf8
 from __future__ import unicode_literals, print_function
 import re
 import sys
 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP
 from ...attrs import LANG
 from ...language import Language
 from ...tokens import Doc
 from ...compat import copy_reg
 from ...util import DummyTokenizer
 from ...compat import is_python3, is_python_pre_3_5
 is_python_post_3_7 = is_python3 and sys.version_info[1] >= 7
 # fmt: off
 if is_python_pre_3_5:
    from collections import namedtuple
    Morpheme = namedtuple("Morpheme", "surface lemma tag")
 elif is_python_post_3_7:
    from dataclasses import dataclass
    @dataclass(frozen=True)
    class Morpheme:
        surface: str
        lemma: str
        tag: str
 else:
    from typing import NamedTuple
    class Morpheme(NamedTuple):
        surface: str
        lemma: str
        tag: str
 def try_mecab_import():
    try:
        from natto import MeCab
        return MeCab
    except ImportError:
        raise ImportError(
            "Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
            "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
            "and [natto-py](https://github.com/buruzaemon/natto-py)"
        )
 # fmt: on
 def check_spaces(text, tokens):
    token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
    m = token_pattern.match(text)
    if m is not None:
        for i in range(1, m.lastindex):
            yield m.end(i) < m.start(i + 1)
        yield False
 class KoreanTokenizer(DummyTokenizer):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
        self.Tokenizer = try_mecab_import()
    def __call__(self, text):
        dtokens = list(self.detailed_tokens(text))
        surfaces = [dt.surface for dt in dtokens]
        doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces)))
        for token, dtoken in zip(doc, dtokens):
            first_tag, sep, eomi_tags = dtoken.tag.partition("+")
            token.tag_ = first_tag  # stem(어간) or pre-final(선어말 어미)
            token.lemma_ = dtoken.lemma
        doc.user_data["full_tags"] = [dt.tag for dt in dtokens]
        return doc
    def detailed_tokens(self, text):
        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
        with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
            for node in tokenizer.parse(text, as_nodes=True):
                if node.is_eos():
                    break
                surface = node.surface
                feature = node.feature
                tag, _, expr = feature.partition(",")
                lemma, _, remainder = expr.partition("/")
                if lemma == "*":
                    lemma = surface
                yield Morpheme(surface, lemma, tag)
 class KoreanDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda _text: "ko"
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
    @classmethod
    def create_tokenizer(cls, nlp=None):
        return KoreanTokenizer(cls, nlp)
 class Korean(Language):
    lang = "ko"
    Defaults = KoreanDefaults
    def make_doc(self, text):
        return self.tokenizer(text)
 def pickle_korean(instance):
    return Korean, tuple()
 copy_reg.pickle(Korean, pickle_korean)
 __all__ = ["Korean"]
--- a/spacy/lang/ko/examples.py
+++ b/spacy/lang/ko/examples.py
@ -0,0 +1,15 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.ko.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
    "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
    "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
    "런던은 영국의 수도이자 가장 큰 도시입니다."
 ]
--- a/spacy/lang/ko/stop_words.py
+++ b/spacy/lang/ko/stop_words.py
@ -0,0 +1,68 @@
 # coding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set("""
 이
 있
 하
 것
 들
 그
 되
 수
 이
 보
 않
 없
 나
 주
 아니
 등
 같
 때
 년
 가
 한
 지
 오
 말
 일
 그렇
 위하
 때문
 그것
 두
 말하
 알
 그러나
 받
 못하
 일
 그런
 또
 더
 많
 그리고
 좋
 크
 시키
 그러
 하나
 살
 데
 안
 어떤
 번
 나
 다른
 어떻
 들
 이렇
 점
 싶
 말
 좀
 원
 잘
 놓
 """.split())
--- a/spacy/lang/ko/tag_map.py
+++ b/spacy/lang/ko/tag_map.py
@ -0,0 +1,59 @@
 # encoding: utf8
 from __future__ import unicode_literals
 from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON
 from ...symbols import VERB, ADV, PROPN, NUM, DET
 # 은전한닢(mecab-ko-dic)의 품사 태그를 universal pos tag로 대응시킴
 # https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265
 # https://universaldependencies.org/u/pos/
 TAG_MAP = {
    # J.{1,2} 조사
    "JKS": {POS: ADP},
    "JKC": {POS: ADP},
    "JKG": {POS: ADP},
    "JKO": {POS: ADP},
    "JKB": {POS: ADP},
    "JKV": {POS: ADP},
    "JKQ": {POS: ADP},
    "JX": {POS: ADP},  # 보조사
    "JC": {POS: CONJ},  # 접속 조사
    "MAJ": {POS: CONJ},  # 접속 부사
    "MAG": {POS: ADV},  # 일반 부사
    "MM": {POS: DET},  # 관형사
    "XPN": {POS: X},  # 접두사
    # XS. 접미사
    "XSN": {POS: X},
    "XSV": {POS: X},
    "XSA": {POS: X},
    "XR": {POS: X},  # 어근
    # E.{1,2} 어미
    "EP": {POS: X},
    "EF": {POS: X},
    "EC": {POS: X},
    "ETN": {POS: X},
    "ETM": {POS: X},
    "IC": {POS: INTJ},  # 감탄사
    "VV": {POS: VERB},  # 동사
    "VA": {POS: ADJ},  # 형용사
    "VX": {POS: AUX},  # 보조 용언
    "VCP": {POS: ADP},  # 긍정 지정사(이다)
    "VCN": {POS: ADJ},  # 부정 지정사(아니다)
    "NNG": {POS: NOUN},  # 일반 명사(general noun)
    "NNB": {POS: NOUN},  # 의존 명사
    "NNBC": {POS: NOUN},  # 의존 명사(단위: unit)
    "NNP": {POS: PROPN},  # 고유 명사(proper noun)
    "NP": {POS: PRON},  # 대명사
    "NR": {POS: NUM},  # 수사(numerals)
    "SN": {POS: NUM},  # 숫자
    # S.{1,2} 부호
    # 문장 부호
    "SF": {POS: PUNCT},  # period or other EOS marker
    "SE": {POS: PUNCT},
    "SC": {POS: PUNCT},  # comma, etc.
    "SSO": {POS: PUNCT},  # open bracket
    "SSC": {POS: PUNCT},  # close bracket
    "SY": {POS: SYM},  # 기타 기호
    "SL": {POS: X},  # 외국어
    "SH": {POS: X},  # 한자
 }
--- a/spacy/lang/lt/init.py
+++ b/spacy/lang/lt/init.py
@ -1,15 +1,37 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .lemmatizer import LOOKUP
 from .morph_rules import MORPH_RULES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG
+from ...attrs import LANG, NORM
 from ...util import update_exc, add_lookups
 def _return_lt(_):
    return "lt"
 class LithuanianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
-    lex_attr_getters[LANG] = lambda text: "lt"
+    lex_attr_getters[LANG] = _return_lt
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
    lex_attr_getters.update(LEX_ATTRS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    morph_rules = MORPH_RULES
    lemma_lookup = LOOKUP
 class Lithuanian(Language):
--- a/spacy/lang/lt/examples.py
+++ b/spacy/lang/lt/examples.py
@ -0,0 +1,22 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.lt.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
    "Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
    "Vilniuje galvojama uždrausti naudoti skėčius",
    "Londonas yra didelis miestas Jungtinėje Karalystėje",
    "Kur tu?",
    "Kas yra Prancūzijos prezidentas?",
    "Kokia yra Jungtinių Amerikos Valstijų sostinė?",
    "Kada gimė Dalia Grybauskaitė?",
 ]
--- a/spacy/lang/lt/lemmatizer.py
+++ b/spacy/lang/lt/lemmatizer.py
--- a/spacy/lang/lt/lex_attrs.py
+++ b/spacy/lang/lt/lex_attrs.py
--- a/spacy/lang/lt/morph_rules.py
+++ b/spacy/lang/lt/morph_rules.py
--- a/spacy/lang/lt/stop_words.py
+++ b/spacy/lang/lt/stop_words.py
--- a/spacy/lang/lt/tag_map.py
+++ b/spacy/lang/lt/tag_map.py
--- a/spacy/lang/lt/tokenizer_exceptions.py
+++ b/spacy/lang/lt/tokenizer_exceptions.py
@ -0,0 +1,268 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import ORTH
 _exc = {}
 for orth in [
    "G.",
    "J. E.",
    "J. Em.",
    "J.E.",
    "J.Em.",
    "K.",
    "N.",
    "V.",
    "Vt.",
    "a.",
    "a.k.",
    "a.s.",
    "adv.",
    "akad.",
    "aklg.",
    "akt.",
    "al.",
    "ang.",
    "angl.",
    "aps.",
    "apskr.",
    "apyg.",
    "arbat.",
    "asist.",
    "asm.",
    "asm.k.",
    "asmv.",
    "atk.",
    "atsak.",
    "atsisk.",
    "atsisk.sąsk.",
    "atv.",
    "aut.",
    "avd.",
    "b.k.",
    "baud.",
    "biol.",
    "bkl.",
    "bot.",
    "bt.",
    "buv.",
    "ch.",
    "chem.",
    "corp.",
    "d.",
    "dab.",
    "dail.",
    "dek.",
    "deš.",
    "dir.",
    "dirig.",
    "doc.",
    "dol.",
    "dr.",
    "drp.",
    "dvit.",
    "dėst.",
    "dš.",
    "dž.",
    "e.b.",
    "e.bankas",
    "e.p.",
    "e.parašas",
    "e.paštas",
    "e.v.",
    "e.valdžia",
    "egz.",
    "eil.",
    "ekon.",
    "el.",
    "el.bankas",
    "el.p.",
    "el.parašas",
    "el.paštas",
    "el.valdžia",
    "etc.",
    "ež.",
    "fak.",
    "faks.",
    "feat.",
    "filol.",
    "filos.",
    "g.",
    "gen.",
    "geol.",
    "gerb.",
    "gim.",
    "gr.",
    "gv.",
    "gyd.",
    "gyv.",
    "habil.",
    "inc.",
    "insp.",
    "inž.",
    "ir pan.",
    "ir t. t.",
    "isp.",
    "istor.",
    "it.",
    "just.",
    "k.",
    "k. a.",
    "k.a.",
    "kab.",
    "kand.",
    "kart.",
    "kat.",
    "ketv.",
    "kh.",
    "kl.",
    "kln.",
    "km.",
    "kn.",
    "koresp.",
    "kpt.",
    "kr.",
    "kt.",
    "kub.",
    "kun.",
    "kv.",
    "kyš.",
    "l. e. p.",
    "l.e.p.",
    "lenk.",
    "liet.",
    "lot.",
    "lt.",
    "ltd.",
    "ltn.",
    "m.",
    "m.e..",
    "m.m.",
    "mat.",
    "med.",
    "mgnt.",
    "mgr.",
    "min.",
    "mjr.",
    "ml.",
    "mln.",
    "mlrd.",
    "mob.",
    "mok.",
    "moksl.",
    "mokyt.",
    "mot.",
    "mr.",
    "mst.",
    "mstl.",
    "mėn.",
    "nkt.",
    "no.",
    "nr.",
    "ntk.",
    "nuotr.",
    "op.",
    "org.",
    "orig.",
    "p.",
    "p.d.",
    "p.m.e.",
    "p.s.",
    "pab.",
    "pan.",
    "past.",
    "pav.",
    "pavad.",
    "per.",
    "perd.",
    "pirm.",
    "pl.",
    "plg.",
    "plk.",
    "pr.",
    "pr.Kr.",
    "pranc.",
    "proc.",
    "prof.",
    "prom.",
    "prot.",
    "psl.",
    "pss.",
    "pvz.",
    "pšt.",
    "r.",
    "raj.",
    "red.",
    "rez.",
    "rež.",
    "rus.",
    "rš.",
    "s.",
    "sav.",
    "saviv.",
    "sek.",
    "sekr.",
    "sen.",
    "sh.",
    "sk.",
    "skg.",
    "skv.",
    "skyr.",
    "sp.",
    "spec.",
    "sr.",
    "st.",
    "str.",
    "stud.",
    "sąs.",
    "t.",
    "t. p.",
    "t. y.",
    "t.p.",
    "t.t.",
    "t.y.",
    "techn.",
    "tel.",
    "teol.",
    "th.",
    "tir.",
    "trit.",
    "trln.",
    "tšk.",
    "tūks.",
    "tūkst.",
    "up.",
    "upl.",
    "v.s.",
    "vad.",
    "val.",
    "valg.",
    "ved.",
    "vert.",
    "vet.",
    "vid.",
    "virš.",
    "vlsč.",
    "vnt.",
    "vok.",
    "vs.",
    "vtv.",
    "vv.",
    "vyr.",
    "vyresn.",
    "zool.",
    "Įn",
    "įl.",
    "š.m.",
    "šnek.",
    "šv.",
    "švč.",
    "ž.ū.",
    "žin.",
    "žml.",
    "žr.",
 ]:
    _exc[orth] = [{ORTH: orth}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/nb/lemmatizer/_lemma_rules.py
+++ b/spacy/lang/nb/lemmatizer/_lemma_rules.py
@ -22,6 +22,7 @@ NOUN_RULES = [
 VERB_RULES = [
    ["er", "e"],  # vasker -> vaske
    ["et", "e"],  # vasket -> vaske
    ["a", "e"],  # vaska -> vaske
    ["es", "e"],  # vaskes -> vaske
    ["te", "e"],  # stekte -> steke
    ["år", "å"],  # får -> få
--- a/spacy/lang/nb/tokenizer_exceptions.py
+++ b/spacy/lang/nb/tokenizer_exceptions.py
@ -10,7 +10,15 @@ _exc = {}
 for exc_data in [
    {ORTH: "jan.", LEMMA: "januar"},
    {ORTH: "feb.", LEMMA: "februar"},
    {ORTH: "mar.", LEMMA: "mars"},
    {ORTH: "apr.", LEMMA: "april"},
    {ORTH: "jun.", LEMMA: "juni"},
    {ORTH: "jul.", LEMMA: "juli"},
    {ORTH: "aug.", LEMMA: "august"},
    {ORTH: "sep.", LEMMA: "september"},
    {ORTH: "okt.", LEMMA: "oktober"},
    {ORTH: "nov.", LEMMA: "november"},
    {ORTH: "des.", LEMMA: "desember"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]
@ -18,11 +26,13 @@ for exc_data in [
 for orth in [
    "adm.dir.",
    "a.m.",
    "andelsnr",
    "Aq.",
    "b.c.",
    "bl.a.",
    "bla.",
    "bm.",
    "bnr.",
    "bto.",
    "ca.",
    "cand.mag.",
@ -41,6 +51,7 @@ for orth in [
    "el.",
    "e.l.",
    "et.",
    "etc.",
    "etg.",
    "ev.",
    "evt.",
@ -76,6 +87,7 @@ for orth in [
    "kgl.res.",
    "kl.",
    "komm.",
    "kr.",
    "kst.",
    "lø.",
    "ma.",
@ -106,6 +118,7 @@ for orth in [
    "o.l.",
    "on.",
    "op.",
    "org."
    "osv.",
    "ovf.",
    "p.",
@ -130,6 +143,7 @@ for orth in [
    "sep.",
    "siviling.",
    "sms.",
    "snr.",
    "spm.",
    "sr.",
    "sst.",
--- a/spacy/lang/sq/examples.py
+++ b/spacy/lang/sq/examples.py
@ -0,0 +1,18 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.sq.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
    "Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
    "San Francisko konsideron ndalimin e robotëve të shpërndarjes",
    "Londra është një qytet i madh në Mbretërinë e Bashkuar.",
 ]
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -262,13 +262,13 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None,
 cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
    # There have been a few bugs here.
    # The code was originally designed to always have pattern[1].attrs.value
    # be the ent_id when we get to the end of a pattern. However, Issue #2671
    # showed this wasn't the case when we had a reject-and-continue before a
-    # match. I still don't really understand what's going on here, but this
+    # match.
-    # workaround does resolve the issue.
+    # The patch to #2671 was wrong though, which came up in #3839.
-    while pattern.attrs.attr != ID and \
+    while pattern.attrs.attr != ID:
            (pattern.nr_attr > 0 or pattern.nr_extra_attr > 0 or pattern.nr_py > 0):
        pattern += 1
    return pattern.attrs.value
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -1,15 +1,17 @@
 # coding: utf8
 from __future__ import unicode_literals
-from collections import defaultdict
+from collections import defaultdict, OrderedDict
 import srsly
 from ..errors import Errors
 from ..compat import basestring_
-from ..util import ensure_path
+from ..util import ensure_path, to_disk, from_disk
 from ..tokens import Span
 from ..matcher import Matcher, PhraseMatcher
 DEFAULT_ENT_ID_SEP = "||"
 class EntityRuler(object):
    """The EntityRuler lets you add spans to the `Doc.ents` using token-based
@ -24,7 +26,7 @@ class EntityRuler(object):
    name = "entity_ruler"
-    def __init__(self, nlp, **cfg):
+    def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
        """Initialize the entitiy ruler. If patterns are supplied here, they
        need to be a list of dictionaries with a `"label"` and `"pattern"`
        key. A pattern can either be a token pattern (list) or a phrase pattern
@ -32,6 +34,8 @@ class EntityRuler(object):
        nlp (Language): The shared nlp object to pass the vocab to the matchers
            and process phrase patterns.
        phrase_matcher_attr (int / unicode): Token attribute to match on, passed
            to the internal PhraseMatcher as `attr`
        patterns (iterable): Optional patterns to load in.
        overwrite_ents (bool): If existing entities are present, e.g. entities
            added by the model, overwrite them by matches if necessary.
@ -47,8 +51,15 @@ class EntityRuler(object):
        self.token_patterns = defaultdict(list)
        self.phrase_patterns = defaultdict(list)
        self.matcher = Matcher(nlp.vocab)
-        self.phrase_matcher = PhraseMatcher(nlp.vocab)
+        if phrase_matcher_attr is not None:
-        self.ent_id_sep = cfg.get("ent_id_sep", "||")
+            self.phrase_matcher_attr = phrase_matcher_attr
            self.phrase_matcher = PhraseMatcher(
                nlp.vocab, attr=self.phrase_matcher_attr
            )
        else:
            self.phrase_matcher_attr = None
            self.phrase_matcher = PhraseMatcher(nlp.vocab)
        self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
        patterns = cfg.get("patterns")
        if patterns is not None:
            self.add_patterns(patterns)
@ -212,8 +223,18 @@ class EntityRuler(object):
        DOCS: https://spacy.io/api/entityruler#from_bytes
        """
-        patterns = srsly.msgpack_loads(patterns_bytes)
+        cfg = srsly.msgpack_loads(patterns_bytes)
-        self.add_patterns(patterns)
+        if isinstance(cfg, dict):
            self.add_patterns(cfg.get("patterns", cfg))
            self.overwrite = cfg.get("overwrite", False)
            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr", None)
            if self.phrase_matcher_attr is not None:
                self.phrase_matcher = PhraseMatcher(
                    self.nlp.vocab, attr=self.phrase_matcher_attr
                )
            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
        else:
            self.add_patterns(cfg)
        return self
    def to_bytes(self, **kwargs):
@ -223,7 +244,16 @@ class EntityRuler(object):
        DOCS: https://spacy.io/api/entityruler#to_bytes
        """
-        return srsly.msgpack_dumps(self.patterns)
+
        serial = OrderedDict(
            (
                ("overwrite", self.overwrite),
                ("ent_id_sep", self.ent_id_sep),
                ("phrase_matcher_attr", self.phrase_matcher_attr),
                ("patterns", self.patterns),
            )
        )
        return srsly.msgpack_dumps(serial)
    def from_disk(self, path, **kwargs):
        """Load the entity ruler from a file. Expects a file containing
@ -236,21 +266,52 @@ class EntityRuler(object):
        DOCS: https://spacy.io/api/entityruler#from_disk
        """
        path = ensure_path(path)
-        path = path.with_suffix(".jsonl")
+        depr_patterns_path = path.with_suffix(".jsonl")
-        patterns = srsly.read_jsonl(path)
+        if depr_patterns_path.is_file():
-        self.add_patterns(patterns)
+            patterns = srsly.read_jsonl(depr_patterns_path)
            self.add_patterns(patterns)
        else:
            cfg = {}
            deserializers = {
                "patterns": lambda p: self.add_patterns(
                    srsly.read_jsonl(p.with_suffix(".jsonl"))
                ),
                "cfg": lambda p: cfg.update(srsly.read_json(p)),
            }
            from_disk(path, deserializers, {})
            self.overwrite = cfg.get("overwrite", False)
            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
            if self.phrase_matcher_attr is not None:
                self.phrase_matcher = PhraseMatcher(
                    self.nlp.vocab, attr=self.phrase_matcher_attr
                )
        return self
    def to_disk(self, path, **kwargs):
        """Save the entity ruler patterns to a directory. The patterns will be
        saved as newline-delimited JSON (JSONL).
-        path (unicode / Path): The JSONL file to load.
+        path (unicode / Path): The JSONL file to save.
        **kwargs: Other config paramters, mostly for consistency.
        RETURNS (EntityRuler): The loaded entity ruler.
        DOCS: https://spacy.io/api/entityruler#to_disk
        """
        path = ensure_path(path)
-        path = path.with_suffix(".jsonl")
+        cfg = {
-        srsly.write_jsonl(path, self.patterns)
+            "overwrite": self.overwrite,
            "phrase_matcher_attr": self.phrase_matcher_attr,
            "ent_id_sep": self.ent_id_sep,
        }
        serializers = {
            "patterns": lambda p: srsly.write_jsonl(
                p.with_suffix(".jsonl"), self.patterns
            ),
            "cfg": lambda p: srsly.write_json(p, cfg),
        }
        if path.suffix == ".jsonl":  # user wants to save only JSONL
            srsly.write_jsonl(path, self.patterns)
        else:
            to_disk(path, serializers, {})
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -3,16 +3,18 @@
 # coding: utf8
 from __future__ import unicode_literals
 cimport numpy as np
 import numpy
 import srsly
 import random
 from collections import OrderedDict
 from thinc.api import chain
 from thinc.v2v import Affine, Maxout, Softmax
 from thinc.misc import LayerNorm
-from thinc.neural.util import to_categorical, copy_array
+from thinc.neural.util import to_categorical
 from thinc.neural.util import get_array_module
 from spacy.kb import KnowledgeBase
 from ..cli.pretrain import get_cossim_loss
 from .functions import merge_subtokens
 from ..tokens.doc cimport Doc
 from ..syntax.nn_parser cimport Parser
@ -24,9 +26,9 @@ from ..vocab cimport Vocab
 from ..syntax import nonproj
 from ..attrs import POS, ID
 from ..parts_of_speech import X
-from .._ml import Tok2Vec, build_tagger_model
+from .._ml import Tok2Vec, build_tagger_model, cosine
 from .._ml import build_text_classifier, build_simple_cnn_text_classifier
-from .._ml import build_bow_text_classifier
+from .._ml import build_bow_text_classifier, build_nel_encoder
 from .._ml import link_vectors_to_models, zero_init, flatten
 from .._ml import masked_language_model, create_default_optimizer
 from ..errors import Errors, TempErrors
@ -229,7 +231,7 @@ class Tensorizer(Pipe):
        vocab (Vocab): A `Vocab` instance. The model must share the same
            `Vocab` instance with the `Doc` objects it will process.
-        model (Model): A `Model` instance or `True` allocate one later.
+        model (Model): A `Model` instance or `True` to allocate one later.
        **cfg: Config parameters.
        EXAMPLE:
@ -294,7 +296,7 @@ class Tensorizer(Pipe):
        docs (iterable): A batch of `Doc` objects.
        golds (iterable): A batch of `GoldParse` objects.
-        drop (float): The droput rate.
+        drop (float): The dropout rate.
        sgd (callable): An optimizer.
        RETURNS (dict): Results from the update.
        """
@ -386,7 +388,7 @@ class Tagger(Pipe):
    def predict(self, docs):
        self.require_model()
        if not any(len(doc) for doc in docs):
-            # Handle case where there are no tokens in any docs.
+            # Handle cases where there are no tokens in any docs.
            n_labels = len(self.labels)
            guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs]
            tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO))
@ -900,6 +902,11 @@ class TextCategorizer(Pipe):
    def labels(self):
        return tuple(self.cfg.setdefault("labels", []))
    def require_labels(self):
        """Raise an error if the component's model has no labels defined."""
        if not self.labels:
            raise ValueError(Errors.E143.format(name=self.name))
    @labels.setter
    def labels(self, value):
        self.cfg["labels"] = tuple(value)
@ -929,6 +936,7 @@ class TextCategorizer(Pipe):
                doc.cats[label] = float(scores[i, j])
    def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
        self.require_model()
        scores, bp_scores = self.model.begin_update(docs, drop=drop)
        loss, d_scores = self.get_loss(docs, golds, scores)
        bp_scores(d_scores, sgd=sgd)
@ -983,6 +991,7 @@ class TextCategorizer(Pipe):
    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
        if self.model is True:
            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
            self.require_labels()
            self.model = self.Model(len(self.labels), **self.cfg)
            link_vectors_to_models(self.vocab)
        if sgd is None:
@ -1001,7 +1010,7 @@ cdef class DependencyParser(Parser):
    @property
    def postprocesses(self):
-        return [nonproj.deprojectivize, merge_subtokens]
+        return [nonproj.deprojectivize]
    def add_multitask_objective(self, target):
        if target == "cloze":
@ -1063,52 +1072,252 @@ cdef class EntityRecognizer(Parser):
 class EntityLinker(Pipe):
    """Pipeline component for named entity linking.
    DOCS: TODO
    """
    name = 'entity_linker'
    @classmethod
-    def Model(cls, nr_class=1, **cfg):
+    def Model(cls, **cfg):
-        # TODO: non-dummy EL implementation
+        embed_width = cfg.get("embed_width", 300)
-        return None
+        hidden_width = cfg.get("hidden_width", 128)
        type_to_int = cfg.get("type_to_int", dict())
-    def __init__(self, model=True, **cfg):
+        model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg)
-        self.model = False
+        return model
    def __init__(self, vocab, **cfg):
        self.vocab = vocab
        self.model = True
        self.kb = None
        self.cfg = dict(cfg)
-        self.kb = self.cfg["kb"]
+        self.sgd_context = None
    def set_kb(self, kb):
        self.kb = kb
    def require_model(self):
        # Raise an error if the component's model is not initialized.
        if getattr(self, "model", None) in (None, True, False):
            raise ValueError(Errors.E109.format(name=self.name))
    def require_kb(self):
        # Raise an error if the knowledge base is not initialized.
        if getattr(self, "kb", None) in (None, True, False):
            raise ValueError(Errors.E139.format(name=self.name))
    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
        self.require_kb()
        self.cfg["entity_width"] = self.kb.entity_vector_length
        if self.model is True:
            self.model = self.Model(**self.cfg)
            self.sgd_context = self.create_optimizer()
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd
    def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
        self.require_model()
        self.require_kb()
        if losses is not None:
            losses.setdefault(self.name, 0.0)
        if not docs or not golds:
            return 0
        if len(docs) != len(golds):
            raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs),
                                                n_golds=len(golds)))
        if isinstance(docs, Doc):
            docs = [docs]
            golds = [golds]
        context_docs = []
        entity_encodings = []
        cats = []
        priors = []
        type_vectors = []
        type_to_int = self.cfg.get("type_to_int", dict())
        for doc, gold in zip(docs, golds):
            ents_by_offset = dict()
            for ent in doc.ents:
                ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent
            for entity in gold.links:
                start, end, gold_kb = entity
                mention = doc.text[start:end]
                gold_ent = ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)]
                assert gold_ent is not None
                type_vector = [0 for i in range(len(type_to_int))]
                if len(type_to_int) > 0:
                    type_vector[type_to_int[gold_ent.label_]] = 1
                candidates = self.kb.get_candidates(mention)
                random.shuffle(candidates)
                nr_neg = 0
                for c in candidates:
                    kb_id = c.entity_
                    entity_encoding = c.entity_vector
                    entity_encodings.append(entity_encoding)
                    context_docs.append(doc)
                    type_vectors.append(type_vector)
                    if self.cfg.get("prior_weight", 1) > 0:
                        priors.append([c.prior_prob])
                    else:
                        priors.append([0])
                    if kb_id == gold_kb:
                        cats.append([1])
                    else:
                        nr_neg += 1
                        cats.append([0])
        if len(entity_encodings) > 0:
            assert len(priors) == len(entity_encodings) == len(context_docs) == len(cats) == len(type_vectors)
            context_encodings, bp_context = self.model.tok2vec.begin_update(context_docs, drop=drop)
            entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
            mention_encodings = [list(context_encodings[i]) + list(entity_encodings[i]) + priors[i] + type_vectors[i]
                                 for i in range(len(entity_encodings))]
            pred, bp_mention = self.model.begin_update(self.model.ops.asarray(mention_encodings, dtype="float32"), drop=drop)
            cats = self.model.ops.asarray(cats, dtype="float32")
            loss, d_scores = self.get_loss(prediction=pred, golds=cats, docs=None)
            mention_gradient = bp_mention(d_scores, sgd=sgd)
            context_gradients = [list(x[0:self.cfg.get("context_width")]) for x in mention_gradient]
            bp_context(self.model.ops.asarray(context_gradients, dtype="float32"), sgd=self.sgd_context)
            if losses is not None:
                losses[self.name] += loss
            return loss
        return 0
    def get_loss(self, docs, golds, prediction):
        d_scores = (prediction - golds)
        loss = (d_scores ** 2).sum()
        loss = loss / len(golds)
        return loss, d_scores
    def get_loss_old(self, docs, golds, scores):
        # this loss function assumes we're only using positive examples
        loss, gradients = get_cossim_loss(yh=scores, y=golds)
        loss = loss / len(golds)
        return loss, gradients
    def __call__(self, doc):
-        self.set_annotations([doc], scores=None, tensors=None)
+        entities, kb_ids = self.predict([doc])
        self.set_annotations([doc], entities, kb_ids)
        return doc
    def pipe(self, stream, batch_size=128, n_threads=-1):
        """Apply the pipe to a stream of documents.
        Both __call__ and pipe should delegate to the `predict()`
        and `set_annotations()` methods.
        """
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
-            self.set_annotations(docs, scores=None, tensors=None)
+            entities, kb_ids = self.predict(docs)
            self.set_annotations(docs, entities, kb_ids)
            yield from docs
-    def set_annotations(self, docs, scores, tensors=None):
+    def predict(self, docs):
-        """
+        self.require_model()
-        Currently implemented as taking the KB entry with highest prior probability for each named entity
+        self.require_kb()
        TODO: actually use context etc
        """
        for i, doc in enumerate(docs):
            for ent in doc.ents:
                candidates = self.kb.get_candidates(ent.text)
                if candidates:
                    best_candidate = max(candidates, key=lambda c: c.prior_prob)
                    for token in ent:
                        token.ent_kb_id_ = best_candidate.entity_
-    def get_loss(self, docs, golds, scores):
+        final_entities = []
-        # TODO
+        final_kb_ids = []
-        pass
+
        if not docs:
            return final_entities, final_kb_ids
        if isinstance(docs, Doc):
            docs = [docs]
        context_encodings = self.model.tok2vec(docs)
        xp = get_array_module(context_encodings)
        type_to_int = self.cfg.get("type_to_int", dict())
        for i, doc in enumerate(docs):
            if len(doc) > 0:
                context_encoding = context_encodings[i]
                for ent in doc.ents:
                    type_vector = [0 for i in range(len(type_to_int))]
                    if len(type_to_int) > 0:
                        type_vector[type_to_int[ent.label_]] = 1
                    candidates = self.kb.get_candidates(ent.text)
                    if candidates:
                        random.shuffle(candidates)
                        # this will set the prior probabilities to 0 (just like in training) if their weight is 0
                        prior_probs = xp.asarray([[c.prior_prob] for c in candidates])
                        prior_probs *= self.cfg.get("prior_weight", 1)
                        scores = prior_probs
                        if self.cfg.get("context_weight", 1) > 0:
                            entity_encodings = xp.asarray([c.entity_vector for c in candidates])
                            assert len(entity_encodings) == len(prior_probs)
                            mention_encodings = [list(context_encoding) + list(entity_encodings[i])
                                                 + list(prior_probs[i]) + type_vector
                                                 for i in range(len(entity_encodings))]
                            scores = self.model(self.model.ops.asarray(mention_encodings, dtype="float32"))
                        # TODO: thresholding
                        best_index = scores.argmax()
                        best_candidate = candidates[best_index]
                        final_entities.append(ent)
                        final_kb_ids.append(best_candidate.entity_)
        return final_entities, final_kb_ids
    def set_annotations(self, docs, entities, kb_ids=None):
        for entity, kb_id in zip(entities, kb_ids):
            for token in entity:
                token.ent_kb_id_ = kb_id
    def to_disk(self, path, exclude=tuple(), **kwargs):
        serialize = OrderedDict()
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        serialize["kb"] = lambda p: self.kb.dump(p)
        if self.model not in (None, True, False):
            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)
    def from_disk(self, path, exclude=tuple(), **kwargs):
        def load_model(p):
             if self.model is True:
                self.model = self.Model(**self.cfg)
             self.model.from_bytes(p.open("rb").read())
        def load_kb(p):
            kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"])
            kb.load_bulk(p)
            self.set_kb(kb)
        deserialize = OrderedDict()
        deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
        deserialize["kb"] = load_kb
        deserialize["model"] = load_model
        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
        util.from_disk(path, deserialize, exclude)
        return self
    def rehearse(self, docs, sgd=None, losses=None, **config):
        raise NotImplementedError
    def add_label(self, label):
-        # TODO
+        raise NotImplementedError
        pass
 class Sentencizer(object):
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -52,6 +52,7 @@ class Scorer(object):
        self.labelled = PRFScore()
        self.tags = PRFScore()
        self.ner = PRFScore()
        self.ner_per_ents = dict()
        self.eval_punct = eval_punct
    @property
@ -91,6 +92,15 @@ class Scorer(object):
        """RETURNS (float): Named entity accuracy (F-score)."""
        return self.ner.fscore * 100
    @property
    def ents_per_type(self):
        """RETURNS (dict): Scores per entity label.
        """
        return {
            k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
            for k, v in self.ner_per_ents.items()
        }
    @property
    def scores(self):
        """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
@ -102,6 +112,7 @@ class Scorer(object):
            "ents_p": self.ents_p,
            "ents_r": self.ents_r,
            "ents_f": self.ents_f,
            "ents_per_type": self.ents_per_type,
            "tags_acc": self.tags_acc,
            "token_acc": self.token_acc,
        }
@ -149,13 +160,31 @@ class Scorer(object):
                    cand_deps.add((gold_i, gold_head, token.dep_.lower()))
        if "-" not in [token[-1] for token in gold.orig_annot]:
            cand_ents = set()
            current_ent = {k.label_: set() for k in doc.ents}
            current_gold = {k.label_: set() for k in doc.ents}
            for ent in doc.ents:
                if ent.label_ not in self.ner_per_ents:
                    self.ner_per_ents[ent.label_] = PRFScore()
                first = gold.cand_to_gold[ent.start]
                last = gold.cand_to_gold[ent.end - 1]
                if first is None or last is None:
                    self.ner.fp += 1
                    self.ner_per_ents[ent.label_].fp += 1
                else:
                    cand_ents.add((ent.label_, first, last))
                    current_ent[ent.label_].add(
                        tuple(x for x in cand_ents if x[0] == ent.label_)
                    )
                    current_gold[ent.label_].add(
                        tuple(x for x in gold_ents if x[0] == ent.label_)
                    )
            # Scores per ent
            [
                v.score_set(current_ent[k], current_gold[k])
                for k, v in self.ner_per_ents.items()
                if k in current_ent
            ]
            # Score for all ents
            self.ner.score_set(cand_ents, gold_ents)
        self.tags.score_set(cand_tags, gold_tags)
        self.labelled.score_set(cand_deps, gold_deps)
--- a/spacy/structs.pxd
+++ b/spacy/structs.pxd
@ -3,6 +3,10 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t
 from .typedefs cimport flags_t, attr_t, hash_t
 from .parts_of_speech cimport univ_pos_t
 from libcpp.vector cimport vector
 from libc.stdint cimport int32_t, int64_t
 cdef struct LexemeC:
    flags_t flags
@ -72,3 +76,32 @@ cdef struct TokenC:
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
    attr_t ent_kb_id
    hash_t ent_id
 # Internal struct, for storage and disambiguation of entities.
 cdef struct KBEntryC:
    # The hash of this entry's unique ID/name in the kB
    hash_t entity_hash
    # Allows retrieval of the entity vector, as an index into a vectors table of the KB.
    # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint).
    int32_t vector_index
    # Allows retrieval of a struct of non-vector features.
    # This is currently not implemented and set to -1 for the common case where there are no features.
    int32_t feats_row
    # log probability of entity, based on corpus frequency
    float prob
 # Each alias struct stores a list of Entry pointers with their prior probabilities
 # for this specific mention/alias.
 cdef struct AliasC:
    # All entry candidates for this alias
    vector[int64_t] entry_indices
    # Prior probability P(entity|alias) - should sum up to (at most) 1.
    vector[float] probs
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -81,6 +81,7 @@ cdef enum symbol_t:
    DEP
    ENT_IOB
    ENT_TYPE
    ENT_KB_ID
    HEAD
    SENT_START
    SPACY
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -86,6 +86,7 @@ IDS = {
    "DEP": DEP,
    "ENT_IOB": ENT_IOB,
    "ENT_TYPE": ENT_TYPE,
    "ENT_KB_ID": ENT_KB_ID,
    "HEAD": HEAD,
    "SENT_START": SENT_START,
    "SPACY": SPACY,
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -124,6 +124,22 @@ def ja_tokenizer():
    return get_lang_class("ja").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def ko_tokenizer():
    pytest.importorskip("natto")
    return get_lang_class("ko").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def lt_tokenizer():
    return get_lang_class("lt").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def lt_lemmatizer():
    return get_lang_class("lt").Defaults.create_lemmatizer()
@pytest.fixture(scope="session")
 def nb_tokenizer():
    return get_lang_class("nb").Defaults.create_tokenizer()
--- a/spacy/tests/lang/ko/init.py
+++ b/spacy/tests/lang/ko/init.py
--- a/spacy/tests/lang/ko/test_lemmatization.py
+++ b/spacy/tests/lang/ko/test_lemmatization.py
@ -0,0 +1,12 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize(
    "word,lemma", [("새로운", "새롭"), ("빨간", "빨갛"), ("클수록", "크"), ("뭡니까", "뭣"), ("됐다", "되")]
 )
 def test_ko_lemmatizer_assigns(ko_tokenizer, word, lemma):
    test_lemma = ko_tokenizer(word)[0].lemma_
    assert test_lemma == lemma
--- a/spacy/tests/lang/ko/test_tokenizer.py
+++ b/spacy/tests/lang/ko/test_tokenizer.py
@ -0,0 +1,46 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 # fmt: off
 TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
                   ("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")]
 TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
              "NNP NNG NNG JKB VV EC VX EF SF"),
             ("영등포구에 있는 맛집 좀 알려주세요.",
              "NNP JKB VV ETM NNG MAG VV VX EP SF")]
 FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.",
                   "NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")]
 POS_TESTS = [("서울 타워 근처에 살고 있습니다.",
              "PROPN NOUN NOUN ADP VERB X AUX X PUNCT"),
             ("영등포구에 있는 맛집 좀 알려주세요.",
              "PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")]
 # fmt: on
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
 def test_ko_tokenizer(ko_tokenizer, text, expected_tokens):
    tokens = [token.text for token in ko_tokenizer(text)]
    assert tokens == expected_tokens.split()
@pytest.mark.parametrize("text,expected_tags", TAG_TESTS)
 def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags):
    tags = [token.tag_ for token in ko_tokenizer(text)]
    assert tags == expected_tags.split()
@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS)
 def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags):
    tags = ko_tokenizer(text).user_data["full_tags"]
    assert tags == expected_tags.split()
@pytest.mark.parametrize("text,expected_pos", POS_TESTS)
 def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
    pos = [token.pos_ for token in ko_tokenizer(text)]
    assert pos == expected_pos.split()
--- a/spacy/tests/lang/lt/init.py
+++ b/spacy/tests/lang/lt/init.py
--- a/spacy/tests/lang/lt/test_lemmatizer.py
+++ b/spacy/tests/lang/lt/test_lemmatizer.py
@ -0,0 +1,15 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize("tokens,lemmas", [
    (["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
      "sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
     ["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
      "apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
    (["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
     ["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
 def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
    assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
--- a/spacy/tests/lang/lt/test_text.py
+++ b/spacy/tests/lang/lt/test_text.py
@ -0,0 +1,56 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_lt_tokenizer_handles_long_text(lt_tokenizer):
    text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
    tokens = lt_tokenizer(text)
    assert len(tokens) == 42
@pytest.mark.parametrize(
    "text,length",
    [
        (
            "177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
            15,
        ),
        (
            "ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
            16,
        ),
    ],
 )
 def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
    tokens = lt_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
 def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
    tokens = lt_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize(
    "text,match",
    [
        ("10", True),
        ("1", True),
        ("10,000", True),
        ("10,00", True),
        ("999.0", True),
        ("vienas", True),
        ("du", True),
        ("milijardas", True),
        ("šuo", False),
        (",", False),
        ("1/2", True),
    ],
 )
 def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
    tokens = lt_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -5,7 +5,6 @@ import pytest
 import re
 from spacy.matcher import Matcher, DependencyMatcher
 from spacy.tokens import Doc, Token
 from ..util import get_doc
@pytest.fixture
@ -288,24 +287,43 @@ def deps():
 def dependency_matcher(en_vocab):
    def is_brown_yellow(text):
        return bool(re.compile(r"brown|yellow|over").match(text))
    IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
    pattern1 = [
        {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}},
-        {"SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},"PATTERN": {"ORTH": "quick", "DEP": "amod"}},
+        {
-        {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, "PATTERN": {IS_BROWN_YELLOW: True}},
+            "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
            "PATTERN": {"ORTH": "quick", "DEP": "amod"},
        },
        {
            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
            "PATTERN": {IS_BROWN_YELLOW: True},
        },
    ]
    pattern2 = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
-        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+        {
-        {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}
+            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
    ]
    pattern3 = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
-        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+        {
-        {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, "PATTERN": {"ORTH": "brown"}}
+            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"},
            "PATTERN": {"ORTH": "brown"},
        },
    ]
    matcher = DependencyMatcher(en_vocab)
@ -320,9 +338,9 @@ def test_dependency_matcher_compile(dependency_matcher):
    assert len(dependency_matcher) == 3
-def test_dependency_matcher(dependency_matcher, text, heads, deps):
+# def test_dependency_matcher(dependency_matcher, text, heads, deps):
-    doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
+#     doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
-    matches = dependency_matcher(doc)
+#     matches = dependency_matcher(doc)
-    # assert matches[0][1] == [[3, 1, 2]]
+#     assert matches[0][1] == [[3, 1, 2]]
-    # assert matches[1][1] == [[4, 3, 3]]
+#     assert matches[1][1] == [[4, 3, 3]]
-    # assert matches[2][1] == [[4, 3, 2]]
+#     assert matches[2][1] == [[4, 3, 2]]
--- a/spacy/tests/pipeline/test_el.py
+++ b/spacy/tests/pipeline/test_el.py
@ -1,91 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.kb import KnowledgeBase
 from spacy.lang.en import English
@pytest.fixture
 def nlp():
    return English()
 def test_kb_valid_entities(nlp):
    """Test the valid construction of a KB with 3 entities and two aliases"""
    mykb = KnowledgeBase(nlp.vocab)
    # adding entities
    mykb.add_entity(entity=u'Q1', prob=0.9)
    mykb.add_entity(entity=u'Q2')
    mykb.add_entity(entity=u'Q3', prob=0.5)
    # adding aliases
    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
    # test the size of the corresponding KB
    assert(mykb.get_size_entities() == 3)
    assert(mykb.get_size_aliases() == 2)
 def test_kb_invalid_entities(nlp):
    """Test the invalid construction of a KB with an alias linked to a non-existing entity"""
    mykb = KnowledgeBase(nlp.vocab)
    # adding entities
    mykb.add_entity(entity=u'Q1', prob=0.9)
    mykb.add_entity(entity=u'Q2', prob=0.2)
    mykb.add_entity(entity=u'Q3', prob=0.5)
    # adding aliases - should fail because one of the given IDs is not valid
    with pytest.raises(ValueError):
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2])
 def test_kb_invalid_probabilities(nlp):
    """Test the invalid construction of a KB with wrong prior probabilities"""
    mykb = KnowledgeBase(nlp.vocab)
    # adding entities
    mykb.add_entity(entity=u'Q1', prob=0.9)
    mykb.add_entity(entity=u'Q2', prob=0.2)
    mykb.add_entity(entity=u'Q3', prob=0.5)
    # adding aliases - should fail because the sum of the probabilities exceeds 1
    with pytest.raises(ValueError):
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4])
 def test_kb_invalid_combination(nlp):
    """Test the invalid construction of a KB with non-matching entity and probability lists"""
    mykb = KnowledgeBase(nlp.vocab)
    # adding entities
    mykb.add_entity(entity=u'Q1', prob=0.9)
    mykb.add_entity(entity=u'Q2', prob=0.2)
    mykb.add_entity(entity=u'Q3', prob=0.5)
    # adding aliases - should fail because the entities and probabilities vectors are not of equal length
    with pytest.raises(ValueError):
        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1])
 def test_candidate_generation(nlp):
    """Test correct candidate generation"""
    mykb = KnowledgeBase(nlp.vocab)
    # adding entities
    mykb.add_entity(entity=u'Q1', prob=0.9)
    mykb.add_entity(entity=u'Q2', prob=0.2)
    mykb.add_entity(entity=u'Q3', prob=0.5)
    # adding aliases
    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
    # test the size of the relevant candidates
    assert(len(mykb.get_candidates(u'douglas')) == 2)
    assert(len(mykb.get_candidates(u'adam')) == 1)
    assert(len(mykb.get_candidates(u'shrubbery')) == 0)
--- a/spacy/tests/pipeline/test_entity_linker.py
+++ b/spacy/tests/pipeline/test_entity_linker.py
@ -0,0 +1,145 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.kb import KnowledgeBase
 from spacy.lang.en import English
 from spacy.pipeline import EntityRuler
@pytest.fixture
 def nlp():
    return English()
 def test_kb_valid_entities(nlp):
    """Test the valid construction of a KB with 3 entities and two aliases"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.5, entity_vector=[2])
    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
    # adding aliases
    mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
    mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
    # test the size of the corresponding KB
    assert(mykb.get_size_entities() == 3)
    assert(mykb.get_size_aliases() == 2)
 def test_kb_invalid_entities(nlp):
    """Test the invalid construction of a KB with an alias linked to a non-existing entity"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
    # adding aliases - should fail because one of the given IDs is not valid
    with pytest.raises(ValueError):
        mykb.add_alias(alias='douglas', entities=['Q2', 'Q342'], probabilities=[0.8, 0.2])
 def test_kb_invalid_probabilities(nlp):
    """Test the invalid construction of a KB with wrong prior probabilities"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
    # adding aliases - should fail because the sum of the probabilities exceeds 1
    with pytest.raises(ValueError):
        mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.4])
 def test_kb_invalid_combination(nlp):
    """Test the invalid construction of a KB with non-matching entity and probability lists"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
    # adding aliases - should fail because the entities and probabilities vectors are not of equal length
    with pytest.raises(ValueError):
        mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.3, 0.4, 0.1])
 def test_kb_invalid_entity_vector(nlp):
    """Test the invalid construction of a KB with non-matching entity vector lengths"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1, 2, 3])
    # this should fail because the kb's expected entity vector length is 3
    with pytest.raises(ValueError):
        mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
 def test_candidate_generation(nlp):
    """Test correct candidate generation"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2])
    mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3])
    # adding aliases
    mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2])
    mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9])
    # test the size of the relevant candidates
    assert(len(mykb.get_candidates('douglas')) == 2)
    assert(len(mykb.get_candidates('adam')) == 1)
    assert(len(mykb.get_candidates('shrubbery')) == 0)
 def test_preserving_links_asdoc(nlp):
    """Test that Span.as_doc preserves the existing entity links"""
    mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
    # adding entities
    mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1])
    mykb.add_entity(entity='Q2', prob=0.8, entity_vector=[1])
    # adding aliases
    mykb.add_alias(alias='Boston', entities=['Q1'], probabilities=[0.7])
    mykb.add_alias(alias='Denver', entities=['Q2'], probabilities=[0.6])
    # set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained)
    sentencizer = nlp.create_pipe("sentencizer")
    nlp.add_pipe(sentencizer)
    ruler = EntityRuler(nlp)
    patterns = [{"label": "GPE", "pattern": "Boston"},
                {"label": "GPE", "pattern": "Denver"}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64})
    el_pipe.set_kb(mykb)
    el_pipe.begin_training()
    el_pipe.context_weight = 0
    el_pipe.prior_weight = 1
    nlp.add_pipe(el_pipe, last=True)
    # test whether the entity links are preserved by the `as_doc()` function
    text = "She lives in Boston. He lives in Denver."
    doc = nlp(text)
    for ent in doc.ents:
        orig_text = ent.text
        orig_kb_id = ent.kb_id_
        sent_doc = ent.sent.as_doc()
        for s_ent in sent_doc.ents:
            if s_ent.text == orig_text:
                assert s_ent.kb_id_ == orig_kb_id
--- a/spacy/tests/pipeline/test_entity_ruler.py
+++ b/spacy/tests/pipeline/test_entity_ruler.py
@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
    assert len(new_ruler) == 0
    assert len(new_ruler.labels) == 0
    new_ruler = new_ruler.from_bytes(ruler_bytes)
    assert len(new_ruler) == len(patterns)
    assert len(new_ruler.labels) == 4
    assert len(new_ruler.patterns) == len(ruler.patterns)
    for pattern in ruler.patterns:
        assert pattern in new_ruler.patterns
    assert sorted(new_ruler.labels) == sorted(ruler.labels)
 def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
    ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
    assert len(ruler) == len(patterns)
    assert len(ruler.labels) == 4
    ruler_bytes = ruler.to_bytes()
    new_ruler = EntityRuler(nlp)
    assert len(new_ruler) == 0
    assert len(new_ruler.labels) == 0
    assert new_ruler.phrase_matcher_attr is None
    new_ruler = new_ruler.from_bytes(ruler_bytes)
    assert len(new_ruler) == len(patterns)
    assert len(new_ruler.labels) == 4
    assert new_ruler.phrase_matcher_attr == "LOWER"
--- a/spacy/tests/regression/test_issue2001-2500.py
+++ b/spacy/tests/regression/test_issue2001-2500.py
@ -4,6 +4,7 @@ from __future__ import unicode_literals
 import pytest
 import numpy
 from spacy.tokens import Doc
 from spacy.matcher import Matcher
 from spacy.displacy import render
 from spacy.gold import iob_to_biluo
 from spacy.lang.it import Italian
@ -123,6 +124,15 @@ def test_issue2396(en_vocab):
    assert (span.get_lca_matrix() == matrix).all()
 def test_issue2464(en_vocab):
    """Test problem with successive ?. This is the same bug, so putting it here."""
    matcher = Matcher(en_vocab)
    doc = Doc(en_vocab, words=["a", "b"])
    matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
    matches = matcher(doc)
    assert len(matches) == 3
 def test_issue2482():
    """Test we can serialize and deserialize a blank NER or parser model."""
    nlp = Italian()
--- a/spacy/tests/regression/test_issue3001-3500.py
+++ b/spacy/tests/regression/test_issue3001-3500.py
@ -0,0 +1,334 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
 from spacy.lang.de import German
 from spacy.pipeline import EntityRuler, EntityRecognizer
 from spacy.matcher import Matcher, PhraseMatcher
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from spacy.attrs import ENT_IOB, ENT_TYPE
 from spacy.compat import pickle, is_python2, unescape_unicode
 from spacy import displacy
 from spacy.util import decaying
 import numpy
 import re
 from ..util import get_doc
 def test_issue3002():
    """Test that the tokenizer doesn't hang on a long list of dots"""
    nlp = German()
    doc = nlp(
        "880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl"
    )
    assert len(doc) == 5
 def test_issue3009(en_vocab):
    """Test problem with matcher quantifiers"""
    patterns = [
        [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}],
        [
            {"LEMMA": "have"},
            {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
            {"LOWER": "to"},
            {"LOWER": "do"},
            {"POS": "ADP"},
        ],
        [
            {"LEMMA": "have"},
            {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
            {"LOWER": "to"},
            {"LOWER": "do"},
            {"POS": "ADP"},
        ],
    ]
    words = ["also", "has", "to", "do", "with"]
    tags = ["RB", "VBZ", "TO", "VB", "IN"]
    doc = get_doc(en_vocab, words=words, tags=tags)
    matcher = Matcher(en_vocab)
    for i, pattern in enumerate(patterns):
        matcher.add(str(i), None, pattern)
        matches = matcher(doc)
        assert matches
 def test_issue3012(en_vocab):
    """Test that the is_tagged attribute doesn't get overwritten when we from_array
    without tag information."""
    words = ["This", "is", "10", "%", "."]
    tags = ["DT", "VBZ", "CD", "NN", "."]
    pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
    ents = [(2, 4, "PERCENT")]
    doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
    assert doc.is_tagged
    expected = ("10", "NUM", "CD", "PERCENT")
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
    header = [ENT_IOB, ENT_TYPE]
    ent_array = doc.to_array(header)
    doc.from_array(header, ent_array)
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
    # Serializing then deserializing
    doc_bytes = doc.to_bytes()
    doc2 = Doc(en_vocab).from_bytes(doc_bytes)
    assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected
 def test_issue3199():
    """Test that Span.noun_chunks works correctly if no noun chunks iterator
    is available. To make this test future-proof, we're constructing a Doc
    with a new Vocab here and setting is_parsed to make sure the noun chunks run.
    """
    doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
    doc.is_parsed = True
    assert list(doc[0:3].noun_chunks) == []
 def test_issue3209():
    """Test issue that occurred in spaCy nightly where NER labels were being
    mapped to classes incorrectly after loading the model, when the labels
    were added using ner.add_label().
    """
    nlp = English()
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
    ner.add_label("ANIMAL")
    nlp.begin_training()
    move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
    assert ner.move_names == move_names
    nlp2 = English()
    nlp2.add_pipe(nlp2.create_pipe("ner"))
    nlp2.from_bytes(nlp.to_bytes())
    assert nlp2.get_pipe("ner").move_names == move_names
 def test_issue3248_1():
    """Test that the PhraseMatcher correctly reports its number of rules, not
    total number of patterns."""
    nlp = English()
    matcher = PhraseMatcher(nlp.vocab)
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
    matcher.add("TEST2", None, nlp("d"))
    assert len(matcher) == 2
 def test_issue3248_2():
    """Test that the PhraseMatcher can be pickled correctly."""
    nlp = English()
    matcher = PhraseMatcher(nlp.vocab)
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
    matcher.add("TEST2", None, nlp("d"))
    data = pickle.dumps(matcher)
    new_matcher = pickle.loads(data)
    assert len(new_matcher) == len(matcher)
 def test_issue3277(es_tokenizer):
    """Test that hyphens are split correctly as prefixes."""
    doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.")
    assert len(doc) == 14
    assert doc[0].text == "\u2014"
    assert doc[5].text == "\u2013"
    assert doc[9].text == "\u2013"
 def test_issue3288(en_vocab):
    """Test that retokenization works correctly via displaCy when punctuation
    is merged onto the preceeding token and tensor is resized."""
    words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
    heads = [1, 0, -1, 1, 0, 1, -2, -3]
    deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
    displacy.render(doc)
 def test_issue3289():
    """Test that Language.to_bytes handles serializing a pipeline component
    with an uninitialized model."""
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("textcat"))
    bytes_data = nlp.to_bytes()
    new_nlp = English()
    new_nlp.add_pipe(nlp.create_pipe("textcat"))
    new_nlp.from_bytes(bytes_data)
 def test_issue3328(en_vocab):
    doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
    matcher = Matcher(en_vocab)
    patterns = [
        [{"LOWER": {"IN": ["hello", "how"]}}],
        [{"LOWER": {"IN": ["you", "doing"]}}],
    ]
    matcher.add("TEST", None, *patterns)
    matches = matcher(doc)
    assert len(matches) == 4
    matched_texts = [doc[start:end].text for _, start, end in matches]
    assert matched_texts == ["Hello", "how", "you", "doing"]
@pytest.mark.xfail
 def test_issue3331(en_vocab):
    """Test that duplicate patterns for different rules result in multiple
    matches, one per rule.
    """
    matcher = PhraseMatcher(en_vocab)
    matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
    matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
    doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
    matches = matcher(doc)
    assert len(matches) == 2
    match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
    assert sorted(match_ids) == ["A", "B"]
 def test_issue3345():
    """Test case where preset entity crosses sentence boundary."""
    nlp = English()
    doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
    doc[4].is_sent_start = True
    ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
    ner = EntityRecognizer(doc.vocab)
    # Add the OUT action. I wouldn't have thought this would be necessary...
    ner.moves.add_action(5, "")
    ner.add_label("GPE")
    doc = ruler(doc)
    # Get into the state just before "New"
    state = ner.moves.init_batch([doc])[0]
    ner.moves.apply_transition(state, "O")
    ner.moves.apply_transition(state, "O")
    ner.moves.apply_transition(state, "O")
    # Check that B-GPE is valid.
    assert ner.moves.is_valid(state, "B-GPE")
 if is_python2:
    # If we have this test in Python 3, pytest chokes, as it can't print the
    # string above in the xpass message.
    prefix_search = (
        b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
        b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
        b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
        b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
        b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
        b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
        b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
        b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
        b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
        b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
        b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
        b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
        b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
        b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
        b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
        b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
        b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
        b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
        b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
        b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
        b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
        b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
        b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
        b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
        b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
        b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
        b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
        b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
        b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
        b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
        b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
        b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
        b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
        b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
        b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
        b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
        b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
        b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
        b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
        b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
        b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
        b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
        b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
        b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
        b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
        b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
        b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
        b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
        b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
        b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
        b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
        b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
        b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
        b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
        b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
        b"\\U0001FA60-\\U0001FA6D]"
    )
    def test_issue3356():
        pattern = re.compile(unescape_unicode(prefix_search.decode("utf8")))
        assert not pattern.search("hello")
 def test_issue3410():
    texts = ["Hello world", "This is a test"]
    nlp = English()
    matcher = Matcher(nlp.vocab)
    phrasematcher = PhraseMatcher(nlp.vocab)
    with pytest.deprecated_call():
        docs = list(nlp.pipe(texts, n_threads=4))
    with pytest.deprecated_call():
        docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
    with pytest.deprecated_call():
        list(matcher.pipe(docs, n_threads=4))
    with pytest.deprecated_call():
        list(phrasematcher.pipe(docs, n_threads=4))
 def test_issue3447():
    sizes = decaying(10.0, 1.0, 0.5)
    size = next(sizes)
    assert size == 10.0
    size = next(sizes)
    assert size == 10.0 - 0.5
    size = next(sizes)
    assert size == 10.0 - 0.5 - 0.5
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
 def test_issue3449():
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    text1 = "He gave the ball to I. Do you want to go to the movies with I?"
    text2 = "He gave the ball to I.  Do you want to go to the movies with I?"
    text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
    t1 = nlp(text1)
    t2 = nlp(text2)
    t3 = nlp(text3)
    assert t1[5].text == "I"
    assert t2[5].text == "I"
    assert t3[5].text == "I"
 def test_issue3468():
    """Test that sentence boundaries are set correctly so Doc.is_sentenced can
    be restored after serialization."""
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    doc = nlp("Hello world")
    assert doc[0].is_sent_start
    assert doc.is_sentenced
    assert len(list(doc.sents)) == 1
    doc_bytes = doc.to_bytes()
    new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
    assert new_doc[0].is_sent_start
    assert new_doc.is_sentenced
    assert len(list(new_doc.sents)) == 1
--- a/spacy/tests/regression/test_issue3002.py
+++ b/spacy/tests/regression/test_issue3002.py
@ -1,11 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.de import German
 def test_issue3002():
    """Test that the tokenizer doesn't hang on a long list of dots"""
    nlp = German()
    doc = nlp('880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl')
    assert len(doc) == 5
--- a/spacy/tests/regression/test_issue3009.py
+++ b/spacy/tests/regression/test_issue3009.py
@ -1,67 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.matcher import Matcher
 from spacy.tokens import Doc
 PATTERNS = [
    ("1", [[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}]]),
    (
        "2",
        [
            [
                {"LEMMA": "have"},
                {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
                {"LOWER": "to"},
                {"LOWER": "do"},
                {"POS": "ADP"},
            ]
        ],
    ),
    (
        "3",
        [
            [
                {"LEMMA": "have"},
                {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
                {"LOWER": "to"},
                {"LOWER": "do"},
                {"POS": "ADP"},
            ]
        ],
    ),
 ]
@pytest.fixture
 def doc(en_tokenizer):
    doc = en_tokenizer("also has to do with")
    doc[0].tag_ = "RB"
    doc[1].tag_ = "VBZ"
    doc[2].tag_ = "TO"
    doc[3].tag_ = "VB"
    doc[4].tag_ = "IN"
    return doc
@pytest.fixture
 def matcher(en_tokenizer):
    return Matcher(en_tokenizer.vocab)
@pytest.mark.parametrize("pattern", PATTERNS)
 def test_issue3009(doc, matcher, pattern):
    """Test problem with matcher quantifiers"""
    matcher.add(pattern[0], None, *pattern[1])
    matches = matcher(doc)
    assert matches
 def test_issue2464(matcher):
    """Test problem with successive ?. This is the same bug, so putting it here."""
    doc = Doc(matcher.vocab, words=["a", "b"])
    matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
    matches = matcher(doc)
    assert len(matches) == 3
--- a/spacy/tests/regression/test_issue3012.py
+++ b/spacy/tests/regression/test_issue3012.py
@ -1,31 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import ENT_IOB, ENT_TYPE
 from ...tokens import Doc
 from ..util import get_doc
 def test_issue3012(en_vocab):
    """Test that the is_tagged attribute doesn't get overwritten when we from_array
    without tag information."""
    words = ["This", "is", "10", "%", "."]
    tags = ["DT", "VBZ", "CD", "NN", "."]
    pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"]
    ents = [(2, 4, "PERCENT")]
    doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents)
    assert doc.is_tagged
    expected = ("10", "NUM", "CD", "PERCENT")
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
    header = [ENT_IOB, ENT_TYPE]
    ent_array = doc.to_array(header)
    doc.from_array(header, ent_array)
    assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected
    # serializing then deserializing
    doc_bytes = doc.to_bytes()
    doc2 = Doc(en_vocab).from_bytes(doc_bytes)
    assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected
--- a/spacy/tests/regression/test_issue3199.py
+++ b/spacy/tests/regression/test_issue3199.py
@ -1,15 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 def test_issue3199():
    """Test that Span.noun_chunks works correctly if no noun chunks iterator
    is available. To make this test future-proof, we're constructing a Doc
    with a new Vocab here and setting is_parsed to make sure the noun chunks run.
    """
    doc = Doc(Vocab(), words=["This", "is", "a", "sentence"])
    doc.is_parsed = True
    assert list(doc[0:3].noun_chunks) == []
--- a/spacy/tests/regression/test_issue3209.py
+++ b/spacy/tests/regression/test_issue3209.py
@ -1,23 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 def test_issue3209():
    """Test issue that occurred in spaCy nightly where NER labels were being
    mapped to classes incorrectly after loading the model, when the labels
    were added using ner.add_label().
    """
    nlp = English()
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
    ner.add_label("ANIMAL")
    nlp.begin_training()
    move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"]
    assert ner.move_names == move_names
    nlp2 = English()
    nlp2.add_pipe(nlp2.create_pipe("ner"))
    nlp2.from_bytes(nlp.to_bytes())
    assert nlp2.get_pipe("ner").move_names == move_names
--- a/spacy/tests/regression/test_issue3248.py
+++ b/spacy/tests/regression/test_issue3248.py
@ -1,27 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from spacy.matcher import PhraseMatcher
 from spacy.lang.en import English
 from spacy.compat import pickle
 def test_issue3248_1():
    """Test that the PhraseMatcher correctly reports its number of rules, not
    total number of patterns."""
    nlp = English()
    matcher = PhraseMatcher(nlp.vocab)
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
    matcher.add("TEST2", None, nlp("d"))
    assert len(matcher) == 2
 def test_issue3248_2():
    """Test that the PhraseMatcher can be pickled correctly."""
    nlp = English()
    matcher = PhraseMatcher(nlp.vocab)
    matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
    matcher.add("TEST2", None, nlp("d"))
    data = pickle.dumps(matcher)
    new_matcher = pickle.loads(data)
    assert len(new_matcher) == len(matcher)
--- a/spacy/tests/regression/test_issue3277.py
+++ b/spacy/tests/regression/test_issue3277.py
@ -1,11 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 def test_issue3277(es_tokenizer):
    """Test that hyphens are split correctly as prefixes."""
    doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.")
    assert len(doc) == 14
    assert doc[0].text == "\u2014"
    assert doc[5].text == "\u2013"
    assert doc[9].text == "\u2013"
--- a/spacy/tests/regression/test_issue3288.py
+++ b/spacy/tests/regression/test_issue3288.py
@ -1,18 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import numpy
 from spacy import displacy
 from ..util import get_doc
 def test_issue3288(en_vocab):
    """Test that retokenization works correctly via displaCy when punctuation
    is merged onto the preceeding token and tensor is resized."""
    words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
    heads = [1, 0, -1, 1, 0, 1, -2, -3]
    deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    doc.tensor = numpy.zeros((len(words), 96), dtype="float32")
    displacy.render(doc)
--- a/spacy/tests/regression/test_issue3289.py
+++ b/spacy/tests/regression/test_issue3289.py
@ -1,15 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 def test_issue3289():
    """Test that Language.to_bytes handles serializing a pipeline component
    with an uninitialized model."""
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("textcat"))
    bytes_data = nlp.to_bytes()
    new_nlp = English()
    new_nlp.add_pipe(nlp.create_pipe("textcat"))
    new_nlp.from_bytes(bytes_data)
--- a/spacy/tests/regression/test_issue3328.py
+++ b/spacy/tests/regression/test_issue3328.py
@ -1,19 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from spacy.matcher import Matcher
 from spacy.tokens import Doc
 def test_issue3328(en_vocab):
    doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"])
    matcher = Matcher(en_vocab)
    patterns = [
        [{"LOWER": {"IN": ["hello", "how"]}}],
        [{"LOWER": {"IN": ["you", "doing"]}}],
    ]
    matcher.add("TEST", None, *patterns)
    matches = matcher(doc)
    assert len(matches) == 4
    matched_texts = [doc[start:end].text for _, start, end in matches]
    assert matched_texts == ["Hello", "how", "you", "doing"]
--- a/spacy/tests/regression/test_issue3331.py
+++ b/spacy/tests/regression/test_issue3331.py
@ -1,21 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.matcher import PhraseMatcher
 from spacy.tokens import Doc
@pytest.mark.xfail
 def test_issue3331(en_vocab):
    """Test that duplicate patterns for different rules result in multiple
    matches, one per rule.
    """
    matcher = PhraseMatcher(en_vocab)
    matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
    matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
    doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
    matches = matcher(doc)
    assert len(matches) == 2
    match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]]
    assert sorted(match_ids) == ["A", "B"]
--- a/spacy/tests/regression/test_issue3345.py
+++ b/spacy/tests/regression/test_issue3345.py
@ -1,26 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 from spacy.tokens import Doc
 from spacy.pipeline import EntityRuler, EntityRecognizer
 def test_issue3345():
    """Test case where preset entity crosses sentence boundary."""
    nlp = English()
    doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
    doc[4].is_sent_start = True
    ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
    ner = EntityRecognizer(doc.vocab)
    # Add the OUT action. I wouldn't have thought this would be necessary...
    ner.moves.add_action(5, "")
    ner.add_label("GPE")
    doc = ruler(doc)
    # Get into the state just before "New"
    state = ner.moves.init_batch([doc])[0]
    ner.moves.apply_transition(state, "O")
    ner.moves.apply_transition(state, "O")
    ner.moves.apply_transition(state, "O")
    # Check that B-GPE is valid.
    assert ner.moves.is_valid(state, "B-GPE")
--- a/spacy/tests/regression/test_issue3356.py
+++ b/spacy/tests/regression/test_issue3356.py
@ -1,72 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 import re
 from spacy import compat
 prefix_search = (
    b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
    b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?"
    b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}"
    b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|"
    b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|"
    b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|"
    b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|"
    b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|"
    b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|"
    b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|"
    b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|"
    b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|"
    b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|"
    b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|"
    b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F"
    b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8"
    b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17"
    b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC"
    b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940"
    b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103"
    b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125"
    b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F"
    b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4"
    b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5"
    b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B"
    b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440"
    b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2"
    b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800"
    b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76"
    b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80"
    b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004"
    b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191"
    b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250"
    b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0"
    b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77"
    b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137"
    b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E"
    b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877"
    b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45"
    b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129"
    b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C"
    b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245"
    b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A"
    b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86"
    b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0"
    b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1"
    b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6"
    b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250"
    b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400"
    b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700"
    b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810"
    b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890"
    b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940"
    b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2"
    b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF"
    b"\\U0001FA60-\\U0001FA6D]"
 )
 if compat.is_python2:
    # If we have this test in Python 3, pytest chokes, as it can't print the
    # string above in the xpass message.
    def test_issue3356():
        pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
        assert not pattern.search("hello")
--- a/spacy/tests/regression/test_issue3410.py
+++ b/spacy/tests/regression/test_issue3410.py
@ -1,21 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
 from spacy.matcher import Matcher, PhraseMatcher
 def test_issue3410():
    texts = ["Hello world", "This is a test"]
    nlp = English()
    matcher = Matcher(nlp.vocab)
    phrasematcher = PhraseMatcher(nlp.vocab)
    with pytest.deprecated_call():
        docs = list(nlp.pipe(texts, n_threads=4))
    with pytest.deprecated_call():
        docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
    with pytest.deprecated_call():
        list(matcher.pipe(docs, n_threads=4))
    with pytest.deprecated_call():
        list(phrasematcher.pipe(docs, n_threads=4))
--- a/spacy/tests/regression/test_issue3447.py
+++ b/spacy/tests/regression/test_issue3447.py
@ -1,14 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.util import decaying
 def test_issue3447():
    sizes = decaying(10.0, 1.0, 0.5)
    size = next(sizes)
    assert size == 10.0
    size = next(sizes)
    assert size == 10.0 - 0.5
    size = next(sizes)
    assert size == 10.0 - 0.5 - 0.5
--- a/spacy/tests/regression/test_issue3449.py
+++ b/spacy/tests/regression/test_issue3449.py
@ -1,21 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
 def test_issue3449():
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    text1 = "He gave the ball to I. Do you want to go to the movies with I?"
    text2 = "He gave the ball to I.  Do you want to go to the movies with I?"
    text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
    t1 = nlp(text1)
    t2 = nlp(text2)
    t3 = nlp(text3)
    assert t1[5].text == "I"
    assert t2[5].text == "I"
    assert t3[5].text == "I"
--- a/spacy/tests/regression/test_issue3468.py
+++ b/spacy/tests/regression/test_issue3468.py
@ -1,21 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 from spacy.tokens import Doc
 def test_issue3468():
    """Test that sentence boundaries are set correctly so Doc.is_sentenced can
    be restored after serialization."""
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    doc = nlp("Hello world")
    assert doc[0].is_sent_start
    assert doc.is_sentenced
    assert len(list(doc.sents)) == 1
    doc_bytes = doc.to_bytes()
    new_doc = Doc(nlp.vocab).from_bytes(doc_bytes)
    assert new_doc[0].is_sent_start
    assert new_doc.is_sentenced
    assert len(list(new_doc.sents)) == 1
--- a/spacy/tests/regression/test_issue3526.py
+++ b/spacy/tests/regression/test_issue3526.py
@ -0,0 +1,88 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.tokens import Span
 from spacy.language import Language
 from spacy.pipeline import EntityRuler
 from spacy import load
 import srsly
 from ..util import make_tempdir
@pytest.fixture
 def patterns():
    return [
        {"label": "HELLO", "pattern": "hello world"},
        {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
        {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
        {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
        {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
    ]
@pytest.fixture
 def add_ent():
    def add_ent_component(doc):
        doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
        return doc
    return add_ent_component
 def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
    nlp = Language(vocab=en_vocab)
    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
    ruler_bytes = ruler.to_bytes()
    assert len(ruler) == len(patterns)
    assert len(ruler.labels) == 4
    assert ruler.overwrite
    new_ruler = EntityRuler(nlp)
    new_ruler = new_ruler.from_bytes(ruler_bytes)
    assert len(new_ruler) == len(ruler)
    assert len(new_ruler.labels) == 4
    assert new_ruler.overwrite == ruler.overwrite
    assert new_ruler.ent_id_sep == ruler.ent_id_sep
 def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
    nlp = Language(vocab=en_vocab)
    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
    bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
    new_ruler = EntityRuler(nlp)
    new_ruler = new_ruler.from_bytes(bytes_old_style)
    assert len(new_ruler) == len(ruler)
    for pattern in ruler.patterns:
        assert pattern in new_ruler.patterns
    assert new_ruler.overwrite is not ruler.overwrite
 def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
    nlp = Language(vocab=en_vocab)
    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
    with make_tempdir() as tmpdir:
        out_file = tmpdir / "entity_ruler"
        srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns)
        new_ruler = EntityRuler(nlp).from_disk(out_file)
        for pattern in ruler.patterns:
            assert pattern in new_ruler.patterns
        assert len(new_ruler) == len(ruler)
        assert new_ruler.overwrite is not ruler.overwrite
 def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
    nlp = Language(vocab=en_vocab)
    ruler = EntityRuler(nlp, overwrite_ents=True)
    ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
    nlp.add_pipe(ruler)
    with make_tempdir() as tmpdir:
        nlp.to_disk(tmpdir)
        ruler = nlp.get_pipe("entity_ruler")
        assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
        assert ruler.overwrite is True
        nlp2 = load(tmpdir)
        new_ruler = nlp2.get_pipe("entity_ruler")
        assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}]
        assert new_ruler.overwrite is True
--- a/spacy/tests/regression/test_issue3611.py
+++ b/spacy/tests/regression/test_issue3611.py
@ -0,0 +1,51 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 import spacy
 from spacy.util import minibatch, compounding
 def test_issue3611():
    """ Test whether adding n-grams in the textcat works even when n > token length of some docs """
    unique_classes = ["offensive", "inoffensive"]
    x_train = ["This is an offensive text",
               "This is the second offensive text",
               "inoff"]
    y_train = ["offensive", "offensive", "inoffensive"]
    # preparing the data
    pos_cats = list()
    for train_instance in y_train:
        pos_cats.append({label: label == train_instance for label in unique_classes})
    train_data = list(zip(x_train, [{'cats': cats} for cats in pos_cats]))
    # set up the spacy model with a text categorizer component
    nlp = spacy.blank('en')
    textcat = nlp.create_pipe(
        "textcat",
        config={
            "exclusive_classes": True,
            "architecture": "bow",
            "ngram_size": 2
        }
    )
    for label in unique_classes:
        textcat.add_label(label)
    nlp.add_pipe(textcat, last=True)
    # training the network
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for i in range(3):
            losses = {}
            batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=0.1, losses=losses)
--- a/spacy/tests/regression/test_issue3625.py
+++ b/spacy/tests/regression/test_issue3625.py
@ -0,0 +1,10 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.hi import Hindi
 def test_issue3625():
    """Test that default punctuation rules applies to hindi unicode characters"""
    nlp = Hindi()
    doc = nlp(u"hi. how हुए. होटल, होटल")
    assert [token.text for token in doc] == ['hi', '.', 'how', 'हुए', '.', 'होटल', ',', 'होटल']
--- a/spacy/tests/regression/test_issue3839.py
+++ b/spacy/tests/regression/test_issue3839.py
@ -6,7 +6,6 @@ from spacy.matcher import Matcher
 from spacy.tokens import Doc
@pytest.mark.xfail
 def test_issue3839(en_vocab):
    """Test that match IDs returned by the matcher are correct, are in the string """
    doc = Doc(en_vocab, words=["terrific", "group", "of", "people"])
--- a/spacy/tests/regression/test_issue3869.py
+++ b/spacy/tests/regression/test_issue3869.py
@ -0,0 +1,31 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.attrs import IS_ALPHA
 from spacy.lang.en import English
@pytest.mark.parametrize(
    "sentence",
    [
        'The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.',
        'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s #1.',
        'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s number one',
        'Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.',
        "It was a missed assignment, but it shouldn't have resulted in a turnover ..."
    ],
 )
 def test_issue3869(sentence):
    """Test that the Doc's count_by function works consistently"""
    nlp = English()
    doc = nlp(sentence)
    count = 0
    for token in doc:
        count += token.is_alpha
    assert count == doc.count_by(IS_ALPHA).get(1, 0)
--- a/spacy/tests/regression/test_issue3880.py
+++ b/spacy/tests/regression/test_issue3880.py
@ -0,0 +1,22 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 def test_issue3880():
    """Test that `nlp.pipe()` works when an empty string ends the batch.
    Fixed in v7.0.5 of Thinc.
    """
    texts = ["hello", "world", "", ""]
    nlp = English()
    nlp.add_pipe(nlp.create_pipe("parser"))
    nlp.add_pipe(nlp.create_pipe("ner"))
    nlp.add_pipe(nlp.create_pipe("tagger"))
    nlp.get_pipe("parser").add_label("dep")
    nlp.get_pipe("ner").add_label("PERSON")
    nlp.get_pipe("tagger").add_label("NN")
    nlp.begin_training()
    for doc in nlp.pipe(texts):
        pass
--- a/spacy/tests/serialize/test_serialize_kb.py
+++ b/spacy/tests/serialize/test_serialize_kb.py
@ -0,0 +1,74 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import make_tempdir
 from ...util import ensure_path
 from spacy.kb import KnowledgeBase
 def test_serialize_kb_disk(en_vocab):
    # baseline assertions
    kb1 = _get_dummy_kb(en_vocab)
    _check_kb(kb1)
    # dumping to file & loading back in
    with make_tempdir() as d:
        dir_path = ensure_path(d)
        if not dir_path.exists():
            dir_path.mkdir()
        file_path = dir_path / "kb"
        kb1.dump(str(file_path))
        kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3)
        kb2.load_bulk(str(file_path))
    # final assertions
    _check_kb(kb2)
 def _get_dummy_kb(vocab):
    kb = KnowledgeBase(vocab=vocab, entity_vector_length=3)
    kb.add_entity(entity='Q53', prob=0.33, entity_vector=[0, 5, 3])
    kb.add_entity(entity='Q17', prob=0.2, entity_vector=[7, 1, 0])
    kb.add_entity(entity='Q007', prob=0.7, entity_vector=[0, 0, 7])
    kb.add_entity(entity='Q44', prob=0.4, entity_vector=[4, 4, 4])
    kb.add_alias(alias='double07', entities=['Q17', 'Q007'], probabilities=[0.1, 0.9])
    kb.add_alias(alias='guy', entities=['Q53', 'Q007', 'Q17', 'Q44'], probabilities=[0.3, 0.3, 0.2, 0.1])
    kb.add_alias(alias='random', entities=['Q007'], probabilities=[1.0])
    return kb
 def _check_kb(kb):
    # check entities
    assert kb.get_size_entities() == 4
    for entity_string in ['Q53', 'Q17', 'Q007', 'Q44']:
        assert entity_string in kb.get_entity_strings()
    for entity_string in ['', 'Q0']:
        assert entity_string not in kb.get_entity_strings()
    # check aliases
    assert kb.get_size_aliases() == 3
    for alias_string in ['double07', 'guy', 'random']:
        assert alias_string in kb.get_alias_strings()
    for alias_string in ['nothingness', '', 'randomnoise']:
        assert alias_string not in kb.get_alias_strings()
    # check candidates & probabilities
    candidates = sorted(kb.get_candidates('double07'), key=lambda x: x.entity_)
    assert len(candidates) == 2
    assert candidates[0].entity_ == 'Q007'
    assert 0.6999 < candidates[0].entity_freq < 0.701
    assert candidates[0].entity_vector == [0, 0, 7]
    assert candidates[0].alias_ == 'double07'
    assert 0.899 < candidates[0].prior_prob < 0.901
    assert candidates[1].entity_ == 'Q17'
    assert 0.199 < candidates[1].entity_freq < 0.201
    assert candidates[1].entity_vector == [7, 1, 0]
    assert candidates[1].alias_ == 'double07'
    assert 0.099 < candidates[1].prior_prob < 0.101
--- a/spacy/tokens/_serialize.py
+++ b/spacy/tokens/_serialize.py
@ -11,29 +11,27 @@ from ..tokens import Doc
 from ..attrs import SPACY, ORTH
-class Binder(object):
+class DocBox(object):
    """Serialize analyses from a collection of doc objects."""
-    def __init__(self, attrs=None):
+    def __init__(self, attrs=None, store_user_data=False):
-        """Create a Binder object, to hold serialized annotations.
+        """Create a DocBox object, to hold serialized annotations.
        attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
            always serialized, so they're not required. Defaults to None.
        """
        attrs = attrs or []
        self.attrs = list(attrs)
        # Ensure ORTH is always attrs[0]
-        if ORTH in self.attrs:
+        self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
            self.attrs.pop(ORTH)
        if SPACY in self.attrs:
            self.attrs.pop(SPACY)
        self.attrs.insert(0, ORTH)
        self.tokens = []
        self.spaces = []
        self.user_data = []
        self.strings = set()
        self.store_user_data = store_user_data
    def add(self, doc):
-        """Add a doc's annotations to the binder for serialization."""
+        """Add a doc's annotations to the DocBox for serialization."""
        array = doc.to_array(self.attrs)
        if len(array.shape) == 1:
            array = array.reshape((array.shape[0], 1))
@ -43,27 +41,35 @@ class Binder(object):
        spaces = spaces.reshape((spaces.shape[0], 1))
        self.spaces.append(numpy.asarray(spaces, dtype=bool))
        self.strings.update(w.text for w in doc)
        if self.store_user_data:
            self.user_data.append(srsly.msgpack_dumps(doc.user_data))
    def get_docs(self, vocab):
        """Recover Doc objects from the annotations, using the given vocab."""
        for string in self.strings:
            vocab[string]
        orth_col = self.attrs.index(ORTH)
-        for tokens, spaces in zip(self.tokens, self.spaces):
+        for i in range(len(self.tokens)):
            tokens = self.tokens[i]
            spaces = self.spaces[i]
            words = [vocab.strings[orth] for orth in tokens[:, orth_col]]
            doc = Doc(vocab, words=words, spaces=spaces)
            doc = doc.from_array(self.attrs, tokens)
            if self.store_user_data:
                doc.user_data.update(srsly.msgpack_loads(self.user_data[i]))
            yield doc
    def merge(self, other):
-        """Extend the annotations of this binder with the annotations from another."""
+        """Extend the annotations of this DocBox with the annotations from another."""
        assert self.attrs == other.attrs
        self.tokens.extend(other.tokens)
        self.spaces.extend(other.spaces)
        self.strings.update(other.strings)
        if self.store_user_data:
            self.user_data.extend(other.user_data)
    def to_bytes(self):
-        """Serialize the binder's annotations into a byte string."""
+        """Serialize the DocBox's annotations into a byte string."""
        for tokens in self.tokens:
            assert len(tokens.shape) == 2, tokens.shape
        lengths = [len(tokens) for tokens in self.tokens]
@ -74,10 +80,12 @@ class Binder(object):
            "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
            "strings": list(self.strings),
        }
        if self.store_user_data:
            msg["user_data"] = self.user_data
        return gzip.compress(srsly.msgpack_dumps(msg))
    def from_bytes(self, string):
-        """Deserialize the binder's annotations from a byte string."""
+        """Deserialize the DocBox's annotations from a byte string."""
        msg = srsly.msgpack_loads(gzip.decompress(string))
        self.attrs = msg["attrs"]
        self.strings = set(msg["strings"])
@ -89,29 +97,38 @@ class Binder(object):
        flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))
        self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
        self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
        if self.store_user_data and "user_data" in msg:
            self.user_data = list(msg["user_data"])
        for tokens in self.tokens:
            assert len(tokens.shape) == 2, tokens.shape
        return self
-def merge_bytes(binder_strings):
+def merge_boxes(boxes):
-    """Concatenate multiple serialized binders into one byte string."""
+    merged = None
-    output = None
+    for byte_string in boxes:
-    for byte_string in binder_strings:
+        if byte_string is not None:
-        binder = Binder().from_bytes(byte_string)
+            box = DocBox(store_user_data=True).from_bytes(byte_string)
-        if output is None:
+            if merged is None:
-            output = binder
+                merged = box
-        else:
+            else:
-            output.merge(binder)
+                merged.merge(box)
-    return output.to_bytes()
+    if merged is not None:
        return merged.to_bytes()
    else:
        return b""
-def pickle_binder(binder):
+def pickle_box(box):
-    return (unpickle_binder, (binder.to_bytes(),))
+    return (unpickle_box, (box.to_bytes(),))
-def unpickle_binder(byte_string):
+def unpickle_box(byte_string):
-    return Binder().from_bytes(byte_string)
+    return DocBox().from_bytes(byte_string)
-copy_reg.pickle(Binder, pickle_binder, unpickle_binder)
+copy_reg.pickle(DocBox, pickle_box, unpickle_box)
 # Compatibility, as we had named it this previously.
 Binder = DocBox
 __all__ = ["DocBox"]
--- a/spacy/tokens/doc.pxd
+++ b/spacy/tokens/doc.pxd
@ -1,6 +1,5 @@
 from cymem.cymem cimport Pool
 cimport numpy as np
 from preshed.counter cimport PreshCounter
 from ..vocab cimport Vocab
 from ..structs cimport TokenC, LexemeC
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -9,6 +9,7 @@ cimport cython
 cimport numpy as np
 from libc.string cimport memcpy, memset
 from libc.math cimport sqrt
 from collections import Counter
 import numpy
 import numpy.linalg
@ -22,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..typedefs cimport attr_t, flags_t
 from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
-from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t
+from ..attrs cimport ENT_TYPE, ENT_KB_ID, SENT_START, attr_id_t
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 from ..attrs import intify_attrs, IDS
@ -64,6 +65,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
        return token.ent_iob
    elif feat_name == ENT_TYPE:
        return token.ent_type
    elif feat_name == ENT_KB_ID:
        return token.ent_kb_id
    else:
        return Lexeme.get_struct_attr(token.lex, feat_name)
@ -85,13 +88,14 @@ cdef class Doc:
    Python-level `Token` and `Span` objects are views of this array, i.e.
    they don't own the data themselves.
-    EXAMPLE: Construction 1
+    EXAMPLE:
        Construction 1
        >>> doc = nlp(u'Some text')
        Construction 2
        >>> from spacy.tokens import Doc
        >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
-                      spaces=[True, False, False])
+        >>>           spaces=[True, False, False])
    DOCS: https://spacy.io/api/doc
    """
@ -237,6 +241,8 @@ cdef class Doc:
            return True
        if self.is_parsed:
            return True
        if len(self) < 2:
            return True
        for i in range(1, self.length):
            if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
                return True
@ -248,6 +254,8 @@ cdef class Doc:
        *any* of the tokens has a named entity tag set (even if the others are
        uknown values).
        """
        if len(self) == 0:
            return True
        for i in range(self.length):
            if self.c[i].ent_iob != 0:
                return True
@ -690,7 +698,7 @@ cdef class Doc:
        # Handle 1d case
        return output if len(attr_ids) >= 2 else output.reshape((self.length,))
-    def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
+    def count_by(self, attr_id_t attr_id, exclude=None, object counts=None):
        """Count the frequencies of a given attribute. Produces a dict of
        `{attribute (int): count (ints)}` frequencies, keyed by the values of
        the given attribute ID.
@ -705,19 +713,18 @@ cdef class Doc:
        cdef size_t count
        if counts is None:
-            counts = PreshCounter()
+            counts = Counter()
            output_dict = True
        else:
            output_dict = False
        # Take this check out of the loop, for a bit of extra speed
        if exclude is None:
            for i in range(self.length):
-                counts.inc(get_token_attr(&self.c[i], attr_id), 1)
+                counts[get_token_attr(&self.c[i], attr_id)] += 1
        else:
            for i in range(self.length):
                if not exclude(self[i]):
-                    attr = get_token_attr(&self.c[i], attr_id)
+                    counts[get_token_attr(&self.c[i], attr_id)] += 1
                    counts.inc(attr, 1)
        if output_dict:
            return dict(counts)
@ -850,7 +857,7 @@ cdef class Doc:
        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]  # TODO: ENT_KB_ID ?
        if self.is_tagged:
            array_head.append(TAG)
        # If doc parsed add head and dep attribute
@ -1004,6 +1011,7 @@ cdef class Doc:
        """
        cdef unicode tag, lemma, ent_type
        deprecation_warning(Warnings.W013.format(obj="Doc"))
        # TODO: ENT_KB_ID ?
        if len(args) == 3:
            deprecation_warning(Warnings.W003)
            tag, lemma, ent_type = args
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -210,7 +210,7 @@ cdef class Span:
        words = [t.text for t in self]
        spaces = [bool(t.whitespace_) for t in self]
        cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID]
        if self.doc.is_tagged:
            array_head.append(TAG)
        # If doc parsed add head and dep attribute
--- a/spacy/tokens/token.pxd
+++ b/spacy/tokens/token.pxd
@ -53,6 +53,8 @@ cdef class Token:
            return token.ent_iob
        elif feat_name == ENT_TYPE:
            return token.ent_type
        elif feat_name == ENT_KB_ID:
            return token.ent_kb_id
        elif feat_name == SENT_START:
            return token.sent_start
        else:
@ -79,5 +81,7 @@ cdef class Token:
            token.ent_iob = value
        elif feat_name == ENT_TYPE:
            token.ent_type = value
        elif feat_name == ENT_KB_ID:
            token.ent_kb_id = value
        elif feat_name == SENT_START:
            token.sent_start = value
--- a/website/docs/api/annotation.md
+++ b/website/docs/api/annotation.md
@ -520,7 +520,9 @@ spaCy takes training data in JSON format. The built-in
 [`convert`](/api/cli#convert) command helps you convert the `.conllu` format
 used by the
 [Universal Dependencies corpora](https://github.com/UniversalDependencies) to
-spaCy's training format.
+spaCy's training format. To convert one or more existing `Doc` objects to
 spaCy's JSON format, you can use the
 [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
 > #### Annotating entities
 >
--- a/Show More
+++ b/Show More