Merge branch 'master' into tmp/sync

2025-09-27 22:46:56 +03:00 · 2020-03-26 13:38:14 +01:00 · 2020-03-26 13:38:14 +01:00 · 46568f40a7
commit 46568f40a7
parent 218e1706ac e53232533b
104 changed files with 2987 additions and 638 deletions
--- a/.github/contributors/Baciccin.md
+++ b/.github/contributors/Baciccin.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Giovanni Battista Parodi |
 | Company name (if applicable)   |                          |
 | Title or role (if applicable)  |                          |
 | Date                           | 2020-03-19               |
 | GitHub username                | Baciccin                 |
 | Website (optional)             |                          |
--- a/.github/contributors/MisterKeefe.md
+++ b/.github/contributors/MisterKeefe.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |      Tom Keefe       |
 | Company name (if applicable)   |          /           |
 | Title or role (if applicable)  |          /           |
 | Date                           |   18 February 2020   |
 | GitHub username                |     MisterKeefe      |
 | Website (optional)             |          /           |
--- a/.github/contributors/Tiljander.md
+++ b/.github/contributors/Tiljander.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |  Henrik Tiljander    |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |   24/3/2020          |
 | GitHub username                |     Tiljander        |
 | Website (optional)             |                      |
--- a/.github/contributors/dhpollack.md
+++ b/.github/contributors/dhpollack.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | David Pollack        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | Mar 5. 2020          |
 | GitHub username                | dhpollack            |
 | Website (optional)             |                      |
--- a/.github/contributors/guerda.md
+++ b/.github/contributors/guerda.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Philip Gillißen      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-03-24           |
 | GitHub username                | guerda               |
 | Website (optional)             |                      |
--- a/.github/contributors/mabraham.md
+++ b/.github/contributors/mabraham.md
@ -0,0 +1,89 @@
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
        assignment is or becomes invalid, ineffective or unenforceable, you hereby
            grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
                royalty-free, unrestricted license to exercise all rights under those
                    copyrights. This includes, at our option, the right to sublicense these same
                        rights to third parties through multiple levels of sublicensees or other
                            licensing arrangements;
    * you agree that each of us can do all things in relation to your
        contribution as if each of us were the sole owners, and if one of us makes
            a derivative work of your contribution, the one who makes the derivative
                work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
        against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
        exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
        consent of, pay or render an accounting to the other for any use or
            distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
        your contribution in whole or in part, alone or in combination with or
            included in any product, work or materials arising out of the project to
                which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
        multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
        authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
        third party's copyrights, trademarks, patents, or other intellectual
            property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
        other applicable export and import laws. You agree to notify us if you
            become aware of any circumstance which would make any of the foregoing
                representations inaccurate in any respect. We may publicly disclose your
                    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
        or entity, including my employer, has or will have rights with respect to my
            contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
        actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |                      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |                      |
 | GitHub username                |                      |
 | Website (optional)             |                      |
--- a/.github/contributors/merrcury.md
+++ b/.github/contributors/merrcury.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [X] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Himanshu Garg        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-03-10           |
 | GitHub username                | merrcury             |
 | Website (optional)             |                      |
--- a/.github/contributors/pinealan.md
+++ b/.github/contributors/pinealan.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Alan Chan            |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-03-15           |
 | GitHub username                | pinealan             |
 | Website (optional)             | http://pinealan.xyz  |
--- a/.github/contributors/sloev.md
+++ b/.github/contributors/sloev.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Johannes Valbjørn        |
 | Company name (if applicable)   |                          |
 | Title or role (if applicable)  |                          |
 | Date                           | 2020-03-13               |
 | GitHub username                | sloev                    |
 | Website (optional)             | https://sloev.github.io  |
--- a/.gitignore
+++ b/.gitignore
@ -46,6 +46,7 @@ __pycache__/
 .venv
 env3.6/
 venv/
 env3.*/
 .dev
 .denv
 .pypyenv
@ -62,6 +63,7 @@ lib64/
 parts/
 sdist/
 var/
 wheelhouse/
 *.egg-info/
 pip-wheel-metadata/
 Pipfile.lock
--- a/2
+++ b/2
@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
+Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/47
+++ b/47
@ -1,28 +1,37 @@
 SHELL := /bin/bash
-sha = $(shell "git" "rev-parse" "--short" "HEAD")
+PYVER := 3.6
-version = $(shell "bin/get-version.sh")
+VENV := ./env$(PYVER)
 wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
-dist/spacy.pex : dist/spacy-$(sha).pex
+version := $(shell "bin/get-version.sh")
 	cp dist/spacy-$(sha).pex dist/spacy.pex
 	chmod a+rx dist/spacy.pex
-dist/spacy-$(sha).pex : dist/$(wheel)
+dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
-	env3.6/bin/python -m pip install pex==1.5.3
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
-	env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
+	chmod a+rx $@
-dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
+dist/pytest.pex : wheelhouse/pytest-*.whl
-	python3.6 -m venv env3.6
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
-	source env3.6/bin/activate
+	chmod a+rx $@
 	env3.6/bin/pip install wheel
 	env3.6/bin/pip install -r requirements.txt --no-cache-dir 
 	env3.6/bin/python setup.py build_ext --inplace
 	env3.6/bin/python setup.py sdist
 	env3.6/bin/python setup.py bdist_wheel
-.PHONY : clean
+wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
 	$(VENV)/bin/pip wheel . -w ./wheelhouse
 	$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
 	touch $@
 wheelhouse/pytest-%.whl : $(VENV)/bin/pex
 	$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
 $(VENV)/bin/pex :
 	python$(PYVER) -m venv $(VENV)
 	$(VENV)/bin/pip install -U pip setuptools pex wheel
 .PHONY : clean test
 test : dist/spacy-$(version).pex dist/pytest.pex
 	( . $(VENV)/bin/activate ; \
 	PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
 clean : setup.py
 	source env3.6/bin/activate
 	rm -rf dist/*
 	rm -rf ./wheelhouse
 	rm -rf $(VENV)
 	python setup.py clean --all
--- a/bin/wiki_entity_linking/README.md
+++ b/bin/wiki_entity_linking/README.md
@ -2,7 +2,7 @@
 ### Step 1: Create a Knowledge Base (KB) and training data
-Run  `wikipedia_pretrain_kb.py` 
+Run  `wikidata_pretrain_kb.py` 
 * This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
  * WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
  * Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
--- a/examples/information_extraction/phrase_matcher.py
+++ b/examples/information_extraction/phrase_matcher.py
@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
                break
-def get_matches(tokenizer, phrases, texts, max_length=6):
+def get_matches(tokenizer, phrases, texts):
-    matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
+    matcher = PhraseMatcher(tokenizer.vocab)
    matcher.add("Phrase", None, *phrases)
    for text in texts:
        doc = tokenizer(text)
--- a/setup.cfg
+++ b/setup.cfg
@ -59,7 +59,7 @@ install_requires =
 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.0.5<0.2.0
+    spacy_lookups_data>=0.0.5,<0.2.0
 cuda =
    cupy>=5.0.0b4
 cuda80 =
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -93,3 +93,5 @@ cdef enum attr_id_t:
    ENT_KB_ID = symbols.ENT_KB_ID
    MORPH
    ENT_ID = symbols.ENT_ID
    IDX
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -89,6 +89,7 @@ IDS = {
    "PROB": PROB,
    "LANG": LANG,
    "MORPH": MORPH,
    "IDX": IDX
 }
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -405,12 +405,10 @@ def train(
                            losses=losses,
                        )
                    except ValueError as e:
-                        msg.warn("Error during training")
+                        err = "Error during training"
                        if init_tok2vec:
-                            msg.warn(
+                            err += " Did you provide the same parameters during 'train' as during 'pretrain'?"
-                                "Did you provide the same parameters during 'train' as during 'pretrain'?"
+                        msg.fail(err, f"Original error message: {e}", exits=1)
                            )
                        msg.fail(f"Original error message: {e}", exits=1)
                    if raw_text:
                        # If raw text is available, perform 'rehearsal' updates,
                        # which use unlabelled data to reduce overfitting.
@ -545,7 +543,40 @@ def train(
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / "model-final"
            nlp.to_disk(final_model_path)
-            final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
+            meta_loc = output_path / "model-final" / "meta.json"
            final_meta = srsly.read_json(meta_loc)
            final_meta.setdefault("accuracy", {})
            final_meta["accuracy"].update(meta.get("accuracy", {}))
            final_meta.setdefault("speed", {})
            final_meta["speed"].setdefault("cpu", None)
            final_meta["speed"].setdefault("gpu", None)
            meta.setdefault("speed", {})
            meta["speed"].setdefault("cpu", None)
            meta["speed"].setdefault("gpu", None)
            # combine cpu and gpu speeds with the base model speeds
            if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
                speed = _get_total_speed(
                    [final_meta["speed"]["cpu"], meta["speed"]["cpu"]]
                )
                final_meta["speed"]["cpu"] = speed
            if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
                speed = _get_total_speed(
                    [final_meta["speed"]["gpu"], meta["speed"]["gpu"]]
                )
                final_meta["speed"]["gpu"] = speed
            # if there were no speeds to update, overwrite with meta
            if (
                final_meta["speed"]["cpu"] is None
                and final_meta["speed"]["gpu"] is None
            ):
                final_meta["speed"].update(meta["speed"])
            # note: beam speeds are not combined with the base model
            if has_beam_widths:
                final_meta.setdefault("beam_accuracy", {})
                final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
                final_meta.setdefault("beam_speed", {})
                final_meta["beam_speed"].update(meta.get("beam_speed", {}))
            srsly.write_json(meta_loc, final_meta)
        msg.good("Saved model to output directory", final_model_path)
        with msg.loading("Creating best model..."):
            best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
@ -630,6 +661,8 @@ def _find_best(experiment_dir, component):
        if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
            accs = srsly.read_json(epoch_model / "accuracy.json")
            scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
            # remove per_type dicts from score list for max() comparison
            scores = [score for score in scores if isinstance(score, float)]
            accuracies.append((scores, epoch_model))
    if accuracies:
        return max(accuracies)[1]
@ -641,13 +674,13 @@ def _get_metrics(component):
    if component == "parser":
        return ("las", "uas", "las_per_type", "token_acc", "sent_f")
    elif component == "tagger":
-        return ("tags_acc",)
+        return ("tags_acc", "token_acc")
    elif component == "ner":
-        return ("ents_f", "ents_p", "ents_r", "ents_per_type")
+        return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
    elif component == "senter":
        return ("sent_f", "sent_p", "sent_r")
    elif component == "textcat":
-        return ("textcat_score",)
+        return ("textcat_score", "token_acc")
    return ("token_acc",)
@ -714,3 +747,12 @@ def _get_progress(
    if beam_width is not None:
        result.insert(1, beam_width)
    return result
 def _get_total_speed(speeds):
    seconds_per_word = 0.0
    for words_per_second in speeds:
        if words_per_second is None:
            return None
        seconds_per_word += 1.0 / words_per_second
    return 1.0 / seconds_per_word
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -142,10 +142,17 @@ def parse_deps(orig_doc, options={}):
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
-    if options.get("fine_grained"):
+    fine_grained = options.get("fine_grained")
-        words = [{"text": w.text, "tag": w.tag_} for w in doc]
+    add_lemma = options.get("add_lemma")
-    else:
+    words = [
-        words = [{"text": w.text, "tag": w.pos_} for w in doc]
+        {
            "text": w.text,
            "tag": w.tag_ if fine_grained else w.pos_,
            "lemma": w.lemma_ if add_lemma else None,
        }
        for w in doc
    ]
    arcs = []
    for word in doc:
        if word.i < word.head.i:
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -1,6 +1,12 @@
 import uuid
-from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
+from .templates import (
    TPL_DEP_SVG,
    TPL_DEP_WORDS,
    TPL_DEP_WORDS_LEMMA,
    TPL_DEP_ARCS,
    TPL_ENTS,
 )
 from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
 from ..util import minify_html, escape_html, registry
 from ..errors import Errors
@ -80,7 +86,10 @@ class DependencyRenderer(object):
        self.width = self.offset_x + len(words) * self.distance
        self.height = self.offset_y + 3 * self.word_spacing
        self.id = render_id
-        words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
+        words = [
            self.render_word(w["text"], w["tag"], w.get("lemma", None), i)
            for i, w in enumerate(words)
        ]
        arcs = [
            self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
            for i, a in enumerate(arcs)
@ -98,7 +107,9 @@ class DependencyRenderer(object):
            lang=self.lang,
        )
-    def render_word(self, text, tag, i):
+    def render_word(
        self, text, tag, lemma, i,
    ):
        """Render individual word.
        text (unicode): Word text.
@ -111,6 +122,10 @@ class DependencyRenderer(object):
        if self.direction == "rtl":
            x = self.width - x
        html_text = escape_html(text)
        if lemma is not None:
            return TPL_DEP_WORDS_LEMMA.format(
                text=html_text, tag=tag, lemma=lemma, x=x, y=y
            )
        return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
    def render_arrow(self, label, start, end, direction, i):
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@ -14,6 +14,15 @@ TPL_DEP_WORDS = """
 """
 TPL_DEP_WORDS_LEMMA = """
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
    <tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
    <tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
 </text>
 """
 TPL_DEP_ARCS = """
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -96,7 +96,10 @@ class Warnings(object):
    W027 = ("Found a large training file of {size} bytes. Note that it may "
            "be more efficient to split your training data into multiple "
            "smaller JSON files instead.")
-    W028 = ("Skipping unsupported morphological feature(s): {feature}. "
+    W028 = ("Doc.from_array was called with a vector of type '{type}', "
            "but is expecting one of type 'uint64' instead. This may result "
            "in problems with the vocab further on in the pipeline.")
    W029 = ("Skipping unsupported morphological feature(s): {feature}. "
            "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
            "string \"Field1=Value1,Value2|Field2=Value3\".")
@ -531,6 +534,15 @@ class Errors(object):
    E188 = ("Could not match the gold entity links to entities in the doc - "
            "make sure the gold EL data refers to valid results of the "
            "named entity recognizer in the `nlp` pipeline.")
    E189 = ("Each argument to `get_doc` should be of equal length.")
    E190 = ("Token head out of range in `Doc.from_array()` for token index "
            "'{index}' with value '{value}' (equivalent to relative head "
            "index: '{rel_head_index}'). The head indices should be relative "
            "to the current token index rather than absolute indices in the "
            "array.")
    E191 = ("Invalid head: the head token must be from the same doc as the "
            "token itself.")
    # TODO: fix numbering after merging develop into master
    E993 = ("The config for 'nlp' should include either a key 'name' to "
            "refer to an existing model by name or path, or a key 'lang' "
--- a/spacy/lang/da/tokenizer_exceptions.py
+++ b/spacy/lang/da/tokenizer_exceptions.py
@ -66,6 +66,7 @@ for orth in [
    "A/S",
    "B.C.",
    "BK.",
    "B.T.",
    "Dr.",
    "Boul.",
    "Chr.",
@ -75,6 +76,7 @@ for orth in [
    "Hf.",
    "i/s",
    "I/S",
    "Inc.",
    "Kprs.",
    "L.A.",
    "Ll.",
@ -145,6 +147,7 @@ for orth in [
    "bygn.",
    "c/o",
    "ca.",
    "cm.",
    "cand.",
    "d.d.",
    "d.m.",
@ -168,10 +171,12 @@ for orth in [
    "dl.",
    "do.",
    "dobb.",
    "dr.",
    "dr.h.c",
    "dr.phil.",
    "ds.",
    "dvs.",
    "d.v.s.",
    "e.b.",
    "e.l.",
    "e.o.",
@ -293,10 +298,14 @@ for orth in [
    "kap.",
    "kbh.",
    "kem.",
    "kg.",
    "kgs.",
    "kgl.",
    "kl.",
    "kld.",
    "km.",
    "km/t",
    "km/t.",
    "knsp.",
    "komm.",
    "kons.",
@ -307,6 +316,7 @@ for orth in [
    "kt.",
    "ktr.",
    "kv.",
    "kvm.",
    "kvt.",
    "l.c.",
    "lab.",
@ -353,6 +363,7 @@ for orth in [
    "nto.",
    "nuv.",
    "o/m",
    "o/m.",
    "o.a.",
    "o.fl.",
    "o.h.",
@ -522,6 +533,7 @@ for orth in [
    "vejl.",
    "vh.",
    "vha.",
    "vind.",
    "vs.",
    "vsa.",
    "vær.",
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -1,5 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .punctuation import TOKENIZER_INFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
@ -19,6 +20,8 @@ class GermanDefaults(Language.Defaults):
        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/de/punctuation.py
+++ b/spacy/lang/de/punctuation.py
@ -1,7 +1,29 @@
-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
 from ..char_classes import CURRENCY, UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
 _prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
 _suffixes = (
    ["''", "/"]
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + LIST_ICONS
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9])(?:{u})".format(u=UNITS),
        r"(?<=[{al}{e}{p}(?:{q})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
    ]
 )
 _quotes = CONCAT_QUOTES.replace("'", "")
 _infixes = (
@ -12,6 +34,7 @@ _infixes = (
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
        r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
        r"(?<=[0-9])-(?=[0-9])",
@ -19,4 +42,6 @@ _infixes = (
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/de/tokenizer_exceptions.py
+++ b/spacy/lang/de/tokenizer_exceptions.py
@ -157,6 +157,8 @@ for exc_data in [
 for orth in [
    "``",
    "''",
    "A.C.",
    "a.D.",
    "A.D.",
@ -172,10 +174,13 @@ for orth in [
    "biol.",
    "Biol.",
    "ca.",
    "CDU/CSU",
    "Chr.",
    "Cie.",
    "c/o",
    "co.",
    "Co.",
    "d'",
    "D.C.",
    "Dipl.-Ing.",
    "Dipl.",
@ -200,12 +205,18 @@ for orth in [
    "i.G.",
    "i.Tr.",
    "i.V.",
    "I.",
    "II.",
    "III.",
    "IV.",
    "Inc.",
    "Ing.",
    "jr.",
    "Jr.",
    "jun.",
    "jur.",
    "K.O.",
    "L'",
    "L.A.",
    "lat.",
    "M.A.",
--- a/spacy/lang/eu/init.py
+++ b/spacy/lang/eu/init.py
@ -0,0 +1,30 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_SUFFIXES
 from .tag_map import TAG_MAP
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ...language import Language
 from ...attrs import LANG
 class BasqueDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "eu"
    tokenizer_exceptions = BASE_EXCEPTIONS
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
    suffixes = TOKENIZER_SUFFIXES
 class Basque(Language):
    lang = "eu"
    Defaults = BasqueDefaults
 __all__ = ["Basque"]
--- a/spacy/lang/eu/examples.py
+++ b/spacy/lang/eu/examples.py
@ -0,0 +1,14 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.eu.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
    "gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira",
 ]
--- a/spacy/lang/eu/lex_attrs.py
+++ b/spacy/lang/eu/lex_attrs.py
@ -0,0 +1,79 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
 # Source http://mylanguages.org/basque_numbers.php
 _num_words = """
 bat
 bi
 hiru
 lau
 bost
 sei
 zazpi
 zortzi
 bederatzi
 hamar
 hamaika
 hamabi
 hamahiru
 hamalau
 hamabost
 hamasei
 hamazazpi
 Hemezortzi
 hemeretzi
 hogei
 ehun
 mila
 milioi
 """.split()
 # source https://www.google.com/intl/ur/inputtools/try/
 _ordinal_words = """
 lehen
 bigarren
 hirugarren
 laugarren
 bosgarren
 seigarren
 zazpigarren
 zortzigarren
 bederatzigarren
 hamargarren
 hamaikagarren
 hamabigarren
 hamahirugarren
 hamalaugarren
 hamabosgarren
 hamaseigarren
 hamazazpigarren
 hamazortzigarren
 hemeretzigarren
 hogeigarren
 behin
 """.split()
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text in _num_words:
        return True
    if text in _ordinal_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/eu/punctuation.py
+++ b/spacy/lang/eu/punctuation.py
@ -0,0 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..punctuation import TOKENIZER_SUFFIXES
 _suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/eu/stop_words.py
+++ b/spacy/lang/eu/stop_words.py
@ -0,0 +1,108 @@
 # encoding: utf8
 from __future__ import unicode_literals
 # Source: https://github.com/stopwords-iso/stopwords-eu
 # https://www.ranks.nl/stopwords/basque
 # https://www.mustgo.com/worldlanguages/basque/
 STOP_WORDS = set(
    """
 al
 anitz
 arabera
 asko
 baina
 bat
 batean
 batek
 bati
 batzuei
 batzuek
 batzuetan
 batzuk
 bera
 beraiek
 berau
 berauek
 bere
 berori
 beroriek
 beste
 bezala
 da
 dago
 dira
 ditu
 du
 dute
 edo
 egin
 ere
 eta
 eurak
 ez
 gainera
 gu
 gutxi
 guzti
 haiei
 haiek
 haietan
 hainbeste
 hala
 han
 handik
 hango
 hara
 hari
 hark
 hartan
 hau
 hauei
 hauek
 hauetan
 hemen
 hemendik
 hemengo
 hi
 hona
 honek
 honela
 honetan
 honi
 hor
 hori
 horiei
 horiek
 horietan
 horko
 horra
 horrek
 horrela
 horretan
 horri
 hortik
 hura
 izan
 ni
 noiz
 nola
 non
 nondik
 nongo
 nor
 nora
 ze
 zein
 zen
 zenbait
 zenbat
 zer
 zergatik
 ziren
 zituen
 zu
 zuek
 zuen
 zuten
 """.split()
 )
--- a/spacy/lang/eu/tag_map.py
+++ b/spacy/lang/eu/tag_map.py
@ -0,0 +1,71 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
 from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
 TAG_MAP = {
    ".": {POS: PUNCT, "PunctType": "peri"},
    ",": {POS: PUNCT, "PunctType": "comm"},
    "-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
    "-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
    "``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
    '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
    "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
    ":": {POS: PUNCT},
    "$": {POS: SYM, "Other": {"SymType": "currency"}},
    "#": {POS: SYM, "Other": {"SymType": "numbersign"}},
    "AFX": {POS: ADJ, "Hyph": "yes"},
    "CC": {POS: CCONJ, "ConjType": "coor"},
    "CD": {POS: NUM, "NumType": "card"},
    "DT": {POS: DET},
    "EX": {POS: ADV, "AdvType": "ex"},
    "FW": {POS: X, "Foreign": "yes"},
    "HYPH": {POS: PUNCT, "PunctType": "dash"},
    "IN": {POS: ADP},
    "JJ": {POS: ADJ, "Degree": "pos"},
    "JJR": {POS: ADJ, "Degree": "comp"},
    "JJS": {POS: ADJ, "Degree": "sup"},
    "LS": {POS: PUNCT, "NumType": "ord"},
    "MD": {POS: VERB, "VerbType": "mod"},
    "NIL": {POS: ""},
    "NN": {POS: NOUN, "Number": "sing"},
    "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
    "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
    "NNS": {POS: NOUN, "Number": "plur"},
    "PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
    "POS": {POS: PART, "Poss": "yes"},
    "PRP": {POS: PRON, "PronType": "prs"},
    "PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
    "RB": {POS: ADV, "Degree": "pos"},
    "RBR": {POS: ADV, "Degree": "comp"},
    "RBS": {POS: ADV, "Degree": "sup"},
    "RP": {POS: PART},
    "SP": {POS: SPACE},
    "SYM": {POS: SYM},
    "TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
    "UH": {POS: INTJ},
    "VB": {POS: VERB, "VerbForm": "inf"},
    "VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
    "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
    "VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
    "VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
    "VBZ": {
        POS: VERB,
        "VerbForm": "fin",
        "Tense": "pres",
        "Number": "sing",
        "Person": 3,
    },
    "WDT": {POS: ADJ, "PronType": "int|rel"},
    "WP": {POS: NOUN, "PronType": "int|rel"},
    "WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
    "WRB": {POS: ADV, "PronType": "int|rel"},
    "ADD": {POS: X},
    "NFP": {POS: PUNCT},
    "GW": {POS: X},
    "XX": {POS: X},
    "BES": {POS: VERB},
    "HVS": {POS: VERB},
    "_SP": {POS: SPACE},
 }
--- a/spacy/lang/fi/tokenizer_exceptions.py
+++ b/spacy/lang/fi/tokenizer_exceptions.py
@ -11,6 +11,7 @@ for exc_data in [
    {ORTH: "alv.", LEMMA: "arvonlisävero"},
    {ORTH: "ark.", LEMMA: "arkisin"},
    {ORTH: "as.", LEMMA: "asunto"},
    {ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
    {ORTH: "ed.", LEMMA: "edellinen"},
    {ORTH: "esim.", LEMMA: "esimerkki"},
    {ORTH: "huom.", LEMMA: "huomautus"},
@ -24,6 +25,7 @@ for exc_data in [
    {ORTH: "läh.", LEMMA: "lähettäjä"},
    {ORTH: "miel.", LEMMA: "mieluummin"},
    {ORTH: "milj.", LEMMA: "miljoona"},
    {ORTH: "Mm.", LEMMA: "muun muassa"},
    {ORTH: "mm.", LEMMA: "muun muassa"},
    {ORTH: "myöh.", LEMMA: "myöhempi"},
    {ORTH: "n.", LEMMA: "noin"},
--- a/spacy/lang/fr/init.py
+++ b/spacy/lang/fr/init.py
@ -1,5 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
-from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -24,6 +25,7 @@ class FrenchDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    token_match = TOKEN_MATCH
--- a/spacy/lang/fr/punctuation.py
+++ b/spacy/lang/fr/punctuation.py
@ -1,12 +1,23 @@
-from ..punctuation import TOKENIZER_INFIXES
+from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
 from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from ..char_classes import merge_chars
-ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
+ELISION = "' ’".replace(" ", "")
-HYPHENS = r"- – — ‐ ‑".strip().replace(" ", "").replace("\n", "")
+HYPHENS = r"- – — ‐ ‑".replace(" ", "")
 _prefixes_elision = "d l n"
 _prefixes_elision += " " + _prefixes_elision.upper()
 _hyphen_suffixes = "ce clés elle en il ils je là moi nous on t vous"
 _hyphen_suffixes += " " + _hyphen_suffixes.upper()
 _prefixes = TOKENIZER_PREFIXES + [
    r"(?:({pe})[{el}])(?=[{a}])".format(
        a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
    )
 ]
 _suffixes = (
    LIST_PUNCT
    + LIST_ELLIPSES
@ -14,7 +25,6 @@ _suffixes = (
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",  # °C. -> ["°C", "."]
        r"(?<=[0-9])°[FfCcKk]",  # 4°C -> ["4", "°C"]
        r"(?<=[0-9])%",  # 4% -> ["4", "%"]
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9])(?:{u})".format(u=UNITS),
@ -22,14 +32,17 @@ _suffixes = (
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
        r"(?<=[{a}])[{h}]({hs})".format(
            a=ALPHA, h=HYPHENS, hs=merge_chars(_hyphen_suffixes)
        ),
    ]
 )
 _infixes = TOKENIZER_INFIXES + [
    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 ]
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -3,7 +3,7 @@ import re
 from .punctuation import ELISION, HYPHENS
 from ..tokenizer_exceptions import URL_PATTERN
 from ..char_classes import ALPHA_LOWER, ALPHA
-from ...symbols import ORTH, LEMMA, TAG
+from ...symbols import ORTH, LEMMA
 # not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
 # from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
@ -53,7 +53,28 @@ for exc_data in [
    _exc[exc_data[ORTH]] = [exc_data]
-for orth in ["etc."]:
+for orth in [
    "après-midi",
    "au-delà",
    "au-dessus",
    "celle-ci",
    "celles-ci",
    "celui-ci",
    "cf.",
    "ci-dessous",
    "elle-même",
    "en-dessous",
    "etc.",
    "jusque-là",
    "lui-même",
    "MM.",
    "No.",
    "peut-être",
    "pp.",
    "quelques-uns",
    "rendez-vous",
    "Vol.",
 ]:
    _exc[orth] = [{ORTH: orth}]
@ -69,7 +90,7 @@ for verb, verb_lemma in [
        for pronoun in ["elle", "il", "on"]:
            token = f"{orth}-t-{pronoun}"
            _exc[token] = [
-                {LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
+                {LEMMA: verb_lemma, ORTH: orth},  # , TAG: "VERB"},
                {LEMMA: "t", ORTH: "-t"},
                {LEMMA: pronoun, ORTH: "-" + pronoun},
            ]
@ -78,7 +99,7 @@ for verb, verb_lemma in [("est", "être")]:
    for orth in [verb, verb.title()]:
        token = f"{orth}-ce"
        _exc[token] = [
-            {LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
+            {LEMMA: verb_lemma, ORTH: orth},  # , TAG: "VERB"},
            {LEMMA: "ce", ORTH: "-ce"},
        ]
@ -86,12 +107,29 @@ for verb, verb_lemma in [("est", "être")]:
 for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
    for orth in [pre, pre.title()]:
        _exc[f"{orth}est-ce"] = [
-            {LEMMA: pre_lemma, ORTH: orth, TAG: "ADV"},
+            {LEMMA: pre_lemma, ORTH: orth},
-            {LEMMA: "être", ORTH: "est", TAG: "VERB"},
+            {LEMMA: "être", ORTH: "est"},
            {LEMMA: "ce", ORTH: "-ce"},
        ]
 for verb, pronoun in [("est", "il"), ("EST", "IL")]:
    token = "{}-{}".format(verb, pronoun)
    _exc[token] = [
        {LEMMA: "être", ORTH: verb},
        {LEMMA: pronoun, ORTH: "-" + pronoun},
    ]
 for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
    token = "{}'{}-{}".format(s, verb, pronoun)
    _exc[token] = [
        {LEMMA: "se", ORTH: s + "'"},
        {LEMMA: "être", ORTH: verb},
        {LEMMA: pronoun, ORTH: "-" + pronoun},
    ]
 _infixes_exc = []
 orig_elision = "'"
 orig_hyphen = "-"
--- a/spacy/lang/it/init.py
+++ b/spacy/lang/it/init.py
@ -1,7 +1,7 @@
 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .punctuation import TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -19,6 +19,7 @@ class ItalianDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/it/punctuation.py
+++ b/spacy/lang/it/punctuation.py
@ -1,12 +1,29 @@
-from ..punctuation import TOKENIZER_INFIXES
+from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
-from ..char_classes import ALPHA
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS
 from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
 from ..char_classes import ALPHA_LOWER, ALPHA_UPPER
-ELISION = " ' ’ ".strip().replace(" ", "")
+ELISION = "'’"
-_infixes = TOKENIZER_INFIXES + [
+_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 ]
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
        r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
    ]
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/it/tokenizer_exceptions.py
+++ b/spacy/lang/it/tokenizer_exceptions.py
@ -1,5 +1,55 @@
 from ...symbols import ORTH, LEMMA
-_exc = {"po'": [{ORTH: "po'", LEMMA: "poco"}]}
+_exc = {
    "all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
    "dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
    "dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
    "L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
    "l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
    "nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
    "po'": [{ORTH: "po'", LEMMA: "poco"}],
    "sett..": [{ORTH: "sett."}, {ORTH: "."}],
 }
 for orth in [
    "..",
    "....",
    "al.",
    "all-path",
    "art.",
    "Art.",
    "artt.",
    "att.",
    "by-pass",
    "c.d.",
    "centro-sinistra",
    "check-up",
    "Civ.",
    "cm.",
    "Cod.",
    "col.",
    "Cost.",
    "d.C.",
    'de"',
    "distr.",
    "E'",
    "ecc.",
    "e-mail",
    "e/o",
    "etc.",
    "Jr.",
    "n°",
    "nord-est",
    "pag.",
    "Proc.",
    "prof.",
    "sett.",
    "s.p.a.",
    "ss.",
    "St.",
    "tel.",
    "week-end",
 ]:
    _exc[orth] = [{ORTH: orth}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/lb/punctuation.py
+++ b/spacy/lang/lb/punctuation.py
@ -2,11 +2,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_
 ELISION = " ' ’ ".strip().replace(" ", "")
 abbrev = ("d", "D")
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
-        r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
+        r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -7,6 +7,8 @@ _exc = {}
 # translate / delete what is not necessary
 for exc_data in [
    {ORTH: "’t", LEMMA: "et", NORM: "et"},
    {ORTH: "’T", LEMMA: "et", NORM: "et"},
    {ORTH: "'t", LEMMA: "et", NORM: "et"},
    {ORTH: "'T", LEMMA: "et", NORM: "et"},
    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
--- a/spacy/lang/lij/init.py
+++ b/spacy/lang/lij/init.py
@ -0,0 +1,31 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...attrs import LANG, NORM
 from ...util import update_exc, add_lookups
 class LigurianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "lij"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    infixes = TOKENIZER_INFIXES
 class Ligurian(Language):
    lang = "lij"
    Defaults = LigurianDefaults
 __all__ = ["Ligurian"]
--- a/spacy/lang/lij/examples.py
+++ b/spacy/lang/lij/examples.py
@ -0,0 +1,18 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.lij.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Sciusciâ e sciorbî no se peu.",
    "Graçie di çetroin, che me son arrivæ.",
    "Vegnime apreuvo, che ve fasso pescâ di òmmi.",
    "Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
 ]
--- a/spacy/lang/lij/punctuation.py
+++ b/spacy/lang/lij/punctuation.py
@ -0,0 +1,15 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..punctuation import TOKENIZER_INFIXES
 from ..char_classes import ALPHA
 ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
 _infixes = TOKENIZER_INFIXES + [
    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 ]
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/lij/stop_words.py
+++ b/spacy/lang/lij/stop_words.py
@ -0,0 +1,43 @@
 # coding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set(
    """
 a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
 bella belle belli bello ben
 ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
 d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
 é e ê ea ean emmo en ëse
 fin fiña
 gh' ghe guæei
 i î in insemme int' inta inte inti into
 l' lê lì lô
 m' ma manco me megio meno mezo mi
 na n' ne ni ninte nisciun nisciuña no
 o ò ô oua
 parte pe pe-a pe-i pe-e pe-o perché pittin pö primma pròpio
 quæ quand' quande quarche quella quelle quelli quello
 s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
 tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
 un uña unn' unna
 za zu
 """.split()
 )
--- a/spacy/lang/lij/tokenizer_exceptions.py
+++ b/spacy/lang/lij/tokenizer_exceptions.py
@ -0,0 +1,52 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import ORTH, LEMMA
 _exc = {}
 for raw, lemma in [
    ("a-a", "a-o"),
    ("a-e", "a-o"),
    ("a-o", "a-o"),
    ("a-i", "a-o"),
    ("co-a", "co-o"),
    ("co-e", "co-o"),
    ("co-i", "co-o"),
    ("co-o", "co-o"),
    ("da-a", "da-o"),
    ("da-e", "da-o"),
    ("da-i", "da-o"),
    ("da-o", "da-o"),
    ("pe-a", "pe-o"),
    ("pe-e", "pe-o"),
    ("pe-i", "pe-o"),
    ("pe-o", "pe-o"),
 ]:
    for orth in [raw, raw.capitalize()]:
        _exc[orth] = [{ORTH: orth, LEMMA: lemma}]
 # Prefix + prepositions with à (e.g. "sott'a-o")
 for prep, prep_lemma in [
    ("a-a", "a-o"),
    ("a-e", "a-o"),
    ("a-o", "a-o"),
    ("a-i", "a-o"),
 ]:
    for prefix, prefix_lemma in [
        ("sott'", "sotta"),
        ("sott’", "sotta"),
        ("contr'", "contra"),
        ("contr’", "contra"),
        ("ch'", "che"),
        ("ch’", "che"),
        ("s'", "se"),
        ("s’", "se"),
    ]:
        for prefix_orth in [prefix, prefix.capitalize()]:
            _exc[prefix_orth + prep] = [
                {ORTH: prefix_orth, LEMMA: prefix_lemma},
                {ORTH: prep, LEMMA: prep_lemma},
            ]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/lt/init.py
+++ b/spacy/lang/lt/init.py
@ -1,3 +1,4 @@
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -23,7 +24,13 @@ class LithuanianDefaults(Language.Defaults):
    )
    lex_attr_getters.update(LEX_ATTRS)
-    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    mod_base_exceptions = {
        exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
    }
    del mod_base_exceptions["8)"]
    tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    morph_rules = MORPH_RULES
--- a/spacy/lang/lt/punctuation.py
+++ b/spacy/lang/lt/punctuation.py
@ -0,0 +1,29 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..char_classes import LIST_ICONS, LIST_ELLIPSES
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
 from ..char_classes import HYPHENS
 from ..punctuation import TOKENIZER_SUFFIXES
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
 )
 _suffixes = ["\."] + list(TOKENIZER_SUFFIXES)
 TOKENIZER_INFIXES = _infixes
 TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/lt/tokenizer_exceptions.py
+++ b/spacy/lang/lt/tokenizer_exceptions.py
@ -3,262 +3,264 @@ from ...symbols import ORTH
 _exc = {}
 for orth in [
-    "G.",
+    "n-tosios",
-    "J. E.",
+    "?!",
-    "J. Em.",
+    #    "G.",
-    "J.E.",
+    #    "J. E.",
-    "J.Em.",
+    #    "J. Em.",
-    "K.",
+    #    "J.E.",
-    "N.",
+    #    "J.Em.",
-    "V.",
+    #    "K.",
-    "Vt.",
+    #    "N.",
-    "a.",
+    #    "V.",
-    "a.k.",
+    #    "Vt.",
-    "a.s.",
+    #    "a.",
-    "adv.",
+    #    "a.k.",
-    "akad.",
+    #    "a.s.",
-    "aklg.",
+    #    "adv.",
-    "akt.",
+    #    "akad.",
-    "al.",
+    #    "aklg.",
-    "ang.",
+    #    "akt.",
-    "angl.",
+    #    "al.",
-    "aps.",
+    #    "ang.",
-    "apskr.",
+    #    "angl.",
-    "apyg.",
+    #    "aps.",
-    "arbat.",
+    #    "apskr.",
-    "asist.",
+    #    "apyg.",
-    "asm.",
+    #    "arbat.",
-    "asm.k.",
+    #    "asist.",
-    "asmv.",
+    #    "asm.",
-    "atk.",
+    #    "asm.k.",
-    "atsak.",
+    #    "asmv.",
-    "atsisk.",
+    #    "atk.",
-    "atsisk.sąsk.",
+    #    "atsak.",
-    "atv.",
+    #    "atsisk.",
-    "aut.",
+    #    "atsisk.sąsk.",
-    "avd.",
+    #    "atv.",
-    "b.k.",
+    #    "aut.",
-    "baud.",
+    #    "avd.",
-    "biol.",
+    #    "b.k.",
-    "bkl.",
+    #    "baud.",
-    "bot.",
+    #    "biol.",
-    "bt.",
+    #    "bkl.",
-    "buv.",
+    #    "bot.",
-    "ch.",
+    #    "bt.",
-    "chem.",
+    #    "buv.",
-    "corp.",
+    #    "ch.",
-    "d.",
+    #    "chem.",
-    "dab.",
+    #    "corp.",
-    "dail.",
+    #    "d.",
-    "dek.",
+    #    "dab.",
-    "deš.",
+    #    "dail.",
-    "dir.",
+    #    "dek.",
-    "dirig.",
+    #    "deš.",
-    "doc.",
+    #    "dir.",
-    "dol.",
+    #    "dirig.",
-    "dr.",
+    #    "doc.",
-    "drp.",
+    #    "dol.",
-    "dvit.",
+    #    "dr.",
-    "dėst.",
+    #    "drp.",
-    "dš.",
+    #    "dvit.",
-    "dž.",
+    #    "dėst.",
-    "e.b.",
+    #    "dš.",
-    "e.bankas",
+    #    "dž.",
-    "e.p.",
+    #    "e.b.",
-    "e.parašas",
+    #    "e.bankas",
-    "e.paštas",
+    #    "e.p.",
-    "e.v.",
+    #    "e.parašas",
-    "e.valdžia",
+    #    "e.paštas",
-    "egz.",
+    #    "e.v.",
-    "eil.",
+    #    "e.valdžia",
-    "ekon.",
+    #    "egz.",
-    "el.",
+    #    "eil.",
-    "el.bankas",
+    #    "ekon.",
-    "el.p.",
+    #    "el.",
-    "el.parašas",
+    #    "el.bankas",
-    "el.paštas",
+    #    "el.p.",
-    "el.valdžia",
+    #    "el.parašas",
-    "etc.",
+    #    "el.paštas",
-    "ež.",
+    #    "el.valdžia",
-    "fak.",
+    #    "etc.",
-    "faks.",
+    #    "ež.",
-    "feat.",
+    #    "fak.",
-    "filol.",
+    #    "faks.",
-    "filos.",
+    #    "feat.",
-    "g.",
+    #    "filol.",
-    "gen.",
+    #    "filos.",
-    "geol.",
+    #    "g.",
-    "gerb.",
+    #    "gen.",
-    "gim.",
+    #    "geol.",
-    "gr.",
+    #    "gerb.",
-    "gv.",
+    #    "gim.",
-    "gyd.",
+    #    "gr.",
-    "gyv.",
+    #    "gv.",
-    "habil.",
+    #    "gyd.",
-    "inc.",
+    #    "gyv.",
-    "insp.",
+    #    "habil.",
-    "inž.",
+    #    "inc.",
-    "ir pan.",
+    #    "insp.",
-    "ir t. t.",
+    #    "inž.",
-    "isp.",
+    #    "ir pan.",
-    "istor.",
+    #    "ir t. t.",
-    "it.",
+    #    "isp.",
-    "just.",
+    #    "istor.",
-    "k.",
+    #    "it.",
-    "k. a.",
+    #    "just.",
-    "k.a.",
+    #    "k.",
-    "kab.",
+    #    "k. a.",
-    "kand.",
+    #    "k.a.",
-    "kart.",
+    #    "kab.",
-    "kat.",
+    #    "kand.",
-    "ketv.",
+    #    "kart.",
-    "kh.",
+    #    "kat.",
-    "kl.",
+    #    "ketv.",
-    "kln.",
+    #    "kh.",
-    "km.",
+    #    "kl.",
-    "kn.",
+    #    "kln.",
-    "koresp.",
+    #    "km.",
-    "kpt.",
+    #    "kn.",
-    "kr.",
+    #    "koresp.",
-    "kt.",
+    #    "kpt.",
-    "kub.",
+    #    "kr.",
-    "kun.",
+    #    "kt.",
-    "kv.",
+    #    "kub.",
-    "kyš.",
+    #    "kun.",
-    "l. e. p.",
+    #    "kv.",
-    "l.e.p.",
+    #    "kyš.",
-    "lenk.",
+    #    "l. e. p.",
-    "liet.",
+    #    "l.e.p.",
-    "lot.",
+    #    "lenk.",
-    "lt.",
+    #    "liet.",
-    "ltd.",
+    #    "lot.",
-    "ltn.",
+    #    "lt.",
-    "m.",
+    #    "ltd.",
-    "m.e..",
+    #    "ltn.",
-    "m.m.",
+    #    "m.",
-    "mat.",
+    #    "m.e..",
-    "med.",
+    #    "m.m.",
-    "mgnt.",
+    #    "mat.",
-    "mgr.",
+    #    "med.",
-    "min.",
+    #    "mgnt.",
-    "mjr.",
+    #    "mgr.",
-    "ml.",
+    #    "min.",
-    "mln.",
+    #    "mjr.",
-    "mlrd.",
+    #    "ml.",
-    "mob.",
+    #    "mln.",
-    "mok.",
+    #    "mlrd.",
-    "moksl.",
+    #    "mob.",
-    "mokyt.",
+    #    "mok.",
-    "mot.",
+    #    "moksl.",
-    "mr.",
+    #    "mokyt.",
-    "mst.",
+    #    "mot.",
-    "mstl.",
+    #    "mr.",
-    "mėn.",
+    #    "mst.",
-    "nkt.",
+    #    "mstl.",
-    "no.",
+    #    "mėn.",
-    "nr.",
+    #    "nkt.",
-    "ntk.",
+    #    "no.",
-    "nuotr.",
+    #    "nr.",
-    "op.",
+    #    "ntk.",
-    "org.",
+    #    "nuotr.",
-    "orig.",
+    #    "op.",
-    "p.",
+    #    "org.",
-    "p.d.",
+    #    "orig.",
-    "p.m.e.",
+    #    "p.",
-    "p.s.",
+    #    "p.d.",
-    "pab.",
+    #    "p.m.e.",
-    "pan.",
+    #    "p.s.",
-    "past.",
+    #    "pab.",
-    "pav.",
+    #    "pan.",
-    "pavad.",
+    #    "past.",
-    "per.",
+    #    "pav.",
-    "perd.",
+    #    "pavad.",
-    "pirm.",
+    #    "per.",
-    "pl.",
+    #    "perd.",
-    "plg.",
+    #    "pirm.",
-    "plk.",
+    #    "pl.",
-    "pr.",
+    #    "plg.",
-    "pr.Kr.",
+    #    "plk.",
-    "pranc.",
+    #    "pr.",
-    "proc.",
+    #    "pr.Kr.",
-    "prof.",
+    #    "pranc.",
-    "prom.",
+    #    "proc.",
-    "prot.",
+    #    "prof.",
-    "psl.",
+    #    "prom.",
-    "pss.",
+    #    "prot.",
-    "pvz.",
+    #    "psl.",
-    "pšt.",
+    #    "pss.",
-    "r.",
+    #    "pvz.",
-    "raj.",
+    #    "pšt.",
-    "red.",
+    #    "r.",
-    "rez.",
+    #    "raj.",
-    "rež.",
+    #    "red.",
-    "rus.",
+    #    "rez.",
-    "rš.",
+    #    "rež.",
-    "s.",
+    #    "rus.",
-    "sav.",
+    #    "rš.",
-    "saviv.",
+    #    "s.",
-    "sek.",
+    #    "sav.",
-    "sekr.",
+    #    "saviv.",
-    "sen.",
+    #    "sek.",
-    "sh.",
+    #    "sekr.",
-    "sk.",
+    #    "sen.",
-    "skg.",
+    #    "sh.",
-    "skv.",
+    #    "sk.",
-    "skyr.",
+    #    "skg.",
-    "sp.",
+    #    "skv.",
-    "spec.",
+    #    "skyr.",
-    "sr.",
+    #    "sp.",
-    "st.",
+    #    "spec.",
-    "str.",
+    #    "sr.",
-    "stud.",
+    #    "st.",
-    "sąs.",
+    #    "str.",
-    "t.",
+    #    "stud.",
-    "t. p.",
+    #    "sąs.",
-    "t. y.",
+    #    "t.",
-    "t.p.",
+    #    "t. p.",
-    "t.t.",
+    #    "t. y.",
-    "t.y.",
+    #    "t.p.",
-    "techn.",
+    #    "t.t.",
-    "tel.",
+    #    "t.y.",
-    "teol.",
+    #    "techn.",
-    "th.",
+    #    "tel.",
-    "tir.",
+    #    "teol.",
-    "trit.",
+    #    "th.",
-    "trln.",
+    #    "tir.",
-    "tšk.",
+    #    "trit.",
-    "tūks.",
+    #    "trln.",
-    "tūkst.",
+    #    "tšk.",
-    "up.",
+    #    "tūks.",
-    "upl.",
+    #    "tūkst.",
-    "v.s.",
+    #    "up.",
-    "vad.",
+    #    "upl.",
-    "val.",
+    #    "v.s.",
-    "valg.",
+    #    "vad.",
-    "ved.",
+    #    "val.",
-    "vert.",
+    #    "valg.",
-    "vet.",
+    #    "ved.",
-    "vid.",
+    #    "vert.",
-    "virš.",
+    #    "vet.",
-    "vlsč.",
+    #    "vid.",
-    "vnt.",
+    #    "virš.",
-    "vok.",
+    #    "vlsč.",
-    "vs.",
+    #    "vnt.",
-    "vtv.",
+    #    "vok.",
-    "vv.",
+    #    "vs.",
-    "vyr.",
+    #    "vtv.",
-    "vyresn.",
+    #    "vv.",
-    "zool.",
+    #    "vyr.",
-    "Įn",
+    #    "vyresn.",
-    "įl.",
+    #    "zool.",
-    "š.m.",
+    #    "Įn",
-    "šnek.",
+    #    "įl.",
-    "šv.",
+    #    "š.m.",
-    "švč.",
+    #    "šnek.",
-    "ž.ū.",
+    #    "šv.",
-    "žin.",
+    #    "švč.",
-    "žml.",
+    #    "ž.ū.",
-    "žr.",
+    #    "žin.",
    #    "žml.",
    #    "žr.",
 ]:
    _exc[orth] = [{ORTH: orth}]
--- a/spacy/lang/nb/init.py
+++ b/spacy/lang/nb/init.py
@ -1,4 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .morph_rules import MORPH_RULES
 from .syntax_iterators import SYNTAX_ITERATORS
@ -18,6 +20,9 @@ class NorwegianDefaults(Language.Defaults):
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    stop_words = STOP_WORDS
    morph_rules = MORPH_RULES
    tag_map = TAG_MAP
--- a/spacy/lang/nb/punctuation.py
+++ b/spacy/lang/nb/punctuation.py
@ -1,13 +1,29 @@
-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
-from ..punctuation import TOKENIZER_SUFFIXES
+from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY
-# Punctuation stolen from Danish
+
 # Punctuation adapted from Danish
 _quotes = CONCAT_QUOTES.replace("'", "")
 _list_punct = [x for x in LIST_PUNCT if x != "#"]
 _list_icons = [x for x in LIST_ICONS if x != "°"]
 _list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
 _list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
 _prefixes = (
    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
    + _list_punct
    + LIST_ELLIPSES
    + LIST_QUOTES
    + LIST_CURRENCY
    + LIST_ICONS
 )
 _infixes = (
    LIST_ELLIPSES
-    + LIST_ICONS
+    + _list_icons
    + [
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
@ -18,13 +34,26 @@ _infixes = (
    ]
 )
-_suffixes = [
+_suffixes = (
-    suffix
+    LIST_PUNCT
-    for suffix in TOKENIZER_SUFFIXES
+    + LIST_ELLIPSES
-    if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
+    + _list_quotes
-]
+    + _list_icons
-_suffixes += [r"(?<=[^sSxXzZ])\'"]
+    + ["—", "–"]
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9])(?:{u})".format(u=UNITS),
        r"(?<=[{al}{e}{p}(?:{q})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
    ]
    + [r"(?<=[^sSxXzZ])'"]
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
 TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/nb/tokenizer_exceptions.py
+++ b/spacy/lang/nb/tokenizer_exceptions.py
@ -21,57 +21,80 @@ for exc_data in [
 for orth in [
-    "adm.dir.",
+    "Ap.",
    "a.m.",
    "andelsnr",
    "Aq.",
    "Ca.",
    "Chr.",
    "Co.",
    "Co.",
    "Dr.",
    "F.eks.",
    "Fr.p.",
    "Frp.",
    "Grl.",
    "Kr.",
    "Kr.F.",
    "Kr.F.s",
    "Mr.",
    "Mrs.",
    "Pb.",
    "Pr.",
    "Sp.",
    "Sp.",
    "St.",
    "a.m.",
    "ad.",
    "adm.dir.",
    "andelsnr",
    "b.c.",
    "bl.a.",
    "bla.",
    "bm.",
    "bnr.",
    "bto.",
    "c.c.",
    "ca.",
    "cand.mag.",
    "c.c.",
    "co.",
    "d.d.",
    "dept.",
    "d.m.",
    "dr.philos.",
    "dvs.",
    "d.y.",
-    "E. coli",
+    "dept.",
    "dr.",
    "dr.med.",
    "dr.philos.",
    "dr.psychol.",
    "dvs.",
    "e.Kr.",
    "e.l.",
    "eg.",
    "ekskl.",
    "e.Kr.",
    "el.",
    "e.l.",
    "et.",
    "etc.",
    "etg.",
    "ev.",
    "evt.",
    "f.",
    "f.Kr.",
    "f.eks.",
    "f.o.m.",
    "fhv.",
    "fk.",
    "f.Kr.",
    "f.o.m.",
    "foreg.",
    "fork.",
    "fv.",
    "fvt.",
    "g.",
    "gt.",
    "gl.",
    "gno.",
    "gnr.",
    "grl.",
    "gt.",
    "h.r.adv.",
    "hhv.",
    "hoh.",
    "hr.",
    "h.r.adv.",
    "ifb.",
    "ifm.",
    "iht.",
@ -80,39 +103,45 @@ for orth in [
    "jf.",
    "jr.",
    "jun.",
    "juris.",
    "kfr.",
    "kgl.",
    "kgl.res.",
    "kl.",
    "komm.",
    "kr.",
    "kst.",
    "lat.",
    "lø.",
    "m.a.o.",
    "m.fl.",
    "m.m.",
    "m.v.",
    "ma.",
    "mag.art.",
    "m.a.o.",
    "md.",
    "mfl.",
    "mht.",
    "mill.",
    "min.",
    "m.m.",
    "mnd.",
    "moh.",
-    "Mr.",
+    "mrd.",
    "muh.",
    "mv.",
    "mva.",
    "n.å.",
    "ndf.",
    "no.",
    "nov.",
    "nr.",
    "nto.",
    "nyno.",
    "n.å.",
    "o.a.",
    "o.l.",
    "off.",
    "ofl.",
    "okt.",
    "o.l.",
    "on.",
    "op.",
    "org.",
@ -120,14 +149,15 @@ for orth in [
    "ovf.",
    "p.",
    "p.a.",
-    "Pb.",
+    "p.g.a.",
    "p.m.",
    "p.t.",
    "pga.",
    "ph.d.",
    "pkt.",
    "p.m.",
    "pr.",
    "pst.",
-    "p.t.",
+    "pt.",
    "red.anm.",
    "ref.",
    "res.",
@ -136,6 +166,10 @@ for orth in [
    "rv.",
    "s.",
    "s.d.",
    "s.k.",
    "s.k.",
    "s.u.",
    "s.å.",
    "sen.",
    "sep.",
    "siviling.",
@ -145,16 +179,17 @@ for orth in [
    "sr.",
    "sst.",
    "st.",
    "stip.",
    "stk.",
    "st.meld.",
    "st.prp.",
    "stip.",
    "stk.",
    "stud.",
    "s.u.",
    "sv.",
    "sø.",
    "s.å.",
    "såk.",
    "sø.",
    "t.h.",
    "t.o.m.",
    "t.v.",
    "temp.",
    "ti.",
    "tils.",
@ -162,7 +197,6 @@ for orth in [
    "tl;dr",
    "tlf.",
    "to.",
    "t.o.m.",
    "ult.",
    "utg.",
    "v.",
@ -176,8 +210,10 @@ for orth in [
    "vol.",
    "vs.",
    "vsa.",
    "©NTB",
    "årg.",
    "årh.",
    "§§",
 ]:
    _exc[orth] = [{ORTH: orth}]
--- a/spacy/lang/pt/tokenizer_exceptions.py
+++ b/spacy/lang/pt/tokenizer_exceptions.py
@ -1,69 +1,47 @@
-from ...symbols import ORTH, NORM
+from ...symbols import ORTH
-_exc = {
+_exc = {}
    "às": [{ORTH: "à", NORM: "a"}, {ORTH: "s", NORM: "as"}],
    "ao": [{ORTH: "a"}, {ORTH: "o"}],
    "aos": [{ORTH: "a"}, {ORTH: "os"}],
    "àquele": [{ORTH: "à", NORM: "a"}, {ORTH: "quele", NORM: "aquele"}],
    "àquela": [{ORTH: "à", NORM: "a"}, {ORTH: "quela", NORM: "aquela"}],
    "àqueles": [{ORTH: "à", NORM: "a"}, {ORTH: "queles", NORM: "aqueles"}],
    "àquelas": [{ORTH: "à", NORM: "a"}, {ORTH: "quelas", NORM: "aquelas"}],
    "àquilo": [{ORTH: "à", NORM: "a"}, {ORTH: "quilo", NORM: "aquilo"}],
    "aonde": [{ORTH: "a"}, {ORTH: "onde"}],
 }
 # Contractions
 _per_pron = ["ele", "ela", "eles", "elas"]
 _dem_pron = [
    "este",
    "esta",
    "estes",
    "estas",
    "isto",
    "esse",
    "essa",
    "esses",
    "essas",
    "isso",
    "aquele",
    "aquela",
    "aqueles",
    "aquelas",
    "aquilo",
 ]
 _und_pron = ["outro", "outra", "outros", "outras"]
 _adv = ["aqui", "aí", "ali", "além"]
 for orth in _per_pron + _dem_pron + _und_pron + _adv:
    _exc["d" + orth] = [{ORTH: "d", NORM: "de"}, {ORTH: orth}]
 for orth in _per_pron + _dem_pron + _und_pron:
    _exc["n" + orth] = [{ORTH: "n", NORM: "em"}, {ORTH: orth}]
 for orth in [
    "Adm.",
    "Art.",
    "art.",
    "Av.",
    "av.",
    "Cia.",
    "dom.",
    "Dr.",
    "dr.",
    "e.g.",
    "E.g.",
    "E.G.",
    "e/ou",
    "ed.",
    "eng.",
    "etc.",
    "Fund.",
    "Gen.",
    "Gov.",
    "i.e.",
    "I.e.",
    "I.E.",
    "Inc.",
    "Jr.",
    "km/h",
    "Ltd.",
    "Mr.",
    "p.m.",
    "Ph.D.",
    "Rep.",
    "Rev.",
    "S/A",
    "Sen.",
    "Sr.",
    "sr.",
    "Sra.",
    "sra.",
    "vs.",
    "tel.",
    "pág.",
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -1,5 +1,7 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -21,6 +23,9 @@ class RomanianDefaults(Language.Defaults):
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP
--- a/spacy/lang/ro/punctuation.py
+++ b/spacy/lang/ro/punctuation.py
@ -0,0 +1,164 @@
 # coding: utf8
 from __future__ import unicode_literals
 import itertools
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
 from ..char_classes import LIST_ICONS, CURRENCY
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
 _list_icons = [x for x in LIST_ICONS if x != "°"]
 _list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
 _ro_variants = {
    "Ă": ["Ă", "A"],
    "Â": ["Â", "A"],
    "Î": ["Î", "I"],
    "Ș": ["Ș", "Ş", "S"],
    "Ț": ["Ț", "Ţ", "T"],
 }
 def _make_ro_variants(tokens):
    variants = []
    for token in tokens:
        upper_token = token.upper()
        upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
        upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
        for variant in upper_variants:
            variants.extend([variant, variant.lower(), variant.title()])
    return sorted(list(set(variants)))
 # UD_Romanian-RRT closed class prefixes
 # POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
 _ud_rrt_prefixes = [
    "a-",
    "c-",
    "ce-",
    "cu-",
    "d-",
    "de-",
    "dintr-",
    "e-",
    "făr-",
    "i-",
    "l-",
    "le-",
    "m-",
    "mi-",
    "n-",
    "ne-",
    "p-",
    "pe-",
    "prim-",
    "printr-",
    "s-",
    "se-",
    "te-",
    "v-",
    "într-",
    "ș-",
    "și-",
    "ți-",
 ]
 _ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
 # UD_Romanian-RRT closed class suffixes without NUM
 # POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
 _ud_rrt_suffixes = [
    "-a",
    "-aceasta",
    "-ai",
    "-al",
    "-ale",
    "-alta",
    "-am",
    "-ar",
    "-astea",
    "-atâta",
    "-au",
    "-aș",
    "-ați",
    "-i",
    "-ilor",
    "-l",
    "-le",
    "-lea",
    "-mea",
    "-meu",
    "-mi",
    "-mă",
    "-n",
    "-ndărătul",
    "-ne",
    "-o",
    "-oi",
    "-or",
    "-s",
    "-se",
    "-si",
    "-te",
    "-ul",
    "-ului",
    "-un",
    "-uri",
    "-urile",
    "-urilor",
    "-veți",
    "-vă",
    "-ăștia",
    "-și",
    "-ți",
 ]
 _ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
 _prefixes = (
    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
    + _ud_rrt_prefix_variants
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + LIST_CURRENCY
    + LIST_ICONS
 )
 _suffixes = (
    _ud_rrt_suffix_variants
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + _list_icons
    + ["—", "–"]
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
    ]
 )
 _infixes = (
    LIST_ELLIPSES
    + _list_icons
    + [
        r"(?<=[0-9])[+\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
    ]
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/ro/tokenizer_exceptions.py
+++ b/spacy/lang/ro/tokenizer_exceptions.py
@ -1,4 +1,5 @@
 from ...symbols import ORTH
 from .punctuation import _make_ro_variants
 _exc = {}
@ -42,8 +43,52 @@ for orth in [
    "dpdv",
    "șamd.",
    "ș.a.m.d.",
    # below: from UD_Romanian-RRT:
    "A.c.",
    "A.f.",
    "A.r.",
    "Al.",
    "Art.",
    "Aug.",
    "Bd.",
    "Dem.",
    "Dr.",
    "Fig.",
    "Fr.",
    "Gh.",
    "Gr.",
    "Lt.",
    "Nr.",
    "Obs.",
    "Prof.",
    "Sf.",
    "a.m.",
    "a.r.",
    "alin.",
    "art.",
    "d-l",
    "d-lui",
    "d-nei",
    "ex.",
    "fig.",
    "ian.",
    "lit.",
    "lt.",
    "p.a.",
    "p.m.",
    "pct.",
    "prep.",
    "sf.",
    "tel.",
    "univ.",
    "îngr.",
    "într-adevăr",
    "Șt.",
    "ș.a.",
 ]:
-    _exc[orth] = [{ORTH: orth}]
+    # note: does not distinguish capitalized-only exceptions from others
    for variant in _make_ro_variants([orth]):
        _exc[variant] = [{ORTH: variant}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/language.py
+++ b/spacy/language.py
@ -13,6 +13,7 @@ import multiprocessing as mp
 from itertools import chain, cycle
 from .tokenizer import Tokenizer
 from .tokens.underscore import Underscore
 from .vocab import Vocab
 from .lemmatizer import Lemmatizer
 from .lookups import Lookups
@ -874,7 +875,10 @@ class Language(object):
        sender.send()
        procs = [
-            mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
+            mp.Process(
                target=_apply_pipes,
                args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
            )
            for rch, sch in zip(texts_q, bytedocs_send_ch)
        ]
        for proc in procs:
@ -1146,16 +1150,19 @@ def _pipe(examples, proc, kwargs):
        yield ex
-def _apply_pipes(make_doc, pipes, reciever, sender):
+def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
    """Worker for Language.pipe
    receiver (multiprocessing.Connection): Pipe to receive text. Usually
        created by `multiprocessing.Pipe()`
    sender (multiprocessing.Connection): Pipe to send doc. Usually created by
        `multiprocessing.Pipe()`
    underscore_state (tuple): The data in the Underscore class of the parent
    vectors (dict): The global vectors data, copied from the parent
    """
    Underscore.load_state(underscore_state)
    while True:
-        texts = reciever.get()
+        texts = receiver.get()
        docs = (make_doc(text) for text in texts)
        for pipe in pipes:
            docs = pipe(docs)
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -664,6 +664,8 @@ def _get_attr_values(spec, string_store):
                continue
            if attr == "TEXT":
                attr = "ORTH"
            if attr == "IS_SENT_START":
                attr = "SENT_START"
            attr = IDS.get(attr)
        if isinstance(value, basestring):
            value = string_store.add(value)
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -365,7 +365,7 @@ class Tensorizer(Pipe):
        return sgd
-@component("tagger", assigns=["token.tag", "token.pos"])
+@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
 class Tagger(Pipe):
    """Pipeline component for part-of-speech tagging.
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -464,3 +464,5 @@ cdef enum symbol_t:
    ENT_KB_ID
    MORPH
    ENT_ID
    IDX
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -89,6 +89,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
    "IDX": IDX,
    "ADJ": ADJ,
    "ADP": ADP,
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -80,6 +80,11 @@ def es_tokenizer():
    return get_lang_class("es").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def eu_tokenizer():
    return get_lang_class("eu").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def fi_tokenizer():
    return get_lang_class("fi").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -63,3 +63,39 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
    words = ["An", "example", "sentence"]
    doc = Doc(en_vocab, words=words)
    Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
 def test_doc_array_idx(en_vocab):
    """Test that Doc.to_array can retrieve token start indices"""
    words = ["An", "example", "sentence"]
    offsets = Doc(en_vocab, words=words).to_array("IDX")
    assert offsets[0] == 0
    assert offsets[1] == 3
    assert offsets[2] == 11
 def test_doc_from_array_heads_in_bounds(en_vocab):
    """Test that Doc.from_array doesn't set heads that are out of bounds."""
    words = ["This", "is", "a", "sentence", "."]
    doc = Doc(en_vocab, words=words)
    for token in doc:
        token.head = doc[0]
    # correct
    arr = doc.to_array(["HEAD"])
    doc_from_array = Doc(en_vocab, words=words)
    doc_from_array.from_array(["HEAD"], arr)
    # head before start
    arr = doc.to_array(["HEAD"])
    arr[0] = -1
    doc_from_array = Doc(en_vocab, words=words)
    with pytest.raises(ValueError):
        doc_from_array.from_array(["HEAD"], arr)
    # head after end
    arr = doc.to_array(["HEAD"])
    arr[0] = 5
    doc_from_array = Doc(en_vocab, words=words)
    with pytest.raises(ValueError):
        doc_from_array.from_array(["HEAD"], arr)
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -145,10 +145,9 @@ def test_doc_api_runtime_error(en_tokenizer):
    # Example that caused run-time error while parsing Reddit
    # fmt: off
    text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
-    deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
+    deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
-            "nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
+            "amod", "pobj", "acl", "prep", "prep", "pobj",
-            "pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
+            "", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
            "ROOT", "amod", "dobj"]
    # fmt: on
    tokens = en_tokenizer(text)
    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
@ -272,19 +271,9 @@ def test_doc_is_nered(en_vocab):
 def test_doc_from_array_sent_starts(en_vocab):
    words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
    heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
-    deps = [
+    # fmt: off
-        "ROOT",
+    deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
-        "dep",
+    # fmt: on
        "dep",
        "dep",
        "dep",
        "dep",
        "ROOT",
        "dep",
        "dep",
        "dep",
        "dep",
    ]
    doc = Doc(en_vocab, words=words)
    for i, (dep, head) in enumerate(zip(deps, heads)):
        doc[i].dep_ = dep
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -164,6 +164,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
    assert doc[4].left_edge.i == 0
    assert doc[2].left_edge.i == 0
    # head token must be from the same document
    doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    with pytest.raises(ValueError):
        doc[0].head = doc2[0]
 def test_is_sent_start(en_tokenizer):
    doc = en_tokenizer("This is a sentence. This is another.")
@ -211,7 +216,7 @@ def test_token_api_conjuncts_chain(en_vocab):
 def test_token_api_conjuncts_simple(en_vocab):
    words = "They came and went .".split()
    heads = [1, 0, -1, -2, -1]
-    deps = ["nsubj", "ROOT", "cc", "conj"]
+    deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert [w.text for w in doc[1].conjuncts] == ["went"]
    assert [w.text for w in doc[3].conjuncts] == ["came"]
--- a/spacy/tests/doc/test_underscore.py
+++ b/spacy/tests/doc/test_underscore.py
@ -4,6 +4,15 @@ from spacy.tokens import Doc, Span, Token
 from spacy.tokens.underscore import Underscore
@pytest.fixture(scope="function", autouse=True)
 def clean_underscore():
    # reset the Underscore object after the test, to avoid having state copied across tests
    yield
    Underscore.doc_extensions = {}
    Underscore.span_extensions = {}
    Underscore.token_extensions = {}
 def test_create_doc_underscore():
    doc = Mock()
    doc.doc = doc
--- a/spacy/tests/lang/da/test_exceptions.py
+++ b/spacy/tests/lang/da/test_exceptions.py
@ -55,7 +55,8 @@ def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
        ("Kristiansen c/o Madsen", 3),
        ("Sprogteknologi a/s", 2),
        ("De boede i A/B Bellevue", 5),
-        ("Rotorhastigheden er 3400 o/m.", 5),
+        # note: skipping due to weirdness in UD_Danish-DDT
        # ("Rotorhastigheden er 3400 o/m.", 5),
        ("Jeg købte billet t/r.", 5),
        ("Murerarbejdsmand m/k søges", 3),
        ("Netværket kører over TCP/IP", 4),
--- a/spacy/tests/lang/eu/test_text.py
+++ b/spacy/tests/lang/eu/test_text.py
@ -0,0 +1,22 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_eu_tokenizer_handles_long_text(eu_tokenizer):
    text = """ta nere guitarra estrenatu ondoren"""
    tokens = eu_tokenizer(text)
    assert len(tokens) == 5
@pytest.mark.parametrize(
    "text,length",
    [
        ("milesker ederra joan zen hitzaldia plazer hutsa", 7),
        ("astelehen guztia sofan pasau biot", 5),
    ],
 )
 def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
    tokens = eu_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/lang/fi/test_tokenizer.py
+++ b/spacy/tests/lang/fi/test_tokenizer.py
@ -7,12 +7,22 @@ ABBREVIATION_TESTS = [
        ["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
    ),
    ("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
    (
        "Vuonna 1 eaa. tapahtui kauheita.",
        ["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
    ),
 ]
 HYPHENATED_TESTS = [
    (
-        "1700-luvulle sijoittuva taide-elokuva",
+        "1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
-        ["1700-luvulle", "sijoittuva", "taide-elokuva"],
+        [
            "1700-luvulle",
            "sijoittuva",
            "taide-elokuva",
            "Wikimedia-säätiön",
            "Varsinais-Suomen",
        ],
    )
 ]
@ -23,6 +33,7 @@ ABBREVIATION_INFLECTION_TESTS = [
    ),
    ("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
    ("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
    ("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
 ]
--- a/spacy/tests/lang/lt/test_text.py
+++ b/spacy/tests/lang/lt/test_text.py
@ -12,11 +12,11 @@ def test_lt_tokenizer_handles_long_text(lt_tokenizer):
    [
        (
            "177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
-            15,
+            17,
        ),
        (
            "ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
-            16,
+            18,
        ),
    ],
 )
@ -28,7 +28,7 @@ def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
 def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
    tokens = lt_tokenizer(text)
-    assert len(tokens) == 1
+    assert len(tokens) == 2
@pytest.mark.parametrize(
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -4,6 +4,8 @@ from mock import Mock
 from spacy.matcher import Matcher, DependencyMatcher
 from spacy.tokens import Doc, Token
 from ..doc.test_underscore import clean_underscore  # noqa: F401
@pytest.fixture
 def matcher(en_vocab):
@ -197,6 +199,7 @@ def test_matcher_any_token_operator(en_vocab):
    assert matches[2] == "test hello world"
@pytest.mark.usefixtures("clean_underscore")
 def test_matcher_extension_attribute(en_vocab):
    matcher = Matcher(en_vocab)
    get_is_fruit = lambda token: token.text in ("apple", "banana")
--- a/spacy/tests/matcher/test_pattern_validation.py
+++ b/spacy/tests/matcher/test_pattern_validation.py
@ -31,6 +31,8 @@ TEST_PATTERNS = [
    ([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
    ([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
    ([{"orth": "foo"}], 0, 0),  # prev: xfail
    ([{"IS_SENT_START": True}], 0, 0),
    ([{"SENT_START": True}], 0, 0),
 ]
--- a/spacy/tests/parser/test_parse_navigate.py
+++ b/spacy/tests/parser/test_parse_navigate.py
@ -31,23 +31,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
@pytest.fixture
 def heads():
    # fmt: off
-    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
+    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
-            -1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
+            -1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
-            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
+            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
-            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
+            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
-            0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
+            0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
-            9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
+            9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
-            2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
+            2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
-            3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
+            3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
-            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
+            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
-            -1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
+            -1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
-            -2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
+            -2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
-            1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
+            1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
-            1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
+            1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
-            -19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
+            -10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
            0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
-            1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
+            1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
-            -1, -8, -9, -1]
+            -1, 0, -1, -1]
    # fmt: on
--- a/spacy/tests/regression/test_issue2001-2500.py
+++ b/spacy/tests/regression/test_issue2001-2500.py
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
    tag_ids = [en_vocab.strings.add(tag) for tag in tags]
    lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
    doc = Doc(en_vocab, words=words)
-    # Work around lemma corrpution problem and set lemmas after tags
+    # Work around lemma corruption problem and set lemmas after tags
    doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
    doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
    assert [t.tag_ for t in doc] == tags
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -121,7 +121,7 @@ def test_issue2772(en_vocab):
    words = "When we write or communicate virtually , we can hide our true feelings .".split()
    # A tree with a non-projective (i.e. crossing) arc
    # The arcs (0, 4) and (2, 9) cross.
-    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
+    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
    deps = ["dep"] * len(heads)
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert doc[1].is_sent_start is None
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -24,7 +24,7 @@ def test_issue4590(en_vocab):
    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
-    deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
+    deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
--- a/spacy/tests/regression/test_issue4725.py
+++ b/spacy/tests/regression/test_issue4725.py
@ -0,0 +1,25 @@
 # coding: utf8
 from __future__ import unicode_literals
 import numpy
 from spacy.lang.en import English
 from spacy.vocab import Vocab
 def test_issue4725():
    # ensures that this runs correctly and doesn't hang or crash because of the global vectors
    vocab = Vocab(vectors_name="test_vocab_add_vector")
    data = numpy.ndarray((5, 3), dtype="f")
    data[0] = 1.0
    data[1] = 2.0
    vocab.set_vector("cat", data[0])
    vocab.set_vector("dog", data[1])
    nlp = English(vocab=vocab)
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
    nlp.begin_training()
    docs = ["Kurt is in London."] * 10
    for _ in nlp.pipe(docs, batch_size=2, n_process=2):
        pass
--- a/spacy/tests/regression/test_issue4903.py
+++ b/spacy/tests/regression/test_issue4903.py
@ -0,0 +1,43 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 from spacy.tokens import Span, Doc
 class CustomPipe:
    name = "my_pipe"
    def __init__(self):
        Span.set_extension("my_ext", getter=self._get_my_ext)
        Doc.set_extension("my_ext", default=None)
    def __call__(self, doc):
        gathered_ext = []
        for sent in doc.sents:
            sent_ext = self._get_my_ext(sent)
            sent._.set("my_ext", sent_ext)
            gathered_ext.append(sent_ext)
        doc._.set("my_ext", "\n".join(gathered_ext))
        return doc
    @staticmethod
    def _get_my_ext(span):
        return str(span.end)
 def test_issue4903():
    # ensures that this runs correctly and doesn't hang or crash on Windows / macOS
    nlp = English()
    custom_component = CustomPipe()
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    nlp.add_pipe(custom_component, after="sentencizer")
    text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
    docs = list(nlp.pipe(text, n_process=2))
    assert docs[0].text == "I like bananas."
    assert docs[1].text == "Do you like them?"
    assert docs[2].text == "No, I prefer wasabi."
--- a/spacy/tests/regression/test_issue4924.py
+++ b/spacy/tests/regression/test_issue4924.py
@ -2,7 +2,7 @@ import pytest
 from spacy.language import Language
-def test_evaluate():
+def test_issue4924():
    nlp = Language()
    docs_golds = [("", {})]
    with pytest.raises(ValueError):
--- a/spacy/tests/regression/test_issue5048.py
+++ b/spacy/tests/regression/test_issue5048.py
@ -0,0 +1,35 @@
 # coding: utf8
 from __future__ import unicode_literals
 import numpy
 from spacy.tokens import Doc
 from spacy.attrs import DEP, POS, TAG
 from ..util import get_doc
 def test_issue5048(en_vocab):
    words = ["This", "is", "a", "sentence"]
    pos_s = ["DET", "VERB", "DET", "NOUN"]
    spaces = [" ", " ", " ", ""]
    deps_s = ["dep", "adj", "nn", "atm"]
    tags_s = ["DT", "VBZ", "DT", "NN"]
    strings = en_vocab.strings
    for w in words:
        strings.add(w)
    deps = [strings.add(d) for d in deps_s]
    pos = [strings.add(p) for p in pos_s]
    tags = [strings.add(t) for t in tags_s]
    attrs = [POS, DEP, TAG]
    array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
    doc = Doc(en_vocab, words=words, spaces=spaces)
    doc.from_array(attrs, array)
    v1 = [(token.text, token.pos_, token.tag_) for token in doc]
    doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
    v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
    assert v1 == v2
--- a/spacy/tests/regression/test_issue5082.py
+++ b/spacy/tests/regression/test_issue5082.py
@ -0,0 +1,46 @@
 # coding: utf8
 from __future__ import unicode_literals
 import numpy as np
 from spacy.lang.en import English
 from spacy.pipeline import EntityRuler
 def test_issue5082():
    # Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
    nlp = English()
    vocab = nlp.vocab
    array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
    array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
    array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
    array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
    array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
    vocab.set_vector("I", array1)
    vocab.set_vector("like", array2)
    vocab.set_vector("David", array3)
    vocab.set_vector("Bowie", array4)
    text = "I like David Bowie"
    ruler = EntityRuler(nlp)
    patterns = [
        {"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
    ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    parsed_vectors_1 = [t.vector for t in nlp(text)]
    assert len(parsed_vectors_1) == 4
    np.testing.assert_array_equal(parsed_vectors_1[0], array1)
    np.testing.assert_array_equal(parsed_vectors_1[1], array2)
    np.testing.assert_array_equal(parsed_vectors_1[2], array3)
    np.testing.assert_array_equal(parsed_vectors_1[3], array4)
    merge_ents = nlp.create_pipe("merge_entities")
    nlp.add_pipe(merge_ents)
    parsed_vectors_2 = [t.vector for t in nlp(text)]
    assert len(parsed_vectors_2) == 3
    np.testing.assert_array_equal(parsed_vectors_2[0], array1)
    np.testing.assert_array_equal(parsed_vectors_2[1], array2)
    np.testing.assert_array_equal(parsed_vectors_2[2], array34)
--- a/spacy/tests/serialize/test_serialize_tokenizer.py
+++ b/spacy/tests/serialize/test_serialize_tokenizer.py
@ -12,12 +12,19 @@ def load_tokenizer(b):
 def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
-    """Test that custom tokenizer with not all functions defined can be
+    """Test that custom tokenizer with not all functions defined or empty
-    serialized and deserialized correctly (see #2494)."""
+    properties can be serialized and deserialized correctly (see #2494,
    #4991)."""
    tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
    tokenizer_bytes = tokenizer.to_bytes()
    Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
    tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
    tokenizer.rules = {}
    tokenizer_bytes = tokenizer.to_bytes()
    tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
    assert tokenizer_reloaded.rules == {}
@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -28,10 +28,10 @@ def test_displacy_parse_deps(en_vocab):
    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps["words"] == [
-        {"text": "This", "tag": "DET"},
+        {"lemma": None, "text": words[0], "tag": pos[0]},
-        {"text": "is", "tag": "AUX"},
+        {"lemma": None, "text": words[1], "tag": pos[1]},
-        {"text": "a", "tag": "DET"},
+        {"lemma": None, "text": words[2], "tag": pos[2]},
-        {"text": "sentence", "tag": "NOUN"},
+        {"lemma": None, "text": words[3], "tag": pos[3]},
    ]
    assert deps["arcs"] == [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
@ -72,7 +72,7 @@ def test_displacy_rtl():
    deps = ["foo", "bar", "foo", "baz"]
    heads = [1, 0, 1, -2]
    nlp = Persian()
-    doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
+    doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
    doc.ents = [Span(doc, 1, 3, label="TEST")]
    html = displacy.render(doc, page=True, style="dep")
    assert "direction: rtl" in html
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -4,8 +4,10 @@ import shutil
 import contextlib
 import srsly
 from pathlib import Path
 from spacy import Errors
 from spacy.tokens import Doc, Span
-from spacy.attrs import POS, HEAD, DEP
+from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
@contextlib.contextmanager
@ -22,30 +24,56 @@ def make_tempdir():
    shutil.rmtree(str(d))
-def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
+def get_doc(
    vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None
 ):
    """Create Doc object from given vocab, words and annotations."""
-    pos = pos or [""] * len(words)
+    if deps and not heads:
-    tags = tags or [""] * len(words)
+        heads = [0] * len(deps)
-    heads = heads or [0] * len(words)
+    headings = []
-    deps = deps or [""] * len(words)
+    values = []
-    for value in deps + tags + pos:
+    annotations = [pos, heads, deps, lemmas, tags]
    possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
    for a, annot in enumerate(annotations):
        if annot is not None:
            if len(annot) != len(words):
                raise ValueError(Errors.E189)
            headings.append(possible_headings[a])
            if annot is not heads:
                values.extend(annot)
    for value in values:
        vocab.strings.add(value)
    doc = Doc(vocab, words=words)
-    attrs = doc.to_array([POS, HEAD, DEP])
+
-    for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
+    # if there are any other annotations, set them
-        attrs[i, 0] = doc.vocab.strings[p]
+    if headings:
-        attrs[i, 1] = head
+        attrs = doc.to_array(headings)
-        attrs[i, 2] = doc.vocab.strings[dep]
+
-    doc.from_array([POS, HEAD, DEP], attrs)
+        j = 0
        for annot in annotations:
            if annot:
                if annot is heads:
                    for i in range(len(words)):
                        if attrs.ndim == 1:
                            attrs[i] = heads[i]
                        else:
                            attrs[i, j] = heads[i]
                else:
                    for i in range(len(words)):
                        if attrs.ndim == 1:
                            attrs[i] = doc.vocab.strings[annot[i]]
                        else:
                            attrs[i, j] = doc.vocab.strings[annot[i]]
                j += 1
        doc.from_array(headings, attrs)
    # finally, set the entities
    if ents:
        doc.ents = [
            Span(doc, start, end, label=doc.vocab.strings[label])
            for start, end, label in ents
        ]
    if tags:
        for token in doc:
            token.tag_ = tags[token.i]
    return doc
@ -86,8 +114,7 @@ def assert_docs_equal(doc1, doc2):
    assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
    assert [t.dep for t in doc1] == [t.dep for t in doc2]
-    if doc1.is_parsed and doc2.is_parsed:
+    assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]
        assert [s for s in doc1.sents] == [s for s in doc2.sents]
    assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
    assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -699,6 +699,7 @@ cdef class Tokenizer:
        DOCS: https://spacy.io/api/tokenizer#to_disk
        """
        path = util.ensure_path(path)
        with path.open("wb") as file_:
            file_.write(self.to_bytes(**kwargs))
@ -712,6 +713,7 @@ cdef class Tokenizer:
        DOCS: https://spacy.io/api/tokenizer#from_disk
        """
        path = util.ensure_path(path)
        with path.open("rb") as file_:
            bytes_data = file_.read()
        self.from_bytes(bytes_data, **kwargs)
@ -756,21 +758,20 @@ cdef class Tokenizer:
        }
        exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
        msg = util.from_bytes(bytes_data, deserializers, exclude)
-        if data.get("prefix_search"):
+        if "prefix_search" in data and isinstance(data["prefix_search"], str):
            self.prefix_search = re.compile(data["prefix_search"]).search
-        if data.get("suffix_search"):
+        if "suffix_search" in data and isinstance(data["suffix_search"], str):
            self.suffix_search = re.compile(data["suffix_search"]).search
-        if data.get("infix_finditer"):
+        if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
            self.infix_finditer = re.compile(data["infix_finditer"]).finditer
-        if data.get("token_match"):
+        if "token_match" in data and isinstance(data["token_match"], str):
            self.token_match = re.compile(data["token_match"]).match
-        if data.get("rules"):
+        if "rules" in data and isinstance(data["rules"], dict):
            # make sure to hard reset the cache to remove data from the default exceptions
            self._rules = {}
            self._flush_cache()
            self._flush_specials()
-            self._load_special_cases(data.get("rules", {}))
+            self._load_special_cases(data["rules"])
        return self
--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
        new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
        if spans[token_index][-1].whitespace_:
            new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
        # add the vector of the (merged) entity to the vocab
        if not doc.vocab.get_vector(new_orth).any():
            if doc.vocab.vectors_length > 0:
                doc.vocab.set_vector(new_orth, span.vector)
        token = tokens[token_index]
        lex = doc.vocab.get(doc.mem, new_orth)
        token.lex = lex
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..typedefs cimport attr_t, flags_t
 from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
-from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
+from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 from ..attrs import intify_attrs, IDS
@ -68,6 +68,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
        return token.ent_id
    elif feat_name == ENT_KB_ID:
        return token.ent_kb_id
    elif feat_name == IDX:
        return token.idx
    else:
        return Lexeme.get_struct_attr(token.lex, feat_name)
@ -253,7 +255,7 @@ cdef class Doc:
    def is_nered(self):
        """Check if the document has named entities set. Will return True if
        *any* of the tokens has a named entity tag set (even if the others are
-        unknown values).
+        unknown values), or if the document is empty.
        """
        if len(self) == 0:
            return True
@ -778,10 +780,12 @@ cdef class Doc:
        # Allow strings, e.g. 'lemma' or 'LEMMA'
        attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
                 for id_ in attrs]
        if array.dtype != numpy.uint64:
            user_warning(Warnings.W028.format(type=array.dtype))
        if SENT_START in attrs and HEAD in attrs:
            raise ValueError(Errors.E032)
-        cdef int i, col
+        cdef int i, col, abs_head_index
        cdef attr_id_t attr_id
        cdef TokenC* tokens = self.c
        cdef int length = len(array)
@ -795,6 +799,14 @@ cdef class Doc:
            attr_ids[i] = attr_id
        if len(array.shape) == 1:
            array = array.reshape((array.size, 1))
        # Check that all heads are within the document bounds
        if HEAD in attrs:
            col = attrs.index(HEAD)
            for i in range(length):
                # cast index to signed int
                abs_head_index = numpy.int32(array[i, col]) + i
                if abs_head_index < 0 or abs_head_index >= length:
                    raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
        # Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
        if TAG in attrs:
            col = attrs.index(TAG)
@ -865,7 +877,7 @@ cdef class Doc:
        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID]  # TODO: ENT_KB_ID ?
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM]  # TODO: ENT_KB_ID ?
        if self.is_tagged:
            array_head.extend([TAG, POS])
        # If doc parsed add head and dep attribute
@ -1166,6 +1178,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
        heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
        if loop_count > 10:
            warnings.warn(Warnings.W026)
            break
        loop_count += 1
    # Set sentence starts
    for i in range(length):
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -626,6 +626,9 @@ cdef class Token:
            # This function sets the head of self to new_head and updates the
            # counters for left/right dependents and left/right corner for the
            # new and the old head
            # Check that token is from the same document
            if self.doc != new_head.doc:
                raise ValueError(Errors.E191)
            # Do nothing if old head is new head
            if self.i + self.c.head == new_head.i:
                return
--- a/spacy/tokens/underscore.py
+++ b/spacy/tokens/underscore.py
@ -76,6 +76,14 @@ class Underscore(object):
    def _get_key(self, name):
        return ("._.", name, self._start, self._end)
    @classmethod
    def get_state(cls):
        return cls.token_extensions, cls.span_extensions, cls.doc_extensions
    @classmethod
    def load_state(cls, state):
        cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
 def get_ext_args(**kwargs):
    """Validate and convert arguments. Reused in Doc, Token and Span."""
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -349,44 +349,6 @@ cdef class Vectors:
                    for i in range(len(queries)) ], dtype="uint64")
        return (keys, best_rows, scores)
    def from_glove(self, path):
        """Load GloVe vectors from a directory. Assumes binary format,
        that the vocab is in a vocab.txt, and that vectors are named
        vectors.{size}.[fd].bin, e.g. vectors.128.f.bin for 128d float32
        vectors, vectors.300.d.bin for 300d float64 (double) vectors, etc.
        By default GloVe outputs 64-bit vectors.
        path (unicode / Path): The path to load the GloVe vectors from.
        RETURNS: A `StringStore` object, holding the key-to-string mapping.
        DOCS: https://spacy.io/api/vectors#from_glove
        """
        path = util.ensure_path(path)
        width = None
        for name in path.iterdir():
            if name.parts[-1].startswith("vectors"):
                _, dims, dtype, _2 = name.parts[-1].split('.')
                width = int(dims)
                break
        else:
            raise IOError(Errors.E061.format(filename=path))
        bin_loc = path / f"vectors.{dims}.{dtype}.bin"
        xp = get_array_module(self.data)
        self.data = None
        with bin_loc.open("rb") as file_:
            self.data = xp.fromfile(file_, dtype=dtype)
            if dtype != "float32":
                self.data = xp.ascontiguousarray(self.data, dtype="float32")
        if self.data.ndim == 1:
            self.data = self.data.reshape((self.data.size//width, width))
        n = 0
        strings = StringStore()
        with (path / "vocab.txt").open("r") as file_:
            for i, line in enumerate(file_):
                key = strings.add(line.strip())
                self.add(key, row=i)
        return strings
    def to_disk(self, path, **kwargs):
        """Save the current state to a directory.
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
 version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
 to ensure that all installed models are can be used with the new version. The
 command is also useful to detect out-of-sync model links resulting from links
-created in different virtual environments. It will a list of models, the
+created in different virtual environments. It will show a list of models and
-installed versions, the latest compatible version (if out of date) and the
+their installed versions. If any model is out of date, the latest compatible
-commands for updating.
+versions and command for updating are shown.
 > #### Automated validation
 >
@ -176,7 +176,7 @@ All output files generated by this command are compatible with
 ## Debug data {#debug-data new="2.2"}
-Analyze, debug and validate your training and development data, get useful
+Analyze, debug, and validate your training and development data. Get useful
 stats, and find problems like invalid entity annotations, cyclic dependencies,
 low data labels and more.
@ -184,16 +184,17 @@ low data labels and more.
 $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
 ```
-| Argument                   | Type       | Description                                                                                        |
+| Argument                                               | Type       | Description                                                                                        |
-| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
+| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
-| `lang`                     | positional | Model language.                                                                                    |
+| `lang`                                                 | positional | Model language.                                                                                    |
-| `train_path`               | positional | Location of JSON-formatted training data. Can be a file or a directory of files.                   |
+| `train_path`                                           | positional | Location of JSON-formatted training data. Can be a file or a directory of files.                   |
-| `dev_path`                 | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
+| `dev_path`                                             | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
-| `--base-model`, `-b`       | option     | Optional name of base model to update. Can be any loadable spaCy model.                            |
+| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option     | Location of JSON-formatted tag map.                                                                |
-| `--pipeline`, `-p`         | option     | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.          |
+| `--base-model`, `-b`                                   | option     | Optional name of base model to update. Can be any loadable spaCy model.                            |
-| `--ignore-warnings`, `-IW` | flag       | Ignore warnings, only show stats and errors.                                                       |
+| `--pipeline`, `-p`                                     | option     | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.          |
-| `--verbose`, `-V`          | flag       | Print additional information and explanations.                                                     |
+| `--ignore-warnings`, `-IW`                             | flag       | Ignore warnings, only show stats and errors.                                                       |
-| --no-format, `-NF`         | flag       | Don't pretty-print the results. Use this if you want to write to a file.                           |
+| `--verbose`, `-V`                                      | flag       | Print additional information and explanations.                                                     |
 | --no-format, `-NF`                                     | flag       | Don't pretty-print the results. Use this if you want to write to a file.                           |
 <Accordion title="Example output">
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `dev_path`                                                      | positional    | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files.                                                                |
 | `--base-model`, `-b` <Tag variant="new">2.1</Tag>               | option        | Optional name of base model to update. Can be any loadable spaCy model.                                                                                           |
 | `--pipeline`, `-p` <Tag variant="new">2.1</Tag>                 | option        | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.                                                                         |
 | `--replace-components`, `-R`                                    | flag          | Replace components from the base model.                                                                                                                           |
 | `--vectors`, `-v`                                               | option        | Model to load vectors from.                                                                                                                                       |
 | `--n-iter`, `-n`                                                | option        | Number of iterations (default: `30`).                                                                                                                             |
 | `--n-early-stopping`, `-ne`                                     | option        | Maximum number of training epochs without dev accuracy improvement.                                                                                               |
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag>           | option        | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.                                                       |
 | `--parser-multitasks`, `-pt`                                    | option        | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                       |
 | `--entity-multitasks`, `-et`                                    | option        | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                          |
 | `--width`, `-cw` <Tag variant="new">2.2.4</Tag>                 | option        | Width of CNN layers of `Tok2Vec` component.                                                                                                                       |
 | `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag>            | option        | Depth of CNN layers of `Tok2Vec` component.                                                                                                                       |
 | `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag>            | option        | Window size for CNN layers of `Tok2Vec` component.                                                                                                                |
 | `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag>            | option        | Maxout size for CNN layers of `Tok2Vec` component.                                                                                                                |
 | `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag>            | flag          | Whether to use character-based embedding of `Tok2Vec` component.                                                                                                  |
 | `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag>        | option        | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch).                                                                                                 |
 | `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag>            | option        | Number of embedding rows of `Tok2Vec` component.                                                                                                                  |
 | `--noise-level`, `-nl`                                          | option        | Float indicating the amount of corruption for data augmentation.                                                                                                  |
 | `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag>     | option        | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement).                |
 | `--gold-preproc`, `-G`                                          | flag          | Use gold preprocessing.                                                                                                                                           |
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag>     | flag          | Text classification classes aren't mutually exclusive (multilabel).                                                                                               |
 | `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag>            | option        | Text classification model architecture. Defaults to `"bow"`.                                                                                                      |
 | `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option        | Text classification positive label for binary classes with two labels.                                                                                            |
 | `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag>          | option        | Location of JSON-formatted tag map.                                                                                                                               |
 | `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag>              | flag          | Show more detailed messages during training.                                                                                                                      |
 | `--help`, `-h`                                                  | flag          | Show help message and available arguments.                                                                                                                        |
 | **CREATES**                                                     | model, pickle | A spaCy model on each epoch.                                                                                                                                      |
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
 A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
 named entities, export annotations to numpy arrays, losslessly serialize to
-compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
+compressed binary strings. The `Doc` object holds an array of
-The Python-level `Token` and [`Span`](/api/span) objects are views of this
+[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
-array, i.e. they don't own the data themselves.
+[`Span`](/api/span) objects are views of this array, i.e. they don't own the
 data themselves.
 ## Doc.\_\_init\_\_ {#init tag="method"}
@ -197,13 +198,14 @@ the character indices don't map to a valid span.
 > assert span.text == "New York"
 > ```
-| Name        | Type                                     | Description                                             |
+| Name                                 | Type                                     | Description                                                           |
-| ----------- | ---------------------------------------- | ------------------------------------------------------- |
+| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
-| `start`     | int                                      | The index of the first character of the span.           |
+| `start`                              | int                                      | The index of the first character of the span.                         |
-| `end`       | int                                      | The index of the last character after the span.         |
+| `end`                                | int                                      | The index of the last character after the span.                       |
-| `label`     | uint64 / unicode                         | A label to attach to the Span, e.g. for named entities. |
+| `label`                              | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
-| `vector`    | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                   |
+| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
-| **RETURNS** | `Span`                                   | The newly constructed object or `None`.                 |
+| `vector`                             | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
 | **RETURNS**                          | `Span`                                   | The newly constructed object or `None`.                               |
 ## Doc.similarity {#similarity tag="method" model="vectors"}
@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
 | `user_data`                             | -            | A generic storage area, for user custom data.                                                                                                                                                                                                                                              |
 | `lang` <Tag variant="new">2.1</Tag>     | int          | Language of the document's vocabulary.                                                                                                                                                                                                                                                     |
 | `lang_` <Tag variant="new">2.1</Tag>    | unicode      | Language of the document's vocabulary.                                                                                                                                                                                                                                                     |
-| `is_tagged`                             | bool         | A flag indicating that the document has been part-of-speech tagged.                                                                                                                                                                                                                        |
+| `is_tagged`                             | bool         | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty.                                                                                                                                                                                  |
-| `is_parsed`                             | bool         | A flag indicating that the document has been syntactically parsed.                                                                                                                                                                                                                         |
+| `is_parsed`                             | bool         | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty.                                                                                                                                                                                   |
-| `is_sentenced`                          | bool         | A flag indicating that sentence boundaries have been applied to the document.                                                                                                                                                                                                              |
+| `is_sentenced`                          | bool         | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty.                                                                                                                                                                        |
-| `is_nered` <Tag variant="new">2.1</Tag> | bool         | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown.                                                                                                                                      |
+| `is_nered` <Tag variant="new">2.1</Tag> | bool         | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown.                                                                                                            |
 | `sentiment`                             | float        | The document's positivity/negativity score, if available.                                                                                                                                                                                                                                  |
 | `user_hooks`                            | dict         | A dictionary that allows customization of the `Doc`'s properties.                                                                                                                                                                                                                          |
 | `user_token_hooks`                      | dict         | A dictionary that allows customization of properties of `Token` children.                                                                                                                                                                                                                  |
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@ -83,7 +83,8 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
 happens automatically after the component has been added to the pipeline using
 [`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
 with `overwrite_ents=True`, existing entities will be replaced if they overlap
-with the matches.
+with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
 patterns over shorter, and if equal the match occuring first in the Doc is chosen.
 > #### Example
 >
--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@ -172,6 +172,28 @@ Remove a previously registered extension.
 | `name`      | unicode | Name of the extension.                                                |
 | **RETURNS** | tuple   | A `(default, method, getter, setter)` tuple of the removed extension. |
 ## Span.char_span {#char_span tag="method" new="2.2.4"}
 Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
 the character indices don't map to a valid span.
 > #### Example
 >
 > ```python
 > doc = nlp("I like New York")
 > span = doc[1:4].char_span(5, 13, label="GPE")
 > assert span.text == "New York"
 > ```
 | Name        | Type                                     | Description                                                           |
 | ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
 | `start`     | int                                      | The index of the first character of the span.                         |
 | `end`       | int                                      | The index of the last character after the span.                       |
 | `label`     | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
 | `kb_id`     | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
 | `vector`    | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
 | **RETURNS** | `Span`                                   | The newly constructed object or `None`.                               |
 ## Span.similarity {#similarity tag="method" model="vectors"}
 Make a semantic similarity estimate. The default estimate is cosine similarity
@ -293,10 +315,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
 > assert doc2.text == "New York"
 > ```
-| Name              | Type  | Description                                          |
+| Name             | Type  | Description                                          |
-| ----------------- | ----- | ---------------------------------------------------- |
+| ---------------- | ----- | ---------------------------------------------------- |
-| `copy_user_data`  | bool  | Whether or not to copy the original doc's user data. |
+| `copy_user_data` | bool  | Whether or not to copy the original doc's user data. |
-| **RETURNS**       | `Doc` | A `Doc` object of the `Span`'s content.              |
+| **RETURNS**      | `Doc` | A `Doc` object of the `Span`'s content.              |
 ## Span.root {#root tag="property" model="parser"}
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 |
 | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         |
-| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
-| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
 | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            |
 | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          |
 | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -236,21 +236,22 @@ If a setting is not present in the options, the default value will be used.
 > displacy.serve(doc, style="dep", options=options)
 > ```
-| Name               | Type    | Description                                                                                                     | Default                 |
+| Name                                       | Type    | Description                                                                                                     | Default                 |
-| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
+| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
-| `fine_grained`     | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
+| `fine_grained`                             | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
-| `collapse_punct`   | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
+| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool    | Print the lemma's in a separate row below the token texts.                                                      | `False`                 |
-| `collapse_phrases` | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
+| `collapse_punct`                           | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
-| `compact`          | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
+| `collapse_phrases`                         | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
-| `color`            | unicode | Text color (HEX, RGB or color names).                                                                           | `'#000000'`             |
+| `compact`                                  | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
-| `bg`               | unicode | Background color (HEX, RGB or color names).                                                                     | `'#ffffff'`             |
+| `color`                                    | unicode | Text color (HEX, RGB or color names).                                                                           | `'#000000'`             |
-| `font`             | unicode | Font name or font family for all text.                                                                          | `'Arial'`               |
+| `bg`                                       | unicode | Background color (HEX, RGB or color names).                                                                     | `'#ffffff'`             |
-| `offset_x`         | int     | Spacing on left side of the SVG in px.                                                                          | `50`                    |
+| `font`                                     | unicode | Font name or font family for all text.                                                                          | `'Arial'`               |
-| `arrow_stroke`     | int     | Width of arrow path in px.                                                                                      | `2`                     |
+| `offset_x`                                 | int     | Spacing on left side of the SVG in px.                                                                          | `50`                    |
-| `arrow_width`      | int     | Width of arrow head in px.                                                                                      | `10` / `8` (compact)    |
+| `arrow_stroke`                             | int     | Width of arrow path in px.                                                                                      | `2`                     |
-| `arrow_spacing`    | int     | Spacing between arrows in px to avoid overlaps.                                                                 | `20` / `12` (compact)   |
+| `arrow_width`                              | int     | Width of arrow head in px.                                                                                      | `10` / `8` (compact)    |
-| `word_spacing`     | int     | Vertical spacing between words and arcs in px.                                                                  | `45`                    |
+| `arrow_spacing`                            | int     | Spacing between arrows in px to avoid overlaps.                                                                 | `20` / `12` (compact)   |
-| `distance`         | int     | Distance between words in px.                                                                                   | `175` / `150` (compact) |
+| `word_spacing`                             | int     | Vertical spacing between words and arcs in px.                                                                  | `45`                    |
 | `distance`                                 | int     | Distance between words in px.                                                                                   | `175` / `150` (compact) |
 #### Named Entity Visualizer options {#displacy_options-ent}
--- a/website/docs/api/vectors.md
+++ b/website/docs/api/vectors.md
@ -326,25 +326,6 @@ performed in chunks, to avoid consuming too much memory. You can set the
 | `sort`       | bool      | Whether to sort the entries returned by score. Defaults to `True`. |
 | **RETURNS**  | tuple     | The most similar entries as a `(keys, best_rows, scores)` tuple.   |
 ## Vectors.from_glove {#from_glove tag="method"}
 Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
 Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
 named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
 vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
 GloVe outputs 64-bit vectors.
 > #### Example
 >
 > ```python
 > vectors = Vectors()
 > vectors.from_glove("/path/to/glove_vectors")
 > ```
 | Name   | Type             | Description                              |
 | ------ | ---------------- | ---------------------------------------- |
 | `path` | unicode / `Path` | The path to load the GloVe vectors from. |
 ## Vectors.to_disk {#to_disk tag="method"}
 Save the current state to a directory.
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
 In order to use this, you'll need training and evaluation data in the
 [JSON format](/api/annotation#json-input) spaCy expects for training.
-You can now train the model using a corpus for your language annotated with If
+If your data is in one of the supported formats, the easiest solution might be
-your data is in one of the supported formats, the easiest solution might be to
+to use the [`spacy convert`](/api/cli#convert) command-line utility. This
-use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
+supports several popular formats, including the IOB format for named entity
-several popular formats, including the IOB format for named entity recognition,
+recognition, the JSONL format produced by our annotation tool
-the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
+[Prodigy](https://prodi.gy), and the
-and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
+[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
-by the [Universal Dependencies](http://universaldependencies.org/) corpus.
+[Universal Dependencies](http://universaldependencies.org/) corpus.
 One thing to keep in mind is that spaCy expects to train its models from **whole
 documents**, not just single sentences. If your corpus only contains single
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -968,7 +968,10 @@ pattern. The entity ruler accepts two types of patterns:
 The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
 added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
 called on a text, it will find matches in the `doc` and add them as entities to
-the `doc.ents`, using the specified pattern label as the entity label.
+the `doc.ents`, using the specified pattern label as the entity label. If any
 matches were to overlap, the pattern matching most tokens takes priority. If
 they also happen to be equally long, then the match occuring first in the Doc is
 chosen.
 ```python
 ### {executable="true"}
@ -1119,7 +1122,7 @@ entityruler = EntityRuler(nlp)
 patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
 other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
-with nlp.disable_pipes(*disable_pipes):
+with nlp.disable_pipes(*other_pipes):
    entityruler.add_patterns(patterns)
 ```
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))
 If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
 well, which includes the values of
-[extension attributes](/processing-pipelines#custom-components-attributes) (if
+[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
 they're serializable with msgpack).
 <Infobox title="Important note on serializing extension attributes" variant="warning">
--- a/Show More
+++ b/Show More