Merge branch 'master' into tmp/sync

2025-09-27 22:46:56 +03:00 · 2020-03-26 13:38:14 +01:00 · 2020-03-26 13:38:14 +01:00 · 46568f40a7
commit 46568f40a7
parent 218e1706ac e53232533b
104 changed files with 2987 additions and 638 deletions
--- a/.github/contributors/Baciccin.md
+++ b/.github/contributors/Baciccin.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Giovanni Battista Parodi |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  |                          |
+| Date                           | 2020-03-19               |
+| GitHub username                | Baciccin                 |
+| Website (optional)             |                          |
--- a/.github/contributors/MisterKeefe.md
+++ b/.github/contributors/MisterKeefe.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |      Tom Keefe       |
+| Company name (if applicable)   |          /           |
+| Title or role (if applicable)  |          /           |
+| Date                           |   18 February 2020   |
+| GitHub username                |     MisterKeefe      |
+| Website (optional)             |          /           |
--- a/.github/contributors/Tiljander.md
+++ b/.github/contributors/Tiljander.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Henrik Tiljander    |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |   24/3/2020          |
+| GitHub username                |     Tiljander        |
+| Website (optional)             |                      |
--- a/.github/contributors/dhpollack.md
+++ b/.github/contributors/dhpollack.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | David Pollack        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | Mar 5. 2020          |
+| GitHub username                | dhpollack            |
+| Website (optional)             |                      |
--- a/.github/contributors/guerda.md
+++ b/.github/contributors/guerda.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Philip Gillißen      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-03-24           |
+| GitHub username                | guerda               |
+| Website (optional)             |                      |
--- a/.github/contributors/mabraham.md
+++ b/.github/contributors/mabraham.md
@ -0,0 +1,89 @@
+
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+        assignment is or becomes invalid, ineffective or unenforceable, you hereby
+            grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+                royalty-free, unrestricted license to exercise all rights under those
+                    copyrights. This includes, at our option, the right to sublicense these same
+                        rights to third parties through multiple levels of sublicensees or other
+                            licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+        contribution as if each of us were the sole owners, and if one of us makes
+            a derivative work of your contribution, the one who makes the derivative
+                work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+        against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+        exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+        consent of, pay or render an accounting to the other for any use or
+            distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+        your contribution in whole or in part, alone or in combination with or
+            included in any product, work or materials arising out of the project to
+                which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+        multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+        authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+        third party's copyrights, trademarks, patents, or other intellectual
+            property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+        other applicable export and import laws. You agree to notify us if you
+            become aware of any circumstance which would make any of the foregoing
+                representations inaccurate in any respect. We may publicly disclose your
+                    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+        or entity, including my employer, has or will have rights with respect to my
+            contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+        actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |                      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |                      |
+| GitHub username                |                      |
+| Website (optional)             |                      |
--- a/.github/contributors/merrcury.md
+++ b/.github/contributors/merrcury.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Himanshu Garg        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-03-10           |
+| GitHub username                | merrcury             |
+| Website (optional)             |                      |
--- a/.github/contributors/pinealan.md
+++ b/.github/contributors/pinealan.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Alan Chan            |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-03-15           |
+| GitHub username                | pinealan             |
+| Website (optional)             | http://pinealan.xyz  |
--- a/.github/contributors/sloev.md
+++ b/.github/contributors/sloev.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Johannes Valbjørn        |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  |                          |
+| Date                           | 2020-03-13               |
+| GitHub username                | sloev                    |
+| Website (optional)             | https://sloev.github.io  |
--- a/.gitignore
+++ b/.gitignore
@ -46,6 +46,7 @@ __pycache__/
 .venv
 env3.6/
 venv/
+env3.*/
 .dev
 .denv
 .pypyenv
@ -62,6 +63,7 @@ lib64/
 parts/
 sdist/
 var/
+wheelhouse/
 *.egg-info/
 pip-wheel-metadata/
 Pipfile.lock
--- a/2
+++ b/2
@ -1,6 +1,6 @@
 The MIT License (MIT)

-Copyright (C) 2016-2019 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
+Copyright (C) 2016-2020 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/47
+++ b/47
@ -1,28 +1,37 @@
 SHELL := /bin/bash
-sha = $(shell "git" "rev-parse" "--short" "HEAD")
-version = $(shell "bin/get-version.sh")
-wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
+PYVER := 3.6
+VENV := ./env$(PYVER)

-dist/spacy.pex : dist/spacy-$(sha).pex
-	cp dist/spacy-$(sha).pex dist/spacy.pex
-	chmod a+rx dist/spacy.pex
+version := $(shell "bin/get-version.sh")

-dist/spacy-$(sha).pex : dist/$(wheel)
-	env3.6/bin/python -m pip install pex==1.5.3
-	env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
+dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
+	chmod a+rx $@

-dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
-	python3.6 -m venv env3.6
-	source env3.6/bin/activate
-	env3.6/bin/pip install wheel
-	env3.6/bin/pip install -r requirements.txt --no-cache-dir 
-	env3.6/bin/python setup.py build_ext --inplace
-	env3.6/bin/python setup.py sdist
-	env3.6/bin/python setup.py bdist_wheel
+dist/pytest.pex : wheelhouse/pytest-*.whl
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
+	chmod a+rx $@

-.PHONY : clean
+wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
+	$(VENV)/bin/pip wheel . -w ./wheelhouse
+	$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
+	touch $@
+
+wheelhouse/pytest-%.whl : $(VENV)/bin/pex
+	$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
+
+$(VENV)/bin/pex :
+	python$(PYVER) -m venv $(VENV)
+	$(VENV)/bin/pip install -U pip setuptools pex wheel
+
+.PHONY : clean test
+
+test : dist/spacy-$(version).pex dist/pytest.pex
+	( . $(VENV)/bin/activate ; \
+	PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )

 clean : setup.py
-	source env3.6/bin/activate
 	rm -rf dist/*
+	rm -rf ./wheelhouse
+	rm -rf $(VENV)
 	python setup.py clean --all
--- a/bin/wiki_entity_linking/README.md
+++ b/bin/wiki_entity_linking/README.md
@ -2,7 +2,7 @@

 ### Step 1: Create a Knowledge Base (KB) and training data

-Run  `wikipedia_pretrain_kb.py` 
+Run  `wikidata_pretrain_kb.py` 
 * This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
  * WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
  * Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
--- a/examples/information_extraction/phrase_matcher.py
+++ b/examples/information_extraction/phrase_matcher.py
@ -88,8 +88,8 @@ def read_text(bz2_loc, n=10000):
                break


-def get_matches(tokenizer, phrases, texts, max_length=6):
-    matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
+def get_matches(tokenizer, phrases, texts):
+    matcher = PhraseMatcher(tokenizer.vocab)
    matcher.add("Phrase", None, *phrases)
    for text in texts:
        doc = tokenizer(text)
--- a/setup.cfg
+++ b/setup.cfg
@ -59,7 +59,7 @@ install_requires =

 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.0.5<0.2.0
+    spacy_lookups_data>=0.0.5,<0.2.0
 cuda =
    cupy>=5.0.0b4
 cuda80 =
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -93,3 +93,5 @@ cdef enum attr_id_t:
    ENT_KB_ID = symbols.ENT_KB_ID
    MORPH
    ENT_ID = symbols.ENT_ID
+
+    IDX
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -89,6 +89,7 @@ IDS = {
    "PROB": PROB,
    "LANG": LANG,
    "MORPH": MORPH,
+    "IDX": IDX
 }


--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -405,12 +405,10 @@ def train(
                            losses=losses,
                        )
                    except ValueError as e:
-                        msg.warn("Error during training")
+                        err = "Error during training"
                        if init_tok2vec:
-                            msg.warn(
-                                "Did you provide the same parameters during 'train' as during 'pretrain'?"
-                            )
-                        msg.fail(f"Original error message: {e}", exits=1)
+                            err += " Did you provide the same parameters during 'train' as during 'pretrain'?"
+                        msg.fail(err, f"Original error message: {e}", exits=1)
                    if raw_text:
                        # If raw text is available, perform 'rehearsal' updates,
                        # which use unlabelled data to reduce overfitting.
@ -545,7 +543,40 @@ def train(
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / "model-final"
            nlp.to_disk(final_model_path)
-            final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
+            meta_loc = output_path / "model-final" / "meta.json"
+            final_meta = srsly.read_json(meta_loc)
+            final_meta.setdefault("accuracy", {})
+            final_meta["accuracy"].update(meta.get("accuracy", {}))
+            final_meta.setdefault("speed", {})
+            final_meta["speed"].setdefault("cpu", None)
+            final_meta["speed"].setdefault("gpu", None)
+            meta.setdefault("speed", {})
+            meta["speed"].setdefault("cpu", None)
+            meta["speed"].setdefault("gpu", None)
+            # combine cpu and gpu speeds with the base model speeds
+            if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
+                speed = _get_total_speed(
+                    [final_meta["speed"]["cpu"], meta["speed"]["cpu"]]
+                )
+                final_meta["speed"]["cpu"] = speed
+            if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
+                speed = _get_total_speed(
+                    [final_meta["speed"]["gpu"], meta["speed"]["gpu"]]
+                )
+                final_meta["speed"]["gpu"] = speed
+            # if there were no speeds to update, overwrite with meta
+            if (
+                final_meta["speed"]["cpu"] is None
+                and final_meta["speed"]["gpu"] is None
+            ):
+                final_meta["speed"].update(meta["speed"])
+            # note: beam speeds are not combined with the base model
+            if has_beam_widths:
+                final_meta.setdefault("beam_accuracy", {})
+                final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
+                final_meta.setdefault("beam_speed", {})
+                final_meta["beam_speed"].update(meta.get("beam_speed", {}))
+            srsly.write_json(meta_loc, final_meta)
        msg.good("Saved model to output directory", final_model_path)
        with msg.loading("Creating best model..."):
            best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
@ -630,6 +661,8 @@ def _find_best(experiment_dir, component):
        if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final":
            accs = srsly.read_json(epoch_model / "accuracy.json")
            scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)]
+            # remove per_type dicts from score list for max() comparison
+            scores = [score for score in scores if isinstance(score, float)]
            accuracies.append((scores, epoch_model))
    if accuracies:
        return max(accuracies)[1]
@ -641,13 +674,13 @@ def _get_metrics(component):
    if component == "parser":
        return ("las", "uas", "las_per_type", "token_acc", "sent_f")
    elif component == "tagger":
-        return ("tags_acc",)
+        return ("tags_acc", "token_acc")
    elif component == "ner":
-        return ("ents_f", "ents_p", "ents_r", "ents_per_type")
+        return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
    elif component == "senter":
        return ("sent_f", "sent_p", "sent_r")
    elif component == "textcat":
-        return ("textcat_score",)
+        return ("textcat_score", "token_acc")
    return ("token_acc",)


@ -714,3 +747,12 @@ def _get_progress(
    if beam_width is not None:
        result.insert(1, beam_width)
    return result
+
+
+def _get_total_speed(speeds):
+    seconds_per_word = 0.0
+    for words_per_second in speeds:
+        if words_per_second is None:
+            return None
+        seconds_per_word += 1.0 / words_per_second
+    return 1.0 / seconds_per_word
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -142,10 +142,17 @@ def parse_deps(orig_doc, options={}):
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
-    if options.get("fine_grained"):
-        words = [{"text": w.text, "tag": w.tag_} for w in doc]
-    else:
-        words = [{"text": w.text, "tag": w.pos_} for w in doc]
+    fine_grained = options.get("fine_grained")
+    add_lemma = options.get("add_lemma")
+    words = [
+        {
+            "text": w.text,
+            "tag": w.tag_ if fine_grained else w.pos_,
+            "lemma": w.lemma_ if add_lemma else None,
+        }
+        for w in doc
+    ]
+
    arcs = []
    for word in doc:
        if word.i < word.head.i:
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -1,6 +1,12 @@
 import uuid

-from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
+from .templates import (
+    TPL_DEP_SVG,
+    TPL_DEP_WORDS,
+    TPL_DEP_WORDS_LEMMA,
+    TPL_DEP_ARCS,
+    TPL_ENTS,
+)
 from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
 from ..util import minify_html, escape_html, registry
 from ..errors import Errors
@ -80,7 +86,10 @@ class DependencyRenderer(object):
        self.width = self.offset_x + len(words) * self.distance
        self.height = self.offset_y + 3 * self.word_spacing
        self.id = render_id
-        words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
+        words = [
+            self.render_word(w["text"], w["tag"], w.get("lemma", None), i)
+            for i, w in enumerate(words)
+        ]
        arcs = [
            self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
            for i, a in enumerate(arcs)
@ -98,7 +107,9 @@ class DependencyRenderer(object):
            lang=self.lang,
        )

-    def render_word(self, text, tag, i):
+    def render_word(
+        self, text, tag, lemma, i,
+    ):
        """Render individual word.

        text (unicode): Word text.
@ -111,6 +122,10 @@ class DependencyRenderer(object):
        if self.direction == "rtl":
            x = self.width - x
        html_text = escape_html(text)
+        if lemma is not None:
+            return TPL_DEP_WORDS_LEMMA.format(
+                text=html_text, tag=tag, lemma=lemma, x=x, y=y
+            )
        return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)

    def render_arrow(self, label, start, end, direction, i):
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@ -14,6 +14,15 @@ TPL_DEP_WORDS = """
 """


+TPL_DEP_WORDS_LEMMA = """
+<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
+    <tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
+    <tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
+    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
+</text>
+"""
+
+
 TPL_DEP_ARCS = """
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -96,7 +96,10 @@ class Warnings(object):
    W027 = ("Found a large training file of {size} bytes. Note that it may "
            "be more efficient to split your training data into multiple "
            "smaller JSON files instead.")
-    W028 = ("Skipping unsupported morphological feature(s): {feature}. "
+    W028 = ("Doc.from_array was called with a vector of type '{type}', "
+            "but is expecting one of type 'uint64' instead. This may result "
+            "in problems with the vocab further on in the pipeline.")
+    W029 = ("Skipping unsupported morphological feature(s): {feature}. "
            "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
            "string \"Field1=Value1,Value2|Field2=Value3\".")

@ -531,6 +534,15 @@ class Errors(object):
    E188 = ("Could not match the gold entity links to entities in the doc - "
            "make sure the gold EL data refers to valid results of the "
            "named entity recognizer in the `nlp` pipeline.")
+    E189 = ("Each argument to `get_doc` should be of equal length.")
+    E190 = ("Token head out of range in `Doc.from_array()` for token index "
+            "'{index}' with value '{value}' (equivalent to relative head "
+            "index: '{rel_head_index}'). The head indices should be relative "
+            "to the current token index rather than absolute indices in the "
+            "array.")
+    E191 = ("Invalid head: the head token must be from the same doc as the "
+            "token itself.")
+
    # TODO: fix numbering after merging develop into master
    E993 = ("The config for 'nlp' should include either a key 'name' to "
            "refer to an existing model by name or path, or a key 'lang' "
--- a/spacy/lang/da/tokenizer_exceptions.py
+++ b/spacy/lang/da/tokenizer_exceptions.py
@ -66,6 +66,7 @@ for orth in [
    "A/S",
    "B.C.",
    "BK.",
+    "B.T.",
    "Dr.",
    "Boul.",
    "Chr.",
@ -75,6 +76,7 @@ for orth in [
    "Hf.",
    "i/s",
    "I/S",
+    "Inc.",
    "Kprs.",
    "L.A.",
    "Ll.",
@ -145,6 +147,7 @@ for orth in [
    "bygn.",
    "c/o",
    "ca.",
+    "cm.",
    "cand.",
    "d.d.",
    "d.m.",
@ -168,10 +171,12 @@ for orth in [
    "dl.",
    "do.",
    "dobb.",
+    "dr.",
    "dr.h.c",
    "dr.phil.",
    "ds.",
    "dvs.",
+    "d.v.s.",
    "e.b.",
    "e.l.",
    "e.o.",
@ -293,10 +298,14 @@ for orth in [
    "kap.",
    "kbh.",
    "kem.",
+    "kg.",
+    "kgs.",
    "kgl.",
    "kl.",
    "kld.",
+    "km.",
    "km/t",
+    "km/t.",
    "knsp.",
    "komm.",
    "kons.",
@ -307,6 +316,7 @@ for orth in [
    "kt.",
    "ktr.",
    "kv.",
+    "kvm.",
    "kvt.",
    "l.c.",
    "lab.",
@ -353,6 +363,7 @@ for orth in [
    "nto.",
    "nuv.",
    "o/m",
+    "o/m.",
    "o.a.",
    "o.fl.",
    "o.h.",
@ -522,6 +533,7 @@ for orth in [
    "vejl.",
    "vh.",
    "vha.",
+    "vind.",
    "vs.",
    "vsa.",
    "vær.",
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -1,5 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .punctuation import TOKENIZER_INFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
@ -19,6 +20,8 @@ class GermanDefaults(Language.Defaults):
        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    prefixes = TOKENIZER_PREFIXES
+    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/de/punctuation.py
+++ b/spacy/lang/de/punctuation.py
@ -1,7 +1,29 @@
-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
+from ..char_classes import CURRENCY, UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES


+_prefixes = ["``"] + BASE_TOKENIZER_PREFIXES
+
+_suffixes = (
+    ["''", "/"]
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
+        ),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
 _quotes = CONCAT_QUOTES.replace("'", "")

 _infixes = (
@ -12,6 +34,7 @@ _infixes = (
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
        r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
        r"(?<=[0-9])-(?=[0-9])",
@ -19,4 +42,6 @@ _infixes = (
 )


+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/de/tokenizer_exceptions.py
+++ b/spacy/lang/de/tokenizer_exceptions.py
@ -157,6 +157,8 @@ for exc_data in [


 for orth in [
+    "``",
+    "''",
    "A.C.",
    "a.D.",
    "A.D.",
@ -172,10 +174,13 @@ for orth in [
    "biol.",
    "Biol.",
    "ca.",
+    "CDU/CSU",
    "Chr.",
    "Cie.",
+    "c/o",
    "co.",
    "Co.",
+    "d'",
    "D.C.",
    "Dipl.-Ing.",
    "Dipl.",
@ -200,12 +205,18 @@ for orth in [
    "i.G.",
    "i.Tr.",
    "i.V.",
+    "I.",
+    "II.",
+    "III.",
+    "IV.",
+    "Inc.",
    "Ing.",
    "jr.",
    "Jr.",
    "jun.",
    "jur.",
    "K.O.",
+    "L'",
    "L.A.",
    "lat.",
    "M.A.",
--- a/spacy/lang/eu/init.py
+++ b/spacy/lang/eu/init.py
@ -0,0 +1,30 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_SUFFIXES
+from .tag_map import TAG_MAP
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...language import Language
+from ...attrs import LANG
+
+
+class BasqueDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "eu"
+
+    tokenizer_exceptions = BASE_EXCEPTIONS
+    tag_map = TAG_MAP
+    stop_words = STOP_WORDS
+    suffixes = TOKENIZER_SUFFIXES
+
+
+class Basque(Language):
+    lang = "eu"
+    Defaults = BasqueDefaults
+
+
+__all__ = ["Basque"]
--- a/spacy/lang/eu/examples.py
+++ b/spacy/lang/eu/examples.py
@ -0,0 +1,14 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.eu.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+    "bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
+    "gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira",
+]
--- a/spacy/lang/eu/lex_attrs.py
+++ b/spacy/lang/eu/lex_attrs.py
@ -0,0 +1,79 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+# Source http://mylanguages.org/basque_numbers.php
+
+
+_num_words = """
+bat
+bi
+hiru
+lau
+bost
+sei
+zazpi
+zortzi
+bederatzi
+hamar
+hamaika
+hamabi
+hamahiru
+hamalau
+hamabost
+hamasei
+hamazazpi
+Hemezortzi
+hemeretzi
+hogei
+ehun
+mila
+milioi
+""".split()
+
+# source https://www.google.com/intl/ur/inputtools/try/
+
+_ordinal_words = """
+lehen
+bigarren
+hirugarren
+laugarren
+bosgarren
+seigarren
+zazpigarren
+zortzigarren
+bederatzigarren
+hamargarren
+hamaikagarren
+hamabigarren
+hamahirugarren
+hamalaugarren
+hamabosgarren
+hamaseigarren
+hamazazpigarren
+hamazortzigarren
+hemeretzigarren
+hogeigarren
+behin
+""".split()
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/eu/punctuation.py
+++ b/spacy/lang/eu/punctuation.py
@ -0,0 +1,7 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..punctuation import TOKENIZER_SUFFIXES
+
+
+_suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/eu/stop_words.py
+++ b/spacy/lang/eu/stop_words.py
@ -0,0 +1,108 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+# Source: https://github.com/stopwords-iso/stopwords-eu
+# https://www.ranks.nl/stopwords/basque
+# https://www.mustgo.com/worldlanguages/basque/
+STOP_WORDS = set(
+    """
+al
+anitz
+arabera
+asko
+baina
+bat
+batean
+batek
+bati
+batzuei
+batzuek
+batzuetan
+batzuk
+bera
+beraiek
+berau
+berauek
+bere
+berori
+beroriek
+beste
+bezala
+da
+dago
+dira
+ditu
+du
+dute
+edo
+egin
+ere
+eta
+eurak
+ez
+gainera
+gu
+gutxi
+guzti
+haiei
+haiek
+haietan
+hainbeste
+hala
+han
+handik
+hango
+hara
+hari
+hark
+hartan
+hau
+hauei
+hauek
+hauetan
+hemen
+hemendik
+hemengo
+hi
+hona
+honek
+honela
+honetan
+honi
+hor
+hori
+horiei
+horiek
+horietan
+horko
+horra
+horrek
+horrela
+horretan
+horri
+hortik
+hura
+izan
+ni
+noiz
+nola
+non
+nondik
+nongo
+nor
+nora
+ze
+zein
+zen
+zenbait
+zenbat
+zer
+zergatik
+ziren
+zituen
+zu
+zuek
+zuen
+zuten
+""".split()
+)
--- a/spacy/lang/eu/tag_map.py
+++ b/spacy/lang/eu/tag_map.py
@ -0,0 +1,71 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
+from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
+
+TAG_MAP = {
+    ".": {POS: PUNCT, "PunctType": "peri"},
+    ",": {POS: PUNCT, "PunctType": "comm"},
+    "-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
+    "-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
+    "``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
+    '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    ":": {POS: PUNCT},
+    "$": {POS: SYM, "Other": {"SymType": "currency"}},
+    "#": {POS: SYM, "Other": {"SymType": "numbersign"}},
+    "AFX": {POS: ADJ, "Hyph": "yes"},
+    "CC": {POS: CCONJ, "ConjType": "coor"},
+    "CD": {POS: NUM, "NumType": "card"},
+    "DT": {POS: DET},
+    "EX": {POS: ADV, "AdvType": "ex"},
+    "FW": {POS: X, "Foreign": "yes"},
+    "HYPH": {POS: PUNCT, "PunctType": "dash"},
+    "IN": {POS: ADP},
+    "JJ": {POS: ADJ, "Degree": "pos"},
+    "JJR": {POS: ADJ, "Degree": "comp"},
+    "JJS": {POS: ADJ, "Degree": "sup"},
+    "LS": {POS: PUNCT, "NumType": "ord"},
+    "MD": {POS: VERB, "VerbType": "mod"},
+    "NIL": {POS: ""},
+    "NN": {POS: NOUN, "Number": "sing"},
+    "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
+    "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
+    "NNS": {POS: NOUN, "Number": "plur"},
+    "PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
+    "POS": {POS: PART, "Poss": "yes"},
+    "PRP": {POS: PRON, "PronType": "prs"},
+    "PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
+    "RB": {POS: ADV, "Degree": "pos"},
+    "RBR": {POS: ADV, "Degree": "comp"},
+    "RBS": {POS: ADV, "Degree": "sup"},
+    "RP": {POS: PART},
+    "SP": {POS: SPACE},
+    "SYM": {POS: SYM},
+    "TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
+    "UH": {POS: INTJ},
+    "VB": {POS: VERB, "VerbForm": "inf"},
+    "VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
+    "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
+    "VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
+    "VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
+    "VBZ": {
+        POS: VERB,
+        "VerbForm": "fin",
+        "Tense": "pres",
+        "Number": "sing",
+        "Person": 3,
+    },
+    "WDT": {POS: ADJ, "PronType": "int|rel"},
+    "WP": {POS: NOUN, "PronType": "int|rel"},
+    "WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
+    "WRB": {POS: ADV, "PronType": "int|rel"},
+    "ADD": {POS: X},
+    "NFP": {POS: PUNCT},
+    "GW": {POS: X},
+    "XX": {POS: X},
+    "BES": {POS: VERB},
+    "HVS": {POS: VERB},
+    "_SP": {POS: SPACE},
+}
--- a/spacy/lang/fi/tokenizer_exceptions.py
+++ b/spacy/lang/fi/tokenizer_exceptions.py
@ -11,6 +11,7 @@ for exc_data in [
    {ORTH: "alv.", LEMMA: "arvonlisävero"},
    {ORTH: "ark.", LEMMA: "arkisin"},
    {ORTH: "as.", LEMMA: "asunto"},
+    {ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
    {ORTH: "ed.", LEMMA: "edellinen"},
    {ORTH: "esim.", LEMMA: "esimerkki"},
    {ORTH: "huom.", LEMMA: "huomautus"},
@ -24,6 +25,7 @@ for exc_data in [
    {ORTH: "läh.", LEMMA: "lähettäjä"},
    {ORTH: "miel.", LEMMA: "mieluummin"},
    {ORTH: "milj.", LEMMA: "miljoona"},
+    {ORTH: "Mm.", LEMMA: "muun muassa"},
    {ORTH: "mm.", LEMMA: "muun muassa"},
    {ORTH: "myöh.", LEMMA: "myöhempi"},
    {ORTH: "n.", LEMMA: "noin"},
--- a/spacy/lang/fr/init.py
+++ b/spacy/lang/fr/init.py
@ -1,5 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
-from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -24,6 +25,7 @@ class FrenchDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
+    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    token_match = TOKEN_MATCH
--- a/spacy/lang/fr/punctuation.py
+++ b/spacy/lang/fr/punctuation.py
@ -1,12 +1,23 @@
-from ..punctuation import TOKENIZER_INFIXES
+from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
 from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from ..char_classes import merge_chars


-ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
-HYPHENS = r"- – — ‐ ‑".strip().replace(" ", "").replace("\n", "")
+ELISION = "' ’".replace(" ", "")
+HYPHENS = r"- – — ‐ ‑".replace(" ", "")
+_prefixes_elision = "d l n"
+_prefixes_elision += " " + _prefixes_elision.upper()
+_hyphen_suffixes = "ce clés elle en il ils je là moi nous on t vous"
+_hyphen_suffixes += " " + _hyphen_suffixes.upper()


+_prefixes = TOKENIZER_PREFIXES + [
+    r"(?:({pe})[{el}])(?=[{a}])".format(
+        a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
+    )
+]
+
 _suffixes = (
    LIST_PUNCT
    + LIST_ELLIPSES
@ -14,7 +25,6 @@ _suffixes = (
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",  # °C. -> ["°C", "."]
-        r"(?<=[0-9])°[FfCcKk]",  # 4°C -> ["4", "°C"]
        r"(?<=[0-9])%",  # 4% -> ["4", "%"]
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9])(?:{u})".format(u=UNITS),
@ -22,14 +32,17 @@ _suffixes = (
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+        r"(?<=[{a}])[{h}]({hs})".format(
+            a=ALPHA, h=HYPHENS, hs=merge_chars(_hyphen_suffixes)
+        ),
    ]
 )

-
 _infixes = TOKENIZER_INFIXES + [
    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 ]


+TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -3,7 +3,7 @@ import re
 from .punctuation import ELISION, HYPHENS
 from ..tokenizer_exceptions import URL_PATTERN
 from ..char_classes import ALPHA_LOWER, ALPHA
-from ...symbols import ORTH, LEMMA, TAG
+from ...symbols import ORTH, LEMMA

 # not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer
 # from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS
@ -53,7 +53,28 @@ for exc_data in [
    _exc[exc_data[ORTH]] = [exc_data]


-for orth in ["etc."]:
+for orth in [
+    "après-midi",
+    "au-delà",
+    "au-dessus",
+    "celle-ci",
+    "celles-ci",
+    "celui-ci",
+    "cf.",
+    "ci-dessous",
+    "elle-même",
+    "en-dessous",
+    "etc.",
+    "jusque-là",
+    "lui-même",
+    "MM.",
+    "No.",
+    "peut-être",
+    "pp.",
+    "quelques-uns",
+    "rendez-vous",
+    "Vol.",
+]:
    _exc[orth] = [{ORTH: orth}]


@ -69,7 +90,7 @@ for verb, verb_lemma in [
        for pronoun in ["elle", "il", "on"]:
            token = f"{orth}-t-{pronoun}"
            _exc[token] = [
-                {LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
+                {LEMMA: verb_lemma, ORTH: orth},  # , TAG: "VERB"},
                {LEMMA: "t", ORTH: "-t"},
                {LEMMA: pronoun, ORTH: "-" + pronoun},
            ]
@ -78,7 +99,7 @@ for verb, verb_lemma in [("est", "être")]:
    for orth in [verb, verb.title()]:
        token = f"{orth}-ce"
        _exc[token] = [
-            {LEMMA: verb_lemma, ORTH: orth, TAG: "VERB"},
+            {LEMMA: verb_lemma, ORTH: orth},  # , TAG: "VERB"},
            {LEMMA: "ce", ORTH: "-ce"},
        ]

@ -86,12 +107,29 @@ for verb, verb_lemma in [("est", "être")]:
 for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]:
    for orth in [pre, pre.title()]:
        _exc[f"{orth}est-ce"] = [
-            {LEMMA: pre_lemma, ORTH: orth, TAG: "ADV"},
-            {LEMMA: "être", ORTH: "est", TAG: "VERB"},
+            {LEMMA: pre_lemma, ORTH: orth},
+            {LEMMA: "être", ORTH: "est"},
            {LEMMA: "ce", ORTH: "-ce"},
        ]


+for verb, pronoun in [("est", "il"), ("EST", "IL")]:
+    token = "{}-{}".format(verb, pronoun)
+    _exc[token] = [
+        {LEMMA: "être", ORTH: verb},
+        {LEMMA: pronoun, ORTH: "-" + pronoun},
+    ]
+
+
+for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
+    token = "{}'{}-{}".format(s, verb, pronoun)
+    _exc[token] = [
+        {LEMMA: "se", ORTH: s + "'"},
+        {LEMMA: "être", ORTH: verb},
+        {LEMMA: pronoun, ORTH: "-" + pronoun},
+    ]
+
+
 _infixes_exc = []
 orig_elision = "'"
 orig_hyphen = "-"
--- a/spacy/lang/it/init.py
+++ b/spacy/lang/it/init.py
@ -1,7 +1,7 @@
 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .punctuation import TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -19,6 +19,7 @@ class ItalianDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
+    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES


--- a/spacy/lang/it/punctuation.py
+++ b/spacy/lang/it/punctuation.py
@ -1,12 +1,29 @@
-from ..punctuation import TOKENIZER_INFIXES
-from ..char_classes import ALPHA
+from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
+from ..char_classes import ALPHA_LOWER, ALPHA_UPPER


-ELISION = " ' ’ ".strip().replace(" ", "")
+ELISION = "'’"


-_infixes = TOKENIZER_INFIXES + [
-    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
+_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
+
+
+_infixes = (
+    LIST_ELLIPSES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
+        r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
    ]
+)

+TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/it/tokenizer_exceptions.py
+++ b/spacy/lang/it/tokenizer_exceptions.py
@ -1,5 +1,55 @@
 from ...symbols import ORTH, LEMMA

-_exc = {"po'": [{ORTH: "po'", LEMMA: "poco"}]}
+_exc = {
+    "all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
+    "dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
+    "dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
+    "L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
+    "l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
+    "nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
+    "po'": [{ORTH: "po'", LEMMA: "poco"}],
+    "sett..": [{ORTH: "sett."}, {ORTH: "."}],
+}
+
+for orth in [
+    "..",
+    "....",
+    "al.",
+    "all-path",
+    "art.",
+    "Art.",
+    "artt.",
+    "att.",
+    "by-pass",
+    "c.d.",
+    "centro-sinistra",
+    "check-up",
+    "Civ.",
+    "cm.",
+    "Cod.",
+    "col.",
+    "Cost.",
+    "d.C.",
+    'de"',
+    "distr.",
+    "E'",
+    "ecc.",
+    "e-mail",
+    "e/o",
+    "etc.",
+    "Jr.",
+    "n°",
+    "nord-est",
+    "pag.",
+    "Proc.",
+    "prof.",
+    "sett.",
+    "s.p.a.",
+    "ss.",
+    "St.",
+    "tel.",
+    "week-end",
+]:
+    _exc[orth] = [{ORTH: orth}]

 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/lb/punctuation.py
+++ b/spacy/lang/lb/punctuation.py
@ -2,11 +2,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_

 ELISION = " ' ’ ".strip().replace(" ", "")

+abbrev = ("d", "D")
+
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
-        r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
+        r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -7,6 +7,8 @@ _exc = {}

 # translate / delete what is not necessary
 for exc_data in [
+    {ORTH: "’t", LEMMA: "et", NORM: "et"},
+    {ORTH: "’T", LEMMA: "et", NORM: "et"},
    {ORTH: "'t", LEMMA: "et", NORM: "et"},
    {ORTH: "'T", LEMMA: "et", NORM: "et"},
    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
--- a/spacy/lang/lij/init.py
+++ b/spacy/lang/lij/init.py
@ -0,0 +1,31 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .punctuation import TOKENIZER_INFIXES
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+class LigurianDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters[LANG] = lambda text: "lij"
+    lex_attr_getters[NORM] = add_lookups(
+        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
+    )
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    stop_words = STOP_WORDS
+    infixes = TOKENIZER_INFIXES
+
+
+class Ligurian(Language):
+    lang = "lij"
+    Defaults = LigurianDefaults
+
+
+__all__ = ["Ligurian"]
--- a/spacy/lang/lij/examples.py
+++ b/spacy/lang/lij/examples.py
@ -0,0 +1,18 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.lij.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Sciusciâ e sciorbî no se peu.",
+    "Graçie di çetroin, che me son arrivæ.",
+    "Vegnime apreuvo, che ve fasso pescâ di òmmi.",
+    "Bella pe sempre l'ægua inta conchetta quande unn'agoggia d'ægua a se â trapaña.",
+]
--- a/spacy/lang/lij/punctuation.py
+++ b/spacy/lang/lij/punctuation.py
@ -0,0 +1,15 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..punctuation import TOKENIZER_INFIXES
+from ..char_classes import ALPHA
+
+
+ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
+
+
+_infixes = TOKENIZER_INFIXES + [
+    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
+]
+
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/lij/stop_words.py
+++ b/spacy/lang/lij/stop_words.py
@ -0,0 +1,43 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+STOP_WORDS = set(
+    """
+a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei
+
+bella belle belli bello ben
+
+ch' che chì chi ciù co-a co-e co-i co-o comm' comme con cösa coscì cöse
+
+d' da da-a da-e da-i da-o dapeu de delongo derê di do doe doî donde dòppo
+
+é e ê ea ean emmo en ëse
+
+fin fiña
+
+gh' ghe guæei
+
+i î in insemme int' inta inte inti into
+
+l' lê lì lô
+
+m' ma manco me megio meno mezo mi
+
+na n' ne ni ninte nisciun nisciuña no
+
+o ò ô oua
+
+parte pe pe-a pe-i pe-e pe-o perché pittin pö primma pròpio
+
+quæ quand' quande quarche quella quelle quelli quello
+
+s' sce scê sci sciâ sciô sciù se segge seu sò solo son sott' sta stæta stæte stæti stæto ste sti sto
+
+tanta tante tanti tanto te ti torna tra tròppo tutta tutte tutti tutto
+
+un uña unn' unna
+
+za zu
+""".split()
+)
--- a/spacy/lang/lij/tokenizer_exceptions.py
+++ b/spacy/lang/lij/tokenizer_exceptions.py
@ -0,0 +1,52 @@
+# coding: utf8
+from __future__ import unicode_literals
+from ...symbols import ORTH, LEMMA
+
+_exc = {}
+
+for raw, lemma in [
+    ("a-a", "a-o"),
+    ("a-e", "a-o"),
+    ("a-o", "a-o"),
+    ("a-i", "a-o"),
+    ("co-a", "co-o"),
+    ("co-e", "co-o"),
+    ("co-i", "co-o"),
+    ("co-o", "co-o"),
+    ("da-a", "da-o"),
+    ("da-e", "da-o"),
+    ("da-i", "da-o"),
+    ("da-o", "da-o"),
+    ("pe-a", "pe-o"),
+    ("pe-e", "pe-o"),
+    ("pe-i", "pe-o"),
+    ("pe-o", "pe-o"),
+]:
+    for orth in [raw, raw.capitalize()]:
+        _exc[orth] = [{ORTH: orth, LEMMA: lemma}]
+
+# Prefix + prepositions with à (e.g. "sott'a-o")
+
+for prep, prep_lemma in [
+    ("a-a", "a-o"),
+    ("a-e", "a-o"),
+    ("a-o", "a-o"),
+    ("a-i", "a-o"),
+]:
+    for prefix, prefix_lemma in [
+        ("sott'", "sotta"),
+        ("sott’", "sotta"),
+        ("contr'", "contra"),
+        ("contr’", "contra"),
+        ("ch'", "che"),
+        ("ch’", "che"),
+        ("s'", "se"),
+        ("s’", "se"),
+    ]:
+        for prefix_orth in [prefix, prefix.capitalize()]:
+            _exc[prefix_orth + prep] = [
+                {ORTH: prefix_orth, LEMMA: prefix_lemma},
+                {ORTH: prep, LEMMA: prep_lemma},
+            ]
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/lt/init.py
+++ b/spacy/lang/lt/init.py
@ -1,3 +1,4 @@
+from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -23,7 +24,13 @@ class LithuanianDefaults(Language.Defaults):
    )
    lex_attr_getters.update(LEX_ATTRS)

-    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    infixes = TOKENIZER_INFIXES
+    suffixes = TOKENIZER_SUFFIXES
+    mod_base_exceptions = {
+        exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
+    }
+    del mod_base_exceptions["8)"]
+    tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    morph_rules = MORPH_RULES
--- a/spacy/lang/lt/punctuation.py
+++ b/spacy/lang/lt/punctuation.py
@ -0,0 +1,29 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..char_classes import LIST_ICONS, LIST_ELLIPSES
+from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
+from ..char_classes import HYPHENS
+from ..punctuation import TOKENIZER_SUFFIXES
+
+
+_infixes = (
+    LIST_ELLIPSES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])[+\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
+    ]
+)
+
+
+_suffixes = ["\."] + list(TOKENIZER_SUFFIXES)
+
+
+TOKENIZER_INFIXES = _infixes
+TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/lt/tokenizer_exceptions.py
+++ b/spacy/lang/lt/tokenizer_exceptions.py
@ -3,262 +3,264 @@ from ...symbols import ORTH
 _exc = {}

 for orth in [
-    "G.",
-    "J. E.",
-    "J. Em.",
-    "J.E.",
-    "J.Em.",
-    "K.",
-    "N.",
-    "V.",
-    "Vt.",
-    "a.",
-    "a.k.",
-    "a.s.",
-    "adv.",
-    "akad.",
-    "aklg.",
-    "akt.",
-    "al.",
-    "ang.",
-    "angl.",
-    "aps.",
-    "apskr.",
-    "apyg.",
-    "arbat.",
-    "asist.",
-    "asm.",
-    "asm.k.",
-    "asmv.",
-    "atk.",
-    "atsak.",
-    "atsisk.",
-    "atsisk.sąsk.",
-    "atv.",
-    "aut.",
-    "avd.",
-    "b.k.",
-    "baud.",
-    "biol.",
-    "bkl.",
-    "bot.",
-    "bt.",
-    "buv.",
-    "ch.",
-    "chem.",
-    "corp.",
-    "d.",
-    "dab.",
-    "dail.",
-    "dek.",
-    "deš.",
-    "dir.",
-    "dirig.",
-    "doc.",
-    "dol.",
-    "dr.",
-    "drp.",
-    "dvit.",
-    "dėst.",
-    "dš.",
-    "dž.",
-    "e.b.",
-    "e.bankas",
-    "e.p.",
-    "e.parašas",
-    "e.paštas",
-    "e.v.",
-    "e.valdžia",
-    "egz.",
-    "eil.",
-    "ekon.",
-    "el.",
-    "el.bankas",
-    "el.p.",
-    "el.parašas",
-    "el.paštas",
-    "el.valdžia",
-    "etc.",
-    "ež.",
-    "fak.",
-    "faks.",
-    "feat.",
-    "filol.",
-    "filos.",
-    "g.",
-    "gen.",
-    "geol.",
-    "gerb.",
-    "gim.",
-    "gr.",
-    "gv.",
-    "gyd.",
-    "gyv.",
-    "habil.",
-    "inc.",
-    "insp.",
-    "inž.",
-    "ir pan.",
-    "ir t. t.",
-    "isp.",
-    "istor.",
-    "it.",
-    "just.",
-    "k.",
-    "k. a.",
-    "k.a.",
-    "kab.",
-    "kand.",
-    "kart.",
-    "kat.",
-    "ketv.",
-    "kh.",
-    "kl.",
-    "kln.",
-    "km.",
-    "kn.",
-    "koresp.",
-    "kpt.",
-    "kr.",
-    "kt.",
-    "kub.",
-    "kun.",
-    "kv.",
-    "kyš.",
-    "l. e. p.",
-    "l.e.p.",
-    "lenk.",
-    "liet.",
-    "lot.",
-    "lt.",
-    "ltd.",
-    "ltn.",
-    "m.",
-    "m.e..",
-    "m.m.",
-    "mat.",
-    "med.",
-    "mgnt.",
-    "mgr.",
-    "min.",
-    "mjr.",
-    "ml.",
-    "mln.",
-    "mlrd.",
-    "mob.",
-    "mok.",
-    "moksl.",
-    "mokyt.",
-    "mot.",
-    "mr.",
-    "mst.",
-    "mstl.",
-    "mėn.",
-    "nkt.",
-    "no.",
-    "nr.",
-    "ntk.",
-    "nuotr.",
-    "op.",
-    "org.",
-    "orig.",
-    "p.",
-    "p.d.",
-    "p.m.e.",
-    "p.s.",
-    "pab.",
-    "pan.",
-    "past.",
-    "pav.",
-    "pavad.",
-    "per.",
-    "perd.",
-    "pirm.",
-    "pl.",
-    "plg.",
-    "plk.",
-    "pr.",
-    "pr.Kr.",
-    "pranc.",
-    "proc.",
-    "prof.",
-    "prom.",
-    "prot.",
-    "psl.",
-    "pss.",
-    "pvz.",
-    "pšt.",
-    "r.",
-    "raj.",
-    "red.",
-    "rez.",
-    "rež.",
-    "rus.",
-    "rš.",
-    "s.",
-    "sav.",
-    "saviv.",
-    "sek.",
-    "sekr.",
-    "sen.",
-    "sh.",
-    "sk.",
-    "skg.",
-    "skv.",
-    "skyr.",
-    "sp.",
-    "spec.",
-    "sr.",
-    "st.",
-    "str.",
-    "stud.",
-    "sąs.",
-    "t.",
-    "t. p.",
-    "t. y.",
-    "t.p.",
-    "t.t.",
-    "t.y.",
-    "techn.",
-    "tel.",
-    "teol.",
-    "th.",
-    "tir.",
-    "trit.",
-    "trln.",
-    "tšk.",
-    "tūks.",
-    "tūkst.",
-    "up.",
-    "upl.",
-    "v.s.",
-    "vad.",
-    "val.",
-    "valg.",
-    "ved.",
-    "vert.",
-    "vet.",
-    "vid.",
-    "virš.",
-    "vlsč.",
-    "vnt.",
-    "vok.",
-    "vs.",
-    "vtv.",
-    "vv.",
-    "vyr.",
-    "vyresn.",
-    "zool.",
-    "Įn",
-    "įl.",
-    "š.m.",
-    "šnek.",
-    "šv.",
-    "švč.",
-    "ž.ū.",
-    "žin.",
-    "žml.",
-    "žr.",
+    "n-tosios",
+    "?!",
+    #    "G.",
+    #    "J. E.",
+    #    "J. Em.",
+    #    "J.E.",
+    #    "J.Em.",
+    #    "K.",
+    #    "N.",
+    #    "V.",
+    #    "Vt.",
+    #    "a.",
+    #    "a.k.",
+    #    "a.s.",
+    #    "adv.",
+    #    "akad.",
+    #    "aklg.",
+    #    "akt.",
+    #    "al.",
+    #    "ang.",
+    #    "angl.",
+    #    "aps.",
+    #    "apskr.",
+    #    "apyg.",
+    #    "arbat.",
+    #    "asist.",
+    #    "asm.",
+    #    "asm.k.",
+    #    "asmv.",
+    #    "atk.",
+    #    "atsak.",
+    #    "atsisk.",
+    #    "atsisk.sąsk.",
+    #    "atv.",
+    #    "aut.",
+    #    "avd.",
+    #    "b.k.",
+    #    "baud.",
+    #    "biol.",
+    #    "bkl.",
+    #    "bot.",
+    #    "bt.",
+    #    "buv.",
+    #    "ch.",
+    #    "chem.",
+    #    "corp.",
+    #    "d.",
+    #    "dab.",
+    #    "dail.",
+    #    "dek.",
+    #    "deš.",
+    #    "dir.",
+    #    "dirig.",
+    #    "doc.",
+    #    "dol.",
+    #    "dr.",
+    #    "drp.",
+    #    "dvit.",
+    #    "dėst.",
+    #    "dš.",
+    #    "dž.",
+    #    "e.b.",
+    #    "e.bankas",
+    #    "e.p.",
+    #    "e.parašas",
+    #    "e.paštas",
+    #    "e.v.",
+    #    "e.valdžia",
+    #    "egz.",
+    #    "eil.",
+    #    "ekon.",
+    #    "el.",
+    #    "el.bankas",
+    #    "el.p.",
+    #    "el.parašas",
+    #    "el.paštas",
+    #    "el.valdžia",
+    #    "etc.",
+    #    "ež.",
+    #    "fak.",
+    #    "faks.",
+    #    "feat.",
+    #    "filol.",
+    #    "filos.",
+    #    "g.",
+    #    "gen.",
+    #    "geol.",
+    #    "gerb.",
+    #    "gim.",
+    #    "gr.",
+    #    "gv.",
+    #    "gyd.",
+    #    "gyv.",
+    #    "habil.",
+    #    "inc.",
+    #    "insp.",
+    #    "inž.",
+    #    "ir pan.",
+    #    "ir t. t.",
+    #    "isp.",
+    #    "istor.",
+    #    "it.",
+    #    "just.",
+    #    "k.",
+    #    "k. a.",
+    #    "k.a.",
+    #    "kab.",
+    #    "kand.",
+    #    "kart.",
+    #    "kat.",
+    #    "ketv.",
+    #    "kh.",
+    #    "kl.",
+    #    "kln.",
+    #    "km.",
+    #    "kn.",
+    #    "koresp.",
+    #    "kpt.",
+    #    "kr.",
+    #    "kt.",
+    #    "kub.",
+    #    "kun.",
+    #    "kv.",
+    #    "kyš.",
+    #    "l. e. p.",
+    #    "l.e.p.",
+    #    "lenk.",
+    #    "liet.",
+    #    "lot.",
+    #    "lt.",
+    #    "ltd.",
+    #    "ltn.",
+    #    "m.",
+    #    "m.e..",
+    #    "m.m.",
+    #    "mat.",
+    #    "med.",
+    #    "mgnt.",
+    #    "mgr.",
+    #    "min.",
+    #    "mjr.",
+    #    "ml.",
+    #    "mln.",
+    #    "mlrd.",
+    #    "mob.",
+    #    "mok.",
+    #    "moksl.",
+    #    "mokyt.",
+    #    "mot.",
+    #    "mr.",
+    #    "mst.",
+    #    "mstl.",
+    #    "mėn.",
+    #    "nkt.",
+    #    "no.",
+    #    "nr.",
+    #    "ntk.",
+    #    "nuotr.",
+    #    "op.",
+    #    "org.",
+    #    "orig.",
+    #    "p.",
+    #    "p.d.",
+    #    "p.m.e.",
+    #    "p.s.",
+    #    "pab.",
+    #    "pan.",
+    #    "past.",
+    #    "pav.",
+    #    "pavad.",
+    #    "per.",
+    #    "perd.",
+    #    "pirm.",
+    #    "pl.",
+    #    "plg.",
+    #    "plk.",
+    #    "pr.",
+    #    "pr.Kr.",
+    #    "pranc.",
+    #    "proc.",
+    #    "prof.",
+    #    "prom.",
+    #    "prot.",
+    #    "psl.",
+    #    "pss.",
+    #    "pvz.",
+    #    "pšt.",
+    #    "r.",
+    #    "raj.",
+    #    "red.",
+    #    "rez.",
+    #    "rež.",
+    #    "rus.",
+    #    "rš.",
+    #    "s.",
+    #    "sav.",
+    #    "saviv.",
+    #    "sek.",
+    #    "sekr.",
+    #    "sen.",
+    #    "sh.",
+    #    "sk.",
+    #    "skg.",
+    #    "skv.",
+    #    "skyr.",
+    #    "sp.",
+    #    "spec.",
+    #    "sr.",
+    #    "st.",
+    #    "str.",
+    #    "stud.",
+    #    "sąs.",
+    #    "t.",
+    #    "t. p.",
+    #    "t. y.",
+    #    "t.p.",
+    #    "t.t.",
+    #    "t.y.",
+    #    "techn.",
+    #    "tel.",
+    #    "teol.",
+    #    "th.",
+    #    "tir.",
+    #    "trit.",
+    #    "trln.",
+    #    "tšk.",
+    #    "tūks.",
+    #    "tūkst.",
+    #    "up.",
+    #    "upl.",
+    #    "v.s.",
+    #    "vad.",
+    #    "val.",
+    #    "valg.",
+    #    "ved.",
+    #    "vert.",
+    #    "vet.",
+    #    "vid.",
+    #    "virš.",
+    #    "vlsč.",
+    #    "vnt.",
+    #    "vok.",
+    #    "vs.",
+    #    "vtv.",
+    #    "vv.",
+    #    "vyr.",
+    #    "vyresn.",
+    #    "zool.",
+    #    "Įn",
+    #    "įl.",
+    #    "š.m.",
+    #    "šnek.",
+    #    "šv.",
+    #    "švč.",
+    #    "ž.ū.",
+    #    "žin.",
+    #    "žml.",
+    #    "žr.",
 ]:
    _exc[orth] = [{ORTH: orth}]

--- a/spacy/lang/nb/init.py
+++ b/spacy/lang/nb/init.py
@ -1,4 +1,6 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .morph_rules import MORPH_RULES
 from .syntax_iterators import SYNTAX_ITERATORS
@ -18,6 +20,9 @@ class NorwegianDefaults(Language.Defaults):
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    prefixes = TOKENIZER_PREFIXES
+    infixes = TOKENIZER_INFIXES
+    suffixes = TOKENIZER_SUFFIXES
    stop_words = STOP_WORDS
    morph_rules = MORPH_RULES
    tag_map = TAG_MAP
--- a/spacy/lang/nb/punctuation.py
+++ b/spacy/lang/nb/punctuation.py
@ -1,13 +1,29 @@
-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
-from ..punctuation import TOKENIZER_SUFFIXES
+from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY

-# Punctuation stolen from Danish
+
+# Punctuation adapted from Danish
 _quotes = CONCAT_QUOTES.replace("'", "")
+_list_punct = [x for x in LIST_PUNCT if x != "#"]
+_list_icons = [x for x in LIST_ICONS if x != "°"]
+_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
+_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
+
+
+_prefixes = (
+    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
+    + _list_punct
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + LIST_CURRENCY
+    + LIST_ICONS
+)
+

 _infixes = (
    LIST_ELLIPSES
-    + LIST_ICONS
+    + _list_icons
    + [
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
@ -18,13 +34,26 @@ _infixes = (
    ]
 )

-_suffixes = [
-    suffix
-    for suffix in TOKENIZER_SUFFIXES
-    if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
+_suffixes = (
+    LIST_PUNCT
+    + LIST_ELLIPSES
+    + _list_quotes
+    + _list_icons
+    + ["—", "–"]
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
+        ),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
    ]
-_suffixes += [r"(?<=[^sSxXzZ])\'"]
+    + [r"(?<=[^sSxXzZ])'"]
+)


+TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
 TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/nb/tokenizer_exceptions.py
+++ b/spacy/lang/nb/tokenizer_exceptions.py
@ -21,57 +21,80 @@ for exc_data in [


 for orth in [
-    "adm.dir.",
-    "a.m.",
-    "andelsnr",
+    "Ap.",
    "Aq.",
+    "Ca.",
+    "Chr.",
+    "Co.",
+    "Co.",
+    "Dr.",
+    "F.eks.",
+    "Fr.p.",
+    "Frp.",
+    "Grl.",
+    "Kr.",
+    "Kr.F.",
+    "Kr.F.s",
+    "Mr.",
+    "Mrs.",
+    "Pb.",
+    "Pr.",
+    "Sp.",
+    "Sp.",
+    "St.",
+    "a.m.",
+    "ad.",
+    "adm.dir.",
+    "andelsnr",
    "b.c.",
    "bl.a.",
    "bla.",
    "bm.",
    "bnr.",
    "bto.",
+    "c.c.",
    "ca.",
    "cand.mag.",
-    "c.c.",
    "co.",
    "d.d.",
-    "dept.",
    "d.m.",
-    "dr.philos.",
-    "dvs.",
    "d.y.",
-    "E. coli",
+    "dept.",
+    "dr.",
+    "dr.med.",
+    "dr.philos.",
+    "dr.psychol.",
+    "dvs.",
+    "e.Kr.",
+    "e.l.",
    "eg.",
    "ekskl.",
-    "e.Kr.",
    "el.",
-    "e.l.",
    "et.",
    "etc.",
    "etg.",
    "ev.",
    "evt.",
    "f.",
+    "f.Kr.",
    "f.eks.",
+    "f.o.m.",
    "fhv.",
    "fk.",
-    "f.Kr.",
-    "f.o.m.",
    "foreg.",
    "fork.",
    "fv.",
    "fvt.",
    "g.",
-    "gt.",
    "gl.",
    "gno.",
    "gnr.",
    "grl.",
+    "gt.",
+    "h.r.adv.",
    "hhv.",
    "hoh.",
    "hr.",
-    "h.r.adv.",
    "ifb.",
    "ifm.",
    "iht.",
@ -80,39 +103,45 @@ for orth in [
    "jf.",
    "jr.",
    "jun.",
+    "juris.",
    "kfr.",
+    "kgl.",
    "kgl.res.",
    "kl.",
    "komm.",
    "kr.",
    "kst.",
+    "lat.",
    "lø.",
+    "m.a.o.",
+    "m.fl.",
+    "m.m.",
+    "m.v.",
    "ma.",
    "mag.art.",
-    "m.a.o.",
    "md.",
    "mfl.",
+    "mht.",
    "mill.",
    "min.",
-    "m.m.",
    "mnd.",
    "moh.",
-    "Mr.",
+    "mrd.",
    "muh.",
    "mv.",
    "mva.",
+    "n.å.",
    "ndf.",
    "no.",
    "nov.",
    "nr.",
    "nto.",
    "nyno.",
-    "n.å.",
    "o.a.",
+    "o.l.",
    "off.",
    "ofl.",
    "okt.",
-    "o.l.",
    "on.",
    "op.",
    "org.",
@ -120,14 +149,15 @@ for orth in [
    "ovf.",
    "p.",
    "p.a.",
-    "Pb.",
+    "p.g.a.",
+    "p.m.",
+    "p.t.",
    "pga.",
    "ph.d.",
    "pkt.",
-    "p.m.",
    "pr.",
    "pst.",
-    "p.t.",
+    "pt.",
    "red.anm.",
    "ref.",
    "res.",
@ -136,6 +166,10 @@ for orth in [
    "rv.",
    "s.",
    "s.d.",
+    "s.k.",
+    "s.k.",
+    "s.u.",
+    "s.å.",
    "sen.",
    "sep.",
    "siviling.",
@ -145,16 +179,17 @@ for orth in [
    "sr.",
    "sst.",
    "st.",
-    "stip.",
-    "stk.",
    "st.meld.",
    "st.prp.",
+    "stip.",
+    "stk.",
    "stud.",
-    "s.u.",
    "sv.",
-    "sø.",
-    "s.å.",
    "såk.",
+    "sø.",
+    "t.h.",
+    "t.o.m.",
+    "t.v.",
    "temp.",
    "ti.",
    "tils.",
@ -162,7 +197,6 @@ for orth in [
    "tl;dr",
    "tlf.",
    "to.",
-    "t.o.m.",
    "ult.",
    "utg.",
    "v.",
@ -176,8 +210,10 @@ for orth in [
    "vol.",
    "vs.",
    "vsa.",
+    "©NTB",
    "årg.",
    "årh.",
+    "§§",
 ]:
    _exc[orth] = [{ORTH: orth}]

--- a/spacy/lang/pt/tokenizer_exceptions.py
+++ b/spacy/lang/pt/tokenizer_exceptions.py
@ -1,69 +1,47 @@
-from ...symbols import ORTH, NORM
+from ...symbols import ORTH


-_exc = {
-    "às": [{ORTH: "à", NORM: "a"}, {ORTH: "s", NORM: "as"}],
-    "ao": [{ORTH: "a"}, {ORTH: "o"}],
-    "aos": [{ORTH: "a"}, {ORTH: "os"}],
-    "àquele": [{ORTH: "à", NORM: "a"}, {ORTH: "quele", NORM: "aquele"}],
-    "àquela": [{ORTH: "à", NORM: "a"}, {ORTH: "quela", NORM: "aquela"}],
-    "àqueles": [{ORTH: "à", NORM: "a"}, {ORTH: "queles", NORM: "aqueles"}],
-    "àquelas": [{ORTH: "à", NORM: "a"}, {ORTH: "quelas", NORM: "aquelas"}],
-    "àquilo": [{ORTH: "à", NORM: "a"}, {ORTH: "quilo", NORM: "aquilo"}],
-    "aonde": [{ORTH: "a"}, {ORTH: "onde"}],
-}
-
-
-# Contractions
-_per_pron = ["ele", "ela", "eles", "elas"]
-_dem_pron = [
-    "este",
-    "esta",
-    "estes",
-    "estas",
-    "isto",
-    "esse",
-    "essa",
-    "esses",
-    "essas",
-    "isso",
-    "aquele",
-    "aquela",
-    "aqueles",
-    "aquelas",
-    "aquilo",
-]
-_und_pron = ["outro", "outra", "outros", "outras"]
-_adv = ["aqui", "aí", "ali", "além"]
-
-
-for orth in _per_pron + _dem_pron + _und_pron + _adv:
-    _exc["d" + orth] = [{ORTH: "d", NORM: "de"}, {ORTH: orth}]
-
-for orth in _per_pron + _dem_pron + _und_pron:
-    _exc["n" + orth] = [{ORTH: "n", NORM: "em"}, {ORTH: orth}]
+_exc = {}


 for orth in [
    "Adm.",
+    "Art.",
+    "art.",
+    "Av.",
+    "av.",
+    "Cia.",
+    "dom.",
    "Dr.",
+    "dr.",
    "e.g.",
    "E.g.",
    "E.G.",
+    "e/ou",
+    "ed.",
+    "eng.",
+    "etc.",
+    "Fund.",
    "Gen.",
    "Gov.",
    "i.e.",
    "I.e.",
    "I.E.",
+    "Inc.",
    "Jr.",
+    "km/h",
    "Ltd.",
+    "Mr.",
    "p.m.",
    "Ph.D.",
    "Rep.",
    "Rev.",
+    "S/A",
    "Sen.",
    "Sr.",
+    "sr.",
    "Sra.",
+    "sra.",
    "vs.",
    "tel.",
    "pág.",
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -1,5 +1,7 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -21,6 +23,9 @@ class RomanianDefaults(Language.Defaults):
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
+    prefixes = TOKENIZER_PREFIXES
+    suffixes = TOKENIZER_SUFFIXES
+    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP


--- a/spacy/lang/ro/punctuation.py
+++ b/spacy/lang/ro/punctuation.py
@ -0,0 +1,164 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import itertools
+
+from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
+from ..char_classes import LIST_ICONS, CURRENCY
+from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
+
+
+_list_icons = [x for x in LIST_ICONS if x != "°"]
+_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
+
+
+_ro_variants = {
+    "Ă": ["Ă", "A"],
+    "Â": ["Â", "A"],
+    "Î": ["Î", "I"],
+    "Ș": ["Ș", "Ş", "S"],
+    "Ț": ["Ț", "Ţ", "T"],
+}
+
+
+def _make_ro_variants(tokens):
+    variants = []
+    for token in tokens:
+        upper_token = token.upper()
+        upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
+        upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
+        for variant in upper_variants:
+            variants.extend([variant, variant.lower(), variant.title()])
+    return sorted(list(set(variants)))
+
+
+# UD_Romanian-RRT closed class prefixes
+# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
+_ud_rrt_prefixes = [
+    "a-",
+    "c-",
+    "ce-",
+    "cu-",
+    "d-",
+    "de-",
+    "dintr-",
+    "e-",
+    "făr-",
+    "i-",
+    "l-",
+    "le-",
+    "m-",
+    "mi-",
+    "n-",
+    "ne-",
+    "p-",
+    "pe-",
+    "prim-",
+    "printr-",
+    "s-",
+    "se-",
+    "te-",
+    "v-",
+    "într-",
+    "ș-",
+    "și-",
+    "ți-",
+]
+_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
+
+
+# UD_Romanian-RRT closed class suffixes without NUM
+# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
+_ud_rrt_suffixes = [
+    "-a",
+    "-aceasta",
+    "-ai",
+    "-al",
+    "-ale",
+    "-alta",
+    "-am",
+    "-ar",
+    "-astea",
+    "-atâta",
+    "-au",
+    "-aș",
+    "-ați",
+    "-i",
+    "-ilor",
+    "-l",
+    "-le",
+    "-lea",
+    "-mea",
+    "-meu",
+    "-mi",
+    "-mă",
+    "-n",
+    "-ndărătul",
+    "-ne",
+    "-o",
+    "-oi",
+    "-or",
+    "-s",
+    "-se",
+    "-si",
+    "-te",
+    "-ul",
+    "-ului",
+    "-un",
+    "-uri",
+    "-urile",
+    "-urilor",
+    "-veți",
+    "-vă",
+    "-ăștia",
+    "-și",
+    "-ți",
+]
+_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
+
+
+_prefixes = (
+    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
+    + _ud_rrt_prefix_variants
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + LIST_CURRENCY
+    + LIST_ICONS
+)
+
+
+_suffixes = (
+    _ud_rrt_suffix_variants
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + _list_icons
+    + ["—", "–"]
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
+        ),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
+_infixes = (
+    LIST_ELLIPSES
+    + _list_icons
+    + [
+        r"(?<=[0-9])[+\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
+    ]
+)
+
+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_SUFFIXES = _suffixes
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/ro/tokenizer_exceptions.py
+++ b/spacy/lang/ro/tokenizer_exceptions.py
@ -1,4 +1,5 @@
 from ...symbols import ORTH
+from .punctuation import _make_ro_variants


 _exc = {}
@ -42,8 +43,52 @@ for orth in [
    "dpdv",
    "șamd.",
    "ș.a.m.d.",
+    # below: from UD_Romanian-RRT:
+    "A.c.",
+    "A.f.",
+    "A.r.",
+    "Al.",
+    "Art.",
+    "Aug.",
+    "Bd.",
+    "Dem.",
+    "Dr.",
+    "Fig.",
+    "Fr.",
+    "Gh.",
+    "Gr.",
+    "Lt.",
+    "Nr.",
+    "Obs.",
+    "Prof.",
+    "Sf.",
+    "a.m.",
+    "a.r.",
+    "alin.",
+    "art.",
+    "d-l",
+    "d-lui",
+    "d-nei",
+    "ex.",
+    "fig.",
+    "ian.",
+    "lit.",
+    "lt.",
+    "p.a.",
+    "p.m.",
+    "pct.",
+    "prep.",
+    "sf.",
+    "tel.",
+    "univ.",
+    "îngr.",
+    "într-adevăr",
+    "Șt.",
+    "ș.a.",
 ]:
-    _exc[orth] = [{ORTH: orth}]
+    # note: does not distinguish capitalized-only exceptions from others
+    for variant in _make_ro_variants([orth]):
+        _exc[variant] = [{ORTH: variant}]


 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/language.py
+++ b/spacy/language.py
@ -13,6 +13,7 @@ import multiprocessing as mp
 from itertools import chain, cycle

 from .tokenizer import Tokenizer
+from .tokens.underscore import Underscore
 from .vocab import Vocab
 from .lemmatizer import Lemmatizer
 from .lookups import Lookups
@ -874,7 +875,10 @@ class Language(object):
        sender.send()

        procs = [
-            mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
+            mp.Process(
+                target=_apply_pipes,
+                args=(self.make_doc, pipes, rch, sch, Underscore.get_state()),
+            )
            for rch, sch in zip(texts_q, bytedocs_send_ch)
        ]
        for proc in procs:
@ -1146,16 +1150,19 @@ def _pipe(examples, proc, kwargs):
        yield ex


-def _apply_pipes(make_doc, pipes, reciever, sender):
+def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
    """Worker for Language.pipe

    receiver (multiprocessing.Connection): Pipe to receive text. Usually
        created by `multiprocessing.Pipe()`
    sender (multiprocessing.Connection): Pipe to send doc. Usually created by
        `multiprocessing.Pipe()`
+    underscore_state (tuple): The data in the Underscore class of the parent
+    vectors (dict): The global vectors data, copied from the parent
    """
+    Underscore.load_state(underscore_state)
    while True:
-        texts = reciever.get()
+        texts = receiver.get()
        docs = (make_doc(text) for text in texts)
        for pipe in pipes:
            docs = pipe(docs)
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -664,6 +664,8 @@ def _get_attr_values(spec, string_store):
                continue
            if attr == "TEXT":
                attr = "ORTH"
+            if attr == "IS_SENT_START":
+                attr = "SENT_START"
            attr = IDS.get(attr)
        if isinstance(value, basestring):
            value = string_store.add(value)
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -365,7 +365,7 @@ class Tensorizer(Pipe):
        return sgd


-@component("tagger", assigns=["token.tag", "token.pos"])
+@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
 class Tagger(Pipe):
    """Pipeline component for part-of-speech tagging.

--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -464,3 +464,5 @@ cdef enum symbol_t:
    ENT_KB_ID
    MORPH
    ENT_ID
+
+    IDX
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -89,6 +89,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
+    "IDX": IDX,

    "ADJ": ADJ,
    "ADP": ADP,
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -80,6 +80,11 @@ def es_tokenizer():
    return get_lang_class("es").Defaults.create_tokenizer()


+@pytest.fixture(scope="session")
+def eu_tokenizer():
+    return get_lang_class("eu").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
 def fi_tokenizer():
    return get_lang_class("fi").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -63,3 +63,39 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
    words = ["An", "example", "sentence"]
    doc = Doc(en_vocab, words=words)
    Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
+
+
+def test_doc_array_idx(en_vocab):
+    """Test that Doc.to_array can retrieve token start indices"""
+    words = ["An", "example", "sentence"]
+    offsets = Doc(en_vocab, words=words).to_array("IDX")
+    assert offsets[0] == 0
+    assert offsets[1] == 3
+    assert offsets[2] == 11
+
+
+def test_doc_from_array_heads_in_bounds(en_vocab):
+    """Test that Doc.from_array doesn't set heads that are out of bounds."""
+    words = ["This", "is", "a", "sentence", "."]
+    doc = Doc(en_vocab, words=words)
+    for token in doc:
+        token.head = doc[0]
+
+    # correct
+    arr = doc.to_array(["HEAD"])
+    doc_from_array = Doc(en_vocab, words=words)
+    doc_from_array.from_array(["HEAD"], arr)
+
+    # head before start
+    arr = doc.to_array(["HEAD"])
+    arr[0] = -1
+    doc_from_array = Doc(en_vocab, words=words)
+    with pytest.raises(ValueError):
+        doc_from_array.from_array(["HEAD"], arr)
+
+    # head after end
+    arr = doc.to_array(["HEAD"])
+    arr[0] = 5
+    doc_from_array = Doc(en_vocab, words=words)
+    with pytest.raises(ValueError):
+        doc_from_array.from_array(["HEAD"], arr)
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -145,10 +145,9 @@ def test_doc_api_runtime_error(en_tokenizer):
    # Example that caused run-time error while parsing Reddit
    # fmt: off
    text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
-    deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
-            "nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
-            "pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
-            "ROOT", "amod", "dobj"]
+    deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
+            "amod", "pobj", "acl", "prep", "prep", "pobj",
+            "", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
    # fmt: on
    tokens = en_tokenizer(text)
    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
@ -272,19 +271,9 @@ def test_doc_is_nered(en_vocab):
 def test_doc_from_array_sent_starts(en_vocab):
    words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
    heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
-    deps = [
-        "ROOT",
-        "dep",
-        "dep",
-        "dep",
-        "dep",
-        "dep",
-        "ROOT",
-        "dep",
-        "dep",
-        "dep",
-        "dep",
-    ]
+    # fmt: off
+    deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
+    # fmt: on
    doc = Doc(en_vocab, words=words)
    for i, (dep, head) in enumerate(zip(deps, heads)):
        doc[i].dep_ = dep
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -164,6 +164,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
    assert doc[4].left_edge.i == 0
    assert doc[2].left_edge.i == 0

+    # head token must be from the same document
+    doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
+    with pytest.raises(ValueError):
+        doc[0].head = doc2[0]
+

 def test_is_sent_start(en_tokenizer):
    doc = en_tokenizer("This is a sentence. This is another.")
@ -211,7 +216,7 @@ def test_token_api_conjuncts_chain(en_vocab):
 def test_token_api_conjuncts_simple(en_vocab):
    words = "They came and went .".split()
    heads = [1, 0, -1, -2, -1]
-    deps = ["nsubj", "ROOT", "cc", "conj"]
+    deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert [w.text for w in doc[1].conjuncts] == ["went"]
    assert [w.text for w in doc[3].conjuncts] == ["came"]
--- a/spacy/tests/doc/test_underscore.py
+++ b/spacy/tests/doc/test_underscore.py
@ -4,6 +4,15 @@ from spacy.tokens import Doc, Span, Token
 from spacy.tokens.underscore import Underscore


+@pytest.fixture(scope="function", autouse=True)
+def clean_underscore():
+    # reset the Underscore object after the test, to avoid having state copied across tests
+    yield
+    Underscore.doc_extensions = {}
+    Underscore.span_extensions = {}
+    Underscore.token_extensions = {}
+
+
 def test_create_doc_underscore():
    doc = Mock()
    doc.doc = doc
--- a/spacy/tests/lang/da/test_exceptions.py
+++ b/spacy/tests/lang/da/test_exceptions.py
@ -55,7 +55,8 @@ def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
        ("Kristiansen c/o Madsen", 3),
        ("Sprogteknologi a/s", 2),
        ("De boede i A/B Bellevue", 5),
-        ("Rotorhastigheden er 3400 o/m.", 5),
+        # note: skipping due to weirdness in UD_Danish-DDT
+        # ("Rotorhastigheden er 3400 o/m.", 5),
        ("Jeg købte billet t/r.", 5),
        ("Murerarbejdsmand m/k søges", 3),
        ("Netværket kører over TCP/IP", 4),
--- a/spacy/tests/lang/eu/test_text.py
+++ b/spacy/tests/lang/eu/test_text.py
@ -0,0 +1,22 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_eu_tokenizer_handles_long_text(eu_tokenizer):
+    text = """ta nere guitarra estrenatu ondoren"""
+    tokens = eu_tokenizer(text)
+    assert len(tokens) == 5
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [
+        ("milesker ederra joan zen hitzaldia plazer hutsa", 7),
+        ("astelehen guztia sofan pasau biot", 5),
+    ],
+)
+def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
+    tokens = eu_tokenizer(text)
+    assert len(tokens) == length
--- a/spacy/tests/lang/fi/test_tokenizer.py
+++ b/spacy/tests/lang/fi/test_tokenizer.py
@ -7,12 +7,22 @@ ABBREVIATION_TESTS = [
        ["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
    ),
    ("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
+    (
+        "Vuonna 1 eaa. tapahtui kauheita.",
+        ["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
+    ),
 ]

 HYPHENATED_TESTS = [
    (
-        "1700-luvulle sijoittuva taide-elokuva",
-        ["1700-luvulle", "sijoittuva", "taide-elokuva"],
+        "1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
+        [
+            "1700-luvulle",
+            "sijoittuva",
+            "taide-elokuva",
+            "Wikimedia-säätiön",
+            "Varsinais-Suomen",
+        ],
    )
 ]

@ -23,6 +33,7 @@ ABBREVIATION_INFLECTION_TESTS = [
    ),
    ("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
    ("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
+    ("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
 ]


--- a/spacy/tests/lang/lt/test_text.py
+++ b/spacy/tests/lang/lt/test_text.py
@ -12,11 +12,11 @@ def test_lt_tokenizer_handles_long_text(lt_tokenizer):
    [
        (
            "177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.",
-            15,
+            17,
        ),
        (
            "ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.",
-            16,
+            18,
        ),
    ],
 )
@ -28,7 +28,7 @@ def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
 def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
    tokens = lt_tokenizer(text)
-    assert len(tokens) == 1
+    assert len(tokens) == 2


@pytest.mark.parametrize(
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -4,6 +4,8 @@ from mock import Mock
 from spacy.matcher import Matcher, DependencyMatcher
 from spacy.tokens import Doc, Token

+from ..doc.test_underscore import clean_underscore  # noqa: F401
+

@pytest.fixture
 def matcher(en_vocab):
@ -197,6 +199,7 @@ def test_matcher_any_token_operator(en_vocab):
    assert matches[2] == "test hello world"


+@pytest.mark.usefixtures("clean_underscore")
 def test_matcher_extension_attribute(en_vocab):
    matcher = Matcher(en_vocab)
    get_is_fruit = lambda token: token.text in ("apple", "banana")
--- a/spacy/tests/matcher/test_pattern_validation.py
+++ b/spacy/tests/matcher/test_pattern_validation.py
@ -31,6 +31,8 @@ TEST_PATTERNS = [
    ([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
    ([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
    ([{"orth": "foo"}], 0, 0),  # prev: xfail
+    ([{"IS_SENT_START": True}], 0, 0),
+    ([{"SENT_START": True}], 0, 0),
 ]


--- a/spacy/tests/parser/test_parse_navigate.py
+++ b/spacy/tests/parser/test_parse_navigate.py
@ -31,23 +31,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
@pytest.fixture
 def heads():
    # fmt: off
-    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
-            -1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
-            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
-            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
-            0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
-            9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
-            2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
-            3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
-            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
-            -1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
-            -2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
-            1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
-            1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
-            -19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
+    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
+            -1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
+            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
+            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
+            0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
+            9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
+            2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
+            3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
+            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
+            -1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
+            -2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
+            1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
+            1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
+            -10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
            0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
-            1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
-            -1, -8, -9, -1]
+            1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
+            -1, 0, -1, -1]
    # fmt: on


--- a/spacy/tests/regression/test_issue2001-2500.py
+++ b/spacy/tests/regression/test_issue2001-2500.py
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
    tag_ids = [en_vocab.strings.add(tag) for tag in tags]
    lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
    doc = Doc(en_vocab, words=words)
-    # Work around lemma corrpution problem and set lemmas after tags
+    # Work around lemma corruption problem and set lemmas after tags
    doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
    doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
    assert [t.tag_ for t in doc] == tags
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -121,7 +121,7 @@ def test_issue2772(en_vocab):
    words = "When we write or communicate virtually , we can hide our true feelings .".split()
    # A tree with a non-projective (i.e. crossing) arc
    # The arcs (0, 4) and (2, 9) cross.
-    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
+    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
    deps = ["dep"] * len(heads)
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert doc[1].is_sent_start is None
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -24,7 +24,7 @@ def test_issue4590(en_vocab):

    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
-    deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
+    deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]

    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)

--- a/spacy/tests/regression/test_issue4725.py
+++ b/spacy/tests/regression/test_issue4725.py
@ -0,0 +1,25 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy
+
+from spacy.lang.en import English
+from spacy.vocab import Vocab
+
+
+def test_issue4725():
+    # ensures that this runs correctly and doesn't hang or crash because of the global vectors
+    vocab = Vocab(vectors_name="test_vocab_add_vector")
+    data = numpy.ndarray((5, 3), dtype="f")
+    data[0] = 1.0
+    data[1] = 2.0
+    vocab.set_vector("cat", data[0])
+    vocab.set_vector("dog", data[1])
+
+    nlp = English(vocab=vocab)
+    ner = nlp.create_pipe("ner")
+    nlp.add_pipe(ner)
+    nlp.begin_training()
+    docs = ["Kurt is in London."] * 10
+    for _ in nlp.pipe(docs, batch_size=2, n_process=2):
+        pass
--- a/spacy/tests/regression/test_issue4903.py
+++ b/spacy/tests/regression/test_issue4903.py
@ -0,0 +1,43 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from spacy.lang.en import English
+from spacy.tokens import Span, Doc
+
+
+class CustomPipe:
+    name = "my_pipe"
+
+    def __init__(self):
+        Span.set_extension("my_ext", getter=self._get_my_ext)
+        Doc.set_extension("my_ext", default=None)
+
+    def __call__(self, doc):
+        gathered_ext = []
+        for sent in doc.sents:
+            sent_ext = self._get_my_ext(sent)
+            sent._.set("my_ext", sent_ext)
+            gathered_ext.append(sent_ext)
+
+        doc._.set("my_ext", "\n".join(gathered_ext))
+
+        return doc
+
+    @staticmethod
+    def _get_my_ext(span):
+        return str(span.end)
+
+
+def test_issue4903():
+    # ensures that this runs correctly and doesn't hang or crash on Windows / macOS
+
+    nlp = English()
+    custom_component = CustomPipe()
+    nlp.add_pipe(nlp.create_pipe("sentencizer"))
+    nlp.add_pipe(custom_component, after="sentencizer")
+
+    text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
+    docs = list(nlp.pipe(text, n_process=2))
+    assert docs[0].text == "I like bananas."
+    assert docs[1].text == "Do you like them?"
+    assert docs[2].text == "No, I prefer wasabi."
--- a/spacy/tests/regression/test_issue4924.py
+++ b/spacy/tests/regression/test_issue4924.py
@ -2,7 +2,7 @@ import pytest
 from spacy.language import Language


-def test_evaluate():
+def test_issue4924():
    nlp = Language()
    docs_golds = [("", {})]
    with pytest.raises(ValueError):
--- a/spacy/tests/regression/test_issue5048.py
+++ b/spacy/tests/regression/test_issue5048.py
@ -0,0 +1,35 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy
+from spacy.tokens import Doc
+from spacy.attrs import DEP, POS, TAG
+
+from ..util import get_doc
+
+
+def test_issue5048(en_vocab):
+    words = ["This", "is", "a", "sentence"]
+    pos_s = ["DET", "VERB", "DET", "NOUN"]
+    spaces = [" ", " ", " ", ""]
+    deps_s = ["dep", "adj", "nn", "atm"]
+    tags_s = ["DT", "VBZ", "DT", "NN"]
+
+    strings = en_vocab.strings
+
+    for w in words:
+        strings.add(w)
+    deps = [strings.add(d) for d in deps_s]
+    pos = [strings.add(p) for p in pos_s]
+    tags = [strings.add(t) for t in tags_s]
+
+    attrs = [POS, DEP, TAG]
+    array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
+
+    doc = Doc(en_vocab, words=words, spaces=spaces)
+    doc.from_array(attrs, array)
+    v1 = [(token.text, token.pos_, token.tag_) for token in doc]
+
+    doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
+    v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
+    assert v1 == v2
--- a/spacy/tests/regression/test_issue5082.py
+++ b/spacy/tests/regression/test_issue5082.py
@ -0,0 +1,46 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy as np
+from spacy.lang.en import English
+from spacy.pipeline import EntityRuler
+
+
+def test_issue5082():
+    # Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
+    nlp = English()
+    vocab = nlp.vocab
+    array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
+    array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
+    array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
+    array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
+    array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
+
+    vocab.set_vector("I", array1)
+    vocab.set_vector("like", array2)
+    vocab.set_vector("David", array3)
+    vocab.set_vector("Bowie", array4)
+
+    text = "I like David Bowie"
+    ruler = EntityRuler(nlp)
+    patterns = [
+        {"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
+    ]
+    ruler.add_patterns(patterns)
+    nlp.add_pipe(ruler)
+
+    parsed_vectors_1 = [t.vector for t in nlp(text)]
+    assert len(parsed_vectors_1) == 4
+    np.testing.assert_array_equal(parsed_vectors_1[0], array1)
+    np.testing.assert_array_equal(parsed_vectors_1[1], array2)
+    np.testing.assert_array_equal(parsed_vectors_1[2], array3)
+    np.testing.assert_array_equal(parsed_vectors_1[3], array4)
+
+    merge_ents = nlp.create_pipe("merge_entities")
+    nlp.add_pipe(merge_ents)
+
+    parsed_vectors_2 = [t.vector for t in nlp(text)]
+    assert len(parsed_vectors_2) == 3
+    np.testing.assert_array_equal(parsed_vectors_2[0], array1)
+    np.testing.assert_array_equal(parsed_vectors_2[1], array2)
+    np.testing.assert_array_equal(parsed_vectors_2[2], array34)
--- a/spacy/tests/serialize/test_serialize_tokenizer.py
+++ b/spacy/tests/serialize/test_serialize_tokenizer.py
@ -12,12 +12,19 @@ def load_tokenizer(b):


 def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
-    """Test that custom tokenizer with not all functions defined can be
-    serialized and deserialized correctly (see #2494)."""
+    """Test that custom tokenizer with not all functions defined or empty
+    properties can be serialized and deserialized correctly (see #2494,
+    #4991)."""
    tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
    tokenizer_bytes = tokenizer.to_bytes()
    Tokenizer(en_vocab).from_bytes(tokenizer_bytes)

+    tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
+    tokenizer.rules = {}
+    tokenizer_bytes = tokenizer.to_bytes()
+    tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
+    assert tokenizer_reloaded.rules == {}
+

@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -28,10 +28,10 @@ def test_displacy_parse_deps(en_vocab):
    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps["words"] == [
-        {"text": "This", "tag": "DET"},
-        {"text": "is", "tag": "AUX"},
-        {"text": "a", "tag": "DET"},
-        {"text": "sentence", "tag": "NOUN"},
+        {"lemma": None, "text": words[0], "tag": pos[0]},
+        {"lemma": None, "text": words[1], "tag": pos[1]},
+        {"lemma": None, "text": words[2], "tag": pos[2]},
+        {"lemma": None, "text": words[3], "tag": pos[3]},
    ]
    assert deps["arcs"] == [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
@ -72,7 +72,7 @@ def test_displacy_rtl():
    deps = ["foo", "bar", "foo", "baz"]
    heads = [1, 0, 1, -2]
    nlp = Persian()
-    doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
+    doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
    doc.ents = [Span(doc, 1, 3, label="TEST")]
    html = displacy.render(doc, page=True, style="dep")
    assert "direction: rtl" in html
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -4,8 +4,10 @@ import shutil
 import contextlib
 import srsly
 from pathlib import Path
+
+from spacy import Errors
 from spacy.tokens import Doc, Span
-from spacy.attrs import POS, HEAD, DEP
+from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA


@contextlib.contextmanager
@ -22,30 +24,56 @@ def make_tempdir():
    shutil.rmtree(str(d))


-def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
+def get_doc(
+    vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None
+):
    """Create Doc object from given vocab, words and annotations."""
-    pos = pos or [""] * len(words)
-    tags = tags or [""] * len(words)
-    heads = heads or [0] * len(words)
-    deps = deps or [""] * len(words)
-    for value in deps + tags + pos:
+    if deps and not heads:
+        heads = [0] * len(deps)
+    headings = []
+    values = []
+    annotations = [pos, heads, deps, lemmas, tags]
+    possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
+    for a, annot in enumerate(annotations):
+        if annot is not None:
+            if len(annot) != len(words):
+                raise ValueError(Errors.E189)
+            headings.append(possible_headings[a])
+            if annot is not heads:
+                values.extend(annot)
+    for value in values:
        vocab.strings.add(value)

    doc = Doc(vocab, words=words)
-    attrs = doc.to_array([POS, HEAD, DEP])
-    for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
-        attrs[i, 0] = doc.vocab.strings[p]
-        attrs[i, 1] = head
-        attrs[i, 2] = doc.vocab.strings[dep]
-    doc.from_array([POS, HEAD, DEP], attrs)
+
+    # if there are any other annotations, set them
+    if headings:
+        attrs = doc.to_array(headings)
+
+        j = 0
+        for annot in annotations:
+            if annot:
+                if annot is heads:
+                    for i in range(len(words)):
+                        if attrs.ndim == 1:
+                            attrs[i] = heads[i]
+                        else:
+                            attrs[i, j] = heads[i]
+                else:
+                    for i in range(len(words)):
+                        if attrs.ndim == 1:
+                            attrs[i] = doc.vocab.strings[annot[i]]
+                        else:
+                            attrs[i, j] = doc.vocab.strings[annot[i]]
+                j += 1
+        doc.from_array(headings, attrs)
+
+    # finally, set the entities
    if ents:
        doc.ents = [
            Span(doc, start, end, label=doc.vocab.strings[label])
            for start, end, label in ents
        ]
-    if tags:
-        for token in doc:
-            token.tag_ = tags[token.i]
    return doc


@ -86,8 +114,7 @@ def assert_docs_equal(doc1, doc2):

    assert [t.head.i for t in doc1] == [t.head.i for t in doc2]
    assert [t.dep for t in doc1] == [t.dep for t in doc2]
-    if doc1.is_parsed and doc2.is_parsed:
-        assert [s for s in doc1.sents] == [s for s in doc2.sents]
+    assert [t.is_sent_start for t in doc1] == [t.is_sent_start for t in doc2]

    assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
    assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -699,6 +699,7 @@ cdef class Tokenizer:

        DOCS: https://spacy.io/api/tokenizer#to_disk
        """
+        path = util.ensure_path(path)
        with path.open("wb") as file_:
            file_.write(self.to_bytes(**kwargs))

@ -712,6 +713,7 @@ cdef class Tokenizer:

        DOCS: https://spacy.io/api/tokenizer#from_disk
        """
+        path = util.ensure_path(path)
        with path.open("rb") as file_:
            bytes_data = file_.read()
        self.from_bytes(bytes_data, **kwargs)
@ -756,21 +758,20 @@ cdef class Tokenizer:
        }
        exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
        msg = util.from_bytes(bytes_data, deserializers, exclude)
-        if data.get("prefix_search"):
+        if "prefix_search" in data and isinstance(data["prefix_search"], str):
            self.prefix_search = re.compile(data["prefix_search"]).search
-        if data.get("suffix_search"):
+        if "suffix_search" in data and isinstance(data["suffix_search"], str):
            self.suffix_search = re.compile(data["suffix_search"]).search
-        if data.get("infix_finditer"):
+        if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
            self.infix_finditer = re.compile(data["infix_finditer"]).finditer
-        if data.get("token_match"):
+        if "token_match" in data and isinstance(data["token_match"], str):
            self.token_match = re.compile(data["token_match"]).match
-        if data.get("rules"):
+        if "rules" in data and isinstance(data["rules"], dict):
            # make sure to hard reset the cache to remove data from the default exceptions
            self._rules = {}
            self._flush_cache()
            self._flush_specials()
-            self._load_special_cases(data.get("rules", {}))
-
+            self._load_special_cases(data["rules"])
        return self


--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
        new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
        if spans[token_index][-1].whitespace_:
            new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
+        # add the vector of the (merged) entity to the vocab
+        if not doc.vocab.get_vector(new_orth).any():
+            if doc.vocab.vectors_length > 0:
+                doc.vocab.set_vector(new_orth, span.vector)
        token = tokens[token_index]
        lex = doc.vocab.get(doc.mem, new_orth)
        token.lex = lex
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -19,7 +19,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..typedefs cimport attr_t, flags_t
 from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
-from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
+from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t

 from ..attrs import intify_attrs, IDS
@ -68,6 +68,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
        return token.ent_id
    elif feat_name == ENT_KB_ID:
        return token.ent_kb_id
+    elif feat_name == IDX:
+        return token.idx
    else:
        return Lexeme.get_struct_attr(token.lex, feat_name)

@ -253,7 +255,7 @@ cdef class Doc:
    def is_nered(self):
        """Check if the document has named entities set. Will return True if
        *any* of the tokens has a named entity tag set (even if the others are
-        unknown values).
+        unknown values), or if the document is empty.
        """
        if len(self) == 0:
            return True
@ -778,10 +780,12 @@ cdef class Doc:
        # Allow strings, e.g. 'lemma' or 'LEMMA'
        attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
                 for id_ in attrs]
+        if array.dtype != numpy.uint64:
+            user_warning(Warnings.W028.format(type=array.dtype))

        if SENT_START in attrs and HEAD in attrs:
            raise ValueError(Errors.E032)
-        cdef int i, col
+        cdef int i, col, abs_head_index
        cdef attr_id_t attr_id
        cdef TokenC* tokens = self.c
        cdef int length = len(array)
@ -795,6 +799,14 @@ cdef class Doc:
            attr_ids[i] = attr_id
        if len(array.shape) == 1:
            array = array.reshape((array.size, 1))
+        # Check that all heads are within the document bounds
+        if HEAD in attrs:
+            col = attrs.index(HEAD)
+            for i in range(length):
+                # cast index to signed int
+                abs_head_index = numpy.int32(array[i, col]) + i
+                if abs_head_index < 0 or abs_head_index >= length:
+                    raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
        # Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
        if TAG in attrs:
            col = attrs.index(TAG)
@ -865,7 +877,7 @@ cdef class Doc:

        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID]  # TODO: ENT_KB_ID ?
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM]  # TODO: ENT_KB_ID ?
        if self.is_tagged:
            array_head.extend([TAG, POS])
        # If doc parsed add head and dep attribute
@ -1166,6 +1178,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
        heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
        if loop_count > 10:
            warnings.warn(Warnings.W026)
+            break
        loop_count += 1
    # Set sentence starts
    for i in range(length):
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -626,6 +626,9 @@ cdef class Token:
            # This function sets the head of self to new_head and updates the
            # counters for left/right dependents and left/right corner for the
            # new and the old head
+            # Check that token is from the same document
+            if self.doc != new_head.doc:
+                raise ValueError(Errors.E191)
            # Do nothing if old head is new head
            if self.i + self.c.head == new_head.i:
                return
--- a/spacy/tokens/underscore.py
+++ b/spacy/tokens/underscore.py
@ -76,6 +76,14 @@ class Underscore(object):
    def _get_key(self, name):
        return ("._.", name, self._start, self._end)

+    @classmethod
+    def get_state(cls):
+        return cls.token_extensions, cls.span_extensions, cls.doc_extensions
+
+    @classmethod
+    def load_state(cls, state):
+        cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
+

 def get_ext_args(**kwargs):
    """Validate and convert arguments. Reused in Doc, Token and Span."""
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -349,44 +349,6 @@ cdef class Vectors:
                    for i in range(len(queries)) ], dtype="uint64")
        return (keys, best_rows, scores)

-    def from_glove(self, path):
-        """Load GloVe vectors from a directory. Assumes binary format,
-        that the vocab is in a vocab.txt, and that vectors are named
-        vectors.{size}.[fd].bin, e.g. vectors.128.f.bin for 128d float32
-        vectors, vectors.300.d.bin for 300d float64 (double) vectors, etc.
-        By default GloVe outputs 64-bit vectors.
-
-        path (unicode / Path): The path to load the GloVe vectors from.
-        RETURNS: A `StringStore` object, holding the key-to-string mapping.
-
-        DOCS: https://spacy.io/api/vectors#from_glove
-        """
-        path = util.ensure_path(path)
-        width = None
-        for name in path.iterdir():
-            if name.parts[-1].startswith("vectors"):
-                _, dims, dtype, _2 = name.parts[-1].split('.')
-                width = int(dims)
-                break
-        else:
-            raise IOError(Errors.E061.format(filename=path))
-        bin_loc = path / f"vectors.{dims}.{dtype}.bin"
-        xp = get_array_module(self.data)
-        self.data = None
-        with bin_loc.open("rb") as file_:
-            self.data = xp.fromfile(file_, dtype=dtype)
-            if dtype != "float32":
-                self.data = xp.ascontiguousarray(self.data, dtype="float32")
-        if self.data.ndim == 1:
-            self.data = self.data.reshape((self.data.size//width, width))
-        n = 0
-        strings = StringStore()
-        with (path / "vocab.txt").open("r") as file_:
-            for i, line in enumerate(file_):
-                key = strings.add(line.strip())
-                self.add(key, row=i)
-        return strings
-
    def to_disk(self, path, **kwargs):
        """Save the current state to a directory.

--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -109,9 +109,9 @@ links) and check whether they are compatible with the currently installed
 version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
 to ensure that all installed models are can be used with the new version. The
 command is also useful to detect out-of-sync model links resulting from links
-created in different virtual environments. It will a list of models, the
-installed versions, the latest compatible version (if out of date) and the
-commands for updating.
+created in different virtual environments. It will show a list of models and
+their installed versions. If any model is out of date, the latest compatible
+versions and command for updating are shown.

 > #### Automated validation
 >
@ -176,7 +176,7 @@ All output files generated by this command are compatible with

 ## Debug data {#debug-data new="2.2"}

-Analyze, debug and validate your training and development data, get useful
+Analyze, debug, and validate your training and development data. Get useful
 stats, and find problems like invalid entity annotations, cyclic dependencies,
 low data labels and more.

@ -185,10 +185,11 @@ $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pi
 ```

 | Argument                                               | Type       | Description                                                                                        |
-| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
+| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
 | `lang`                                                 | positional | Model language.                                                                                    |
 | `train_path`                                           | positional | Location of JSON-formatted training data. Can be a file or a directory of files.                   |
 | `dev_path`                                             | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
+| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option     | Location of JSON-formatted tag map.                                                                |
 | `--base-model`, `-b`                                   | option     | Optional name of base model to update. Can be any loadable spaCy model.                            |
 | `--pipeline`, `-p`                                     | option     | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.          |
 | `--ignore-warnings`, `-IW`                             | flag       | Ignore warnings, only show stats and errors.                                                       |
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `dev_path`                                                      | positional    | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files.                                                                |
 | `--base-model`, `-b` <Tag variant="new">2.1</Tag>               | option        | Optional name of base model to update. Can be any loadable spaCy model.                                                                                           |
 | `--pipeline`, `-p` <Tag variant="new">2.1</Tag>                 | option        | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.                                                                         |
+| `--replace-components`, `-R`                                    | flag          | Replace components from the base model.                                                                                                                           |
 | `--vectors`, `-v`                                               | option        | Model to load vectors from.                                                                                                                                       |
 | `--n-iter`, `-n`                                                | option        | Number of iterations (default: `30`).                                                                                                                             |
 | `--n-early-stopping`, `-ne`                                     | option        | Maximum number of training epochs without dev accuracy improvement.                                                                                               |
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag>           | option        | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.                                                       |
 | `--parser-multitasks`, `-pt`                                    | option        | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                       |
 | `--entity-multitasks`, `-et`                                    | option        | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                          |
+| `--width`, `-cw` <Tag variant="new">2.2.4</Tag>                 | option        | Width of CNN layers of `Tok2Vec` component.                                                                                                                       |
+| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag>            | option        | Depth of CNN layers of `Tok2Vec` component.                                                                                                                       |
+| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag>            | option        | Window size for CNN layers of `Tok2Vec` component.                                                                                                                |
+| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag>            | option        | Maxout size for CNN layers of `Tok2Vec` component.                                                                                                                |
+| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag>            | flag          | Whether to use character-based embedding of `Tok2Vec` component.                                                                                                  |
+| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag>        | option        | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch).                                                                                                 |
+| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag>            | option        | Number of embedding rows of `Tok2Vec` component.                                                                                                                  |
 | `--noise-level`, `-nl`                                          | option        | Float indicating the amount of corruption for data augmentation.                                                                                                  |
 | `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag>     | option        | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement).                |
 | `--gold-preproc`, `-G`                                          | flag          | Use gold preprocessing.                                                                                                                                           |
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag>     | flag          | Text classification classes aren't mutually exclusive (multilabel).                                                                                               |
 | `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag>            | option        | Text classification model architecture. Defaults to `"bow"`.                                                                                                      |
 | `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option        | Text classification positive label for binary classes with two labels.                                                                                            |
+| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag>          | option        | Location of JSON-formatted tag map.                                                                                                                               |
 | `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag>              | flag          | Show more detailed messages during training.                                                                                                                      |
 | `--help`, `-h`                                                  | flag          | Show help message and available arguments.                                                                                                                        |
 | **CREATES**                                                     | model, pickle | A spaCy model on each epoch.                                                                                                                                      |
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx

 A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
 named entities, export annotations to numpy arrays, losslessly serialize to
-compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
-The Python-level `Token` and [`Span`](/api/span) objects are views of this
-array, i.e. they don't own the data themselves.
+compressed binary strings. The `Doc` object holds an array of
+[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
+[`Span`](/api/span) objects are views of this array, i.e. they don't own the
+data themselves.

 ## Doc.\_\_init\_\_ {#init tag="method"}

@ -198,10 +199,11 @@ the character indices don't map to a valid span.
 > ```

 | Name                                 | Type                                     | Description                                                           |
-| ----------- | ---------------------------------------- | ------------------------------------------------------- |
+| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
 | `start`                              | int                                      | The index of the first character of the span.                         |
 | `end`                                | int                                      | The index of the last character after the span.                       |
-| `label`     | uint64 / unicode                         | A label to attach to the Span, e.g. for named entities. |
+| `label`                              | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
+| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
 | `vector`                             | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
 | **RETURNS**                          | `Span`                                   | The newly constructed object or `None`.                               |

@ -655,10 +657,10 @@ The L2 norm of the document's vector representation.
 | `user_data`                             | -            | A generic storage area, for user custom data.                                                                                                                                                                                                                                              |
 | `lang` <Tag variant="new">2.1</Tag>     | int          | Language of the document's vocabulary.                                                                                                                                                                                                                                                     |
 | `lang_` <Tag variant="new">2.1</Tag>    | unicode      | Language of the document's vocabulary.                                                                                                                                                                                                                                                     |
-| `is_tagged`                             | bool         | A flag indicating that the document has been part-of-speech tagged.                                                                                                                                                                                                                        |
-| `is_parsed`                             | bool         | A flag indicating that the document has been syntactically parsed.                                                                                                                                                                                                                         |
-| `is_sentenced`                          | bool         | A flag indicating that sentence boundaries have been applied to the document.                                                                                                                                                                                                              |
-| `is_nered` <Tag variant="new">2.1</Tag> | bool         | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown.                                                                                                                                      |
+| `is_tagged`                             | bool         | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty.                                                                                                                                                                                  |
+| `is_parsed`                             | bool         | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty.                                                                                                                                                                                   |
+| `is_sentenced`                          | bool         | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty.                                                                                                                                                                        |
+| `is_nered` <Tag variant="new">2.1</Tag> | bool         | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown.                                                                                                            |
 | `sentiment`                             | float        | The document's positivity/negativity score, if available.                                                                                                                                                                                                                                  |
 | `user_hooks`                            | dict         | A dictionary that allows customization of the `Doc`'s properties.                                                                                                                                                                                                                          |
 | `user_token_hooks`                      | dict         | A dictionary that allows customization of properties of `Token` children.                                                                                                                                                                                                                  |
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@ -83,7 +83,8 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
 happens automatically after the component has been added to the pipeline using
 [`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
 with `overwrite_ents=True`, existing entities will be replaced if they overlap
-with the matches.
+with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
+patterns over shorter, and if equal the match occuring first in the Doc is chosen.

 > #### Example
 >
--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@ -172,6 +172,28 @@ Remove a previously registered extension.
 | `name`      | unicode | Name of the extension.                                                |
 | **RETURNS** | tuple   | A `(default, method, getter, setter)` tuple of the removed extension. |

+## Span.char_span {#char_span tag="method" new="2.2.4"}
+
+Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
+the character indices don't map to a valid span.
+
+> #### Example
+>
+> ```python
+> doc = nlp("I like New York")
+> span = doc[1:4].char_span(5, 13, label="GPE")
+> assert span.text == "New York"
+> ```
+
+| Name        | Type                                     | Description                                                           |
+| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
+| `start`     | int                                      | The index of the first character of the span.                         |
+| `end`       | int                                      | The index of the last character after the span.                       |
+| `label`     | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
+| `kb_id`     | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
+| `vector`    | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
+| **RETURNS** | `Span`                                   | The newly constructed object or `None`.                               |
+
 ## Span.similarity {#similarity tag="method" model="vectors"}

 Make a semantic similarity estimate. The default estimate is cosine similarity
@ -294,7 +316,7 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
 > ```

 | Name             | Type  | Description                                          |
-| ----------------- | ----- | ---------------------------------------------------- |
+| ---------------- | ----- | ---------------------------------------------------- |
 | `copy_user_data` | bool  | Whether or not to copy the original doc's user data. |
 | **RETURNS**      | `Doc` | A `Doc` object of the `Span`'s content.              |

--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 |
 | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         |
-| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
-| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
 | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            |
 | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          |
 | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -237,8 +237,9 @@ If a setting is not present in the options, the default value will be used.
 > ```

 | Name                                       | Type    | Description                                                                                                     | Default                 |
-| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
+| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
 | `fine_grained`                             | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
+| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool    | Print the lemma's in a separate row below the token texts.                                                      | `False`                 |
 | `collapse_punct`                           | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
 | `collapse_phrases`                         | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
 | `compact`                                  | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
--- a/website/docs/api/vectors.md
+++ b/website/docs/api/vectors.md
@ -326,25 +326,6 @@ performed in chunks, to avoid consuming too much memory. You can set the
 | `sort`       | bool      | Whether to sort the entries returned by score. Defaults to `True`. |
 | **RETURNS**  | tuple     | The most similar entries as a `(keys, best_rows, scores)` tuple.   |

-## Vectors.from_glove {#from_glove tag="method"}
-
-Load [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from a directory.
-Assumes binary format, that the vocab is in a `vocab.txt`, and that vectors are
-named `vectors.{size}.[fd.bin]`, e.g. `vectors.128.f.bin` for 128d float32
-vectors, `vectors.300.d.bin` for 300d float64 (double) vectors, etc. By default
-GloVe outputs 64-bit vectors.
-
-> #### Example
->
-> ```python
-> vectors = Vectors()
-> vectors.from_glove("/path/to/glove_vectors")
-> ```
-
-| Name   | Type             | Description                              |
-| ------ | ---------------- | ---------------------------------------- |
-| `path` | unicode / `Path` | The path to load the GloVe vectors from. |
-
 ## Vectors.to_disk {#to_disk tag="method"}

 Save the current state to a directory.
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@ -622,13 +622,13 @@ categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
 In order to use this, you'll need training and evaluation data in the
 [JSON format](/api/annotation#json-input) spaCy expects for training.

-You can now train the model using a corpus for your language annotated with If
-your data is in one of the supported formats, the easiest solution might be to
-use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
-several popular formats, including the IOB format for named entity recognition,
-the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
-and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
-by the [Universal Dependencies](http://universaldependencies.org/) corpus.
+If your data is in one of the supported formats, the easiest solution might be
+to use the [`spacy convert`](/api/cli#convert) command-line utility. This
+supports several popular formats, including the IOB format for named entity
+recognition, the JSONL format produced by our annotation tool
+[Prodigy](https://prodi.gy), and the
+[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the
+[Universal Dependencies](http://universaldependencies.org/) corpus.

 One thing to keep in mind is that spaCy expects to train its models from **whole
 documents**, not just single sentences. If your corpus only contains single
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -968,7 +968,10 @@ pattern. The entity ruler accepts two types of patterns:
 The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
 added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
 called on a text, it will find matches in the `doc` and add them as entities to
-the `doc.ents`, using the specified pattern label as the entity label.
+the `doc.ents`, using the specified pattern label as the entity label. If any
+matches were to overlap, the pattern matching most tokens takes priority. If
+they also happen to be equally long, then the match occuring first in the Doc is
+chosen.

 ```python
 ### {executable="true"}
@ -1119,7 +1122,7 @@ entityruler = EntityRuler(nlp)
 patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]

 other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
-with nlp.disable_pipes(*disable_pipes):
+with nlp.disable_pipes(*other_pipes):
    entityruler.add_patterns(patterns)
 ```

--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -94,7 +94,7 @@ docs = list(doc_bin.get_docs(nlp.vocab))

 If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
 well, which includes the values of
-[extension attributes](/processing-pipelines#custom-components-attributes) (if
+[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
 they're serializable with msgpack).

 <Infobox title="Important note on serializing extension attributes" variant="warning">
--- a/Show More
+++ b/Show More