Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match

2025-09-22 20:16:43 +03:00 · 2020-05-22 12:18:00 +02:00 · 2020-05-22 12:18:00 +02:00 · 730fa493a4
commit 730fa493a4
parent 565e0eef73 93c4d13588
143 changed files with 2003 additions and 8059 deletions
--- a/.github/contributors/ilivans.md
+++ b/.github/contributors/ilivans.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Ilia Ivanov              |
 | Company name (if applicable)   | Chattermill              |
 | Title or role (if applicable)  | DL Engineer              |
 | Date                           | 2020-05-14               |
 | GitHub username                | ilivans                  |
 | Website (optional)             |                          |
--- a/.github/contributors/kevinlu1248.md
+++ b/.github/contributors/kevinlu1248.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |              Kevin Lu|
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |               Student|
 | Date                           |                      |
 | GitHub username                |           kevinlu1248|
 | Website (optional)             |                      |
--- a/.github/contributors/lfiedler.md
+++ b/.github/contributors/lfiedler.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Leander Fiedler      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 06 April 2020        |
 | GitHub username                | lfiedler             |
 | Website (optional)             |                      |
--- a/.github/contributors/osori.md
+++ b/.github/contributors/osori.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Ilkyu Ju             |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-05-17           |
 | GitHub username                | osori                |
 | Website (optional)             |                      |
--- a/.github/contributors/thoppe.md
+++ b/.github/contributors/thoppe.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Travis Hoppe  |
 | Company name (if applicable)   |        |
 | Title or role (if applicable)  | Data Scientist |
 | Date                           | 07 May 2020         |
 | GitHub username                | thoppe    |
 | Website (optional)             | http://thoppe.github.io/  |
--- a/.github/contributors/vishnupriyavr.md
+++ b/.github/contributors/vishnupriyavr.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Vishnu Priya VR          |
 | Company name (if applicable)   | Uniphore                 |
 | Title or role (if applicable)  | NLP/AI Engineer          |
 | Date                           | 2020-05-03               |
 | GitHub username                | vishnupriyavr            |
 | Website (optional)             |                          |
--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -1,6 +1,7 @@
 """Prevent catastrophic forgetting with rehearsal updates."""
 import plac
 import random
 import warnings
 import srsly
 import spacy
 from spacy.gold import GoldParse
@ -66,7 +67,10 @@ def main(model_name, unlabelled_loc):
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    sizes = compounding(1.0, 4.0, 1.001)
-    with nlp.disable_pipes(*other_pipes):
+    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            random.shuffle(raw_docs)
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -64,7 +64,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
    vocab = Vocab().from_disk(vocab_path)
-    # create blank Language class with correct vocab
+    # create blank English model with correct vocab
    nlp = spacy.blank("en", vocab=vocab)
    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -8,12 +8,13 @@ For more details, see the documentation:
 * NER: https://spacy.io/usage/linguistic-features#named-entities
 Compatible with: spaCy v2.0.0+
-Last tested with: v2.1.0
+Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
@ -57,7 +58,11 @@ def main(model=None, output_dir=None, n_iter=100):
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
-    with nlp.disable_pipes(*other_pipes):  # only train NER
+    # only train NER
    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -24,12 +24,13 @@ For more details, see the documentation:
 * NER: https://spacy.io/usage/linguistic-features#named-entities
 Compatible with: spaCy v2.1.0+
-Last tested with: v2.1.0
+Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
@ -97,7 +98,11 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
-    with nlp.disable_pipes(*other_pipes):  # only train NER
+    # only train NER
    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
--- a/setup.cfg
+++ b/setup.cfg
@ -59,7 +59,7 @@ install_requires =
 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.0.5,<0.2.0
+    spacy_lookups_data>=0.3.1,<0.4.0
 cuda =
    cupy>=5.0.0b4,<9.0.0
 cuda80 =
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -279,13 +279,14 @@ class PrecomputableAffine(Model):
                break
-def link_vectors_to_models(vocab):
+def link_vectors_to_models(vocab, skip_rank=False):
    vectors = vocab.vectors
    if vectors.name is None:
        vectors.name = VECTORS_KEY
        if vectors.data.size != 0:
            warnings.warn(Warnings.W020.format(shape=vectors.data.shape))
    ops = Model.ops
    if not skip_rank:
        for word in vocab:
            if word.orth in vectors.key2row:
                word.rank = vectors.key2row[word.orth]
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -15,7 +15,7 @@ cdef enum attr_id_t:
    LIKE_NUM
    LIKE_EMAIL
    IS_STOP
-    IS_OOV
+    IS_OOV_DEPRECATED
    IS_BRACKET
    IS_QUOTE
    IS_LEFT_PUNCT
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -16,7 +16,7 @@ IDS = {
    "LIKE_NUM": LIKE_NUM,
    "LIKE_EMAIL": LIKE_EMAIL,
    "IS_STOP": IS_STOP,
-    "IS_OOV": IS_OOV,
+    "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
    "IS_BRACKET": IS_BRACKET,
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -187,12 +187,17 @@ def debug_data(
        n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
        msg.warn(
            "{} words in training data without vectors ({:0.2f}%)".format(
-                n_missing_vectors,
+                n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
                n_missing_vectors / gold_train_data["n_words"],
            ),
        )
        msg.text(
-            "10 most common words without vectors: {}".format(_format_labels(gold_train_data["words_missing_vectors"].most_common(10), counts=True)), show=verbose,
+            "10 most common words without vectors: {}".format(
                _format_labels(
                    gold_train_data["words_missing_vectors"].most_common(10),
                    counts=True,
                )
            ),
            show=verbose,
        )
    else:
        msg.info("No word vectors present in the model")
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals, division, print_function
 import plac
 import spacy
 from timeit import default_timer as timer
 from wasabi import msg
@ -45,7 +44,7 @@ def evaluate(
        msg.fail("Visualization output directory not found", displacy_path, exits=1)
    corpus = GoldCorpus(data_path, data_path)
    if model.startswith("blank:"):
-        nlp = spacy.blank(model.replace("blank:", ""))
+        nlp = util.get_lang_class(model.replace("blank:", ""))()
    else:
        nlp = util.load_model(model)
    dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -17,7 +17,9 @@ from wasabi import msg
 from ..vectors import Vectors
 from ..errors import Errors, Warnings
-from ..util import ensure_path, get_lang_class, OOV_RANK
+from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
 from ..lookups import Lookups
 try:
    import ftfy
@ -49,6 +51,8 @@ DEFAULT_OOV_PROB = -20
        str,
    ),
    model_name=("Optional name for the model meta", "option", "mn", str),
    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
    base_model=("Base model (for languages with custom tokenizers)", "option", "b", str),
 )
 def init_model(
    lang,
@ -61,6 +65,8 @@ def init_model(
    prune_vectors=-1,
    vectors_name=None,
    model_name=None,
    omit_extra_lookups=False,
    base_model=None,
 ):
    """
    Create a new model from raw data, like word frequencies, Brown clusters
@ -92,7 +98,16 @@ def init_model(
        lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
    with msg.loading("Creating model..."):
-        nlp = create_model(lang, lex_attrs, name=model_name)
+        nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
    # Create empty extra lexeme tables so the data from spacy-lookups-data
    # isn't loaded if these features are accessed
    if omit_extra_lookups:
        nlp.vocab.lookups_extra = Lookups()
        nlp.vocab.lookups_extra.add_table("lexeme_cluster")
        nlp.vocab.lookups_extra.add_table("lexeme_prob")
        nlp.vocab.lookups_extra.add_table("lexeme_settings")
    msg.good("Successfully created model")
    if vectors_loc is not None:
        add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
@ -152,20 +167,23 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
    return lex_attrs
-def create_model(lang, lex_attrs, name=None):
+def create_model(lang, lex_attrs, name=None, base_model=None):
    if base_model:
        nlp = load_model(base_model)
        # keep the tokenizer but remove any existing pipeline components due to
        # potentially conflicting vectors
        for pipe in nlp.pipe_names:
            nlp.remove_pipe(pipe)
    else:
        lang_class = get_lang_class(lang)
        nlp = lang_class()
    for lexeme in nlp.vocab:
        lexeme.rank = OOV_RANK
    lex_added = 0
    for attrs in lex_attrs:
        if "settings" in attrs:
            continue
        lexeme = nlp.vocab[attrs["orth"]]
        lexeme.set_attrs(**attrs)
        lexeme.is_oov = False
        lex_added += 1
        lex_added += 1
    if len(nlp.vocab):
        oov_prob = min(lex.prob for lex in nlp.vocab) - 1
    else:
@ -181,7 +199,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
    if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
        nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
        for lex in nlp.vocab:
-            if lex.rank:
+            if lex.rank and lex.rank != OOV_RANK:
                nlp.vocab.vectors.add(lex.orth, row=lex.rank)
    else:
        if vectors_loc:
@ -193,8 +211,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
        if vector_keys is not None:
            for word in vector_keys:
                if word not in nlp.vocab:
-                    lexeme = nlp.vocab[word]
+                    nlp.vocab[word]
                    lexeme.is_oov = False
        if vectors_data is not None:
            nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
    if name is None:
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -15,9 +15,9 @@ import random
 from .._ml import create_default_optimizer
 from ..util import use_gpu as set_gpu
 from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from ..compat import path2str
 from ..lookups import Lookups
 from .. import util
 from .. import about
@ -58,6 +58,7 @@ from .. import about
    textcat_arch=("Textcat model architecture", "option", "ta", str),
    textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
    verbose=("Display more information for debug", "flag", "VV", bool),
    debug=("Run data diagnostics before training", "flag", "D", bool),
    # fmt: on
@ -97,6 +98,7 @@ def train(
    textcat_arch="bow",
    textcat_positive_label=None,
    tag_map_path=None,
    omit_extra_lookups=False,
    verbose=False,
    debug=False,
 ):
@ -248,6 +250,14 @@ def train(
    # Update tag map with provided mapping
    nlp.vocab.morphology.tag_map.update(tag_map)
    # Create empty extra lexeme tables so the data from spacy-lookups-data
    # isn't loaded if these features are accessed
    if omit_extra_lookups:
        nlp.vocab.lookups_extra = Lookups()
        nlp.vocab.lookups_extra.add_table("lexeme_cluster")
        nlp.vocab.lookups_extra.add_table("lexeme_prob")
        nlp.vocab.lookups_extra.add_table("lexeme_settings")
    if vectors:
        msg.text("Loading vector from model '{}'".format(vectors))
        _load_vectors(nlp, vectors)
@ -630,15 +640,6 @@ def _create_progress_bar(total):
 def _load_vectors(nlp, vectors):
    util.load_model(vectors, vocab=nlp.vocab)
    for lex in nlp.vocab:
        values = {}
        for attr, func in nlp.vocab.lex_attr_getters.items():
            # These attrs are expected to be set by data. Others should
            # be set by calling the language functions.
            if attr not in (CLUSTER, PROB, IS_OOV, LANG):
                values[lex.vocab.strings[attr]] = func(lex.orth_)
        lex.set_attrs(**values)
        lex.is_oov = False
 def _load_pretrained_tok2vec(nlp, loc):
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -1,12 +1,16 @@
 # coding: utf8
 from __future__ import unicode_literals
 def add_codes(err_cls):
    """Add error codes to string messages via class attribute names."""
-    class ErrorsWithCodes(object):
+    class ErrorsWithCodes(err_cls):
        def __getattribute__(self, code):
-            msg = getattr(err_cls, code)
+            msg = super(ErrorsWithCodes, self).__getattribute__(code)
            if code.startswith("__"):  # python system attributes like __class__
                return msg
            else:
                return "[{code}] {msg}".format(code=code, msg=msg)
    return ErrorsWithCodes()
@ -106,6 +110,11 @@ class Warnings(object):
            "in problems with the vocab further on in the pipeline.")
    W029 = ("Unable to align tokens with entities from character offsets. "
            "Discarding entity annotation for the text: {text}.")
    W030 = ("Some entities could not be aligned in the text \"{text}\" with "
            "entities \"{entities}\". Use "
            "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
            " to check the alignment. Misaligned entities ('-') will be "
            "ignored during training.")
@add_codes
@ -555,6 +564,9 @@ class Errors(object):
    E195 = ("Matcher can be called on {good} only, got {got}.")
    E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
            "only be fixed with token.is_sent_start.")
    E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
    E198 = ("Unable to return {n} most similar vectors for the current vectors "
            "table, which contains {n_rows} vectors.")
@add_codes
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -658,7 +658,15 @@ cdef class GoldParse:
        entdoc = None
        # avoid allocating memory if the doc does not contain any tokens
-        if self.length > 0:
+        if self.length == 0:
            self.words = []
            self.tags = []
            self.heads = []
            self.labels = []
            self.ner = []
            self.morphology = []
        else:
            if words is None:
                words = [token.text for token in doc]
            if tags is None:
@ -949,6 +957,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
                break
        else:
            biluo[token.i] = missing
    if "-" in biluo:
        ent_str = str(entities)
        warnings.warn(Warnings.W030.format(
            text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
            entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str
        ))
    return biluo
--- a/spacy/kb.pxd
+++ b/spacy/kb.pxd
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
 from libc.stdint cimport int32_t, int64_t
 from libc.stdio cimport FILE
-from spacy.vocab cimport Vocab
+from .vocab cimport Vocab
 from .typedefs cimport hash_t
 from .structs cimport KBEntryC, AliasC
@ -169,4 +169,3 @@ cdef class Reader:
    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
    cdef int _read(self, void* value, size_t size) except -1
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -1,23 +1,20 @@
 # cython: infer_types=True
 # cython: profile=True
 # coding: utf8
 import warnings
 from spacy.errors import Errors, Warnings
 from pathlib import Path
 from cymem.cymem cimport Pool
 from preshed.maps cimport PreshMap
 from cpython.exc cimport PyErr_SetFromErrno
 from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
 from libc.stdint cimport int32_t, int64_t
 from libcpp.vector cimport vector
 import warnings
 from os import path
 from pathlib import Path
 from .typedefs cimport hash_t
-from os import path
+from .errors import Errors, Warnings
 from libcpp.vector cimport vector
 cdef class Candidate:
@ -448,10 +445,10 @@ cdef class KnowledgeBase:
 cdef class Writer:
    def __init__(self, object loc):
        if path.exists(loc):
            assert not path.isdir(loc), "%s is directory." % loc
        if isinstance(loc, Path):
            loc = bytes(loc)
        if path.exists(loc):
            assert not path.isdir(loc), "%s is directory." % loc
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
@ -493,10 +490,10 @@ cdef class Writer:
 cdef class Reader:
    def __init__(self, object loc):
        assert path.exists(loc)
        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
        assert path.exists(loc)
        assert not path.isdir(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
@ -586,5 +583,3 @@ cdef class Reader:
    cdef int _read(self, void* value, size_t size) except -1:
        status = fread(value, size, 1, self._fp)
        return status
--- a/spacy/lang/da/init.py
+++ b/spacy/lang/da/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -10,19 +9,15 @@ from .morph_rules import MORPH_RULES
 from ..tag_map import TAG_MAP
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class DanishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "da"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    morph_rules = MORPH_RULES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/da/norm_exceptions.py
+++ b/spacy/lang/da/norm_exceptions.py
@ -1,527 +0,0 @@
 # coding: utf8
 """
 Special-case rules for normalizing tokens to improve the model's predictions.
 For example 'mysterium' vs 'mysterie' and similar.
 """
 from __future__ import unicode_literals
 # Sources:
 # 1: https://dsn.dk/retskrivning/om-retskrivningsordbogen/mere-om-retskrivningsordbogen-2012/endrede-stave-og-ordformer/
 # 2: http://www.tjerry-korrektur.dk/ord-med-flere-stavemaader/
 _exc = {
    # Alternative spelling
    "a-kraft-værk": "a-kraftværk",  # 1
    "ålborg": "aalborg",  # 2
    "århus": "aarhus",
    "accessoirer": "accessoires",  # 1
    "affektert": "affekteret",  # 1
    "afrikander": "afrikaaner",  # 1
    "aftabuere": "aftabuisere",  # 1
    "aftabuering": "aftabuisering",  # 1
    "akvarium": "akvarie",  # 1
    "alenefader": "alenefar",  # 1
    "alenemoder": "alenemor",  # 1
    "alkoholambulatorium": "alkoholambulatorie",  # 1
    "ambulatorium": "ambulatorie",  # 1
    "ananassene": "ananasserne",  # 2
    "anførelsestegn": "anførselstegn",  # 1
    "anseelig": "anselig",  # 2
    "antioxydant": "antioxidant",  # 1
    "artrig": "artsrig",  # 1
    "auditorium": "auditorie",  # 1
    "avocado": "avokado",  # 2
    "bagerst": "bagest",  # 2
    "bagstræv": "bagstræb",  # 1
    "bagstræver": "bagstræber",  # 1
    "bagstræverisk": "bagstræberisk",  # 1
    "balde": "balle",  # 2
    "barselorlov": "barselsorlov",  # 1
    "barselvikar": "barselsvikar",  # 1
    "baskien": "baskerlandet",  # 1
    "bayrisk": "bayersk",  # 1
    "bedstefader": "bedstefar",  # 1
    "bedstemoder": "bedstemor",  # 1
    "behefte": "behæfte",  # 1
    "beheftelse": "behæftelse",  # 1
    "bidragydende": "bidragsydende",  # 1
    "bidragyder": "bidragsyder",  # 1
    "billiondel": "billiontedel",  # 1
    "blaseret": "blasert",  # 1
    "bleskifte": "bleskift",  # 1
    "blodbroder": "blodsbroder",  # 2
    "blyantspidser": "blyantsspidser",  # 2
    "boligministerium": "boligministerie",  # 1
    "borhul": "borehul",  # 1
    "broder": "bror",  # 2
    "buldog": "bulldog",  # 2
    "bådhus": "bådehus",  # 1
    "børnepleje": "barnepleje",  # 1
    "børneseng": "barneseng",  # 1
    "børnestol": "barnestol",  # 1
    "cairo": "kairo",  # 1
    "cambodia": "cambodja",  # 1
    "cambodianer": "cambodjaner",  # 1
    "cambodiansk": "cambodjansk",  # 1
    "camouflage": "kamuflage",  # 2
    "campylobacter": "kampylobakter",  # 1
    "centeret": "centret",  # 2
    "chefskahyt": "chefkahyt",  # 1
    "chefspost": "chefpost",  # 1
    "chefssekretær": "chefsekretær",  # 1
    "chefsstol": "chefstol",  # 1
    "cirkulærskrivelse": "cirkulæreskrivelse",  # 1
    "cognacsglas": "cognacglas",  # 1
    "columnist": "kolumnist",  # 1
    "cricket": "kricket",  # 2
    "dagplejemoder": "dagplejemor",  # 1
    "damaskesdug": "damaskdug",  # 1
    "damp-barn": "dampbarn",  # 1
    "delfinarium": "delfinarie",  # 1
    "dentallaboratorium": "dentallaboratorie",  # 1
    "diaramme": "diasramme",  # 1
    "diaré": "diarré",  # 1
    "dioxyd": "dioxid",  # 1
    "dommedagsprædiken": "dommedagspræken",  # 1
    "donut": "doughnut",  # 2
    "driftmæssig": "driftsmæssig",  # 1
    "driftsikker": "driftssikker",  # 1
    "driftsikring": "driftssikring",  # 1
    "drikkejogurt": "drikkeyoghurt",  # 1
    "drivein": "drive-in",  # 1
    "driveinbiograf": "drive-in-biograf",  # 1
    "drøvel": "drøbel",  # 1
    "dødskriterium": "dødskriterie",  # 1
    "e-mail-adresse": "e-mailadresse",  # 1
    "e-post-adresse": "e-postadresse",  # 1
    "egypten": "ægypten",  # 2
    "ekskommunicere": "ekskommunikere",  # 1
    "eksperimentarium": "eksperimentarie",  # 1
    "elsass": "Alsace",  # 1
    "elsasser": "alsacer",  # 1
    "elsassisk": "alsacisk",  # 1
    "elvetal": "ellevetal",  # 1
    "elvetiden": "ellevetiden",  # 1
    "elveårig": "elleveårig",  # 1
    "elveårs": "elleveårs",  # 1
    "elveårsbarn": "elleveårsbarn",  # 1
    "elvte": "ellevte",  # 1
    "elvtedel": "ellevtedel",  # 1
    "energiministerium": "energiministerie",  # 1
    "erhvervsministerium": "erhvervsministerie",  # 1
    "espaliere": "spaliere",  # 2
    "evangelium": "evangelie",  # 1
    "fagministerium": "fagministerie",  # 1
    "fakse": "faxe",  # 1
    "fangstkvota": "fangstkvote",  # 1
    "fader": "far",  # 2
    "farbroder": "farbror",  # 1
    "farfader": "farfar",  # 1
    "farmoder": "farmor",  # 1
    "federal": "føderal",  # 1
    "federalisering": "føderalisering",  # 1
    "federalisme": "føderalisme",  # 1
    "federalist": "føderalist",  # 1
    "federalistisk": "føderalistisk",  # 1
    "federation": "føderation",  # 1
    "federativ": "føderativ",  # 1
    "fejlbeheftet": "fejlbehæftet",  # 1
    "femetagers": "femetages",  # 2
    "femhundredekroneseddel": "femhundredkroneseddel",  # 2
    "filmpremiere": "filmpræmiere",  # 2
    "finansimperium": "finansimperie",  # 1
    "finansministerium": "finansministerie",  # 1
    "firehjulstræk": "firhjulstræk",  # 2
    "fjernstudium": "fjernstudie",  # 1
    "formalier": "formalia",  # 1
    "formandsskift": "formandsskifte",  # 1
    "fornemst": "fornemmest",  # 2
    "fornuftparti": "fornuftsparti",  # 1
    "fornuftstridig": "fornuftsstridig",  # 1
    "fornuftvæsen": "fornuftsvæsen",  # 1
    "fornuftægteskab": "fornuftsægteskab",  # 1
    "forretningsministerium": "forretningsministerie",  # 1
    "forskningsministerium": "forskningsministerie",  # 1
    "forstudium": "forstudie",  # 1
    "forsvarsministerium": "forsvarsministerie",  # 1
    "frilægge": "fritlægge",  # 1
    "frilæggelse": "fritlæggelse",  # 1
    "frilægning": "fritlægning",  # 1
    "fristille": "fritstille",  # 1
    "fristilling": "fritstilling",  # 1
    "fuldttegnet": "fuldtegnet",  # 1
    "fødestedskriterium": "fødestedskriterie",  # 1
    "fødevareministerium": "fødevareministerie",  # 1
    "følesløs": "følelsesløs",  # 1
    "følgeligt": "følgelig",  # 1
    "førne": "førn",  # 1
    "gearskift": "gearskifte",  # 2
    "gladeligt": "gladelig",  # 1
    "glosehefte": "glosehæfte",  # 1
    "glædeløs": "glædesløs",  # 1
    "gonoré": "gonorré",  # 1
    "grangiveligt": "grangivelig",  # 1
    "grundliggende": "grundlæggende",  # 2
    "grønsag": "grøntsag",  # 2
    "gudbenådet": "gudsbenådet",  # 1
    "gudfader": "gudfar",  # 1
    "gudmoder": "gudmor",  # 1
    "gulvmop": "gulvmoppe",  # 1
    "gymnasium": "gymnasie",  # 1
    "hackning": "hacking",  # 1
    "halvbroder": "halvbror",  # 1
    "halvelvetiden": "halvellevetiden",  # 1
    "handelsgymnasium": "handelsgymnasie",  # 1
    "hefte": "hæfte",  # 1
    "hefteklamme": "hæfteklamme",  # 1
    "heftelse": "hæftelse",  # 1
    "heftemaskine": "hæftemaskine",  # 1
    "heftepistol": "hæftepistol",  # 1
    "hefteplaster": "hæfteplaster",  # 1
    "heftestraf": "hæftestraf",  # 1
    "heftning": "hæftning",  # 1
    "helbroder": "helbror",  # 1
    "hjemmeklasse": "hjemklasse",  # 1
    "hjulspin": "hjulspind",  # 1
    "huggevåben": "hugvåben",  # 1
    "hulmurisolering": "hulmursisolering",  # 1
    "hurtiggående": "hurtigtgående",  # 2
    "hurtigttørrende": "hurtigtørrende",  # 2
    "husmoder": "husmor",  # 1
    "hydroxyd": "hydroxid",  # 1
    "håndmikser": "håndmixer",  # 1
    "højtaler": "højttaler",  # 2
    "hønemoder": "hønemor",  # 1
    "ide": "idé",  # 2
    "imperium": "imperie",  # 1
    "imponerthed": "imponerethed",  # 1
    "inbox": "indboks",  # 2
    "indenrigsministerium": "indenrigsministerie",  # 1
    "indhefte": "indhæfte",  # 1
    "indheftning": "indhæftning",  # 1
    "indicium": "indicie",  # 1
    "indkassere": "inkassere",  # 2
    "iota": "jota",  # 1
    "jobskift": "jobskifte",  # 1
    "jogurt": "yoghurt",  # 1
    "jukeboks": "jukebox",  # 1
    "justitsministerium": "justitsministerie",  # 1
    "kalorifere": "kalorifer",  # 1
    "kandidatstipendium": "kandidatstipendie",  # 1
    "kannevas": "kanvas",  # 1
    "kaperssauce": "kaperssovs",  # 1
    "kigge": "kikke",  # 2
    "kirkeministerium": "kirkeministerie",  # 1
    "klapmydse": "klapmyds",  # 1
    "klimakterium": "klimakterie",  # 1
    "klogeligt": "klogelig",  # 1
    "knivblad": "knivsblad",  # 1
    "kollegaer": "kolleger",  # 2
    "kollegium": "kollegie",  # 1
    "kollegiehefte": "kollegiehæfte",  # 1
    "kollokviumx": "kollokvium",  # 1
    "kommissorium": "kommissorie",  # 1
    "kompendium": "kompendie",  # 1
    "komplicerthed": "komplicerethed",  # 1
    "konfederation": "konføderation",  # 1
    "konfedereret": "konfødereret",  # 1
    "konferensstudium": "konferensstudie",  # 1
    "konservatorium": "konservatorie",  # 1
    "konsulere": "konsultere",  # 1
    "kradsbørstig": "krasbørstig",  # 2
    "kravsspecifikation": "kravspecifikation",  # 1
    "krematorium": "krematorie",  # 1
    "krep": "crepe",  # 1
    "krepnylon": "crepenylon",  # 1
    "kreppapir": "crepepapir",  # 1
    "kricket": "cricket",  # 2
    "kriterium": "kriterie",  # 1
    "kroat": "kroater",  # 2
    "kroki": "croquis",  # 1
    "kronprinsepar": "kronprinspar",  # 2
    "kropdoven": "kropsdoven",  # 1
    "kroplus": "kropslus",  # 1
    "krøllefedt": "krølfedt",  # 1
    "kulturministerium": "kulturministerie",  # 1
    "kuponhefte": "kuponhæfte",  # 1
    "kvota": "kvote",  # 1
    "kvotaordning": "kvoteordning",  # 1
    "laboratorium": "laboratorie",  # 1
    "laksfarve": "laksefarve",  # 1
    "laksfarvet": "laksefarvet",  # 1
    "laksrød": "lakserød",  # 1
    "laksyngel": "lakseyngel",  # 1
    "laksørred": "lakseørred",  # 1
    "landbrugsministerium": "landbrugsministerie",  # 1
    "landskampstemning": "landskampsstemning",  # 1
    "langust": "languster",  # 1
    "lappegrejer": "lappegrej",  # 1
    "lavløn": "lavtløn",  # 1
    "lillebroder": "lillebror",  # 1
    "linear": "lineær",  # 1
    "loftlampe": "loftslampe",  # 2
    "log-in": "login",  # 1
    "login": "log-in",  # 2
    "lovmedholdig": "lovmedholdelig",  # 1
    "ludder": "luder",  # 2
    "lysholder": "lyseholder",  # 1
    "lægeskifte": "lægeskift",  # 1
    "lærvillig": "lærevillig",  # 1
    "løgsauce": "løgsovs",  # 1
    "madmoder": "madmor",  # 1
    "majonæse": "mayonnaise",  # 1
    "mareridtagtig": "mareridtsagtig",  # 1
    "margen": "margin",  # 2
    "martyrium": "martyrie",  # 1
    "mellemstatlig": "mellemstatslig",  # 1
    "menneskene": "menneskerne",  # 2
    "metropolis": "metropol",  # 1
    "miks": "mix",  # 1
    "mikse": "mixe",  # 1
    "miksepult": "mixerpult",  # 1
    "mikser": "mixer",  # 1
    "mikserpult": "mixerpult",  # 1
    "mikslån": "mixlån",  # 1
    "miksning": "mixning",  # 1
    "miljøministerium": "miljøministerie",  # 1
    "milliarddel": "milliardtedel",  # 1
    "milliondel": "milliontedel",  # 1
    "ministerium": "ministerie",  # 1
    "mop": "moppe",  # 1
    "moder": "mor",  # 2
    "moratorium": "moratorie",  # 1
    "morbroder": "morbror",  # 1
    "morfader": "morfar",  # 1
    "mormoder": "mormor",  # 1
    "musikkonservatorium": "musikkonservatorie",  # 1
    "muslingskal": "muslingeskal",  # 1
    "mysterium": "mysterie",  # 1
    "naturalieydelse": "naturalydelse",  # 1
    "naturalieøkonomi": "naturaløkonomi",  # 1
    "navnebroder": "navnebror",  # 1
    "nerium": "nerie",  # 1
    "nådeløs": "nådesløs",  # 1
    "nærforestående": "nærtforestående",  # 1
    "nærstående": "nærtstående",  # 1
    "observatorium": "observatorie",  # 1
    "oldefader": "oldefar",  # 1
    "oldemoder": "oldemor",  # 1
    "opgraduere": "opgradere",  # 1
    "opgraduering": "opgradering",  # 1
    "oratorium": "oratorie",  # 1
    "overbookning": "overbooking",  # 1
    "overpræsidium": "overpræsidie",  # 1
    "overstatlig": "overstatslig",  # 1
    "oxyd": "oxid",  # 1
    "oxydere": "oxidere",  # 1
    "oxydering": "oxidering",  # 1
    "pakkenellike": "pakkenelliker",  # 1
    "papirtynd": "papirstynd",  # 1
    "pastoralseminarium": "pastoralseminarie",  # 1
    "peanutsene": "peanuttene",  # 2
    "penalhus": "pennalhus",  # 2
    "pensakrav": "pensumkrav",  # 1
    "pepperoni": "peperoni",  # 1
    "peruaner": "peruvianer",  # 1
    "petrole": "petrol",  # 1
    "piltast": "piletast",  # 1
    "piltaste": "piletast",  # 1
    "planetarium": "planetarie",  # 1
    "plasteret": "plastret",  # 2
    "plastic": "plastik",  # 2
    "play-off-kamp": "playoffkamp",  # 1
    "plejefader": "plejefar",  # 1
    "plejemoder": "plejemor",  # 1
    "podium": "podie",  # 2
    "praha": "prag",  # 2
    "preciøs": "pretiøs",  # 2
    "privilegium": "privilegie",  # 1
    "progredere": "progrediere",  # 1
    "præsidium": "præsidie",  # 1
    "psykodelisk": "psykedelisk",  # 1
    "pudsegrejer": "pudsegrej",  # 1
    "referensgruppe": "referencegruppe",  # 1
    "referensramme": "referenceramme",  # 1
    "refugium": "refugie",  # 1
    "registeret": "registret",  # 2
    "remedium": "remedie",  # 1
    "remiks": "remix",  # 1
    "reservert": "reserveret",  # 1
    "ressortministerium": "ressortministerie",  # 1
    "ressource": "resurse",  # 2
    "resætte": "resette",  # 1
    "rettelig": "retteligt",  # 1
    "rettetaste": "rettetast",  # 1
    "returtaste": "returtast",  # 1
    "risici": "risikoer",  # 2
    "roll-on": "rollon",  # 1
    "rollehefte": "rollehæfte",  # 1
    "rostbøf": "roastbeef",  # 1
    "rygsæksturist": "rygsækturist",  # 1
    "rødstjært": "rødstjert",  # 1
    "saddel": "sadel",  # 2
    "samaritan": "samaritaner",  # 2
    "sanatorium": "sanatorie",  # 1
    "sauce": "sovs",  # 1
    "scanning": "skanning",  # 2
    "sceneskifte": "sceneskift",  # 1
    "scilla": "skilla",  # 1
    "sejflydende": "sejtflydende",  # 1
    "selvstudium": "selvstudie",  # 1
    "seminarium": "seminarie",  # 1
    "sennepssauce": "sennepssovs ",  # 1
    "servitutbeheftet": "servitutbehæftet",  # 1
    "sit-in": "sitin",  # 1
    "skatteministerium": "skatteministerie",  # 1
    "skifer": "skiffer",  # 2
    "skyldsfølelse": "skyldfølelse",  # 1
    "skysauce": "skysovs",  # 1
    "sladdertaske": "sladretaske",  # 2
    "sladdervorn": "sladrevorn",  # 2
    "slagsbroder": "slagsbror",  # 1
    "slettetaste": "slettetast",  # 1
    "smørsauce": "smørsovs",  # 1
    "snitsel": "schnitzel",  # 1
    "snobbeeffekt": "snobeffekt",  # 2
    "socialministerium": "socialministerie",  # 1
    "solarium": "solarie",  # 1
    "soldebroder": "soldebror",  # 1
    "spagetti": "spaghetti",  # 1
    "spagettistrop": "spaghettistrop",  # 1
    "spagettiwestern": "spaghettiwestern",  # 1
    "spin-off": "spinoff",  # 1
    "spinnefiskeri": "spindefiskeri",  # 1
    "spolorm": "spoleorm",  # 1
    "sproglaboratorium": "sproglaboratorie",  # 1
    "spækbræt": "spækkebræt",  # 2
    "stand-in": "standin",  # 1
    "stand-up-comedy": "standupcomedy",  # 1
    "stand-up-komiker": "standupkomiker",  # 1
    "statsministerium": "statsministerie",  # 1
    "stedbroder": "stedbror",  # 1
    "stedfader": "stedfar",  # 1
    "stedmoder": "stedmor",  # 1
    "stilehefte": "stilehæfte",  # 1
    "stipendium": "stipendie",  # 1
    "stjært": "stjert",  # 1
    "stjærthage": "stjerthage",  # 1
    "storebroder": "storebror",  # 1
    "stortå": "storetå",  # 1
    "strabads": "strabadser",  # 1
    "strømlinjet": "strømlinet",  # 1
    "studium": "studie",  # 1
    "stænkelap": "stænklap",  # 1
    "sundhedsministerium": "sundhedsministerie",  # 1
    "suppositorium": "suppositorie",  # 1
    "svejts": "schweiz",  # 1
    "svejtser": "schweizer",  # 1
    "svejtserfranc": "schweizerfranc",  # 1
    "svejtserost": "schweizerost",  # 1
    "svejtsisk": "schweizisk",  # 1
    "svigerfader": "svigerfar",  # 1
    "svigermoder": "svigermor",  # 1
    "svirebroder": "svirebror",  # 1
    "symposium": "symposie",  # 1
    "sælarium": "sælarie",  # 1
    "søreme": "sørme",  # 2
    "søterritorium": "søterritorie",  # 1
    "t-bone-steak": "t-bonesteak",  # 1
    "tabgivende": "tabsgivende",  # 1
    "tabuere": "tabuisere",  # 1
    "tabuering": "tabuisering",  # 1
    "tackle": "takle",  # 2
    "tackling": "takling",  # 2
    "taifun": "tyfon",  # 1
    "take-off": "takeoff",  # 1
    "taknemlig": "taknemmelig",  # 2
    "talehørelærer": "tale-høre-lærer",  # 1
    "talehøreundervisning": "tale-høre-undervisning",  # 1
    "tandstik": "tandstikker",  # 1
    "tao": "dao",  # 1
    "taoisme": "daoisme",  # 1
    "taoist": "daoist",  # 1
    "taoistisk": "daoistisk",  # 1
    "taverne": "taverna",  # 1
    "teateret": "teatret",  # 2
    "tekno": "techno",  # 1
    "temposkifte": "temposkift",  # 1
    "terrarium": "terrarie",  # 1
    "territorium": "territorie",  # 1
    "tesis": "tese",  # 1
    "tidsstudium": "tidsstudie",  # 1
    "tipoldefader": "tipoldefar",  # 1
    "tipoldemoder": "tipoldemor",  # 1
    "tomatsauce": "tomatsovs",  # 1
    "tonart": "toneart",  # 1
    "trafikministerium": "trafikministerie",  # 1
    "tredve": "tredive",  # 1
    "tredver": "trediver",  # 1
    "tredveårig": "trediveårig",  # 1
    "tredveårs": "trediveårs",  # 1
    "tredveårsfødselsdag": "trediveårsfødselsdag",  # 1
    "tredvte": "tredivte",  # 1
    "tredvtedel": "tredivtedel",  # 1
    "troldunge": "troldeunge",  # 1
    "trommestikke": "trommestik",  # 1
    "trubadur": "troubadour",  # 2
    "trøstepræmie": "trøstpræmie",  # 2
    "tummerum": "trummerum",  # 1
    "tumultuarisk": "tumultarisk",  # 1
    "tunghørighed": "tunghørhed",  # 1
    "tus": "tusch",  # 2
    "tusind": "tusinde",  # 2
    "tvillingbroder": "tvillingebror",  # 1
    "tvillingbror": "tvillingebror",  # 1
    "tvillingebroder": "tvillingebror",  # 1
    "ubeheftet": "ubehæftet",  # 1
    "udenrigsministerium": "udenrigsministerie",  # 1
    "udhulning": "udhuling",  # 1
    "udslaggivende": "udslagsgivende",  # 1
    "udspekulert": "udspekuleret",  # 1
    "udviklingsministerium": "udviklingsministerie",  # 1
    "uforpligtigende": "uforpligtende",  # 1
    "uheldvarslende": "uheldsvarslende",  # 1
    "uimponerthed": "uimponerethed",  # 1
    "undervisningsministerium": "undervisningsministerie",  # 1
    "unægtelig": "unægteligt",  # 1
    "urinale": "urinal",  # 1
    "uvederheftig": "uvederhæftig",  # 1
    "vabel": "vable",  # 2
    "vadi": "wadi",  # 1
    "vaklevorn": "vakkelvorn",  # 1
    "vanadin": "vanadium",  # 1
    "vaselin": "vaseline",  # 1
    "vederheftig": "vederhæftig",  # 1
    "vedhefte": "vedhæfte",  # 1
    "velar": "velær",  # 1
    "videndeling": "vidensdeling",  # 2
    "vinkelanførelsestegn": "vinkelanførselstegn",  # 1
    "vipstjært": "vipstjert",  # 1
    "vismut": "bismut",  # 1
    "visvas": "vissevasse",  # 1
    "voksværk": "vokseværk",  # 1
    "værtdyr": "værtsdyr",  # 1
    "værtplante": "værtsplante",  # 1
    "wienersnitsel": "wienerschnitzel",  # 1
    "yderliggående": "yderligtgående",  # 2
    "zombi": "zombie",  # 1
    "ægbakke": "æggebakke",  # 1
    "ægformet": "æggeformet",  # 1
    "ægleder": "æggeleder",  # 1
    "ækvilibrist": "ekvilibrist",  # 2
    "æselsøre": "æseløre",  # 1
    "øjehule": "øjenhule",  # 1
    "øjelåg": "øjenlåg",  # 1
    "øjeåbner": "øjenåbner",  # 1
    "økonomiministerium": "økonomiministerie",  # 1
    "ørenring": "ørering",  # 2
    "øvehefte": "øvehæfte",  # 1
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/da/tokenizer_exceptions.py
+++ b/spacy/lang/da/tokenizer_exceptions.py
@ -6,7 +6,7 @@ Source: https://forkortelse.dk/ and various others.
 from __future__ import unicode_literals
-from ...symbols import ORTH, LEMMA, NORM, TAG, PUNCT
+from ...symbols import ORTH, LEMMA, NORM
 _exc = {}
@ -52,7 +52,7 @@ for exc_data in [
    {ORTH: "Ons.", LEMMA: "onsdag"},
    {ORTH: "Fre.", LEMMA: "fredag"},
    {ORTH: "Lør.", LEMMA: "lørdag"},
-    {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller", TAG: "CC"},
+    {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]
@ -577,7 +577,7 @@ for h in range(1, 31 + 1):
    for period in ["."]:
        _exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]
-_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]}
+_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]}
 _exc.update(_custom_base_exc)
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .punctuation import TOKENIZER_INFIXES
 from .tag_map import TAG_MAP
@ -10,18 +9,14 @@ from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class GermanDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "de"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/de/norm_exceptions.py
+++ b/spacy/lang/de/norm_exceptions.py
@ -1,16 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 # Here we only want to include the absolute most common words. Otherwise,
 # this list would get impossibly long for German – especially considering the
 # old vs. new spelling rules, and all possible cases.
 _exc = {"daß": "dass"}
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/de/syntax_iterators.py
+++ b/spacy/lang/de/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -27,13 +28,17 @@ def noun_chunks(obj):
        "og",
        "app",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_label = doc.vocab.strings.add("NP")
    np_deps = set(doc.vocab.strings.add(label) for label in labels)
    close_app = doc.vocab.strings.add("nk")
    rbracket = 0
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if i < rbracket:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -10,21 +10,16 @@ from .lemmatizer import GreekLemmatizer
 from .syntax_iterators import SYNTAX_ITERATORS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...lookups import Lookups
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class GreekDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "el"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/el/norm_exceptions.py
+++ b/spacy/lang/el/norm_exceptions.py
--- a/spacy/lang/el/syntax_iterators.py
+++ b/spacy/lang/el/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases. Works on both Doc and Span.
    """
@ -13,34 +14,34 @@ def noun_chunks(obj):
    # obj tag corrects some DEP tagger mistakes.
    # Further improvement of the models will eliminate the need for this tag.
    labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            if any(w.i in seen for w in word.subtree):
                continue
            flag = False
            if word.pos == NOUN:
                #  check for patterns such as γραμμή παραγωγής
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
-                        seen.update(
+                        prev_end = potential_nmod.i
                            j for j in range(word.left_edge.i, potential_nmod.i + 1)
                        )
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        flag = True
                        break
            if flag is False:
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -49,9 +50,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -10,10 +9,9 @@ from .morph_rules import MORPH_RULES
 from .syntax_iterators import SYNTAX_ITERATORS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 def _return_en(_):
@ -24,9 +22,6 @@ class EnglishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = _return_en
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/en/norm_exceptions.py
+++ b/spacy/lang/en/norm_exceptions.py
--- a/spacy/lang/en/syntax_iterators.py
+++ b/spacy/lang/en/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "attr",
        "ROOT",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.i + 1))
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -77,12 +77,12 @@ for pron in ["i", "you", "he", "she", "it", "we", "they"]:
        _exc[orth + "'d"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
-            {ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
+            {ORTH: "'d", NORM: "'d"},
        ]
        _exc[orth + "d"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
-            {ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
+            {ORTH: "d", NORM: "'d"},
        ]
        _exc[orth + "'d've"] = [
@ -195,7 +195,10 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
            {ORTH: "'d", NORM: "'d"},
        ]
-        _exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}]
+        _exc[orth + "d"] = [
            {ORTH: orth, LEMMA: word, NORM: word},
            {ORTH: "d", NORM: "'d"},
        ]
        _exc[orth + "'d've"] = [
            {ORTH: orth, LEMMA: word, NORM: word},
--- a/spacy/lang/es/punctuation.py
+++ b/spacy/lang/es/punctuation.py
@ -5,7 +5,6 @@ from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
 from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
 from ..char_classes import merge_chars
 from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
 _list_units = [u for u in LIST_UNITS if u != "%"]
--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -2,10 +2,15 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON, VERB, AUX
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
-    doc = obj.doc
+    doc = doclike.doc
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    if not len(doc):
        return
    np_label = doc.vocab.strings.add("NP")
@ -16,7 +21,7 @@ def noun_chunks(obj):
    np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
    stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
    token = doc[0]
-    while token and token.i < len(doc):
+    while token and token.i < len(doclike):
        if token.pos in [PROPN, NOUN, PRON]:
            left, right = noun_bounds(
                doc, token, np_left_deps, np_right_deps, stop_deps
--- a/spacy/lang/fa/init.py
+++ b/spacy/lang/fa/init.py
@ -10,6 +10,7 @@ from .lex_attrs import LEX_ATTRS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .punctuation import TOKENIZER_SUFFIXES
 from .syntax_iterators import SYNTAX_ITERATORS
 class PersianDefaults(Language.Defaults):
@ -24,6 +25,7 @@ class PersianDefaults(Language.Defaults):
    tag_map = TAG_MAP
    suffixes = TOKENIZER_SUFFIXES
    writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
    syntax_iterators = SYNTAX_ITERATORS
 class Persian(Language):
--- a/spacy/lang/fa/syntax_iterators.py
+++ b/spacy/lang/fa/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "attr",
        "ROOT",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.i + 1))
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i + 1))
                yield word.left_edge.i, word.i + 1, np_label
--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/hy/init.py
+++ b/spacy/lang/hy/init.py
@ -1,11 +1,12 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from ...attrs import LANG
 from ...language import Language
 from ...tokens import Doc
 class ArmenianDefaults(Language.Defaults):
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -1,6 +1,6 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.hy.examples import sentences
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -1,3 +1,4 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
--- a/spacy/lang/hy/stop_words.py
+++ b/spacy/lang/hy/stop_words.py
@ -1,6 +1,6 @@
 # coding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set(
    """
 նա
--- a/spacy/lang/hy/tag_map.py
+++ b/spacy/lang/hy/tag_map.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ...symbols import POS, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
+from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
 from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ
 TAG_MAP = {
@ -716,7 +716,7 @@ TAG_MAP = {
        POS: NOUN,
        "Animacy": "Nhum",
        "Case": "Dat",
-        "Number": "Coll",
+        # "Number": "Coll",
        "Number": "Sing",
        "Person": "1",
    },
@ -815,7 +815,7 @@ TAG_MAP = {
        "Animacy": "Nhum",
        "Case": "Nom",
        "Definite": "Def",
-        "Number": "Plur",
+        # "Number": "Plur",
        "Number": "Sing",
        "Poss": "Yes",
    },
@ -880,7 +880,7 @@ TAG_MAP = {
        POS: NOUN,
        "Animacy": "Nhum",
        "Case": "Nom",
-        "Number": "Plur",
+        # "Number": "Plur",
        "Number": "Sing",
        "Person": "2",
    },
@ -1223,9 +1223,9 @@ TAG_MAP = {
    "PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": {
        POS: PRON,
        "Case": "Nom",
-        "Number": "Sing",
+        # "Number": "Sing",
        "Number": "Plur",
-        "Person": "3",
+        # "Person": "3",
        "Person": "1",
        "PronType": "Emp",
    },
--- a/spacy/lang/id/init.py
+++ b/spacy/lang/id/init.py
@ -4,25 +4,20 @@ from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .tag_map import TAG_MAP
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class IndonesianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "id"
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
--- a/spacy/lang/id/norm_exceptions.py
+++ b/spacy/lang/id/norm_exceptions.py
@ -1,532 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 # Daftar kosakata yang sering salah dieja
 # https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
 _exc = {
    # Slang and abbreviations
    "silahkan": "silakan",
    "yg": "yang",
    "kalo": "kalau",
    "cawu": "caturwulan",
    "ok": "oke",
    "gak": "tidak",
    "enggak": "tidak",
    "nggak": "tidak",
    "ndak": "tidak",
    "ngga": "tidak",
    "dgn": "dengan",
    "tdk": "tidak",
    "jg": "juga",
    "klo": "kalau",
    "denger": "dengar",
    "pinter": "pintar",
    "krn": "karena",
    "nemuin": "menemukan",
    "jgn": "jangan",
    "udah": "sudah",
    "sy": "saya",
    "udh": "sudah",
    "dapetin": "mendapatkan",
    "ngelakuin": "melakukan",
    "ngebuat": "membuat",
    "membikin": "membuat",
    "bikin": "buat",
    # Daftar kosakata yang sering salah dieja
    "malpraktik": "malapraktik",
    "malfungsi": "malafungsi",
    "malserap": "malaserap",
    "maladaptasi": "malaadaptasi",
    "malsuai": "malasuai",
    "maldistribusi": "maladistribusi",
    "malgizi": "malagizi",
    "malsikap": "malasikap",
    "memperhatikan": "memerhatikan",
    "akte": "akta",
    "cemilan": "camilan",
    "esei": "esai",
    "frase": "frasa",
    "kafeteria": "kafetaria",
    "ketapel": "katapel",
    "kenderaan": "kendaraan",
    "menejemen": "manajemen",
    "menejer": "manajer",
    "mesjid": "masjid",
    "rebo": "rabu",
    "seksama": "saksama",
    "senggama": "sanggama",
    "sekedar": "sekadar",
    "seprei": "seprai",
    "semedi": "semadi",
    "samadi": "semadi",
    "amandemen": "amendemen",
    "algoritma": "algoritme",
    "aritmatika": "aritmetika",
    "metoda": "metode",
    "materai": "meterai",
    "meterei": "meterai",
    "kalendar": "kalender",
    "kadaluwarsa": "kedaluwarsa",
    "katagori": "kategori",
    "parlamen": "parlemen",
    "sekular": "sekuler",
    "selular": "seluler",
    "sirkular": "sirkuler",
    "survai": "survei",
    "survey": "survei",
    "aktuil": "aktual",
    "formil": "formal",
    "trotoir": "trotoar",
    "komersiil": "komersial",
    "komersil": "komersial",
    "tradisionil": "tradisionial",
    "orisinil": "orisinal",
    "orijinil": "orisinal",
    "afdol": "afdal",
    "antri": "antre",
    "apotik": "apotek",
    "atlit": "atlet",
    "atmosfir": "atmosfer",
    "cidera": "cedera",
    "cendikiawan": "cendekiawan",
    "cepet": "cepat",
    "cinderamata": "cenderamata",
    "debet": "debit",
    "difinisi": "definisi",
    "dekrit": "dekret",
    "disain": "desain",
    "diskripsi": "deskripsi",
    "diskotik": "diskotek",
    "eksim": "eksem",
    "exim": "eksem",
    "faidah": "faedah",
    "ekstrim": "ekstrem",
    "ekstrimis": "ekstremis",
    "komplit": "komplet",
    "konkrit": "konkret",
    "kongkrit": "konkret",
    "kongkret": "konkret",
    "kridit": "kredit",
    "musium": "museum",
    "pinalti": "penalti",
    "piranti": "peranti",
    "pinsil": "pensil",
    "personil": "personel",
    "sistim": "sistem",
    "teoritis": "teoretis",
    "vidio": "video",
    "cengkeh": "cengkih",
    "desertasi": "disertasi",
    "hakekat": "hakikat",
    "intelejen": "intelijen",
    "kaedah": "kaidah",
    "kempes": "kempis",
    "kementrian": "kementerian",
    "ledeng": "leding",
    "nasehat": "nasihat",
    "penasehat": "penasihat",
    "praktek": "praktik",
    "praktekum": "praktikum",
    "resiko": "risiko",
    "retsleting": "ritsleting",
    "senen": "senin",
    "amuba": "ameba",
    "punggawa": "penggawa",
    "surban": "serban",
    "nomer": "nomor",
    "sorban": "serban",
    "bis": "bus",
    "agribisnis": "agrobisnis",
    "kantung": "kantong",
    "khutbah": "khotbah",
    "mandur": "mandor",
    "rubuh": "roboh",
    "pastur": "pastor",
    "supir": "sopir",
    "goncang": "guncang",
    "goa": "gua",
    "kaos": "kaus",
    "kokoh": "kukuh",
    "komulatif": "kumulatif",
    "kolomnis": "kolumnis",
    "korma": "kurma",
    "lobang": "lubang",
    "limo": "limusin",
    "limosin": "limusin",
    "mangkok": "mangkuk",
    "saos": "saus",
    "sop": "sup",
    "sorga": "surga",
    "tegor": "tegur",
    "telor": "telur",
    "obrak-abrik": "ubrak-abrik",
    "ekwivalen": "ekuivalen",
    "frekwensi": "frekuensi",
    "konsekwensi": "konsekuensi",
    "kwadran": "kuadran",
    "kwadrat": "kuadrat",
    "kwalifikasi": "kualifikasi",
    "kwalitas": "kualitas",
    "kwalitet": "kualitas",
    "kwalitatif": "kualitatif",
    "kwantitas": "kuantitas",
    "kwantitatif": "kuantitatif",
    "kwantum": "kuantum",
    "kwartal": "kuartal",
    "kwintal": "kuintal",
    "kwitansi": "kuitansi",
    "kwatir": "khawatir",
    "kuatir": "khawatir",
    "jadual": "jadwal",
    "hirarki": "hierarki",
    "karir": "karier",
    "aktip": "aktif",
    "daptar": "daftar",
    "efektip": "efektif",
    "epektif": "efektif",
    "epektip": "efektif",
    "Pebruari": "Februari",
    "pisik": "fisik",
    "pondasi": "fondasi",
    "photo": "foto",
    "photokopi": "fotokopi",
    "hapal": "hafal",
    "insap": "insaf",
    "insyaf": "insaf",
    "konperensi": "konferensi",
    "kreatip": "kreatif",
    "kreativ": "kreatif",
    "maap": "maaf",
    "napsu": "nafsu",
    "negatip": "negatif",
    "negativ": "negatif",
    "objektip": "objektif",
    "obyektip": "objektif",
    "obyektif": "objektif",
    "pasip": "pasif",
    "pasiv": "pasif",
    "positip": "positif",
    "positiv": "positif",
    "produktip": "produktif",
    "produktiv": "produktif",
    "sarap": "saraf",
    "sertipikat": "sertifikat",
    "subjektip": "subjektif",
    "subyektip": "subjektif",
    "subyektif": "subjektif",
    "tarip": "tarif",
    "transitip": "transitif",
    "transitiv": "transitif",
    "faham": "paham",
    "fikir": "pikir",
    "berfikir": "berpikir",
    "telefon": "telepon",
    "telfon": "telepon",
    "telpon": "telepon",
    "tilpon": "telepon",
    "nafas": "napas",
    "bernafas": "bernapas",
    "pernafasan": "pernapasan",
    "vermak": "permak",
    "vulpen": "pulpen",
    "aktifis": "aktivis",
    "konfeksi": "konveksi",
    "motifasi": "motivasi",
    "Nopember": "November",
    "propinsi": "provinsi",
    "babtis": "baptis",
    "jerembab": "jerembap",
    "lembab": "lembap",
    "sembab": "sembap",
    "saptu": "sabtu",
    "tekat": "tekad",
    "bejad": "bejat",
    "nekad": "nekat",
    "otoped": "otopet",
    "skuad": "skuat",
    "jenius": "genius",
    "marjin": "margin",
    "marjinal": "marginal",
    "obyek": "objek",
    "subyek": "subjek",
    "projek": "proyek",
    "azas": "asas",
    "ijasah": "ijazah",
    "jenasah": "jenazah",
    "plasa": "plaza",
    "bathin": "batin",
    "Katholik": "Katolik",
    "orthografi": "ortografi",
    "pathogen": "patogen",
    "theologi": "teologi",
    "ijin": "izin",
    "rejeki": "rezeki",
    "rejim": "rezim",
    "jaman": "zaman",
    "jamrud": "zamrud",
    "jinah": "zina",
    "perjinahan": "perzinaan",
    "anugrah": "anugerah",
    "cendrawasih": "cenderawasih",
    "jendral": "jenderal",
    "kripik": "keripik",
    "krupuk": "kerupuk",
    "ksatria": "kesatria",
    "mentri": "menteri",
    "negri": "negeri",
    "Prancis": "Perancis",
    "sebrang": "seberang",
    "menyebrang": "menyeberang",
    "Sumatra": "Sumatera",
    "trampil": "terampil",
    "isteri": "istri",
    "justeru": "justru",
    "perajurit": "prajurit",
    "putera": "putra",
    "puteri": "putri",
    "samudera": "samudra",
    "sastera": "sastra",
    "sutera": "sutra",
    "terompet": "trompet",
    "iklas": "ikhlas",
    "iktisar": "ikhtisar",
    "kafilah": "khafilah",
    "kawatir": "khawatir",
    "kotbah": "khotbah",
    "kusyuk": "khusyuk",
    "makluk": "makhluk",
    "mahluk": "makhluk",
    "mahkluk": "makhluk",
    "nahkoda": "nakhoda",
    "nakoda": "nakhoda",
    "tahta": "takhta",
    "takhyul": "takhayul",
    "tahyul": "takhayul",
    "tahayul": "takhayul",
    "akhli": "ahli",
    "anarkhi": "anarki",
    "kharisma": "karisma",
    "kharismatik": "karismatik",
    "mahsud": "maksud",
    "makhsud": "maksud",
    "rakhmat": "rahmat",
    "tekhnik": "teknik",
    "tehnik": "teknik",
    "tehnologi": "teknologi",
    "ikhwal": "ihwal",
    "expor": "ekspor",
    "extra": "ekstra",
    "komplex": "komplek",
    "sex": "seks",
    "taxi": "taksi",
    "extasi": "ekstasi",
    "syaraf": "saraf",
    "syurga": "surga",
    "mashur": "masyhur",
    "masyur": "masyhur",
    "mahsyur": "masyhur",
    "mashyur": "masyhur",
    "muadzin": "muazin",
    "adzan": "azan",
    "ustadz": "ustaz",
    "ustad": "ustaz",
    "ustadzah": "ustaz",
    "dzikir": "zikir",
    "dzuhur": "zuhur",
    "dhuhur": "zuhur",
    "zhuhur": "zuhur",
    "analisa": "analisis",
    "diagnosa": "diagnosis",
    "hipotesa": "hipotesis",
    "sintesa": "sintesis",
    "aktiviti": "aktivitas",
    "aktifitas": "aktivitas",
    "efektifitas": "efektivitas",
    "komuniti": "komunitas",
    "kreatifitas": "kreativitas",
    "produktifitas": "produktivitas",
    "realiti": "realitas",
    "realita": "realitas",
    "selebriti": "selebritas",
    "spotifitas": "sportivitas",
    "universiti": "universitas",
    "utiliti": "utilitas",
    "validiti": "validitas",
    "dilokalisir": "dilokalisasi",
    "didramatisir": "didramatisasi",
    "dipolitisir": "dipolitisasi",
    "dinetralisir": "dinetralisasi",
    "dikonfrontir": "dikonfrontasi",
    "mendominir": "mendominasi",
    "koordinir": "koordinasi",
    "proklamir": "proklamasi",
    "terorganisir": "terorganisasi",
    "terealisir": "terealisasi",
    "robah": "ubah",
    "dirubah": "diubah",
    "merubah": "mengubah",
    "terlanjur": "telanjur",
    "terlantar": "telantar",
    "penglepasan": "pelepasan",
    "pelihatan": "penglihatan",
    "pemukiman": "permukiman",
    "pengrumahan": "perumahan",
    "penyewaan": "persewaan",
    "menyintai": "mencintai",
    "menyolok": "mencolok",
    "contek": "sontek",
    "mencontek": "menyontek",
    "pungkir": "mungkir",
    "dipungkiri": "dimungkiri",
    "kupungkiri": "kumungkiri",
    "kaupungkiri": "kaumungkiri",
    "nampak": "tampak",
    "nampaknya": "tampaknya",
    "nongkrong": "tongkrong",
    "berternak": "beternak",
    "berterbangan": "beterbangan",
    "berserta": "beserta",
    "berperkara": "beperkara",
    "berpergian": "bepergian",
    "berkerja": "bekerja",
    "berberapa": "beberapa",
    "terbersit": "tebersit",
    "terpercaya": "tepercaya",
    "terperdaya": "teperdaya",
    "terpercik": "tepercik",
    "terpergok": "tepergok",
    "aksesoris": "aksesori",
    "handal": "andal",
    "hantar": "antar",
    "panutan": "anutan",
    "atsiri": "asiri",
    "bhakti": "bakti",
    "china": "cina",
    "dharma": "darma",
    "diktaktor": "diktator",
    "eksport": "ekspor",
    "hembus": "embus",
    "hadits": "hadis",
    "hadist": "hadits",
    "harafiah": "harfiah",
    "himbau": "imbau",
    "import": "impor",
    "inget": "ingat",
    "hisap": "isap",
    "interprestasi": "interpretasi",
    "kangker": "kanker",
    "konggres": "kongres",
    "lansekap": "lanskap",
    "maghrib": "magrib",
    "emak": "mak",
    "moderen": "modern",
    "pasport": "paspor",
    "perduli": "peduli",
    "ramadhan": "ramadan",
    "rapih": "rapi",
    "Sansekerta": "Sanskerta",
    "shalat": "salat",
    "sholat": "salat",
    "silahkan": "silakan",
    "standard": "standar",
    "hutang": "utang",
    "zinah": "zina",
    "ambulan": "ambulans",
    "antartika": "sntarktika",
    "arteri": "arteria",
    "asik": "asyik",
    "australi": "australia",
    "denga": "dengan",
    "depo": "depot",
    "detil": "detail",
    "ensiklopedi": "ensiklopedia",
    "elit": "elite",
    "frustasi": "frustrasi",
    "gladi": "geladi",
    "greget": "gereget",
    "itali": "italia",
    "karna": "karena",
    "klenteng": "kelenteng",
    "erling": "kerling",
    "kontruksi": "konstruksi",
    "masal": "massal",
    "merk": "merek",
    "respon": "respons",
    "diresponi": "direspons",
    "skak": "sekak",
    "stir": "setir",
    "singapur": "singapura",
    "standarisasi": "standardisasi",
    "varitas": "varietas",
    "amphibi": "amfibi",
    "anjlog": "anjlok",
    "alpukat": "avokad",
    "alpokat": "avokad",
    "bolpen": "pulpen",
    "cabe": "cabai",
    "cabay": "cabai",
    "ceret": "cerek",
    "differensial": "diferensial",
    "duren": "durian",
    "faksimili": "faksimile",
    "faksimil": "faksimile",
    "graha": "gerha",
    "goblog": "goblok",
    "gombrong": "gombroh",
    "horden": "gorden",
    "korden": "gorden",
    "gubug": "gubuk",
    "imaginasi": "imajinasi",
    "jerigen": "jeriken",
    "jirigen": "jeriken",
    "carut-marut": "karut-marut",
    "kwota": "kuota",
    "mahzab": "mazhab",
    "mempesona": "memesona",
    "milyar": "miliar",
    "missi": "misi",
    "nenas": "nanas",
    "negoisasi": "negosiasi",
    "automotif": "otomotif",
    "pararel": "paralel",
    "paska": "pasca",
    "prosen": "persen",
    "pete": "petai",
    "petay": "petai",
    "proffesor": "profesor",
    "rame": "ramai",
    "rapot": "rapor",
    "rileks": "relaks",
    "rileksasi": "relaksasi",
    "renumerasi": "remunerasi",
    "seketaris": "sekretaris",
    "sekertaris": "sekretaris",
    "sensorik": "sensoris",
    "sentausa": "sentosa",
    "strawberi": "stroberi",
    "strawbery": "stroberi",
    "taqwa": "takwa",
    "tauco": "taoco",
    "tauge": "taoge",
    "toge": "taoge",
    "tauladan": "teladan",
    "taubat": "tobat",
    "trilyun": "triliun",
    "vissi": "visi",
    "coklat": "cokelat",
    "narkotika": "narkotik",
    "oase": "oasis",
    "politisi": "politikus",
    "terong": "terung",
    "wool": "wol",
    "himpit": "impit",
    "mujizat": "mukjizat",
    "mujijat": "mukjizat",
    "yag": "yang",
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/id/syntax_iterators.py
+++ b/spacy/lang/id/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/ko/examples.py
+++ b/spacy/lang/ko/examples.py
@ -9,8 +9,8 @@ Example sentences to test spaCy and its language models.
 """
 sentences = [
-    "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
+    "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.",
-    "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
+    "자율주행 자동차의 손해 배상 책임이 제조 업체로 옮겨 가다",
-    "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
+    "샌프란시스코 시가 자동 배달 로봇의 보도 주행 금지를 검토 중이라고 합니다.",
    "런던은 영국의 수도이자 가장 큰 도시입니다.",
 ]
--- a/spacy/lang/lb/init.py
+++ b/spacy/lang/lb/init.py
@ -2,26 +2,21 @@
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class LuxembourgishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "lb"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/lb/norm_exceptions.py
+++ b/spacy/lang/lb/norm_exceptions.py
@ -1,16 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 # TODO
 # norm execptions: find a possibility to deal with the zillions of spelling
 # variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
 # here one could include the most common spelling mistakes
 _exc = {"dass": "datt", "viläicht": "vläicht"}
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/lex_attrs.py
+++ b/spacy/lang/lex_attrs.py
@ -186,10 +186,6 @@ def suffix(string):
    return string[-3:]
 def cluster(string):
    return 0
 def is_alpha(string):
    return string.isalpha()
@ -218,20 +214,11 @@ def is_stop(string, stops=set()):
    return string.lower() in stops
 def is_oov(string):
    return True
 def get_prob(string):
    return -20.0
 LEX_ATTRS = {
    attrs.LOWER: lower,
    attrs.NORM: lower,
    attrs.PREFIX: prefix,
    attrs.SUFFIX: suffix,
    attrs.CLUSTER: cluster,
    attrs.IS_ALPHA: is_alpha,
    attrs.IS_DIGIT: is_digit,
    attrs.IS_LOWER: is_lower,
@ -239,8 +226,6 @@ LEX_ATTRS = {
    attrs.IS_TITLE: is_title,
    attrs.IS_UPPER: is_upper,
    attrs.IS_STOP: is_stop,
    attrs.IS_OOV: is_oov,
    attrs.PROB: get_prob,
    attrs.LIKE_EMAIL: like_email,
    attrs.LIKE_NUM: like_num,
    attrs.IS_PUNCT: is_punct,
--- a/spacy/lang/ml/lex_attrs.py
+++ b/spacy/lang/ml/lex_attrs.py
@ -55,7 +55,7 @@ _num_words = [
    "തൊണ്ണൂറ് ",
    "നുറ് ",
    "ആയിരം ",
-    "പത്തുലക്ഷം"
+    "പത്തുലക്ഷം",
 ]
--- a/spacy/lang/ml/stop_words.py
+++ b/spacy/lang/ml/stop_words.py
@ -3,7 +3,6 @@ from __future__ import unicode_literals
 STOP_WORDS = set(
    """
 അത്
 ഇത്
--- a/spacy/lang/nb/syntax_iterators.py
+++ b/spacy/lang/nb/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/pl/init.py
+++ b/spacy/lang/pl/init.py
@ -1,17 +1,19 @@
 # coding: utf8
 from __future__ import unicode_literals
-from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
-from .punctuation import TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import PolishLemmatizer
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...util import add_lookups
 from ...lookups import Lookups
 class PolishDefaults(Language.Defaults):
@ -21,10 +23,21 @@ class PolishDefaults(Language.Defaults):
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
-    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    mod_base_exceptions = {
        exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
    }
    tokenizer_exceptions = mod_base_exceptions
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    @classmethod
    def create_lemmatizer(cls, nlp=None, lookups=None):
        if lookups is None:
            lookups = Lookups()
        return PolishLemmatizer(lookups)
 class Polish(Language):
--- a/spacy/lang/pl/_tokenizer_exceptions_list.py
+++ b/spacy/lang/pl/_tokenizer_exceptions_list.py
--- a/spacy/lang/pl/lemmatizer.py
+++ b/spacy/lang/pl/lemmatizer.py
@ -0,0 +1,106 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...lemmatizer import Lemmatizer
 from ...parts_of_speech import NAMES
 class PolishLemmatizer(Lemmatizer):
    # This lemmatizer implements lookup lemmatization based on
    # the Morfeusz dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS
    # It utilizes some prefix based improvements for
    # verb and adjectives lemmatization, as well as case-sensitive
    # lemmatization for nouns
    def __init__(self, lookups, *args, **kwargs):
        # this lemmatizer is lookup based, so it does not require an index, exceptionlist, or rules
        super(PolishLemmatizer, self).__init__(lookups)
        self.lemma_lookups = {}
        for tag in [
            "ADJ",
            "ADP",
            "ADV",
            "AUX",
            "NOUN",
            "NUM",
            "PART",
            "PRON",
            "VERB",
            "X",
        ]:
            self.lemma_lookups[tag] = self.lookups.get_table(
                "lemma_lookup_" + tag.lower(), {}
            )
        self.lemma_lookups["DET"] = self.lemma_lookups["X"]
        self.lemma_lookups["PROPN"] = self.lemma_lookups["NOUN"]
    def __call__(self, string, univ_pos, morphology=None):
        if isinstance(univ_pos, int):
            univ_pos = NAMES.get(univ_pos, "X")
        univ_pos = univ_pos.upper()
        if univ_pos == "NOUN":
            return self.lemmatize_noun(string, morphology)
        if univ_pos != "PROPN":
            string = string.lower()
        if univ_pos == "ADJ":
            return self.lemmatize_adj(string, morphology)
        elif univ_pos == "VERB":
            return self.lemmatize_verb(string, morphology)
        lemma_dict = self.lemma_lookups.get(univ_pos, {})
        return [lemma_dict.get(string, string.lower())]
    def lemmatize_adj(self, string, morphology):
        # this method utilizes different procedures for adjectives
        # with 'nie' and 'naj' prefixes
        lemma_dict = self.lemma_lookups["ADJ"]
        if string[:3] == "nie":
            search_string = string[3:]
            if search_string[:3] == "naj":
                naj_search_string = search_string[3:]
                if naj_search_string in lemma_dict:
                    return [lemma_dict[naj_search_string]]
            if search_string in lemma_dict:
                return [lemma_dict[search_string]]
        if string[:3] == "naj":
            naj_search_string = string[3:]
            if naj_search_string in lemma_dict:
                return [lemma_dict[naj_search_string]]
        return [lemma_dict.get(string, string)]
    def lemmatize_verb(self, string, morphology):
        # this method utilizes a different procedure for verbs
        # with 'nie' prefix
        lemma_dict = self.lemma_lookups["VERB"]
        if string[:3] == "nie":
            search_string = string[3:]
            if search_string in lemma_dict:
                return [lemma_dict[search_string]]
        return [lemma_dict.get(string, string)]
    def lemmatize_noun(self, string, morphology):
        # this method is case-sensitive, in order to work
        # for incorrectly tagged proper names
        lemma_dict = self.lemma_lookups["NOUN"]
        if string != string.lower():
            if string.lower() in lemma_dict:
                return [lemma_dict[string.lower()]]
            elif string in lemma_dict:
                return [lemma_dict[string]]
            return [string.lower()]
        return [lemma_dict.get(string, string)]
    def lookup(self, string, orth=None):
        return string.lower()
    def lemmatize(self, string, index, exceptions, rules):
        raise NotImplementedError
--- a/spacy/lang/pl/polish_srx_rules_LICENSE.txt
+++ b/spacy/lang/pl/polish_srx_rules_LICENSE.txt
@ -1,23 +0,0 @@
 Copyright (c) 2019, Marcin Miłkowski
 All rights reserved.
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met: 
 1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer. 
 2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution. 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
 ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/spacy/lang/pl/punctuation.py
+++ b/spacy/lang/pl/punctuation.py
@ -1,22 +1,48 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ..char_classes import LIST_ELLIPSES, CONCAT_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS
 from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
 _quotes = CONCAT_QUOTES.replace("'", "")
 _prefixes = _prefixes = [
    r"(długo|krótko|jedno|dwu|trzy|cztero)-"
 ] + BASE_TOKENIZER_PREFIXES
 _infixes = (
    LIST_ELLIPSES
-    + [CONCAT_ICONS]
+    + LIST_ICONS
    + LIST_HYPHENS
    + [
-        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
+        r"(?<=[0-9{al}])\.(?=[0-9{au}])".format(al=ALPHA, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
-        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])[:<>=\/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
-        r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES),
+        r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=_quotes),
    ]
 )
 _suffixes = (
    ["''", "’’", r"\.", "…"]
    + LIST_PUNCT
    + LIST_QUOTES
    + LIST_ICONS
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9])(?:{u})".format(u=UNITS),
        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
        ),
        r"(?<=[{au}])\.".format(au=ALPHA_UPPER),
    ]
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
 TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/pl/tokenizer_exceptions.py
+++ b/spacy/lang/pl/tokenizer_exceptions.py
@ -1,26 +0,0 @@
 # encoding: utf8
 from __future__ import unicode_literals
 from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS
 from ...symbols import POS, ADV, NOUN, ORTH, LEMMA, ADJ
 _exc = {}
 for exc_data in [
    {ORTH: "m.in.", LEMMA: "między innymi", POS: ADV},
    {ORTH: "inż.", LEMMA: "inżynier", POS: NOUN},
    {ORTH: "mgr.", LEMMA: "magister", POS: NOUN},
    {ORTH: "tzn.", LEMMA: "to znaczy", POS: ADV},
    {ORTH: "tj.", LEMMA: "to jest", POS: ADV},
    {ORTH: "tzw.", LEMMA: "tak zwany", POS: ADJ},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]
 for orth in ["w.", "r."]:
    _exc[orth] = [{ORTH: orth}]
 for orth in PL_BASE_EXCEPTIONS:
    _exc[orth] = [{ORTH: orth}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/pt/init.py
+++ b/spacy/lang/pt/init.py
@ -5,22 +5,17 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .norm_exceptions import NORM_EXCEPTIONS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class PortugueseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "pt"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    lex_attr_getters.update(LEX_ATTRS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
--- a/spacy/lang/pt/norm_exceptions.py
+++ b/spacy/lang/pt/norm_exceptions.py
@ -1,23 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 # These exceptions are used to add NORM values based on a token's ORTH value.
 # Individual languages can also add their own exceptions and overwrite them -
 # for example, British vs. American spelling in English.
 # Norms are only set if no alternative is provided in the tokenizer exceptions.
 # Note that this does not change any other token attributes. Its main purpose
 # is to normalise the word representations so that equivalent tokens receive
 # similar representations. For example: $ and € are very different, but they're
 # both currency symbols. By normalising currency symbols to $, all symbols are
 # seen as similar, no matter how common they are in the training data.
 NORM_EXCEPTIONS = {
    "R$": "$",  # Real
    "r$": "$",  # Real
    "Cz$": "$",  # Cruzado
    "cz$": "$",  # Cruzado
    "NCz$": "$",  # Cruzado Novo
    "ncz$": "$",  # Cruzado Novo
 }
--- a/spacy/lang/ru/init.py
+++ b/spacy/lang/ru/init.py
@ -3,26 +3,21 @@ from __future__ import unicode_literals, print_function
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .lemmatizer import RussianLemmatizer
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
+from ...util import update_exc
 from ...util import update_exc, add_lookups
 from ...language import Language
 from ...lookups import Lookups
-from ...attrs import LANG, NORM
+from ...attrs import LANG
 class RussianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "ru"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/ru/norm_exceptions.py
+++ b/spacy/lang/ru/norm_exceptions.py
@ -1,36 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 _exc = {
    # Slang
    "прив": "привет",
    "дарова": "привет",
    "дак": "так",
    "дык": "так",
    "здарова": "привет",
    "пакедава": "пока",
    "пакедаво": "пока",
    "ща": "сейчас",
    "спс": "спасибо",
    "пжлст": "пожалуйста",
    "плиз": "пожалуйста",
    "ладненько": "ладно",
    "лады": "ладно",
    "лан": "ладно",
    "ясн": "ясно",
    "всм": "всмысле",
    "хош": "хочешь",
    "хаюшки": "привет",
    "оч": "очень",
    "че": "что",
    "чо": "что",
    "шо": "что",
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/sr/init.py
+++ b/spacy/lang/sr/init.py
@ -3,22 +3,17 @@ from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
+from ...attrs import LANG
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 class SerbianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "sr"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
--- a/spacy/lang/sr/norm_exceptions.py
+++ b/spacy/lang/sr/norm_exceptions.py
@ -1,26 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 _exc = {
    # Slang
    "ћале": "отац",
    "кева": "мајка",
    "смор": "досада",
    "кец": "јединица",
    "тебра": "брат",
    "штребер": "ученик",
    "факс": "факултет",
    "профа": "професор",
    "бус": "аутобус",
    "пискарало": "службеник",
    "бакутанер": "бака",
    "џибер": "простак",
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/sv/lex_attrs.py
+++ b/spacy/lang/sv/lex_attrs.py
@ -40,7 +40,7 @@ _num_words = [
    "miljard",
    "biljon",
    "biljard",
-    "kvadriljon"
+    "kvadriljon",
 ]
--- a/spacy/lang/sv/syntax_iterators.py
+++ b/spacy/lang/sv/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
    if not doc.is_parsed:
        raise ValueError(Errors.E029)
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    prev_end = -1
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
+            prev_end = word.right_edge.i
                continue
            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
+                prev_end = word.right_edge.i
                    continue
                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
                yield word.left_edge.i, word.right_edge.i + 1, np_label
--- a/spacy/lang/sv/tokenizer_exceptions.py
+++ b/spacy/lang/sv/tokenizer_exceptions.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG
+from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA
 _exc = {}
@ -155,6 +155,6 @@ for orth in ABBREVIATIONS:
 # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
 # should be tokenized as two separate tokens.
 for orth in ["i", "m"]:
-    _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: ".", TAG: PUNCT}]
+    _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/ta/norm_exceptions.py
+++ b/spacy/lang/ta/norm_exceptions.py
@ -1,139 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 _exc = {
    # Regional words normal
    # Sri Lanka - wikipeadia
    "இங்க": "இங்கே",
    "வாங்க": "வாருங்கள்",
    "ஒண்டு": "ஒன்று",
    "கண்டு": "கன்று",
    "கொண்டு": "கொன்று",
    "பண்டி": "பன்றி",
    "பச்ச": "பச்சை",
    "அம்பது": "ஐம்பது",
    "வெச்ச": "வைத்து",
    "வச்ச": "வைத்து",
    "வச்சி": "வைத்து",
    "வாளைப்பழம்": "வாழைப்பழம்",
    "மண்ணு": "மண்",
    "பொன்னு": "பொன்",
    "சாவல்": "சேவல்",
    "அங்கால": "அங்கு ",
    "அசுப்பு": "நடமாட்டம்",
    "எழுவான் கரை": "எழுவான்கரை",
    "ஓய்யாரம்": "எழில் ",
    "ஒளும்பு": "எழும்பு",
    "ஓர்மை": "துணிவு",
    "கச்சை": "கோவணம்",
    "கடப்பு": "தெருவாசல்",
    "சுள்ளி": "காய்ந்த குச்சி",
    "திறாவுதல்": "தடவுதல்",
    "நாசமறுப்பு": "தொல்லை",
    "பரிசாரி": "வைத்தியன்",
    "பறவாதி": "பேராசைக்காரன்",
    "பிசினி": "உலோபி ",
    "விசர்": "பைத்தியம்",
    "ஏனம்": "பாத்திரம்",
    "ஏலா": "இயலாது",
    "ஒசில்": "அழகு",
    "ஒள்ளுப்பம்": "கொஞ்சம்",
    # Srilankan and indian
    "குத்துமதிப்பு": "",
    "நூனாயம்": "நூல்நயம்",
    "பைய": "மெதுவாக",
    "மண்டை": "தலை",
    "வெள்ளனே": "சீக்கிரம்",
    "உசுப்பு": "எழுப்பு",
    "ஆணம்": "குழம்பு",
    "உறக்கம்": "தூக்கம்",
    "பஸ்": "பேருந்து",
    "களவு": "திருட்டு ",
    # relationship
    "புருசன்": "கணவன்",
    "பொஞ்சாதி": "மனைவி",
    "புள்ள": "பிள்ளை",
    "பிள்ள": "பிள்ளை",
    "ஆம்பிளப்புள்ள": "ஆண் பிள்ளை",
    "பொம்பிளப்புள்ள": "பெண் பிள்ளை",
    "அண்ணாச்சி": "அண்ணா",
    "அக்காச்சி": "அக்கா",
    "தங்கச்சி": "தங்கை",
    # difference words
    "பொடியன்": "சிறுவன்",
    "பொட்டை": "சிறுமி",
    "பிறகு": "பின்பு",
    "டக்கென்டு": "விரைவாக",
    "கெதியா": "விரைவாக",
    "கிறுகி": "திரும்பி",
    "போயித்து வாறன்": "போய் வருகிறேன்",
    "வருவாங்களா": "வருவார்களா",
    # regular spokens
    "சொல்லு": "சொல்",
    "கேளு": "கேள்",
    "சொல்லுங்க": "சொல்லுங்கள்",
    "கேளுங்க": "கேளுங்கள்",
    "நீங்கள்": "நீ",
    "உன்": "உன்னுடைய",
    # Portugeese formal words
    "அலவாங்கு": "கடப்பாரை",
    "ஆசுப்பத்திரி": "மருத்துவமனை",
    "உரோதை": "சில்லு",
    "கடுதாசி": "கடிதம்",
    "கதிரை": "நாற்காலி",
    "குசினி": "அடுக்களை",
    "கோப்பை": "கிண்ணம்",
    "சப்பாத்து": "காலணி",
    "தாச்சி": "இரும்புச் சட்டி",
    "துவாய்": "துவாலை",
    "தவறணை": "மதுக்கடை",
    "பீப்பா": "மரத்தாழி",
    "யன்னல்": "சாளரம்",
    "வாங்கு": "மரஇருக்கை",
    # Dutch formal words
    "இறாக்கை": "பற்சட்டம்",
    "இலாட்சி": "இழுப்பறை",
    "கந்தோர்": "பணிமனை",
    "நொத்தாரிசு": "ஆவண எழுத்துபதிவாளர்",
    # English formal words
    "இஞ்சினியர்": "பொறியியலாளர்",
    "சூப்பு": "ரசம்",
    "செக்": "காசோலை",
    "சேட்டு": "மேற்ச்சட்டை",
    "மார்க்கட்டு": "சந்தை",
    "விண்ணன்": "கெட்டிக்காரன்",
    # Arabic formal words
    "ஈமான்": "நம்பிக்கை",
    "சுன்னத்து": "விருத்தசேதனம்",
    "செய்த்தான்": "பிசாசு",
    "மவுத்து": "இறப்பு",
    "ஹலால்": "அங்கீகரிக்கப்பட்டது",
    "கறாம்": "நிராகரிக்கப்பட்டது",
    # Persian, Hindustanian and hindi formal words
    "சுமார்": "கிட்டத்தட்ட",
    "சிப்பாய்": "போர்வீரன்",
    "சிபார்சு": "சிபாரிசு",
    "ஜமீன்": "பணக்காரா்",
    "அசல்": "மெய்யான",
    "அந்தஸ்து": "கௌரவம்",
    "ஆஜர்": "சமா்ப்பித்தல்",
    "உசார்": "எச்சரிக்கை",
    "அச்சா": "நல்ல",
    # English words used in text conversations
    "bcoz": "ஏனெனில்",
    "bcuz": "ஏனெனில்",
    "fav": "விருப்பமான",
    "morning": "காலை வணக்கம்",
    "gdeveng": "மாலை வணக்கம்",
    "gdnyt": "இரவு வணக்கம்",
    "gdnit": "இரவு வணக்கம்",
    "plz": "தயவு செய்து",
    "pls": "தயவு செய்து",
    "thx": "நன்றி",
    "thanx": "நன்றி",
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
--- a/spacy/lang/th/init.py
+++ b/spacy/lang/th/init.py
@ -4,14 +4,12 @@ from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
-from ..norm_exceptions import BASE_NORMS
+from ...attrs import LANG
 from ...attrs import LANG, NORM
 from ...language import Language
 from ...tokens import Doc
-from ...util import DummyTokenizer, add_lookups
+from ...util import DummyTokenizer
 class ThaiTokenizer(DummyTokenizer):
@ -37,9 +35,6 @@ class ThaiDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda _text: "th"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
    )
    tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/th/norm_exceptions.py
+++ b/spacy/lang/th/norm_exceptions.py
@ -1,113 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 _exc = {
    # Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์)
    "สนุ๊กเกอร์": "สนุกเกอร์",
    "โน้ต": "โน้ต",
    # Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ)
    "โทสับ": "โทรศัพท์",
    "พุ่งนี้": "พรุ่งนี้",
    # Strange (ให้ดูแปลกตา)
    "ชะมะ": "ใช่ไหม",
    "ชิมิ": "ใช่ไหม",
    "ชะ": "ใช่ไหม",
    "ช่ายมะ": "ใช่ไหม",
    "ป่าว": "เปล่า",
    "ป่ะ": "เปล่า",
    "ปล่าว": "เปล่า",
    "คัย": "ใคร",
    "ไค": "ใคร",
    "คราย": "ใคร",
    "เตง": "ตัวเอง",
    "ตะเอง": "ตัวเอง",
    "รึ": "หรือ",
    "เหรอ": "หรือ",
    "หรา": "หรือ",
    "หรอ": "หรือ",
    "ชั้น": "ฉัน",
    "ชั้ล": "ฉัน",
    "ช้าน": "ฉัน",
    "เทอ": "เธอ",
    "เทอร์": "เธอ",
    "เทอว์": "เธอ",
    "แกร": "แก",
    "ป๋ม": "ผม",
    "บ่องตง": "บอกตรงๆ",
    "ถ่ามตง": "ถามตรงๆ",
    "ต่อมตง": "ตอบตรงๆ",
    "เพิ่ล": "เพื่อน",
    "จอบอ": "จอบอ",
    "ดั้ย": "ได้",
    "ขอบคุง": "ขอบคุณ",
    "ยังงัย": "ยังไง",
    "Inw": "เทพ",
    "uou": "นอน",
    "Lกรีeu": "เกรียน",
    # Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์)
    "เปงราย": "เป็นอะไร",
    "เปนรัย": "เป็นอะไร",
    "เปงรัย": "เป็นอะไร",
    "เป็นอัลไล": "เป็นอะไร",
    "ทามมาย": "ทำไม",
    "ทามมัย": "ทำไม",
    "จังรุย": "จังเลย",
    "จังเยย": "จังเลย",
    "จุงเบย": "จังเลย",
    "ไม่รู้": "มะรุ",
    "เฮ่ย": "เฮ้ย",
    "เห้ย": "เฮ้ย",
    "น่าร็อค": "น่ารัก",
    "น่าร๊าก": "น่ารัก",
    "ตั้ลล๊าก": "น่ารัก",
    "คือร๊ะ": "คืออะไร",
    "โอป่ะ": "โอเคหรือเปล่า",
    "น่ามคาน": "น่ารำคาญ",
    "น่ามสาร": "น่าสงสาร",
    "วงวาร": "สงสาร",
    "บับว่า": "แบบว่า",
    "อัลไล": "อะไร",
    "อิจ": "อิจฉา",
    # Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์)
    "กรู": "กู",
    "กุ": "กู",
    "กรุ": "กู",
    "ตู": "กู",
    "ตรู": "กู",
    "มรึง": "มึง",
    "เมิง": "มึง",
    "มืง": "มึง",
    "มุง": "มึง",
    "สาด": "สัตว์",
    "สัส": "สัตว์",
    "สัก": "สัตว์",
    "แสรด": "สัตว์",
    "โคโตะ": "โคตร",
    "โคด": "โคตร",
    "โครต": "โคตร",
    "โคตะระ": "โคตร",
    "พ่อง": "พ่อมึง",
    "แม่เมิง": "แม่มึง",
    "เชี่ย": "เหี้ย",
    # Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร)
    "แอร๊ยย": "อ๊าย",
    "อร๊ายยย": "อ๊าย",
    "มันส์": "มัน",
    "วู๊วววววววว์": "วู้",
    # Acronym (แบบคำย่อ)
    "หมาลัย": "มหาวิทยาลัย",
    "วิดวะ": "วิศวะ",
    "สินสาด ": "ศิลปศาสตร์",
    "สินกำ ": "ศิลปกรรมศาสตร์",
    "เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ",
    "เมกา ": "อเมริกา",
    "มอไซค์ ": "มอเตอร์ไซค์",
 }
 NORM_EXCEPTIONS = {}
 for string, norm in _exc.items():
    NORM_EXCEPTIONS[string] = norm
    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/ur/tag_map.py
+++ b/spacy/lang/ur/tag_map.py
@ -38,7 +38,6 @@ TAG_MAP = {
    "NNPC": {POS: PROPN},
    "NNC": {POS: NOUN},
    "PSP": {POS: ADP},
    ".": {POS: PUNCT},
    ",": {POS: PUNCT},
    "-LRB-": {POS: PUNCT},
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -104,6 +104,23 @@ class ChineseTokenizer(DummyTokenizer):
            (words, spaces) = util.get_words_and_spaces(words, text)
            return Doc(self.vocab, words=words, spaces=spaces)
    def pkuseg_update_user_dict(self, words, reset=False):
        if self.pkuseg_seg:
            if reset:
                try:
                    import pkuseg
                    self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None)
                except ImportError:
                    if self.use_pkuseg:
                        msg = (
                            "pkuseg not installed: unable to reset pkuseg "
                            "user dict. Please " + _PKUSEG_INSTALL_MSG
                        )
                        raise ImportError(msg)
            for word in words:
                self.pkuseg_seg.preprocesser.insert(word.strip(), "")
    def _get_config(self):
        config = OrderedDict(
            (
@ -152,21 +169,16 @@ class ChineseTokenizer(DummyTokenizer):
        return util.to_bytes(serializers, [])
    def from_bytes(self, data, **kwargs):
-        pkuseg_features_b = b""
+        pkuseg_data = {"features_b": b"", "weights_b": b"", "processors_data": None}
        pkuseg_weights_b = b""
        pkuseg_processors_data = None
        def deserialize_pkuseg_features(b):
-            nonlocal pkuseg_features_b
+            pkuseg_data["features_b"] = b
            pkuseg_features_b = b
        def deserialize_pkuseg_weights(b):
-            nonlocal pkuseg_weights_b
+            pkuseg_data["weights_b"] = b
            pkuseg_weights_b = b
        def deserialize_pkuseg_processors(b):
-            nonlocal pkuseg_processors_data
+            pkuseg_data["processors_data"] = srsly.msgpack_loads(b)
            pkuseg_processors_data = srsly.msgpack_loads(b)
        deserializers = OrderedDict(
            (
@ -178,13 +190,13 @@ class ChineseTokenizer(DummyTokenizer):
        )
        util.from_bytes(data, deserializers, [])
-        if pkuseg_features_b and pkuseg_weights_b:
+        if pkuseg_data["features_b"] and pkuseg_data["weights_b"]:
            with tempfile.TemporaryDirectory() as tempdir:
                tempdir = Path(tempdir)
                with open(tempdir / "features.pkl", "wb") as fileh:
-                    fileh.write(pkuseg_features_b)
+                    fileh.write(pkuseg_data["features_b"])
                with open(tempdir / "weights.npz", "wb") as fileh:
-                    fileh.write(pkuseg_weights_b)
+                    fileh.write(pkuseg_data["weights_b"])
                try:
                    import pkuseg
                except ImportError:
@ -193,13 +205,9 @@ class ChineseTokenizer(DummyTokenizer):
                        + _PKUSEG_INSTALL_MSG
                    )
                self.pkuseg_seg = pkuseg.pkuseg(str(tempdir))
-            if pkuseg_processors_data:
+            if pkuseg_data["processors_data"]:
-                (
+                processors_data = pkuseg_data["processors_data"]
-                    user_dict,
+                (user_dict, do_process, common_words, other_words) = processors_data
                    do_process,
                    common_words,
                    other_words,
                ) = pkuseg_processors_data
                self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict)
                self.pkuseg_seg.postprocesser.do_process = do_process
                self.pkuseg_seg.postprocesser.common_words = set(common_words)
--- a/spacy/language.py
+++ b/spacy/language.py
@ -4,10 +4,7 @@ from __future__ import absolute_import, unicode_literals
 import random
 import itertools
 import warnings
 from thinc.extra import load_nlp
 from spacy.util import minibatch
 import weakref
 import functools
 from collections import OrderedDict
@ -28,10 +25,11 @@ from .compat import izip, basestring_, is_python2, class_types
 from .gold import GoldParse
 from .scorer import Scorer
 from ._ml import link_vectors_to_models, create_default_optimizer
-from .attrs import IS_STOP, LANG
+from .attrs import IS_STOP, LANG, NORM
 from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .lang.punctuation import TOKENIZER_INFIXES
 from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES
 from .lang.norm_exceptions import BASE_NORMS
 from .lang.tag_map import TAG_MAP
 from .tokens import Doc
 from .lang.lex_attrs import LEX_ATTRS, is_stop
@ -77,6 +75,11 @@ class BaseDefaults(object):
            lemmatizer=lemmatizer,
            lookups=lookups,
        )
        vocab.lex_attr_getters[NORM] = util.add_lookups(
            vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]),
            BASE_NORMS,
            vocab.lookups.get_table("lexeme_norm"),
        )
        for tag_str, exc in cls.morph_rules.items():
            for orth_str, attrs in exc.items():
                vocab.morphology.add_special_case(tag_str, orth_str, attrs)
@ -417,7 +420,7 @@ class Language(object):
    def __call__(self, text, disable=[], component_cfg=None):
        """Apply the pipeline to some text. The text can span multiple sentences,
-        and can contain arbtrary whitespace. Alignment into the original string
+        and can contain arbitrary whitespace. Alignment into the original string
        is preserved.
        text (unicode): The text to be processed.
@ -849,7 +852,7 @@ class Language(object):
            *[mp.Pipe(False) for _ in range(n_process)]
        )
-        batch_texts = minibatch(texts, batch_size)
+        batch_texts = util.minibatch(texts, batch_size)
        # Sender sends texts to the workers.
        # This is necessary to properly handle infinite length of texts.
        # (In this case, all data cannot be sent to the workers at once)
@ -907,9 +910,8 @@ class Language(object):
        serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
            p, exclude=["vocab"]
        )
-        serializers["meta.json"] = lambda p: p.open("w").write(
+        serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
-            srsly.json_dumps(self.meta)
+
        )
        for name, proc in self.pipeline:
            if not hasattr(proc, "name"):
                continue
@ -973,7 +975,9 @@ class Language(object):
        serializers = OrderedDict()
        serializers["vocab"] = lambda: self.vocab.to_bytes()
        serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
-        serializers["meta.json"] = lambda: srsly.json_dumps(OrderedDict(sorted(self.meta.items())))
+        serializers["meta.json"] = lambda: srsly.json_dumps(
            OrderedDict(sorted(self.meta.items()))
        )
        for name, proc in self.pipeline:
            if name in exclude:
                continue
@ -1075,7 +1079,7 @@ def _fix_pretrained_vectors_name(nlp):
    else:
        raise ValueError(Errors.E092)
    if nlp.vocab.vectors.size != 0:
-        link_vectors_to_models(nlp.vocab)
+        link_vectors_to_models(nlp.vocab, skip_rank=True)
    for name, proc in nlp.pipeline:
        if not hasattr(proc, "cfg"):
            continue
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@ -6,6 +6,7 @@ from collections import OrderedDict
 from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
 from .errors import Errors
 from .lookups import Lookups
 from .parts_of_speech import NAMES as UPOS_NAMES
 class Lemmatizer(object):
@ -43,17 +44,11 @@ class Lemmatizer(object):
        lookup_table = self.lookups.get_table("lemma_lookup", {})
        if "lemma_rules" not in self.lookups:
            return [lookup_table.get(string, string)]
-        if univ_pos in (NOUN, "NOUN", "noun"):
+        if isinstance(univ_pos, int):
-            univ_pos = "noun"
+            univ_pos = UPOS_NAMES.get(univ_pos, "X")
-        elif univ_pos in (VERB, "VERB", "verb"):
+        univ_pos = univ_pos.lower()
-            univ_pos = "verb"
+
-        elif univ_pos in (ADJ, "ADJ", "adj"):
+        if univ_pos in ("", "eol", "space"):
            univ_pos = "adj"
        elif univ_pos in (PUNCT, "PUNCT", "punct"):
            univ_pos = "punct"
        elif univ_pos in (PROPN, "PROPN"):
            return [string]
        else:
            return [string.lower()]
        # See Issue #435 for example of where this logic is requied.
        if self.is_base_form(univ_pos, morphology):
@ -61,6 +56,11 @@ class Lemmatizer(object):
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
        rules_table = self.lookups.get_table("lemma_rules", {})
        if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
            if univ_pos == "propn":
                return [string]
            else:
                return [string.lower()]
        lemmas = self.lemmatize(
            string,
            index_table.get(univ_pos, {}),
--- a/spacy/lexeme.pxd
+++ b/spacy/lexeme.pxd
@ -1,8 +1,8 @@
 from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
 from .attrs cimport attr_id_t
-from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER, LANG
+from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG
-from .structs cimport LexemeC, SerializedLexemeC
+from .structs cimport LexemeC
 from .strings cimport StringStore
 from .vocab cimport Vocab
@ -24,22 +24,6 @@ cdef class Lexeme:
        self.vocab = vocab
        self.orth = lex.orth
    @staticmethod
    cdef inline SerializedLexemeC c_to_bytes(const LexemeC* lex) nogil:
        cdef SerializedLexemeC lex_data
        buff = <const unsigned char*>&lex.flags
        end = <const unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
        for i in range(sizeof(lex_data.data)):
            lex_data.data[i] = buff[i]
        return lex_data
    @staticmethod
    cdef inline void c_from_bytes(LexemeC* lex, SerializedLexemeC lex_data) nogil:
        buff = <unsigned char*>&lex.flags
        end = <unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
        for i in range(sizeof(lex_data.data)):
            buff[i] = lex_data.data[i]
    @staticmethod
    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
        if name < (sizeof(flags_t) * 8):
@ -56,8 +40,6 @@ cdef class Lexeme:
            lex.prefix = value
        elif name == SUFFIX:
            lex.suffix = value
        elif name == CLUSTER:
            lex.cluster = value
        elif name == LANG:
            lex.lang = value
@ -84,8 +66,6 @@ cdef class Lexeme:
            return lex.suffix
        elif feat_name == LENGTH:
            return lex.length
        elif feat_name == CLUSTER:
            return lex.cluster
        elif feat_name == LANG:
            return lex.lang
        else:
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -17,7 +17,7 @@ from .typedefs cimport attr_t, flags_t
 from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
 from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
-from .attrs cimport IS_CURRENCY, IS_OOV, PROB
+from .attrs cimport IS_CURRENCY
 from .attrs import intify_attrs
 from .errors import Errors, Warnings
@ -89,11 +89,10 @@ cdef class Lexeme:
        cdef attr_id_t attr
        attrs = intify_attrs(attrs)
        for attr, value in attrs.items():
-            if attr == PROB:
+            # skip PROB, e.g. from lexemes.jsonl
-                self.c.prob = value
+            if isinstance(value, float):
-            elif attr == CLUSTER:
+                continue
-                self.c.cluster = int(value)
+            elif isinstance(value, (int, long)):
            elif isinstance(value, int) or isinstance(value, long):
                 Lexeme.set_struct_attr(self.c, attr, value)
            else:
                Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
@ -137,34 +136,6 @@ cdef class Lexeme:
        xp = get_array_module(vector)
        return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
    def to_bytes(self):
        lex_data = Lexeme.c_to_bytes(self.c)
        start = <const char*>&self.c.flags
        end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
        if (end-start) != sizeof(lex_data.data):
            raise ValueError(Errors.E072.format(length=end-start,
                                                bad_length=sizeof(lex_data.data)))
        byte_string = b"\0" * sizeof(lex_data.data)
        byte_chars = <char*>byte_string
        for i in range(sizeof(lex_data.data)):
            byte_chars[i] = lex_data.data[i]
        if len(byte_string) != sizeof(lex_data.data):
            raise ValueError(Errors.E072.format(length=len(byte_string),
                                                bad_length=sizeof(lex_data.data)))
        return byte_string
    def from_bytes(self, bytes byte_string):
        # This method doesn't really have a use-case --- wrote it for testing.
        # Possibly delete? It puts the Lexeme out of synch with the vocab.
        cdef SerializedLexemeC lex_data
        if len(byte_string) != sizeof(lex_data.data):
            raise ValueError(Errors.E072.format(length=len(byte_string),
                                                bad_length=sizeof(lex_data.data)))
        for i in range(len(byte_string)):
            lex_data.data[i] = byte_string[i]
        Lexeme.c_from_bytes(self.c, lex_data)
        self.orth = self.c.orth
    @property
    def has_vector(self):
        """RETURNS (bool): Whether a word vector is associated with the object.
@ -208,10 +179,14 @@ cdef class Lexeme:
        """RETURNS (float): A scalar value indicating the positivity or
            negativity of the lexeme."""
        def __get__(self):
-            return self.c.sentiment
+            sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
            return sentiment_table.get(self.c.orth, 0.0)
-        def __set__(self, float sentiment):
+        def __set__(self, float x):
-            self.c.sentiment = sentiment
+            if "lexeme_sentiment" not in self.vocab.lookups:
                self.vocab.lookups.add_table("lexeme_sentiment")
            sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
            sentiment_table[self.c.orth] = x
    @property
    def orth_(self):
@ -241,6 +216,10 @@ cdef class Lexeme:
            return self.c.norm
        def __set__(self, attr_t x):
            if "lexeme_norm" not in self.vocab.lookups:
                self.vocab.lookups.add_table("lexeme_norm")
            norm_table = self.vocab.lookups.get_table("lexeme_norm")
            norm_table[self.c.orth] = self.vocab.strings[x]
            self.c.norm = x
    property shape:
@ -276,10 +255,12 @@ cdef class Lexeme:
    property cluster:
        """RETURNS (int): Brown cluster ID."""
        def __get__(self):
-            return self.c.cluster
+            cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
            return cluster_table.get(self.c.orth, 0)
-        def __set__(self, attr_t x):
+        def __set__(self, int x):
-            self.c.cluster = x
+            cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
            cluster_table[self.c.orth] = x
    property lang:
        """RETURNS (uint64): Language of the parent vocabulary."""
@ -293,10 +274,14 @@ cdef class Lexeme:
        """RETURNS (float): Smoothed log probability estimate of the lexeme's
            type."""
        def __get__(self):
-            return self.c.prob
+            prob_table = self.vocab.load_extra_lookups("lexeme_prob")
            settings_table = self.vocab.load_extra_lookups("lexeme_settings")
            default_oov_prob = settings_table.get("oov_prob", -20.0)
            return prob_table.get(self.c.orth, default_oov_prob)
        def __set__(self, float x):
-            self.c.prob = x
+            prob_table = self.vocab.load_extra_lookups("lexeme_prob")
            prob_table[self.c.orth] = x
    property lower_:
        """RETURNS (unicode): Lowercase form of the word."""
@ -314,7 +299,7 @@ cdef class Lexeme:
            return self.vocab.strings[self.c.norm]
        def __set__(self, unicode x):
-            self.c.norm = self.vocab.strings.add(x)
+            self.norm = self.vocab.strings.add(x)
    property shape_:
        """RETURNS (unicode): Transform of the word's string, to show
@ -362,13 +347,10 @@ cdef class Lexeme:
        def __set__(self, flags_t x):
            self.c.flags = x
-    property is_oov:
+    @property
    def is_oov(self):
        """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        def __get__(self):
+        return self.orth in self.vocab.vectors
            return Lexeme.c_check_flag(self.c, IS_OOV)
        def __set__(self, attr_t x):
            Lexeme.c_set_flag(self.c, IS_OOV, x)
    property is_stop:
        """RETURNS (bool): Whether the lexeme is a stop word."""
--- a/spacy/lookups.py
+++ b/spacy/lookups.py
@ -124,7 +124,7 @@ class Lookups(object):
            self._tables[key].update(value)
        return self
-    def to_disk(self, path, **kwargs):
+    def to_disk(self, path, filename="lookups.bin", **kwargs):
        """Save the lookups to a directory as lookups.bin. Expects a path to a
        directory, which will be created if it doesn't exist.
@ -136,11 +136,11 @@ class Lookups(object):
            path = ensure_path(path)
            if not path.exists():
                path.mkdir()
-            filepath = path / "lookups.bin"
+            filepath = path / filename
            with filepath.open("wb") as file_:
                file_.write(self.to_bytes())
-    def from_disk(self, path, **kwargs):
+    def from_disk(self, path, filename="lookups.bin", **kwargs):
        """Load lookups from a directory containing a lookups.bin. Will skip
        loading if the file doesn't exist.
@ -150,7 +150,7 @@ class Lookups(object):
        DOCS: https://spacy.io/api/lookups#from_disk
        """
        path = ensure_path(path)
-        filepath = path / "lookups.bin"
+        filepath = path / filename
        if filepath.exists():
            with filepath.open("rb") as file_:
                data = file_.read()
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -213,28 +213,28 @@ cdef class Matcher:
                else:
                    yield doc
-    def __call__(self, object doc_or_span):
+    def __call__(self, object doclike):
        """Find all token sequences matching the supplied pattern.
-        doc_or_span (Doc or Span): The document to match over.
+        doclike (Doc or Span): The document to match over.
        RETURNS (list): A list of `(key, start, end)` tuples,
            describing the matches. A match tuple describes a span
            `doc[start:end]`. The `label_id` and `key` are both integers.
        """
-        if isinstance(doc_or_span, Doc):
+        if isinstance(doclike, Doc):
-            doc = doc_or_span
+            doc = doclike
            length = len(doc)
-        elif isinstance(doc_or_span, Span):
+        elif isinstance(doclike, Span):
-            doc = doc_or_span.doc
+            doc = doclike.doc
-            length = doc_or_span.end - doc_or_span.start
+            length = doclike.end - doclike.start
        else:
-            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doc_or_span).__name__))
+            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
        if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \
          and not doc.is_tagged:
            raise ValueError(Errors.E155.format())
        if DEP in self._seen_attrs and not doc.is_parsed:
            raise ValueError(Errors.E156.format())
-        matches = find_matches(&self.patterns[0], self.patterns.size(), doc_or_span, length, 
+        matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
                                extensions=self._extensions, predicates=self._extra_predicates)
        for i, (key, start, end) in enumerate(matches):
            on_match = self._callbacks.get(key, None)
@ -257,7 +257,7 @@ def unpickle_matcher(vocab, patterns, callbacks):
    return matcher
-cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int length, extensions=None, predicates=tuple()):
+cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
    """Find matches in a doc, with a compiled array of patterns. Matches are
    returned as a list of (id, start, end) tuples.
@ -286,7 +286,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
    else:
        nr_extra_attr = 0
        extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t))
-    for i, token in enumerate(doc_or_span):
+    for i, token in enumerate(doclike):
        for name, index in extensions.items():
            value = token._.get(name)
            if isinstance(value, basestring):
@ -298,7 +298,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
        for j in range(n):
            states.push_back(PatternStateC(patterns[j], i, 0))
        transition_states(states, matches, predicate_cache,
-            doc_or_span[i], extra_attr_values, predicates)
+            doclike[i], extra_attr_values, predicates)
        extra_attr_values += nr_extra_attr
        predicate_cache += len(predicates)
    # Handle matches that end in 0-width patterns
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -203,7 +203,7 @@ class Pipe(object):
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)
@ -626,7 +626,7 @@ class Tagger(Pipe):
        serialize = OrderedDict((
            ("vocab", lambda p: self.vocab.to_disk(p)),
            ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
-            ("model", lambda p: p.open("wb").write(self.model.to_bytes())),
+            ("model", lambda p: self.model.to_disk(p)),
            ("cfg", lambda p: srsly.write_json(p, self.cfg))
        ))
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
@ -1395,7 +1395,7 @@ class EntityLinker(Pipe):
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        serialize["kb"] = lambda p: self.kb.dump(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)
--- a/spacy/structs.pxd
+++ b/spacy/structs.pxd
@ -23,29 +23,6 @@ cdef struct LexemeC:
    attr_t prefix
    attr_t suffix
    attr_t cluster
    float prob
    float sentiment
 cdef struct SerializedLexemeC:
    unsigned char[8 + 8*10 + 4 + 4] data
    #    sizeof(flags_t)  # flags
    #    + sizeof(attr_t) # lang
    #    + sizeof(attr_t) # id
    #    + sizeof(attr_t) # length
    #    + sizeof(attr_t) # orth
    #    + sizeof(attr_t) # lower
    #    + sizeof(attr_t) # norm
    #    + sizeof(attr_t) # shape
    #    + sizeof(attr_t) # prefix
    #    + sizeof(attr_t) # suffix
    #    + sizeof(attr_t) # cluster
    #    + sizeof(float)  # prob
    #    + sizeof(float)  # cluster
    #    + sizeof(float) # l2_norm
 cdef struct SpanC:
    hash_t id
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -12,7 +12,7 @@ cdef enum symbol_t:
    LIKE_NUM
    LIKE_EMAIL
    IS_STOP
-    IS_OOV
+    IS_OOV_DEPRECATED
    IS_BRACKET
    IS_QUOTE
    IS_LEFT_PUNCT
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -17,7 +17,7 @@ IDS = {
    "LIKE_NUM": LIKE_NUM,
    "LIKE_EMAIL": LIKE_EMAIL,
    "IS_STOP": IS_STOP,
-    "IS_OOV": IS_OOV,
+    "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
    "IS_BRACKET": IS_BRACKET,
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -9,7 +9,6 @@ import numpy
 cimport cython.parallel
 import numpy.random
 cimport numpy as np
 from itertools import islice
 from cpython.ref cimport PyObject, Py_XDECREF
 from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
 from libc.math cimport exp
@ -621,15 +620,15 @@ cdef class Parser:
            self.model, cfg = self.Model(self.moves.n_moves, **cfg)
            if sgd is None:
                sgd = self.create_optimizer()
-            doc_sample = []
+            docs = []
-            gold_sample = []
+            golds = []
-            for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
+            for raw_text, annots_brackets in get_gold_tuples():
                for annots, brackets in annots_brackets:
                    ids, words, tags, heads, deps, ents = annots
-                    doc_sample.append(Doc(self.vocab, words=words))
+                    docs.append(Doc(self.vocab, words=words))
-                    gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
+                    golds.append(GoldParse(docs[-1], words=words, tags=tags,
                                           heads=heads, deps=deps, entities=ents))
-            self.model.begin_training(doc_sample, gold_sample)
+            self.model.begin_training(docs, golds)
            if pipeline is not None:
                self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
            link_vectors_to_models(self.vocab)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -88,6 +88,11 @@ def eu_tokenizer():
    return get_lang_class("eu").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def fa_tokenizer():
    return get_lang_class("fa").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def fi_tokenizer():
    return get_lang_class("fi").Defaults.create_tokenizer()
@ -107,6 +112,7 @@ def ga_tokenizer():
 def gu_tokenizer():
    return get_lang_class("gu").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def he_tokenizer():
    return get_lang_class("he").Defaults.create_tokenizer()
@ -241,7 +247,9 @@ def yo_tokenizer():
@pytest.fixture(scope="session")
 def zh_tokenizer_char():
-    return get_lang_class("zh").Defaults.create_tokenizer(config={"use_jieba": False, "use_pkuseg": False})
+    return get_lang_class("zh").Defaults.create_tokenizer(
        config={"use_jieba": False, "use_pkuseg": False}
    )
@pytest.fixture(scope="session")
@ -253,7 +261,9 @@ def zh_tokenizer_jieba():
@pytest.fixture(scope="session")
 def zh_tokenizer_pkuseg():
    pytest.importorskip("pkuseg")
-    return get_lang_class("zh").Defaults.create_tokenizer(config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True})
+    return get_lang_class("zh").Defaults.create_tokenizer(
        config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}
    )
@pytest.fixture(scope="session")
--- a/spacy/tests/doc/test_creation.py
+++ b/spacy/tests/doc/test_creation.py
@ -50,7 +50,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]
    # partial whitespace in words
    words = ["  ", "'", "dogs", "'", "\n\n", "run", " "]
@ -60,7 +62,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]
    # non-standard whitespace tokens
    words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
@ -70,7 +74,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]
    # mismatch between words and text
    with pytest.raises(ValueError):
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -181,6 +181,7 @@ def test_is_sent_start(en_tokenizer):
    doc.is_parsed = True
    assert len(list(doc.sents)) == 2
 def test_is_sent_end(en_tokenizer):
    doc = en_tokenizer("This is a sentence. This is another.")
    assert doc[4].is_sent_end is None
@ -213,6 +214,7 @@ def test_token0_has_sent_start_true():
    assert doc[1].is_sent_start is None
    assert not doc.is_sentenced
 def test_tokenlast_has_sent_end_true():
    doc = Doc(Vocab(), words=["hello", "world"])
    assert doc[0].is_sent_end is None
--- a/spacy/tests/lang/da/test_exceptions.py
+++ b/spacy/tests/lang/da/test_exceptions.py
@ -37,14 +37,6 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
    assert tokens[7].text == "."
@pytest.mark.parametrize(
    "text,norm", [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]
 )
 def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
    tokens = da_tokenizer(text)
    assert tokens[0].norm_ == norm
@pytest.mark.parametrize(
    "text,n_tokens",
    [
--- a/spacy/tests/lang/de/test_exceptions.py
+++ b/spacy/tests/lang/de/test_exceptions.py
@ -22,17 +22,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer):
    assert len(tokens) == 6
    assert tokens[2].text == "z.Zt."
    assert tokens[2].lemma_ == "zur Zeit"
@pytest.mark.parametrize(
    "text,norms", [("vor'm", ["vor", "dem"]), ("du's", ["du", "es"])]
 )
 def test_de_tokenizer_norm_exceptions(de_tokenizer, text, norms):
    tokens = de_tokenizer(text)
    assert [token.norm_ for token in tokens] == norms
@pytest.mark.parametrize("text,norm", [("daß", "dass")])
 def test_de_lex_attrs_norm_exceptions(de_tokenizer, text, norm):
    tokens = de_tokenizer(text)
    assert tokens[0].norm_ == norm
--- a/spacy/tests/lang/de/test_noun_chunks.py
+++ b/spacy/tests/lang/de/test_noun_chunks.py
@ -0,0 +1,16 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_noun_chunks_is_parsed_de(de_tokenizer):
    """Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = de_tokenizer("Er lag auf seinem")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
--- a/spacy/tests/lang/el/test_noun_chunks.py
+++ b/spacy/tests/lang/el/test_noun_chunks.py
@ -0,0 +1,16 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_noun_chunks_is_parsed_el(el_tokenizer):
    """Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = el_tokenizer("είναι χώρα της νοτιοανατολικής")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
--- a/spacy/tests/lang/en/test_exceptions.py
+++ b/spacy/tests/lang/en/test_exceptions.py
@ -118,6 +118,7 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
    assert [token.norm_ for token in tokens] == norms
@pytest.mark.skip
@pytest.mark.parametrize(
    "text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
 )
--- a/spacy/tests/lang/en/test_noun_chunks.py
+++ b/spacy/tests/lang/en/test_noun_chunks.py
@ -6,9 +6,24 @@ from spacy.attrs import HEAD, DEP
 from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
 from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
 import pytest
 from ...util import get_doc
 def test_noun_chunks_is_parsed(en_tokenizer):
    """Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = en_tokenizer("This is a sentence")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
 def test_en_noun_chunks_not_nested(en_vocab):
    words = ["Peter", "has", "chronic", "command", "and", "control", "issues"]
    heads = [1, 0, 4, 3, -1, -2, -5]
--- a/spacy/tests/lang/es/test_noun_chunks.py
+++ b/spacy/tests/lang/es/test_noun_chunks.py
@ -0,0 +1,16 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_noun_chunks_is_parsed_es(es_tokenizer):
    """Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = es_tokenizer("en Oxford este verano")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
--- a/spacy/tests/lang/fa/init.py
+++ b/spacy/tests/lang/fa/init.py
--- a/spacy/tests/lang/fa/test_noun_chunks.py
+++ b/spacy/tests/lang/fa/test_noun_chunks.py
@ -0,0 +1,17 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_noun_chunks_is_parsed_fa(fa_tokenizer):
    """Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = fa_tokenizer("این یک جمله نمونه می باشد.")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
--- a/spacy/tests/lang/fr/test_noun_chunks.py
+++ b/spacy/tests/lang/fr/test_noun_chunks.py
@ -0,0 +1,16 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_noun_chunks_is_parsed_fr(fr_tokenizer):
    """Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed.
    To check this test, we're constructing a Doc
    with a new Vocab here and forcing is_parsed to 'False'
    to make sure the noun chunks don't run.
    """
    doc = fr_tokenizer("trouver des travaux antérieurs")
    doc.is_parsed = False
    with pytest.raises(ValueError):
        list(doc.noun_chunks)
--- a/spacy/tests/lang/gu/test_text.py
+++ b/spacy/tests/lang/gu/test_text.py
@ -3,17 +3,16 @@ from __future__ import unicode_literals
 import pytest
 def test_gu_tokenizer_handlers_long_text(gu_tokenizer):
    text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે"""
    tokens = gu_tokenizer(text)
    assert len(tokens) == 9
@pytest.mark.parametrize(
    "text,length",
-    [
+    [("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6), ("ખેતરની ખેડ કરવામાં આવે છે.", 5)],
        ("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6),
        ("ખેતરની ખેડ કરવામાં આવે છે.", 5),
    ],
 )
 def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length):
    tokens = gu_tokenizer(text)
--- a/Show More
+++ b/Show More