Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match

2025-09-22 20:16:43 +03:00 · 2020-05-22 12:18:00 +02:00 · 2020-05-22 12:18:00 +02:00 · 730fa493a4
commit 730fa493a4
parent 565e0eef73 93c4d13588
143 changed files with 2003 additions and 8059 deletions
--- a/.github/contributors/ilivans.md
+++ b/.github/contributors/ilivans.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Ilia Ivanov              |
+| Company name (if applicable)   | Chattermill              |
+| Title or role (if applicable)  | DL Engineer              |
+| Date                           | 2020-05-14               |
+| GitHub username                | ilivans                  |
+| Website (optional)             |                          |
--- a/.github/contributors/kevinlu1248.md
+++ b/.github/contributors/kevinlu1248.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |              Kevin Lu|
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |               Student|
+| Date                           |                      |
+| GitHub username                |           kevinlu1248|
+| Website (optional)             |                      |
--- a/.github/contributors/lfiedler.md
+++ b/.github/contributors/lfiedler.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Leander Fiedler      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06 April 2020        |
+| GitHub username                | lfiedler             |
+| Website (optional)             |                      |
--- a/.github/contributors/osori.md
+++ b/.github/contributors/osori.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Ilkyu Ju             |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-05-17           |
+| GitHub username                | osori                |
+| Website (optional)             |                      |
--- a/.github/contributors/thoppe.md
+++ b/.github/contributors/thoppe.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Travis Hoppe  |
+| Company name (if applicable)   |        |
+| Title or role (if applicable)  | Data Scientist |
+| Date                           | 07 May 2020         |
+| GitHub username                | thoppe    |
+| Website (optional)             | http://thoppe.github.io/  |
--- a/.github/contributors/vishnupriyavr.md
+++ b/.github/contributors/vishnupriyavr.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Vishnu Priya VR          |
+| Company name (if applicable)   | Uniphore                 |
+| Title or role (if applicable)  | NLP/AI Engineer          |
+| Date                           | 2020-05-03               |
+| GitHub username                | vishnupriyavr            |
+| Website (optional)             |                          |
--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -1,6 +1,7 @@
 """Prevent catastrophic forgetting with rehearsal updates."""
 import plac
 import random
+import warnings
 import srsly
 import spacy
 from spacy.gold import GoldParse
@ -66,7 +67,10 @@ def main(model_name, unlabelled_loc):
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    sizes = compounding(1.0, 4.0, 1.001)
-    with nlp.disable_pipes(*other_pipes):
+    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
+        # show warnings for misaligned entity spans once
+        warnings.filterwarnings("once", category=UserWarning, module='spacy')
+
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            random.shuffle(raw_docs)
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -64,7 +64,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
    vocab = Vocab().from_disk(vocab_path)
-    # create blank Language class with correct vocab
+    # create blank English model with correct vocab
    nlp = spacy.blank("en", vocab=vocab)
    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -8,12 +8,13 @@ For more details, see the documentation:
 * NER: https://spacy.io/usage/linguistic-features#named-entities

 Compatible with: spaCy v2.0.0+
-Last tested with: v2.1.0
+Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function

 import plac
 import random
+import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
@ -57,7 +58,11 @@ def main(model=None, output_dir=None, n_iter=100):
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
-    with nlp.disable_pipes(*other_pipes):  # only train NER
+    # only train NER
+    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
+        # show warnings for misaligned entity spans once
+        warnings.filterwarnings("once", category=UserWarning, module='spacy')
+
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -24,12 +24,13 @@ For more details, see the documentation:
 * NER: https://spacy.io/usage/linguistic-features#named-entities

 Compatible with: spaCy v2.1.0+
-Last tested with: v2.1.0
+Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function

 import plac
 import random
+import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
@ -97,7 +98,11 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
-    with nlp.disable_pipes(*other_pipes):  # only train NER
+    # only train NER
+    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
+        # show warnings for misaligned entity spans once
+        warnings.filterwarnings("once", category=UserWarning, module='spacy')
+
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
--- a/setup.cfg
+++ b/setup.cfg
@ -59,7 +59,7 @@ install_requires =

 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.0.5,<0.2.0
+    spacy_lookups_data>=0.3.1,<0.4.0
 cuda =
    cupy>=5.0.0b4,<9.0.0
 cuda80 =
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -279,18 +279,19 @@ class PrecomputableAffine(Model):
                break


-def link_vectors_to_models(vocab):
+def link_vectors_to_models(vocab, skip_rank=False):
    vectors = vocab.vectors
    if vectors.name is None:
        vectors.name = VECTORS_KEY
        if vectors.data.size != 0:
            warnings.warn(Warnings.W020.format(shape=vectors.data.shape))
    ops = Model.ops
-    for word in vocab:
-        if word.orth in vectors.key2row:
-            word.rank = vectors.key2row[word.orth]
-        else:
-            word.rank = util.OOV_RANK
+    if not skip_rank:
+        for word in vocab:
+            if word.orth in vectors.key2row:
+                word.rank = vectors.key2row[word.orth]
+            else:
+                word.rank = util.OOV_RANK
    data = ops.asarray(vectors.data)
    # Set an entry here, so that vectors are accessed by StaticVectors
    # (unideal, I know)
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -15,7 +15,7 @@ cdef enum attr_id_t:
    LIKE_NUM
    LIKE_EMAIL
    IS_STOP
-    IS_OOV
+    IS_OOV_DEPRECATED
    IS_BRACKET
    IS_QUOTE
    IS_LEFT_PUNCT
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -16,7 +16,7 @@ IDS = {
    "LIKE_NUM": LIKE_NUM,
    "LIKE_EMAIL": LIKE_EMAIL,
    "IS_STOP": IS_STOP,
-    "IS_OOV": IS_OOV,
+    "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
    "IS_BRACKET": IS_BRACKET,
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -187,12 +187,17 @@ def debug_data(
        n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
        msg.warn(
            "{} words in training data without vectors ({:0.2f}%)".format(
-                n_missing_vectors,
-                n_missing_vectors / gold_train_data["n_words"],
+                n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
            ),
        )
        msg.text(
-            "10 most common words without vectors: {}".format(_format_labels(gold_train_data["words_missing_vectors"].most_common(10), counts=True)), show=verbose,
+            "10 most common words without vectors: {}".format(
+                _format_labels(
+                    gold_train_data["words_missing_vectors"].most_common(10),
+                    counts=True,
+                )
+            ),
+            show=verbose,
        )
    else:
        msg.info("No word vectors present in the model")
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals, division, print_function

 import plac
-import spacy
 from timeit import default_timer as timer
 from wasabi import msg

@ -45,7 +44,7 @@ def evaluate(
        msg.fail("Visualization output directory not found", displacy_path, exits=1)
    corpus = GoldCorpus(data_path, data_path)
    if model.startswith("blank:"):
-        nlp = spacy.blank(model.replace("blank:", ""))
+        nlp = util.get_lang_class(model.replace("blank:", ""))()
    else:
        nlp = util.load_model(model)
    dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -17,7 +17,9 @@ from wasabi import msg

 from ..vectors import Vectors
 from ..errors import Errors, Warnings
-from ..util import ensure_path, get_lang_class, OOV_RANK
+from ..util import ensure_path, get_lang_class, load_model, OOV_RANK
+from ..lookups import Lookups
+

 try:
    import ftfy
@ -49,6 +51,8 @@ DEFAULT_OOV_PROB = -20
        str,
    ),
    model_name=("Optional name for the model meta", "option", "mn", str),
+    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
+    base_model=("Base model (for languages with custom tokenizers)", "option", "b", str),
 )
 def init_model(
    lang,
@ -61,6 +65,8 @@ def init_model(
    prune_vectors=-1,
    vectors_name=None,
    model_name=None,
+    omit_extra_lookups=False,
+    base_model=None,
 ):
    """
    Create a new model from raw data, like word frequencies, Brown clusters
@ -92,7 +98,16 @@ def init_model(
        lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)

    with msg.loading("Creating model..."):
-        nlp = create_model(lang, lex_attrs, name=model_name)
+        nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model)
+
+    # Create empty extra lexeme tables so the data from spacy-lookups-data
+    # isn't loaded if these features are accessed
+    if omit_extra_lookups:
+        nlp.vocab.lookups_extra = Lookups()
+        nlp.vocab.lookups_extra.add_table("lexeme_cluster")
+        nlp.vocab.lookups_extra.add_table("lexeme_prob")
+        nlp.vocab.lookups_extra.add_table("lexeme_settings")
+
    msg.good("Successfully created model")
    if vectors_loc is not None:
        add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
@ -152,20 +167,23 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
    return lex_attrs


-def create_model(lang, lex_attrs, name=None):
-    lang_class = get_lang_class(lang)
-    nlp = lang_class()
+def create_model(lang, lex_attrs, name=None, base_model=None):
+    if base_model:
+        nlp = load_model(base_model)
+        # keep the tokenizer but remove any existing pipeline components due to
+        # potentially conflicting vectors
+        for pipe in nlp.pipe_names:
+            nlp.remove_pipe(pipe)
+    else:
+        lang_class = get_lang_class(lang)
+        nlp = lang_class()
    for lexeme in nlp.vocab:
        lexeme.rank = OOV_RANK
-    lex_added = 0
    for attrs in lex_attrs:
        if "settings" in attrs:
            continue
        lexeme = nlp.vocab[attrs["orth"]]
        lexeme.set_attrs(**attrs)
-        lexeme.is_oov = False
-        lex_added += 1
-        lex_added += 1
    if len(nlp.vocab):
        oov_prob = min(lex.prob for lex in nlp.vocab) - 1
    else:
@ -181,7 +199,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
    if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
        nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
        for lex in nlp.vocab:
-            if lex.rank:
+            if lex.rank and lex.rank != OOV_RANK:
                nlp.vocab.vectors.add(lex.orth, row=lex.rank)
    else:
        if vectors_loc:
@ -193,8 +211,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None):
        if vector_keys is not None:
            for word in vector_keys:
                if word not in nlp.vocab:
-                    lexeme = nlp.vocab[word]
-                    lexeme.is_oov = False
+                    nlp.vocab[word]
        if vectors_data is not None:
            nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
    if name is None:
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -15,9 +15,9 @@ import random

 from .._ml import create_default_optimizer
 from ..util import use_gpu as set_gpu
-from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from ..compat import path2str
+from ..lookups import Lookups
 from .. import util
 from .. import about

@ -58,6 +58,7 @@ from .. import about
    textcat_arch=("Textcat model architecture", "option", "ta", str),
    textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
+    omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool),
    verbose=("Display more information for debug", "flag", "VV", bool),
    debug=("Run data diagnostics before training", "flag", "D", bool),
    # fmt: on
@ -97,6 +98,7 @@ def train(
    textcat_arch="bow",
    textcat_positive_label=None,
    tag_map_path=None,
+    omit_extra_lookups=False,
    verbose=False,
    debug=False,
 ):
@ -248,6 +250,14 @@ def train(
    # Update tag map with provided mapping
    nlp.vocab.morphology.tag_map.update(tag_map)

+    # Create empty extra lexeme tables so the data from spacy-lookups-data
+    # isn't loaded if these features are accessed
+    if omit_extra_lookups:
+        nlp.vocab.lookups_extra = Lookups()
+        nlp.vocab.lookups_extra.add_table("lexeme_cluster")
+        nlp.vocab.lookups_extra.add_table("lexeme_prob")
+        nlp.vocab.lookups_extra.add_table("lexeme_settings")
+
    if vectors:
        msg.text("Loading vector from model '{}'".format(vectors))
        _load_vectors(nlp, vectors)
@ -630,15 +640,6 @@ def _create_progress_bar(total):

 def _load_vectors(nlp, vectors):
    util.load_model(vectors, vocab=nlp.vocab)
-    for lex in nlp.vocab:
-        values = {}
-        for attr, func in nlp.vocab.lex_attr_getters.items():
-            # These attrs are expected to be set by data. Others should
-            # be set by calling the language functions.
-            if attr not in (CLUSTER, PROB, IS_OOV, LANG):
-                values[lex.vocab.strings[attr]] = func(lex.orth_)
-        lex.set_attrs(**values)
-        lex.is_oov = False


 def _load_pretrained_tok2vec(nlp, loc):
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -1,13 +1,17 @@
 # coding: utf8
 from __future__ import unicode_literals

+
 def add_codes(err_cls):
    """Add error codes to string messages via class attribute names."""

-    class ErrorsWithCodes(object):
+    class ErrorsWithCodes(err_cls):
        def __getattribute__(self, code):
-            msg = getattr(err_cls, code)
-            return "[{code}] {msg}".format(code=code, msg=msg)
+            msg = super(ErrorsWithCodes, self).__getattribute__(code)
+            if code.startswith("__"):  # python system attributes like __class__
+                return msg
+            else:
+                return "[{code}] {msg}".format(code=code, msg=msg)

    return ErrorsWithCodes()

@ -106,6 +110,11 @@ class Warnings(object):
            "in problems with the vocab further on in the pipeline.")
    W029 = ("Unable to align tokens with entities from character offsets. "
            "Discarding entity annotation for the text: {text}.")
+    W030 = ("Some entities could not be aligned in the text \"{text}\" with "
+            "entities \"{entities}\". Use "
+            "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
+            " to check the alignment. Misaligned entities ('-') will be "
+            "ignored during training.")


@add_codes
@ -555,6 +564,9 @@ class Errors(object):
    E195 = ("Matcher can be called on {good} only, got {got}.")
    E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
            "only be fixed with token.is_sent_start.")
+    E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
+    E198 = ("Unable to return {n} most similar vectors for the current vectors "
+            "table, which contains {n_rows} vectors.")


@add_codes
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -658,7 +658,15 @@ cdef class GoldParse:
        entdoc = None

        # avoid allocating memory if the doc does not contain any tokens
-        if self.length > 0:
+        if self.length == 0:
+            self.words = []
+            self.tags = []
+            self.heads = []
+            self.labels = []
+            self.ner = []
+            self.morphology = []
+
+        else:
            if words is None:
                words = [token.text for token in doc]
            if tags is None:
@ -949,6 +957,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
                break
        else:
            biluo[token.i] = missing
+    if "-" in biluo:
+        ent_str = str(entities)
+        warnings.warn(Warnings.W030.format(
+            text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text,
+            entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str
+        ))
    return biluo


--- a/spacy/kb.pxd
+++ b/spacy/kb.pxd
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
 from libc.stdint cimport int32_t, int64_t
 from libc.stdio cimport FILE

-from spacy.vocab cimport Vocab
+from .vocab cimport Vocab
 from .typedefs cimport hash_t

 from .structs cimport KBEntryC, AliasC
@ -113,7 +113,7 @@ cdef class KnowledgeBase:
        return new_index

    cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil:
-        """ 
+        """
        Initializing the vectors and making sure the first element of each vector is a dummy,
        because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
        cf. https://github.com/explosion/preshed/issues/17
@ -169,4 +169,3 @@ cdef class Reader:
    cdef int read_alias(self, int64_t* entry_index, float* prob) except -1

    cdef int _read(self, void* value, size_t size) except -1
-
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -1,23 +1,20 @@
 # cython: infer_types=True
 # cython: profile=True
 # coding: utf8
-import warnings
-
-from spacy.errors import Errors, Warnings
-
-from pathlib import Path
 from cymem.cymem cimport Pool
 from preshed.maps cimport PreshMap
-
 from cpython.exc cimport PyErr_SetFromErrno
-
 from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
 from libc.stdint cimport int32_t, int64_t
+from libcpp.vector cimport vector
+
+import warnings
+from os import path
+from pathlib import Path

 from .typedefs cimport hash_t

-from os import path
-from libcpp.vector cimport vector
+from .errors import Errors, Warnings


 cdef class Candidate:
@ -448,10 +445,10 @@ cdef class KnowledgeBase:

 cdef class Writer:
    def __init__(self, object loc):
-        if path.exists(loc):
-            assert not path.isdir(loc), "%s is directory." % loc
        if isinstance(loc, Path):
            loc = bytes(loc)
+        if path.exists(loc):
+            assert not path.isdir(loc), "%s is directory." % loc
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
@ -493,10 +490,10 @@ cdef class Writer:

 cdef class Reader:
    def __init__(self, object loc):
-        assert path.exists(loc)
-        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
+        assert path.exists(loc)
+        assert not path.isdir(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
@ -586,5 +583,3 @@ cdef class Reader:
    cdef int _read(self, void* value, size_t size) except -1:
        status = fread(value, size, 1, self._fp)
        return status
-
-
--- a/spacy/lang/da/init.py
+++ b/spacy/lang/da/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -10,19 +9,15 @@ from .morph_rules import MORPH_RULES
 from ..tag_map import TAG_MAP

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class DanishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "da"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    morph_rules = MORPH_RULES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/da/norm_exceptions.py
+++ b/spacy/lang/da/norm_exceptions.py
@ -1,527 +0,0 @@
-# coding: utf8
-"""
-Special-case rules for normalizing tokens to improve the model's predictions.
-For example 'mysterium' vs 'mysterie' and similar.
-"""
-from __future__ import unicode_literals
-
-
-# Sources:
-# 1: https://dsn.dk/retskrivning/om-retskrivningsordbogen/mere-om-retskrivningsordbogen-2012/endrede-stave-og-ordformer/
-# 2: http://www.tjerry-korrektur.dk/ord-med-flere-stavemaader/
-
-_exc = {
-    # Alternative spelling
-    "a-kraft-værk": "a-kraftværk",  # 1
-    "ålborg": "aalborg",  # 2
-    "århus": "aarhus",
-    "accessoirer": "accessoires",  # 1
-    "affektert": "affekteret",  # 1
-    "afrikander": "afrikaaner",  # 1
-    "aftabuere": "aftabuisere",  # 1
-    "aftabuering": "aftabuisering",  # 1
-    "akvarium": "akvarie",  # 1
-    "alenefader": "alenefar",  # 1
-    "alenemoder": "alenemor",  # 1
-    "alkoholambulatorium": "alkoholambulatorie",  # 1
-    "ambulatorium": "ambulatorie",  # 1
-    "ananassene": "ananasserne",  # 2
-    "anførelsestegn": "anførselstegn",  # 1
-    "anseelig": "anselig",  # 2
-    "antioxydant": "antioxidant",  # 1
-    "artrig": "artsrig",  # 1
-    "auditorium": "auditorie",  # 1
-    "avocado": "avokado",  # 2
-    "bagerst": "bagest",  # 2
-    "bagstræv": "bagstræb",  # 1
-    "bagstræver": "bagstræber",  # 1
-    "bagstræverisk": "bagstræberisk",  # 1
-    "balde": "balle",  # 2
-    "barselorlov": "barselsorlov",  # 1
-    "barselvikar": "barselsvikar",  # 1
-    "baskien": "baskerlandet",  # 1
-    "bayrisk": "bayersk",  # 1
-    "bedstefader": "bedstefar",  # 1
-    "bedstemoder": "bedstemor",  # 1
-    "behefte": "behæfte",  # 1
-    "beheftelse": "behæftelse",  # 1
-    "bidragydende": "bidragsydende",  # 1
-    "bidragyder": "bidragsyder",  # 1
-    "billiondel": "billiontedel",  # 1
-    "blaseret": "blasert",  # 1
-    "bleskifte": "bleskift",  # 1
-    "blodbroder": "blodsbroder",  # 2
-    "blyantspidser": "blyantsspidser",  # 2
-    "boligministerium": "boligministerie",  # 1
-    "borhul": "borehul",  # 1
-    "broder": "bror",  # 2
-    "buldog": "bulldog",  # 2
-    "bådhus": "bådehus",  # 1
-    "børnepleje": "barnepleje",  # 1
-    "børneseng": "barneseng",  # 1
-    "børnestol": "barnestol",  # 1
-    "cairo": "kairo",  # 1
-    "cambodia": "cambodja",  # 1
-    "cambodianer": "cambodjaner",  # 1
-    "cambodiansk": "cambodjansk",  # 1
-    "camouflage": "kamuflage",  # 2
-    "campylobacter": "kampylobakter",  # 1
-    "centeret": "centret",  # 2
-    "chefskahyt": "chefkahyt",  # 1
-    "chefspost": "chefpost",  # 1
-    "chefssekretær": "chefsekretær",  # 1
-    "chefsstol": "chefstol",  # 1
-    "cirkulærskrivelse": "cirkulæreskrivelse",  # 1
-    "cognacsglas": "cognacglas",  # 1
-    "columnist": "kolumnist",  # 1
-    "cricket": "kricket",  # 2
-    "dagplejemoder": "dagplejemor",  # 1
-    "damaskesdug": "damaskdug",  # 1
-    "damp-barn": "dampbarn",  # 1
-    "delfinarium": "delfinarie",  # 1
-    "dentallaboratorium": "dentallaboratorie",  # 1
-    "diaramme": "diasramme",  # 1
-    "diaré": "diarré",  # 1
-    "dioxyd": "dioxid",  # 1
-    "dommedagsprædiken": "dommedagspræken",  # 1
-    "donut": "doughnut",  # 2
-    "driftmæssig": "driftsmæssig",  # 1
-    "driftsikker": "driftssikker",  # 1
-    "driftsikring": "driftssikring",  # 1
-    "drikkejogurt": "drikkeyoghurt",  # 1
-    "drivein": "drive-in",  # 1
-    "driveinbiograf": "drive-in-biograf",  # 1
-    "drøvel": "drøbel",  # 1
-    "dødskriterium": "dødskriterie",  # 1
-    "e-mail-adresse": "e-mailadresse",  # 1
-    "e-post-adresse": "e-postadresse",  # 1
-    "egypten": "ægypten",  # 2
-    "ekskommunicere": "ekskommunikere",  # 1
-    "eksperimentarium": "eksperimentarie",  # 1
-    "elsass": "Alsace",  # 1
-    "elsasser": "alsacer",  # 1
-    "elsassisk": "alsacisk",  # 1
-    "elvetal": "ellevetal",  # 1
-    "elvetiden": "ellevetiden",  # 1
-    "elveårig": "elleveårig",  # 1
-    "elveårs": "elleveårs",  # 1
-    "elveårsbarn": "elleveårsbarn",  # 1
-    "elvte": "ellevte",  # 1
-    "elvtedel": "ellevtedel",  # 1
-    "energiministerium": "energiministerie",  # 1
-    "erhvervsministerium": "erhvervsministerie",  # 1
-    "espaliere": "spaliere",  # 2
-    "evangelium": "evangelie",  # 1
-    "fagministerium": "fagministerie",  # 1
-    "fakse": "faxe",  # 1
-    "fangstkvota": "fangstkvote",  # 1
-    "fader": "far",  # 2
-    "farbroder": "farbror",  # 1
-    "farfader": "farfar",  # 1
-    "farmoder": "farmor",  # 1
-    "federal": "føderal",  # 1
-    "federalisering": "føderalisering",  # 1
-    "federalisme": "føderalisme",  # 1
-    "federalist": "føderalist",  # 1
-    "federalistisk": "føderalistisk",  # 1
-    "federation": "føderation",  # 1
-    "federativ": "føderativ",  # 1
-    "fejlbeheftet": "fejlbehæftet",  # 1
-    "femetagers": "femetages",  # 2
-    "femhundredekroneseddel": "femhundredkroneseddel",  # 2
-    "filmpremiere": "filmpræmiere",  # 2
-    "finansimperium": "finansimperie",  # 1
-    "finansministerium": "finansministerie",  # 1
-    "firehjulstræk": "firhjulstræk",  # 2
-    "fjernstudium": "fjernstudie",  # 1
-    "formalier": "formalia",  # 1
-    "formandsskift": "formandsskifte",  # 1
-    "fornemst": "fornemmest",  # 2
-    "fornuftparti": "fornuftsparti",  # 1
-    "fornuftstridig": "fornuftsstridig",  # 1
-    "fornuftvæsen": "fornuftsvæsen",  # 1
-    "fornuftægteskab": "fornuftsægteskab",  # 1
-    "forretningsministerium": "forretningsministerie",  # 1
-    "forskningsministerium": "forskningsministerie",  # 1
-    "forstudium": "forstudie",  # 1
-    "forsvarsministerium": "forsvarsministerie",  # 1
-    "frilægge": "fritlægge",  # 1
-    "frilæggelse": "fritlæggelse",  # 1
-    "frilægning": "fritlægning",  # 1
-    "fristille": "fritstille",  # 1
-    "fristilling": "fritstilling",  # 1
-    "fuldttegnet": "fuldtegnet",  # 1
-    "fødestedskriterium": "fødestedskriterie",  # 1
-    "fødevareministerium": "fødevareministerie",  # 1
-    "følesløs": "følelsesløs",  # 1
-    "følgeligt": "følgelig",  # 1
-    "førne": "førn",  # 1
-    "gearskift": "gearskifte",  # 2
-    "gladeligt": "gladelig",  # 1
-    "glosehefte": "glosehæfte",  # 1
-    "glædeløs": "glædesløs",  # 1
-    "gonoré": "gonorré",  # 1
-    "grangiveligt": "grangivelig",  # 1
-    "grundliggende": "grundlæggende",  # 2
-    "grønsag": "grøntsag",  # 2
-    "gudbenådet": "gudsbenådet",  # 1
-    "gudfader": "gudfar",  # 1
-    "gudmoder": "gudmor",  # 1
-    "gulvmop": "gulvmoppe",  # 1
-    "gymnasium": "gymnasie",  # 1
-    "hackning": "hacking",  # 1
-    "halvbroder": "halvbror",  # 1
-    "halvelvetiden": "halvellevetiden",  # 1
-    "handelsgymnasium": "handelsgymnasie",  # 1
-    "hefte": "hæfte",  # 1
-    "hefteklamme": "hæfteklamme",  # 1
-    "heftelse": "hæftelse",  # 1
-    "heftemaskine": "hæftemaskine",  # 1
-    "heftepistol": "hæftepistol",  # 1
-    "hefteplaster": "hæfteplaster",  # 1
-    "heftestraf": "hæftestraf",  # 1
-    "heftning": "hæftning",  # 1
-    "helbroder": "helbror",  # 1
-    "hjemmeklasse": "hjemklasse",  # 1
-    "hjulspin": "hjulspind",  # 1
-    "huggevåben": "hugvåben",  # 1
-    "hulmurisolering": "hulmursisolering",  # 1
-    "hurtiggående": "hurtigtgående",  # 2
-    "hurtigttørrende": "hurtigtørrende",  # 2
-    "husmoder": "husmor",  # 1
-    "hydroxyd": "hydroxid",  # 1
-    "håndmikser": "håndmixer",  # 1
-    "højtaler": "højttaler",  # 2
-    "hønemoder": "hønemor",  # 1
-    "ide": "idé",  # 2
-    "imperium": "imperie",  # 1
-    "imponerthed": "imponerethed",  # 1
-    "inbox": "indboks",  # 2
-    "indenrigsministerium": "indenrigsministerie",  # 1
-    "indhefte": "indhæfte",  # 1
-    "indheftning": "indhæftning",  # 1
-    "indicium": "indicie",  # 1
-    "indkassere": "inkassere",  # 2
-    "iota": "jota",  # 1
-    "jobskift": "jobskifte",  # 1
-    "jogurt": "yoghurt",  # 1
-    "jukeboks": "jukebox",  # 1
-    "justitsministerium": "justitsministerie",  # 1
-    "kalorifere": "kalorifer",  # 1
-    "kandidatstipendium": "kandidatstipendie",  # 1
-    "kannevas": "kanvas",  # 1
-    "kaperssauce": "kaperssovs",  # 1
-    "kigge": "kikke",  # 2
-    "kirkeministerium": "kirkeministerie",  # 1
-    "klapmydse": "klapmyds",  # 1
-    "klimakterium": "klimakterie",  # 1
-    "klogeligt": "klogelig",  # 1
-    "knivblad": "knivsblad",  # 1
-    "kollegaer": "kolleger",  # 2
-    "kollegium": "kollegie",  # 1
-    "kollegiehefte": "kollegiehæfte",  # 1
-    "kollokviumx": "kollokvium",  # 1
-    "kommissorium": "kommissorie",  # 1
-    "kompendium": "kompendie",  # 1
-    "komplicerthed": "komplicerethed",  # 1
-    "konfederation": "konføderation",  # 1
-    "konfedereret": "konfødereret",  # 1
-    "konferensstudium": "konferensstudie",  # 1
-    "konservatorium": "konservatorie",  # 1
-    "konsulere": "konsultere",  # 1
-    "kradsbørstig": "krasbørstig",  # 2
-    "kravsspecifikation": "kravspecifikation",  # 1
-    "krematorium": "krematorie",  # 1
-    "krep": "crepe",  # 1
-    "krepnylon": "crepenylon",  # 1
-    "kreppapir": "crepepapir",  # 1
-    "kricket": "cricket",  # 2
-    "kriterium": "kriterie",  # 1
-    "kroat": "kroater",  # 2
-    "kroki": "croquis",  # 1
-    "kronprinsepar": "kronprinspar",  # 2
-    "kropdoven": "kropsdoven",  # 1
-    "kroplus": "kropslus",  # 1
-    "krøllefedt": "krølfedt",  # 1
-    "kulturministerium": "kulturministerie",  # 1
-    "kuponhefte": "kuponhæfte",  # 1
-    "kvota": "kvote",  # 1
-    "kvotaordning": "kvoteordning",  # 1
-    "laboratorium": "laboratorie",  # 1
-    "laksfarve": "laksefarve",  # 1
-    "laksfarvet": "laksefarvet",  # 1
-    "laksrød": "lakserød",  # 1
-    "laksyngel": "lakseyngel",  # 1
-    "laksørred": "lakseørred",  # 1
-    "landbrugsministerium": "landbrugsministerie",  # 1
-    "landskampstemning": "landskampsstemning",  # 1
-    "langust": "languster",  # 1
-    "lappegrejer": "lappegrej",  # 1
-    "lavløn": "lavtløn",  # 1
-    "lillebroder": "lillebror",  # 1
-    "linear": "lineær",  # 1
-    "loftlampe": "loftslampe",  # 2
-    "log-in": "login",  # 1
-    "login": "log-in",  # 2
-    "lovmedholdig": "lovmedholdelig",  # 1
-    "ludder": "luder",  # 2
-    "lysholder": "lyseholder",  # 1
-    "lægeskifte": "lægeskift",  # 1
-    "lærvillig": "lærevillig",  # 1
-    "løgsauce": "løgsovs",  # 1
-    "madmoder": "madmor",  # 1
-    "majonæse": "mayonnaise",  # 1
-    "mareridtagtig": "mareridtsagtig",  # 1
-    "margen": "margin",  # 2
-    "martyrium": "martyrie",  # 1
-    "mellemstatlig": "mellemstatslig",  # 1
-    "menneskene": "menneskerne",  # 2
-    "metropolis": "metropol",  # 1
-    "miks": "mix",  # 1
-    "mikse": "mixe",  # 1
-    "miksepult": "mixerpult",  # 1
-    "mikser": "mixer",  # 1
-    "mikserpult": "mixerpult",  # 1
-    "mikslån": "mixlån",  # 1
-    "miksning": "mixning",  # 1
-    "miljøministerium": "miljøministerie",  # 1
-    "milliarddel": "milliardtedel",  # 1
-    "milliondel": "milliontedel",  # 1
-    "ministerium": "ministerie",  # 1
-    "mop": "moppe",  # 1
-    "moder": "mor",  # 2
-    "moratorium": "moratorie",  # 1
-    "morbroder": "morbror",  # 1
-    "morfader": "morfar",  # 1
-    "mormoder": "mormor",  # 1
-    "musikkonservatorium": "musikkonservatorie",  # 1
-    "muslingskal": "muslingeskal",  # 1
-    "mysterium": "mysterie",  # 1
-    "naturalieydelse": "naturalydelse",  # 1
-    "naturalieøkonomi": "naturaløkonomi",  # 1
-    "navnebroder": "navnebror",  # 1
-    "nerium": "nerie",  # 1
-    "nådeløs": "nådesløs",  # 1
-    "nærforestående": "nærtforestående",  # 1
-    "nærstående": "nærtstående",  # 1
-    "observatorium": "observatorie",  # 1
-    "oldefader": "oldefar",  # 1
-    "oldemoder": "oldemor",  # 1
-    "opgraduere": "opgradere",  # 1
-    "opgraduering": "opgradering",  # 1
-    "oratorium": "oratorie",  # 1
-    "overbookning": "overbooking",  # 1
-    "overpræsidium": "overpræsidie",  # 1
-    "overstatlig": "overstatslig",  # 1
-    "oxyd": "oxid",  # 1
-    "oxydere": "oxidere",  # 1
-    "oxydering": "oxidering",  # 1
-    "pakkenellike": "pakkenelliker",  # 1
-    "papirtynd": "papirstynd",  # 1
-    "pastoralseminarium": "pastoralseminarie",  # 1
-    "peanutsene": "peanuttene",  # 2
-    "penalhus": "pennalhus",  # 2
-    "pensakrav": "pensumkrav",  # 1
-    "pepperoni": "peperoni",  # 1
-    "peruaner": "peruvianer",  # 1
-    "petrole": "petrol",  # 1
-    "piltast": "piletast",  # 1
-    "piltaste": "piletast",  # 1
-    "planetarium": "planetarie",  # 1
-    "plasteret": "plastret",  # 2
-    "plastic": "plastik",  # 2
-    "play-off-kamp": "playoffkamp",  # 1
-    "plejefader": "plejefar",  # 1
-    "plejemoder": "plejemor",  # 1
-    "podium": "podie",  # 2
-    "praha": "prag",  # 2
-    "preciøs": "pretiøs",  # 2
-    "privilegium": "privilegie",  # 1
-    "progredere": "progrediere",  # 1
-    "præsidium": "præsidie",  # 1
-    "psykodelisk": "psykedelisk",  # 1
-    "pudsegrejer": "pudsegrej",  # 1
-    "referensgruppe": "referencegruppe",  # 1
-    "referensramme": "referenceramme",  # 1
-    "refugium": "refugie",  # 1
-    "registeret": "registret",  # 2
-    "remedium": "remedie",  # 1
-    "remiks": "remix",  # 1
-    "reservert": "reserveret",  # 1
-    "ressortministerium": "ressortministerie",  # 1
-    "ressource": "resurse",  # 2
-    "resætte": "resette",  # 1
-    "rettelig": "retteligt",  # 1
-    "rettetaste": "rettetast",  # 1
-    "returtaste": "returtast",  # 1
-    "risici": "risikoer",  # 2
-    "roll-on": "rollon",  # 1
-    "rollehefte": "rollehæfte",  # 1
-    "rostbøf": "roastbeef",  # 1
-    "rygsæksturist": "rygsækturist",  # 1
-    "rødstjært": "rødstjert",  # 1
-    "saddel": "sadel",  # 2
-    "samaritan": "samaritaner",  # 2
-    "sanatorium": "sanatorie",  # 1
-    "sauce": "sovs",  # 1
-    "scanning": "skanning",  # 2
-    "sceneskifte": "sceneskift",  # 1
-    "scilla": "skilla",  # 1
-    "sejflydende": "sejtflydende",  # 1
-    "selvstudium": "selvstudie",  # 1
-    "seminarium": "seminarie",  # 1
-    "sennepssauce": "sennepssovs ",  # 1
-    "servitutbeheftet": "servitutbehæftet",  # 1
-    "sit-in": "sitin",  # 1
-    "skatteministerium": "skatteministerie",  # 1
-    "skifer": "skiffer",  # 2
-    "skyldsfølelse": "skyldfølelse",  # 1
-    "skysauce": "skysovs",  # 1
-    "sladdertaske": "sladretaske",  # 2
-    "sladdervorn": "sladrevorn",  # 2
-    "slagsbroder": "slagsbror",  # 1
-    "slettetaste": "slettetast",  # 1
-    "smørsauce": "smørsovs",  # 1
-    "snitsel": "schnitzel",  # 1
-    "snobbeeffekt": "snobeffekt",  # 2
-    "socialministerium": "socialministerie",  # 1
-    "solarium": "solarie",  # 1
-    "soldebroder": "soldebror",  # 1
-    "spagetti": "spaghetti",  # 1
-    "spagettistrop": "spaghettistrop",  # 1
-    "spagettiwestern": "spaghettiwestern",  # 1
-    "spin-off": "spinoff",  # 1
-    "spinnefiskeri": "spindefiskeri",  # 1
-    "spolorm": "spoleorm",  # 1
-    "sproglaboratorium": "sproglaboratorie",  # 1
-    "spækbræt": "spækkebræt",  # 2
-    "stand-in": "standin",  # 1
-    "stand-up-comedy": "standupcomedy",  # 1
-    "stand-up-komiker": "standupkomiker",  # 1
-    "statsministerium": "statsministerie",  # 1
-    "stedbroder": "stedbror",  # 1
-    "stedfader": "stedfar",  # 1
-    "stedmoder": "stedmor",  # 1
-    "stilehefte": "stilehæfte",  # 1
-    "stipendium": "stipendie",  # 1
-    "stjært": "stjert",  # 1
-    "stjærthage": "stjerthage",  # 1
-    "storebroder": "storebror",  # 1
-    "stortå": "storetå",  # 1
-    "strabads": "strabadser",  # 1
-    "strømlinjet": "strømlinet",  # 1
-    "studium": "studie",  # 1
-    "stænkelap": "stænklap",  # 1
-    "sundhedsministerium": "sundhedsministerie",  # 1
-    "suppositorium": "suppositorie",  # 1
-    "svejts": "schweiz",  # 1
-    "svejtser": "schweizer",  # 1
-    "svejtserfranc": "schweizerfranc",  # 1
-    "svejtserost": "schweizerost",  # 1
-    "svejtsisk": "schweizisk",  # 1
-    "svigerfader": "svigerfar",  # 1
-    "svigermoder": "svigermor",  # 1
-    "svirebroder": "svirebror",  # 1
-    "symposium": "symposie",  # 1
-    "sælarium": "sælarie",  # 1
-    "søreme": "sørme",  # 2
-    "søterritorium": "søterritorie",  # 1
-    "t-bone-steak": "t-bonesteak",  # 1
-    "tabgivende": "tabsgivende",  # 1
-    "tabuere": "tabuisere",  # 1
-    "tabuering": "tabuisering",  # 1
-    "tackle": "takle",  # 2
-    "tackling": "takling",  # 2
-    "taifun": "tyfon",  # 1
-    "take-off": "takeoff",  # 1
-    "taknemlig": "taknemmelig",  # 2
-    "talehørelærer": "tale-høre-lærer",  # 1
-    "talehøreundervisning": "tale-høre-undervisning",  # 1
-    "tandstik": "tandstikker",  # 1
-    "tao": "dao",  # 1
-    "taoisme": "daoisme",  # 1
-    "taoist": "daoist",  # 1
-    "taoistisk": "daoistisk",  # 1
-    "taverne": "taverna",  # 1
-    "teateret": "teatret",  # 2
-    "tekno": "techno",  # 1
-    "temposkifte": "temposkift",  # 1
-    "terrarium": "terrarie",  # 1
-    "territorium": "territorie",  # 1
-    "tesis": "tese",  # 1
-    "tidsstudium": "tidsstudie",  # 1
-    "tipoldefader": "tipoldefar",  # 1
-    "tipoldemoder": "tipoldemor",  # 1
-    "tomatsauce": "tomatsovs",  # 1
-    "tonart": "toneart",  # 1
-    "trafikministerium": "trafikministerie",  # 1
-    "tredve": "tredive",  # 1
-    "tredver": "trediver",  # 1
-    "tredveårig": "trediveårig",  # 1
-    "tredveårs": "trediveårs",  # 1
-    "tredveårsfødselsdag": "trediveårsfødselsdag",  # 1
-    "tredvte": "tredivte",  # 1
-    "tredvtedel": "tredivtedel",  # 1
-    "troldunge": "troldeunge",  # 1
-    "trommestikke": "trommestik",  # 1
-    "trubadur": "troubadour",  # 2
-    "trøstepræmie": "trøstpræmie",  # 2
-    "tummerum": "trummerum",  # 1
-    "tumultuarisk": "tumultarisk",  # 1
-    "tunghørighed": "tunghørhed",  # 1
-    "tus": "tusch",  # 2
-    "tusind": "tusinde",  # 2
-    "tvillingbroder": "tvillingebror",  # 1
-    "tvillingbror": "tvillingebror",  # 1
-    "tvillingebroder": "tvillingebror",  # 1
-    "ubeheftet": "ubehæftet",  # 1
-    "udenrigsministerium": "udenrigsministerie",  # 1
-    "udhulning": "udhuling",  # 1
-    "udslaggivende": "udslagsgivende",  # 1
-    "udspekulert": "udspekuleret",  # 1
-    "udviklingsministerium": "udviklingsministerie",  # 1
-    "uforpligtigende": "uforpligtende",  # 1
-    "uheldvarslende": "uheldsvarslende",  # 1
-    "uimponerthed": "uimponerethed",  # 1
-    "undervisningsministerium": "undervisningsministerie",  # 1
-    "unægtelig": "unægteligt",  # 1
-    "urinale": "urinal",  # 1
-    "uvederheftig": "uvederhæftig",  # 1
-    "vabel": "vable",  # 2
-    "vadi": "wadi",  # 1
-    "vaklevorn": "vakkelvorn",  # 1
-    "vanadin": "vanadium",  # 1
-    "vaselin": "vaseline",  # 1
-    "vederheftig": "vederhæftig",  # 1
-    "vedhefte": "vedhæfte",  # 1
-    "velar": "velær",  # 1
-    "videndeling": "vidensdeling",  # 2
-    "vinkelanførelsestegn": "vinkelanførselstegn",  # 1
-    "vipstjært": "vipstjert",  # 1
-    "vismut": "bismut",  # 1
-    "visvas": "vissevasse",  # 1
-    "voksværk": "vokseværk",  # 1
-    "værtdyr": "værtsdyr",  # 1
-    "værtplante": "værtsplante",  # 1
-    "wienersnitsel": "wienerschnitzel",  # 1
-    "yderliggående": "yderligtgående",  # 2
-    "zombi": "zombie",  # 1
-    "ægbakke": "æggebakke",  # 1
-    "ægformet": "æggeformet",  # 1
-    "ægleder": "æggeleder",  # 1
-    "ækvilibrist": "ekvilibrist",  # 2
-    "æselsøre": "æseløre",  # 1
-    "øjehule": "øjenhule",  # 1
-    "øjelåg": "øjenlåg",  # 1
-    "øjeåbner": "øjenåbner",  # 1
-    "økonomiministerium": "økonomiministerie",  # 1
-    "ørenring": "ørering",  # 2
-    "øvehefte": "øvehæfte",  # 1
-}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/da/tokenizer_exceptions.py
+++ b/spacy/lang/da/tokenizer_exceptions.py
@ -6,7 +6,7 @@ Source: https://forkortelse.dk/ and various others.

 from __future__ import unicode_literals

-from ...symbols import ORTH, LEMMA, NORM, TAG, PUNCT
+from ...symbols import ORTH, LEMMA, NORM


 _exc = {}
@ -52,7 +52,7 @@ for exc_data in [
    {ORTH: "Ons.", LEMMA: "onsdag"},
    {ORTH: "Fre.", LEMMA: "fredag"},
    {ORTH: "Lør.", LEMMA: "lørdag"},
-    {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller", TAG: "CC"},
+    {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]

@ -577,7 +577,7 @@ for h in range(1, 31 + 1):
    for period in ["."]:
        _exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]

-_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]}
+_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]}
 _exc.update(_custom_base_exc)

 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .punctuation import TOKENIZER_INFIXES
 from .tag_map import TAG_MAP
@ -10,18 +9,14 @@ from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class GermanDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "de"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/de/norm_exceptions.py
+++ b/spacy/lang/de/norm_exceptions.py
@ -1,16 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-# Here we only want to include the absolute most common words. Otherwise,
-# this list would get impossibly long for German – especially considering the
-# old vs. new spelling rules, and all possible cases.
-
-
-_exc = {"daß": "dass"}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/de/stop_words.py
+++ b/spacy/lang/de/stop_words.py
@ -47,7 +47,7 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
 lang lange leicht leider lieber los

 machen macht machte mag magst man manche manchem manchen mancher manches mehr
-mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten 
+mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
 mögen möglich mögt morgen muss muß müssen musst müsst musste mussten

 na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
--- a/spacy/lang/de/syntax_iterators.py
+++ b/spacy/lang/de/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -27,13 +28,17 @@ def noun_chunks(obj):
        "og",
        "app",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_label = doc.vocab.strings.add("NP")
    np_deps = set(doc.vocab.strings.add(label) for label in labels)
    close_app = doc.vocab.strings.add("nk")

    rbracket = 0
-    for i, word in enumerate(obj):
+    for i, word in enumerate(doclike):
        if i < rbracket:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -10,21 +10,16 @@ from .lemmatizer import GreekLemmatizer
 from .syntax_iterators import SYNTAX_ITERATORS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...lookups import Lookups
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class GreekDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "el"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/el/norm_exceptions.py
+++ b/spacy/lang/el/norm_exceptions.py
--- a/spacy/lang/el/syntax_iterators.py
+++ b/spacy/lang/el/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases. Works on both Doc and Span.
    """
@ -13,34 +14,34 @@ def noun_chunks(obj):
    # obj tag corrects some DEP tagger mistakes.
    # Further improvement of the models will eliminate the need for this tag.
    labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
            flag = False
            if word.pos == NOUN:
                #  check for patterns such as γραμμή παραγωγής
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
-                        seen.update(
-                            j for j in range(word.left_edge.i, potential_nmod.i + 1)
-                        )
+                        prev_end = potential_nmod.i
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        flag = True
                        break
            if flag is False:
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
@ -49,9 +50,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -2,7 +2,6 @@
 from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
@ -10,10 +9,9 @@ from .morph_rules import MORPH_RULES
 from .syntax_iterators import SYNTAX_ITERATORS

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 def _return_en(_):
@ -24,9 +22,6 @@ class EnglishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = _return_en
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/en/norm_exceptions.py
+++ b/spacy/lang/en/norm_exceptions.py
--- a/spacy/lang/en/syntax_iterators.py
+++ b/spacy/lang/en/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "attr",
        "ROOT",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.i + 1))
+            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -77,12 +77,12 @@ for pron in ["i", "you", "he", "she", "it", "we", "they"]:

        _exc[orth + "'d"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
-            {ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
+            {ORTH: "'d", NORM: "'d"},
        ]

        _exc[orth + "d"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
-            {ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
+            {ORTH: "d", NORM: "'d"},
        ]

        _exc[orth + "'d've"] = [
@ -195,7 +195,10 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
            {ORTH: "'d", NORM: "'d"},
        ]

-        _exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}]
+        _exc[orth + "d"] = [
+            {ORTH: orth, LEMMA: word, NORM: word},
+            {ORTH: "d", NORM: "'d"},
+        ]

        _exc[orth + "'d've"] = [
            {ORTH: orth, LEMMA: word, NORM: word},
--- a/spacy/lang/es/punctuation.py
+++ b/spacy/lang/es/punctuation.py
@ -5,7 +5,6 @@ from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
 from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
 from ..char_classes import merge_chars
-from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES


 _list_units = [u for u in LIST_UNITS if u != "%"]
--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -2,10 +2,15 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON, VERB, AUX
+from ...errors import Errors


-def noun_chunks(obj):
-    doc = obj.doc
+def noun_chunks(doclike):
+    doc = doclike.doc
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    if not len(doc):
        return
    np_label = doc.vocab.strings.add("NP")
@ -16,7 +21,7 @@ def noun_chunks(obj):
    np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
    stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
    token = doc[0]
-    while token and token.i < len(doc):
+    while token and token.i < len(doclike):
        if token.pos in [PROPN, NOUN, PRON]:
            left, right = noun_bounds(
                doc, token, np_left_deps, np_right_deps, stop_deps
--- a/spacy/lang/fa/init.py
+++ b/spacy/lang/fa/init.py
@ -10,6 +10,7 @@ from .lex_attrs import LEX_ATTRS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .punctuation import TOKENIZER_SUFFIXES
+from .syntax_iterators import SYNTAX_ITERATORS


 class PersianDefaults(Language.Defaults):
@ -24,6 +25,7 @@ class PersianDefaults(Language.Defaults):
    tag_map = TAG_MAP
    suffixes = TOKENIZER_SUFFIXES
    writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
+    syntax_iterators = SYNTAX_ITERATORS


 class Persian(Language):
--- a/spacy/lang/fa/syntax_iterators.py
+++ b/spacy/lang/fa/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "attr",
        "ROOT",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.i + 1))
+            prev_end = word.i
            yield word.left_edge.i, word.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.i + 1))
+                prev_end = word.i
                yield word.left_edge.i, word.i + 1, np_label


--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -457,5 +457,5 @@ _regular_exp += [

 TOKENIZER_EXCEPTIONS = _exc
 TOKEN_MATCH = re.compile(
-        "(?iu)" + "|".join("(?:{})".format(m) for m in _regular_exp)
+    "(?iu)" + "|".join("(?:{})".format(m) for m in _regular_exp)
 ).match
--- a/spacy/lang/gu/stop_words.py
+++ b/spacy/lang/gu/stop_words.py
@ -3,7 +3,7 @@ from __future__ import unicode_literals

 STOP_WORDS = set(
    """
-એમ 
+એમ
 આ
 એ
 રહી
@ -24,7 +24,7 @@ STOP_WORDS = set(
 તેમને
 તેમના
 તેમણે
-તેમનું 
+તેમનું
 તેમાં
 અને
 અહીં
@ -33,12 +33,12 @@ STOP_WORDS = set(
 થાય
 જે
 ને
-કે 
+કે
 ના
 ની
 નો
 ને
-નું 
+નું
 શું
 માં
 પણ
@ -69,12 +69,12 @@ STOP_WORDS = set(
 કોઈ
 કેમ
 કર્યો
-કર્યુ 
+કર્યુ
 કરે
 સૌથી
-ત્યારબાદ 
+ત્યારબાદ
 તથા
-દ્વારા 
+દ્વારા
 જુઓ
 જાઓ
 જ્યારે
--- a/spacy/lang/hy/init.py
+++ b/spacy/lang/hy/init.py
@ -1,11 +1,12 @@
+# coding: utf8
+from __future__ import unicode_literals
+
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP

-
 from ...attrs import LANG
 from ...language import Language
-from ...tokens import Doc


 class ArmenianDefaults(Language.Defaults):
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -1,6 +1,6 @@
+# coding: utf8
 from __future__ import unicode_literals

-
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.hy.examples import sentences
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -1,3 +1,4 @@
+# coding: utf8
 from __future__ import unicode_literals

 from ...attrs import LIKE_NUM
--- a/spacy/lang/hy/stop_words.py
+++ b/spacy/lang/hy/stop_words.py
@ -1,6 +1,6 @@
+# coding: utf8
 from __future__ import unicode_literals

-
 STOP_WORDS = set(
    """
 նա
@ -105,6 +105,6 @@ STOP_WORDS = set(
 յուրաքանչյուր
 այս
 մեջ
-թ	
+թ
 """.split()
 )
--- a/spacy/lang/hy/tag_map.py
+++ b/spacy/lang/hy/tag_map.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals

-from ...symbols import POS, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
+from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN
 from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ

 TAG_MAP = {
@ -716,7 +716,7 @@ TAG_MAP = {
        POS: NOUN,
        "Animacy": "Nhum",
        "Case": "Dat",
-        "Number": "Coll",
+        # "Number": "Coll",
        "Number": "Sing",
        "Person": "1",
    },
@ -815,7 +815,7 @@ TAG_MAP = {
        "Animacy": "Nhum",
        "Case": "Nom",
        "Definite": "Def",
-        "Number": "Plur",
+        # "Number": "Plur",
        "Number": "Sing",
        "Poss": "Yes",
    },
@ -880,7 +880,7 @@ TAG_MAP = {
        POS: NOUN,
        "Animacy": "Nhum",
        "Case": "Nom",
-        "Number": "Plur",
+        # "Number": "Plur",
        "Number": "Sing",
        "Person": "2",
    },
@ -1223,9 +1223,9 @@ TAG_MAP = {
    "PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": {
        POS: PRON,
        "Case": "Nom",
-        "Number": "Sing",
+        # "Number": "Sing",
        "Number": "Plur",
-        "Person": "3",
+        # "Person": "3",
        "Person": "1",
        "PronType": "Emp",
    },
--- a/spacy/lang/id/init.py
+++ b/spacy/lang/id/init.py
@ -4,25 +4,20 @@ from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .tag_map import TAG_MAP

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class IndonesianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "id"
    lex_attr_getters.update(LEX_ATTRS)
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
--- a/spacy/lang/id/norm_exceptions.py
+++ b/spacy/lang/id/norm_exceptions.py
@ -1,532 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-# Daftar kosakata yang sering salah dieja
-# https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
-_exc = {
-    # Slang and abbreviations
-    "silahkan": "silakan",
-    "yg": "yang",
-    "kalo": "kalau",
-    "cawu": "caturwulan",
-    "ok": "oke",
-    "gak": "tidak",
-    "enggak": "tidak",
-    "nggak": "tidak",
-    "ndak": "tidak",
-    "ngga": "tidak",
-    "dgn": "dengan",
-    "tdk": "tidak",
-    "jg": "juga",
-    "klo": "kalau",
-    "denger": "dengar",
-    "pinter": "pintar",
-    "krn": "karena",
-    "nemuin": "menemukan",
-    "jgn": "jangan",
-    "udah": "sudah",
-    "sy": "saya",
-    "udh": "sudah",
-    "dapetin": "mendapatkan",
-    "ngelakuin": "melakukan",
-    "ngebuat": "membuat",
-    "membikin": "membuat",
-    "bikin": "buat",
-    # Daftar kosakata yang sering salah dieja
-    "malpraktik": "malapraktik",
-    "malfungsi": "malafungsi",
-    "malserap": "malaserap",
-    "maladaptasi": "malaadaptasi",
-    "malsuai": "malasuai",
-    "maldistribusi": "maladistribusi",
-    "malgizi": "malagizi",
-    "malsikap": "malasikap",
-    "memperhatikan": "memerhatikan",
-    "akte": "akta",
-    "cemilan": "camilan",
-    "esei": "esai",
-    "frase": "frasa",
-    "kafeteria": "kafetaria",
-    "ketapel": "katapel",
-    "kenderaan": "kendaraan",
-    "menejemen": "manajemen",
-    "menejer": "manajer",
-    "mesjid": "masjid",
-    "rebo": "rabu",
-    "seksama": "saksama",
-    "senggama": "sanggama",
-    "sekedar": "sekadar",
-    "seprei": "seprai",
-    "semedi": "semadi",
-    "samadi": "semadi",
-    "amandemen": "amendemen",
-    "algoritma": "algoritme",
-    "aritmatika": "aritmetika",
-    "metoda": "metode",
-    "materai": "meterai",
-    "meterei": "meterai",
-    "kalendar": "kalender",
-    "kadaluwarsa": "kedaluwarsa",
-    "katagori": "kategori",
-    "parlamen": "parlemen",
-    "sekular": "sekuler",
-    "selular": "seluler",
-    "sirkular": "sirkuler",
-    "survai": "survei",
-    "survey": "survei",
-    "aktuil": "aktual",
-    "formil": "formal",
-    "trotoir": "trotoar",
-    "komersiil": "komersial",
-    "komersil": "komersial",
-    "tradisionil": "tradisionial",
-    "orisinil": "orisinal",
-    "orijinil": "orisinal",
-    "afdol": "afdal",
-    "antri": "antre",
-    "apotik": "apotek",
-    "atlit": "atlet",
-    "atmosfir": "atmosfer",
-    "cidera": "cedera",
-    "cendikiawan": "cendekiawan",
-    "cepet": "cepat",
-    "cinderamata": "cenderamata",
-    "debet": "debit",
-    "difinisi": "definisi",
-    "dekrit": "dekret",
-    "disain": "desain",
-    "diskripsi": "deskripsi",
-    "diskotik": "diskotek",
-    "eksim": "eksem",
-    "exim": "eksem",
-    "faidah": "faedah",
-    "ekstrim": "ekstrem",
-    "ekstrimis": "ekstremis",
-    "komplit": "komplet",
-    "konkrit": "konkret",
-    "kongkrit": "konkret",
-    "kongkret": "konkret",
-    "kridit": "kredit",
-    "musium": "museum",
-    "pinalti": "penalti",
-    "piranti": "peranti",
-    "pinsil": "pensil",
-    "personil": "personel",
-    "sistim": "sistem",
-    "teoritis": "teoretis",
-    "vidio": "video",
-    "cengkeh": "cengkih",
-    "desertasi": "disertasi",
-    "hakekat": "hakikat",
-    "intelejen": "intelijen",
-    "kaedah": "kaidah",
-    "kempes": "kempis",
-    "kementrian": "kementerian",
-    "ledeng": "leding",
-    "nasehat": "nasihat",
-    "penasehat": "penasihat",
-    "praktek": "praktik",
-    "praktekum": "praktikum",
-    "resiko": "risiko",
-    "retsleting": "ritsleting",
-    "senen": "senin",
-    "amuba": "ameba",
-    "punggawa": "penggawa",
-    "surban": "serban",
-    "nomer": "nomor",
-    "sorban": "serban",
-    "bis": "bus",
-    "agribisnis": "agrobisnis",
-    "kantung": "kantong",
-    "khutbah": "khotbah",
-    "mandur": "mandor",
-    "rubuh": "roboh",
-    "pastur": "pastor",
-    "supir": "sopir",
-    "goncang": "guncang",
-    "goa": "gua",
-    "kaos": "kaus",
-    "kokoh": "kukuh",
-    "komulatif": "kumulatif",
-    "kolomnis": "kolumnis",
-    "korma": "kurma",
-    "lobang": "lubang",
-    "limo": "limusin",
-    "limosin": "limusin",
-    "mangkok": "mangkuk",
-    "saos": "saus",
-    "sop": "sup",
-    "sorga": "surga",
-    "tegor": "tegur",
-    "telor": "telur",
-    "obrak-abrik": "ubrak-abrik",
-    "ekwivalen": "ekuivalen",
-    "frekwensi": "frekuensi",
-    "konsekwensi": "konsekuensi",
-    "kwadran": "kuadran",
-    "kwadrat": "kuadrat",
-    "kwalifikasi": "kualifikasi",
-    "kwalitas": "kualitas",
-    "kwalitet": "kualitas",
-    "kwalitatif": "kualitatif",
-    "kwantitas": "kuantitas",
-    "kwantitatif": "kuantitatif",
-    "kwantum": "kuantum",
-    "kwartal": "kuartal",
-    "kwintal": "kuintal",
-    "kwitansi": "kuitansi",
-    "kwatir": "khawatir",
-    "kuatir": "khawatir",
-    "jadual": "jadwal",
-    "hirarki": "hierarki",
-    "karir": "karier",
-    "aktip": "aktif",
-    "daptar": "daftar",
-    "efektip": "efektif",
-    "epektif": "efektif",
-    "epektip": "efektif",
-    "Pebruari": "Februari",
-    "pisik": "fisik",
-    "pondasi": "fondasi",
-    "photo": "foto",
-    "photokopi": "fotokopi",
-    "hapal": "hafal",
-    "insap": "insaf",
-    "insyaf": "insaf",
-    "konperensi": "konferensi",
-    "kreatip": "kreatif",
-    "kreativ": "kreatif",
-    "maap": "maaf",
-    "napsu": "nafsu",
-    "negatip": "negatif",
-    "negativ": "negatif",
-    "objektip": "objektif",
-    "obyektip": "objektif",
-    "obyektif": "objektif",
-    "pasip": "pasif",
-    "pasiv": "pasif",
-    "positip": "positif",
-    "positiv": "positif",
-    "produktip": "produktif",
-    "produktiv": "produktif",
-    "sarap": "saraf",
-    "sertipikat": "sertifikat",
-    "subjektip": "subjektif",
-    "subyektip": "subjektif",
-    "subyektif": "subjektif",
-    "tarip": "tarif",
-    "transitip": "transitif",
-    "transitiv": "transitif",
-    "faham": "paham",
-    "fikir": "pikir",
-    "berfikir": "berpikir",
-    "telefon": "telepon",
-    "telfon": "telepon",
-    "telpon": "telepon",
-    "tilpon": "telepon",
-    "nafas": "napas",
-    "bernafas": "bernapas",
-    "pernafasan": "pernapasan",
-    "vermak": "permak",
-    "vulpen": "pulpen",
-    "aktifis": "aktivis",
-    "konfeksi": "konveksi",
-    "motifasi": "motivasi",
-    "Nopember": "November",
-    "propinsi": "provinsi",
-    "babtis": "baptis",
-    "jerembab": "jerembap",
-    "lembab": "lembap",
-    "sembab": "sembap",
-    "saptu": "sabtu",
-    "tekat": "tekad",
-    "bejad": "bejat",
-    "nekad": "nekat",
-    "otoped": "otopet",
-    "skuad": "skuat",
-    "jenius": "genius",
-    "marjin": "margin",
-    "marjinal": "marginal",
-    "obyek": "objek",
-    "subyek": "subjek",
-    "projek": "proyek",
-    "azas": "asas",
-    "ijasah": "ijazah",
-    "jenasah": "jenazah",
-    "plasa": "plaza",
-    "bathin": "batin",
-    "Katholik": "Katolik",
-    "orthografi": "ortografi",
-    "pathogen": "patogen",
-    "theologi": "teologi",
-    "ijin": "izin",
-    "rejeki": "rezeki",
-    "rejim": "rezim",
-    "jaman": "zaman",
-    "jamrud": "zamrud",
-    "jinah": "zina",
-    "perjinahan": "perzinaan",
-    "anugrah": "anugerah",
-    "cendrawasih": "cenderawasih",
-    "jendral": "jenderal",
-    "kripik": "keripik",
-    "krupuk": "kerupuk",
-    "ksatria": "kesatria",
-    "mentri": "menteri",
-    "negri": "negeri",
-    "Prancis": "Perancis",
-    "sebrang": "seberang",
-    "menyebrang": "menyeberang",
-    "Sumatra": "Sumatera",
-    "trampil": "terampil",
-    "isteri": "istri",
-    "justeru": "justru",
-    "perajurit": "prajurit",
-    "putera": "putra",
-    "puteri": "putri",
-    "samudera": "samudra",
-    "sastera": "sastra",
-    "sutera": "sutra",
-    "terompet": "trompet",
-    "iklas": "ikhlas",
-    "iktisar": "ikhtisar",
-    "kafilah": "khafilah",
-    "kawatir": "khawatir",
-    "kotbah": "khotbah",
-    "kusyuk": "khusyuk",
-    "makluk": "makhluk",
-    "mahluk": "makhluk",
-    "mahkluk": "makhluk",
-    "nahkoda": "nakhoda",
-    "nakoda": "nakhoda",
-    "tahta": "takhta",
-    "takhyul": "takhayul",
-    "tahyul": "takhayul",
-    "tahayul": "takhayul",
-    "akhli": "ahli",
-    "anarkhi": "anarki",
-    "kharisma": "karisma",
-    "kharismatik": "karismatik",
-    "mahsud": "maksud",
-    "makhsud": "maksud",
-    "rakhmat": "rahmat",
-    "tekhnik": "teknik",
-    "tehnik": "teknik",
-    "tehnologi": "teknologi",
-    "ikhwal": "ihwal",
-    "expor": "ekspor",
-    "extra": "ekstra",
-    "komplex": "komplek",
-    "sex": "seks",
-    "taxi": "taksi",
-    "extasi": "ekstasi",
-    "syaraf": "saraf",
-    "syurga": "surga",
-    "mashur": "masyhur",
-    "masyur": "masyhur",
-    "mahsyur": "masyhur",
-    "mashyur": "masyhur",
-    "muadzin": "muazin",
-    "adzan": "azan",
-    "ustadz": "ustaz",
-    "ustad": "ustaz",
-    "ustadzah": "ustaz",
-    "dzikir": "zikir",
-    "dzuhur": "zuhur",
-    "dhuhur": "zuhur",
-    "zhuhur": "zuhur",
-    "analisa": "analisis",
-    "diagnosa": "diagnosis",
-    "hipotesa": "hipotesis",
-    "sintesa": "sintesis",
-    "aktiviti": "aktivitas",
-    "aktifitas": "aktivitas",
-    "efektifitas": "efektivitas",
-    "komuniti": "komunitas",
-    "kreatifitas": "kreativitas",
-    "produktifitas": "produktivitas",
-    "realiti": "realitas",
-    "realita": "realitas",
-    "selebriti": "selebritas",
-    "spotifitas": "sportivitas",
-    "universiti": "universitas",
-    "utiliti": "utilitas",
-    "validiti": "validitas",
-    "dilokalisir": "dilokalisasi",
-    "didramatisir": "didramatisasi",
-    "dipolitisir": "dipolitisasi",
-    "dinetralisir": "dinetralisasi",
-    "dikonfrontir": "dikonfrontasi",
-    "mendominir": "mendominasi",
-    "koordinir": "koordinasi",
-    "proklamir": "proklamasi",
-    "terorganisir": "terorganisasi",
-    "terealisir": "terealisasi",
-    "robah": "ubah",
-    "dirubah": "diubah",
-    "merubah": "mengubah",
-    "terlanjur": "telanjur",
-    "terlantar": "telantar",
-    "penglepasan": "pelepasan",
-    "pelihatan": "penglihatan",
-    "pemukiman": "permukiman",
-    "pengrumahan": "perumahan",
-    "penyewaan": "persewaan",
-    "menyintai": "mencintai",
-    "menyolok": "mencolok",
-    "contek": "sontek",
-    "mencontek": "menyontek",
-    "pungkir": "mungkir",
-    "dipungkiri": "dimungkiri",
-    "kupungkiri": "kumungkiri",
-    "kaupungkiri": "kaumungkiri",
-    "nampak": "tampak",
-    "nampaknya": "tampaknya",
-    "nongkrong": "tongkrong",
-    "berternak": "beternak",
-    "berterbangan": "beterbangan",
-    "berserta": "beserta",
-    "berperkara": "beperkara",
-    "berpergian": "bepergian",
-    "berkerja": "bekerja",
-    "berberapa": "beberapa",
-    "terbersit": "tebersit",
-    "terpercaya": "tepercaya",
-    "terperdaya": "teperdaya",
-    "terpercik": "tepercik",
-    "terpergok": "tepergok",
-    "aksesoris": "aksesori",
-    "handal": "andal",
-    "hantar": "antar",
-    "panutan": "anutan",
-    "atsiri": "asiri",
-    "bhakti": "bakti",
-    "china": "cina",
-    "dharma": "darma",
-    "diktaktor": "diktator",
-    "eksport": "ekspor",
-    "hembus": "embus",
-    "hadits": "hadis",
-    "hadist": "hadits",
-    "harafiah": "harfiah",
-    "himbau": "imbau",
-    "import": "impor",
-    "inget": "ingat",
-    "hisap": "isap",
-    "interprestasi": "interpretasi",
-    "kangker": "kanker",
-    "konggres": "kongres",
-    "lansekap": "lanskap",
-    "maghrib": "magrib",
-    "emak": "mak",
-    "moderen": "modern",
-    "pasport": "paspor",
-    "perduli": "peduli",
-    "ramadhan": "ramadan",
-    "rapih": "rapi",
-    "Sansekerta": "Sanskerta",
-    "shalat": "salat",
-    "sholat": "salat",
-    "silahkan": "silakan",
-    "standard": "standar",
-    "hutang": "utang",
-    "zinah": "zina",
-    "ambulan": "ambulans",
-    "antartika": "sntarktika",
-    "arteri": "arteria",
-    "asik": "asyik",
-    "australi": "australia",
-    "denga": "dengan",
-    "depo": "depot",
-    "detil": "detail",
-    "ensiklopedi": "ensiklopedia",
-    "elit": "elite",
-    "frustasi": "frustrasi",
-    "gladi": "geladi",
-    "greget": "gereget",
-    "itali": "italia",
-    "karna": "karena",
-    "klenteng": "kelenteng",
-    "erling": "kerling",
-    "kontruksi": "konstruksi",
-    "masal": "massal",
-    "merk": "merek",
-    "respon": "respons",
-    "diresponi": "direspons",
-    "skak": "sekak",
-    "stir": "setir",
-    "singapur": "singapura",
-    "standarisasi": "standardisasi",
-    "varitas": "varietas",
-    "amphibi": "amfibi",
-    "anjlog": "anjlok",
-    "alpukat": "avokad",
-    "alpokat": "avokad",
-    "bolpen": "pulpen",
-    "cabe": "cabai",
-    "cabay": "cabai",
-    "ceret": "cerek",
-    "differensial": "diferensial",
-    "duren": "durian",
-    "faksimili": "faksimile",
-    "faksimil": "faksimile",
-    "graha": "gerha",
-    "goblog": "goblok",
-    "gombrong": "gombroh",
-    "horden": "gorden",
-    "korden": "gorden",
-    "gubug": "gubuk",
-    "imaginasi": "imajinasi",
-    "jerigen": "jeriken",
-    "jirigen": "jeriken",
-    "carut-marut": "karut-marut",
-    "kwota": "kuota",
-    "mahzab": "mazhab",
-    "mempesona": "memesona",
-    "milyar": "miliar",
-    "missi": "misi",
-    "nenas": "nanas",
-    "negoisasi": "negosiasi",
-    "automotif": "otomotif",
-    "pararel": "paralel",
-    "paska": "pasca",
-    "prosen": "persen",
-    "pete": "petai",
-    "petay": "petai",
-    "proffesor": "profesor",
-    "rame": "ramai",
-    "rapot": "rapor",
-    "rileks": "relaks",
-    "rileksasi": "relaksasi",
-    "renumerasi": "remunerasi",
-    "seketaris": "sekretaris",
-    "sekertaris": "sekretaris",
-    "sensorik": "sensoris",
-    "sentausa": "sentosa",
-    "strawberi": "stroberi",
-    "strawbery": "stroberi",
-    "taqwa": "takwa",
-    "tauco": "taoco",
-    "tauge": "taoge",
-    "toge": "taoge",
-    "tauladan": "teladan",
-    "taubat": "tobat",
-    "trilyun": "triliun",
-    "vissi": "visi",
-    "coklat": "cokelat",
-    "narkotika": "narkotik",
-    "oase": "oasis",
-    "politisi": "politikus",
-    "terong": "terung",
-    "wool": "wol",
-    "himpit": "impit",
-    "mujizat": "mukjizat",
-    "mujijat": "mukjizat",
-    "yag": "yang",
-}
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/id/syntax_iterators.py
+++ b/spacy/lang/id/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/ko/examples.py
+++ b/spacy/lang/ko/examples.py
@ -9,8 +9,8 @@ Example sentences to test spaCy and its language models.
 """

 sentences = [
-    "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.",
-    "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.",
-    "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.",
+    "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.",
+    "자율주행 자동차의 손해 배상 책임이 제조 업체로 옮겨 가다",
+    "샌프란시스코 시가 자동 배달 로봇의 보도 주행 금지를 검토 중이라고 합니다.",
    "런던은 영국의 수도이자 가장 큰 도시입니다.",
 ]
--- a/spacy/lang/lb/init.py
+++ b/spacy/lang/lb/init.py
@ -2,26 +2,21 @@
 from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class LuxembourgishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "lb"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/lb/norm_exceptions.py
+++ b/spacy/lang/lb/norm_exceptions.py
@ -1,16 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-# TODO
-# norm execptions: find a possibility to deal with the zillions of spelling
-# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.)
-# here one could include the most common spelling mistakes
-
-_exc = {"dass": "datt", "viläicht": "vläicht"}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/lex_attrs.py
+++ b/spacy/lang/lex_attrs.py
@ -186,10 +186,6 @@ def suffix(string):
    return string[-3:]


-def cluster(string):
-    return 0
-
-
 def is_alpha(string):
    return string.isalpha()

@ -218,20 +214,11 @@ def is_stop(string, stops=set()):
    return string.lower() in stops


-def is_oov(string):
-    return True
-
-
-def get_prob(string):
-    return -20.0
-
-
 LEX_ATTRS = {
    attrs.LOWER: lower,
    attrs.NORM: lower,
    attrs.PREFIX: prefix,
    attrs.SUFFIX: suffix,
-    attrs.CLUSTER: cluster,
    attrs.IS_ALPHA: is_alpha,
    attrs.IS_DIGIT: is_digit,
    attrs.IS_LOWER: is_lower,
@ -239,8 +226,6 @@ LEX_ATTRS = {
    attrs.IS_TITLE: is_title,
    attrs.IS_UPPER: is_upper,
    attrs.IS_STOP: is_stop,
-    attrs.IS_OOV: is_oov,
-    attrs.PROB: get_prob,
    attrs.LIKE_EMAIL: like_email,
    attrs.LIKE_NUM: like_num,
    attrs.IS_PUNCT: is_punct,
--- a/spacy/lang/ml/lex_attrs.py
+++ b/spacy/lang/ml/lex_attrs.py
@ -55,7 +55,7 @@ _num_words = [
    "തൊണ്ണൂറ് ",
    "നുറ് ",
    "ആയിരം ",
-    "പത്തുലക്ഷം"
+    "പത്തുലക്ഷം",
 ]


--- a/spacy/lang/ml/stop_words.py
+++ b/spacy/lang/ml/stop_words.py
@ -3,7 +3,6 @@ from __future__ import unicode_literals


 STOP_WORDS = set(
-
    """
 അത്
 ഇത്
--- a/spacy/lang/nb/syntax_iterators.py
+++ b/spacy/lang/nb/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -18,21 +19,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -40,9 +43,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/pl/init.py
+++ b/spacy/lang/pl/init.py
@ -1,17 +1,19 @@
 # coding: utf8
 from __future__ import unicode_literals

-from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .punctuation import TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
+from .lemmatizer import PolishLemmatizer

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...util import add_lookups
+from ...lookups import Lookups


 class PolishDefaults(Language.Defaults):
@ -21,10 +23,21 @@ class PolishDefaults(Language.Defaults):
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
-    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    mod_base_exceptions = {
+        exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
+    }
+    tokenizer_exceptions = mod_base_exceptions
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
+    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
+    suffixes = TOKENIZER_SUFFIXES
+
+    @classmethod
+    def create_lemmatizer(cls, nlp=None, lookups=None):
+        if lookups is None:
+            lookups = Lookups()
+        return PolishLemmatizer(lookups)


 class Polish(Language):
--- a/spacy/lang/pl/_tokenizer_exceptions_list.py
+++ b/spacy/lang/pl/_tokenizer_exceptions_list.py
--- a/spacy/lang/pl/lemmatizer.py
+++ b/spacy/lang/pl/lemmatizer.py
@ -0,0 +1,106 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...lemmatizer import Lemmatizer
+from ...parts_of_speech import NAMES
+
+
+class PolishLemmatizer(Lemmatizer):
+    # This lemmatizer implements lookup lemmatization based on
+    # the Morfeusz dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS
+    # It utilizes some prefix based improvements for
+    # verb and adjectives lemmatization, as well as case-sensitive
+    # lemmatization for nouns
+    def __init__(self, lookups, *args, **kwargs):
+        # this lemmatizer is lookup based, so it does not require an index, exceptionlist, or rules
+        super(PolishLemmatizer, self).__init__(lookups)
+        self.lemma_lookups = {}
+        for tag in [
+            "ADJ",
+            "ADP",
+            "ADV",
+            "AUX",
+            "NOUN",
+            "NUM",
+            "PART",
+            "PRON",
+            "VERB",
+            "X",
+        ]:
+            self.lemma_lookups[tag] = self.lookups.get_table(
+                "lemma_lookup_" + tag.lower(), {}
+            )
+        self.lemma_lookups["DET"] = self.lemma_lookups["X"]
+        self.lemma_lookups["PROPN"] = self.lemma_lookups["NOUN"]
+
+    def __call__(self, string, univ_pos, morphology=None):
+        if isinstance(univ_pos, int):
+            univ_pos = NAMES.get(univ_pos, "X")
+        univ_pos = univ_pos.upper()
+
+        if univ_pos == "NOUN":
+            return self.lemmatize_noun(string, morphology)
+
+        if univ_pos != "PROPN":
+            string = string.lower()
+
+        if univ_pos == "ADJ":
+            return self.lemmatize_adj(string, morphology)
+        elif univ_pos == "VERB":
+            return self.lemmatize_verb(string, morphology)
+
+        lemma_dict = self.lemma_lookups.get(univ_pos, {})
+        return [lemma_dict.get(string, string.lower())]
+
+    def lemmatize_adj(self, string, morphology):
+        # this method utilizes different procedures for adjectives
+        # with 'nie' and 'naj' prefixes
+        lemma_dict = self.lemma_lookups["ADJ"]
+
+        if string[:3] == "nie":
+            search_string = string[3:]
+            if search_string[:3] == "naj":
+                naj_search_string = search_string[3:]
+                if naj_search_string in lemma_dict:
+                    return [lemma_dict[naj_search_string]]
+            if search_string in lemma_dict:
+                return [lemma_dict[search_string]]
+
+        if string[:3] == "naj":
+            naj_search_string = string[3:]
+            if naj_search_string in lemma_dict:
+                return [lemma_dict[naj_search_string]]
+
+        return [lemma_dict.get(string, string)]
+
+    def lemmatize_verb(self, string, morphology):
+        # this method utilizes a different procedure for verbs
+        # with 'nie' prefix
+        lemma_dict = self.lemma_lookups["VERB"]
+
+        if string[:3] == "nie":
+            search_string = string[3:]
+            if search_string in lemma_dict:
+                return [lemma_dict[search_string]]
+
+        return [lemma_dict.get(string, string)]
+
+    def lemmatize_noun(self, string, morphology):
+        # this method is case-sensitive, in order to work
+        # for incorrectly tagged proper names
+        lemma_dict = self.lemma_lookups["NOUN"]
+
+        if string != string.lower():
+            if string.lower() in lemma_dict:
+                return [lemma_dict[string.lower()]]
+            elif string in lemma_dict:
+                return [lemma_dict[string]]
+            return [string.lower()]
+
+        return [lemma_dict.get(string, string)]
+
+    def lookup(self, string, orth=None):
+        return string.lower()
+
+    def lemmatize(self, string, index, exceptions, rules):
+        raise NotImplementedError
--- a/spacy/lang/pl/polish_srx_rules_LICENSE.txt
+++ b/spacy/lang/pl/polish_srx_rules_LICENSE.txt
@ -1,23 +0,0 @@
-
-Copyright (c) 2019, Marcin Miłkowski
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met: 
-
-1. Redistributions of source code must retain the above copyright notice, this
-   list of conditions and the following disclaimer. 
-2. Redistributions in binary form must reproduce the above copyright notice,
-   this list of conditions and the following disclaimer in the documentation
-   and/or other materials provided with the distribution. 
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
-ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/spacy/lang/pl/punctuation.py
+++ b/spacy/lang/pl/punctuation.py
@ -1,22 +1,48 @@
 # coding: utf8
 from __future__ import unicode_literals

-from ..char_classes import LIST_ELLIPSES, CONCAT_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS
+from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES

 _quotes = CONCAT_QUOTES.replace("'", "")

+_prefixes = _prefixes = [
+    r"(długo|krótko|jedno|dwu|trzy|cztero)-"
+] + BASE_TOKENIZER_PREFIXES
+
 _infixes = (
    LIST_ELLIPSES
-    + [CONCAT_ICONS]
+    + LIST_ICONS
+    + LIST_HYPHENS
    + [
-        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
+        r"(?<=[0-9{al}])\.(?=[0-9{au}])".format(al=ALPHA, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
-        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
-        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}])[:<>=\/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
-        r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES),
+        r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=_quotes),
    ]
 )

+_suffixes = (
+    ["''", "’’", r"\.", "…"]
+    + LIST_PUNCT
+    + LIST_QUOTES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
+        ),
+        r"(?<=[{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
+
+TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_INFIXES = _infixes
+TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/pl/tokenizer_exceptions.py
+++ b/spacy/lang/pl/tokenizer_exceptions.py
@ -1,26 +0,0 @@
-# encoding: utf8
-from __future__ import unicode_literals
-
-from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS
-from ...symbols import POS, ADV, NOUN, ORTH, LEMMA, ADJ
-
-
-_exc = {}
-
-for exc_data in [
-    {ORTH: "m.in.", LEMMA: "między innymi", POS: ADV},
-    {ORTH: "inż.", LEMMA: "inżynier", POS: NOUN},
-    {ORTH: "mgr.", LEMMA: "magister", POS: NOUN},
-    {ORTH: "tzn.", LEMMA: "to znaczy", POS: ADV},
-    {ORTH: "tj.", LEMMA: "to jest", POS: ADV},
-    {ORTH: "tzw.", LEMMA: "tak zwany", POS: ADJ},
-]:
-    _exc[exc_data[ORTH]] = [exc_data]
-
-for orth in ["w.", "r."]:
-    _exc[orth] = [{ORTH: orth}]
-
-for orth in PL_BASE_EXCEPTIONS:
-    _exc[orth] = [{ORTH: orth}]
-
-TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/pt/init.py
+++ b/spacy/lang/pt/init.py
@ -5,22 +5,17 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
-from .norm_exceptions import NORM_EXCEPTIONS

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class PortugueseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "pt"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    lex_attr_getters.update(LEX_ATTRS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
--- a/spacy/lang/pt/norm_exceptions.py
+++ b/spacy/lang/pt/norm_exceptions.py
@ -1,23 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-# These exceptions are used to add NORM values based on a token's ORTH value.
-# Individual languages can also add their own exceptions and overwrite them -
-# for example, British vs. American spelling in English.
-
-# Norms are only set if no alternative is provided in the tokenizer exceptions.
-# Note that this does not change any other token attributes. Its main purpose
-# is to normalise the word representations so that equivalent tokens receive
-# similar representations. For example: $ and € are very different, but they're
-# both currency symbols. By normalising currency symbols to $, all symbols are
-# seen as similar, no matter how common they are in the training data.
-
-
-NORM_EXCEPTIONS = {
-    "R$": "$",  # Real
-    "r$": "$",  # Real
-    "Cz$": "$",  # Cruzado
-    "cz$": "$",  # Cruzado
-    "NCz$": "$",  # Cruzado Novo
-    "ncz$": "$",  # Cruzado Novo
-}
--- a/spacy/lang/ru/init.py
+++ b/spacy/lang/ru/init.py
@ -3,26 +3,21 @@ from __future__ import unicode_literals, print_function

 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .lemmatizer import RussianLemmatizer

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
-from ...util import update_exc, add_lookups
+from ...util import update_exc
 from ...language import Language
 from ...lookups import Lookups
-from ...attrs import LANG, NORM
+from ...attrs import LANG


 class RussianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "ru"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
--- a/spacy/lang/ru/norm_exceptions.py
+++ b/spacy/lang/ru/norm_exceptions.py
@ -1,36 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
-_exc = {
-    # Slang
-    "прив": "привет",
-    "дарова": "привет",
-    "дак": "так",
-    "дык": "так",
-    "здарова": "привет",
-    "пакедава": "пока",
-    "пакедаво": "пока",
-    "ща": "сейчас",
-    "спс": "спасибо",
-    "пжлст": "пожалуйста",
-    "плиз": "пожалуйста",
-    "ладненько": "ладно",
-    "лады": "ладно",
-    "лан": "ладно",
-    "ясн": "ясно",
-    "всм": "всмысле",
-    "хош": "хочешь",
-    "хаюшки": "привет",
-    "оч": "очень",
-    "че": "что",
-    "чо": "что",
-    "шо": "что",
-}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/sr/init.py
+++ b/spacy/lang/sr/init.py
@ -3,22 +3,17 @@ from __future__ import unicode_literals

 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG, NORM
-from ...util import update_exc, add_lookups
+from ...attrs import LANG
+from ...util import update_exc


 class SerbianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "sr"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS

--- a/spacy/lang/sr/norm_exceptions.py
+++ b/spacy/lang/sr/norm_exceptions.py
@ -1,26 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
-_exc = {
-    # Slang
-    "ћале": "отац",
-    "кева": "мајка",
-    "смор": "досада",
-    "кец": "јединица",
-    "тебра": "брат",
-    "штребер": "ученик",
-    "факс": "факултет",
-    "профа": "професор",
-    "бус": "аутобус",
-    "пискарало": "службеник",
-    "бакутанер": "бака",
-    "џибер": "простак",
-}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/sv/lex_attrs.py
+++ b/spacy/lang/sv/lex_attrs.py
@ -40,7 +40,7 @@ _num_words = [
    "miljard",
    "biljon",
    "biljard",
-    "kvadriljon"
+    "kvadriljon",
 ]


--- a/spacy/lang/sv/syntax_iterators.py
+++ b/spacy/lang/sv/syntax_iterators.py
@ -2,9 +2,10 @@
 from __future__ import unicode_literals

 from ...symbols import NOUN, PROPN, PRON
+from ...errors import Errors


-def noun_chunks(obj):
+def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
@ -19,21 +20,23 @@ def noun_chunks(obj):
        "nmod",
        "nmod:poss",
    ]
-    doc = obj.doc  # Ensure works on both Doc and Span.
+    doc = doclike.doc  # Ensure works on both Doc and Span.
+
+    if not doc.is_parsed:
+        raise ValueError(Errors.E029)
+
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
-    for i, word in enumerate(obj):
+    prev_end = -1
+    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
-        if word.i in seen:
+        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
-            if any(w.i in seen for w in word.subtree):
-                continue
-            seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+            prev_end = word.right_edge.i
            yield word.left_edge.i, word.right_edge.i + 1, np_label
        elif word.dep == conj:
            head = word.head
@ -41,9 +44,7 @@ def noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
-                if any(w.i in seen for w in word.subtree):
-                    continue
-                seen.update(j for j in range(word.left_edge.i, word.right_edge.i + 1))
+                prev_end = word.right_edge.i
                yield word.left_edge.i, word.right_edge.i + 1, np_label


--- a/spacy/lang/sv/tokenizer_exceptions.py
+++ b/spacy/lang/sv/tokenizer_exceptions.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals

-from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG
+from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA

 _exc = {}

@ -155,6 +155,6 @@ for orth in ABBREVIATIONS:
 # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."),
 # should be tokenized as two separate tokens.
 for orth in ["i", "m"]:
-    _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: ".", TAG: PUNCT}]
+    _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}]

 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/ta/norm_exceptions.py
+++ b/spacy/lang/ta/norm_exceptions.py
@ -1,139 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-_exc = {
-    # Regional words normal
-    # Sri Lanka - wikipeadia
-    "இங்க": "இங்கே",
-    "வாங்க": "வாருங்கள்",
-    "ஒண்டு": "ஒன்று",
-    "கண்டு": "கன்று",
-    "கொண்டு": "கொன்று",
-    "பண்டி": "பன்றி",
-    "பச்ச": "பச்சை",
-    "அம்பது": "ஐம்பது",
-    "வெச்ச": "வைத்து",
-    "வச்ச": "வைத்து",
-    "வச்சி": "வைத்து",
-    "வாளைப்பழம்": "வாழைப்பழம்",
-    "மண்ணு": "மண்",
-    "பொன்னு": "பொன்",
-    "சாவல்": "சேவல்",
-    "அங்கால": "அங்கு ",
-    "அசுப்பு": "நடமாட்டம்",
-    "எழுவான் கரை": "எழுவான்கரை",
-    "ஓய்யாரம்": "எழில் ",
-    "ஒளும்பு": "எழும்பு",
-    "ஓர்மை": "துணிவு",
-    "கச்சை": "கோவணம்",
-    "கடப்பு": "தெருவாசல்",
-    "சுள்ளி": "காய்ந்த குச்சி",
-    "திறாவுதல்": "தடவுதல்",
-    "நாசமறுப்பு": "தொல்லை",
-    "பரிசாரி": "வைத்தியன்",
-    "பறவாதி": "பேராசைக்காரன்",
-    "பிசினி": "உலோபி ",
-    "விசர்": "பைத்தியம்",
-    "ஏனம்": "பாத்திரம்",
-    "ஏலா": "இயலாது",
-    "ஒசில்": "அழகு",
-    "ஒள்ளுப்பம்": "கொஞ்சம்",
-    # Srilankan and indian
-    "குத்துமதிப்பு": "",
-    "நூனாயம்": "நூல்நயம்",
-    "பைய": "மெதுவாக",
-    "மண்டை": "தலை",
-    "வெள்ளனே": "சீக்கிரம்",
-    "உசுப்பு": "எழுப்பு",
-    "ஆணம்": "குழம்பு",
-    "உறக்கம்": "தூக்கம்",
-    "பஸ்": "பேருந்து",
-    "களவு": "திருட்டு ",
-    # relationship
-    "புருசன்": "கணவன்",
-    "பொஞ்சாதி": "மனைவி",
-    "புள்ள": "பிள்ளை",
-    "பிள்ள": "பிள்ளை",
-    "ஆம்பிளப்புள்ள": "ஆண் பிள்ளை",
-    "பொம்பிளப்புள்ள": "பெண் பிள்ளை",
-    "அண்ணாச்சி": "அண்ணா",
-    "அக்காச்சி": "அக்கா",
-    "தங்கச்சி": "தங்கை",
-    # difference words
-    "பொடியன்": "சிறுவன்",
-    "பொட்டை": "சிறுமி",
-    "பிறகு": "பின்பு",
-    "டக்கென்டு": "விரைவாக",
-    "கெதியா": "விரைவாக",
-    "கிறுகி": "திரும்பி",
-    "போயித்து வாறன்": "போய் வருகிறேன்",
-    "வருவாங்களா": "வருவார்களா",
-    # regular spokens
-    "சொல்லு": "சொல்",
-    "கேளு": "கேள்",
-    "சொல்லுங்க": "சொல்லுங்கள்",
-    "கேளுங்க": "கேளுங்கள்",
-    "நீங்கள்": "நீ",
-    "உன்": "உன்னுடைய",
-    # Portugeese formal words
-    "அலவாங்கு": "கடப்பாரை",
-    "ஆசுப்பத்திரி": "மருத்துவமனை",
-    "உரோதை": "சில்லு",
-    "கடுதாசி": "கடிதம்",
-    "கதிரை": "நாற்காலி",
-    "குசினி": "அடுக்களை",
-    "கோப்பை": "கிண்ணம்",
-    "சப்பாத்து": "காலணி",
-    "தாச்சி": "இரும்புச் சட்டி",
-    "துவாய்": "துவாலை",
-    "தவறணை": "மதுக்கடை",
-    "பீப்பா": "மரத்தாழி",
-    "யன்னல்": "சாளரம்",
-    "வாங்கு": "மரஇருக்கை",
-    # Dutch formal words
-    "இறாக்கை": "பற்சட்டம்",
-    "இலாட்சி": "இழுப்பறை",
-    "கந்தோர்": "பணிமனை",
-    "நொத்தாரிசு": "ஆவண எழுத்துபதிவாளர்",
-    # English formal words
-    "இஞ்சினியர்": "பொறியியலாளர்",
-    "சூப்பு": "ரசம்",
-    "செக்": "காசோலை",
-    "சேட்டு": "மேற்ச்சட்டை",
-    "மார்க்கட்டு": "சந்தை",
-    "விண்ணன்": "கெட்டிக்காரன்",
-    # Arabic formal words
-    "ஈமான்": "நம்பிக்கை",
-    "சுன்னத்து": "விருத்தசேதனம்",
-    "செய்த்தான்": "பிசாசு",
-    "மவுத்து": "இறப்பு",
-    "ஹலால்": "அங்கீகரிக்கப்பட்டது",
-    "கறாம்": "நிராகரிக்கப்பட்டது",
-    # Persian, Hindustanian and hindi formal words
-    "சுமார்": "கிட்டத்தட்ட",
-    "சிப்பாய்": "போர்வீரன்",
-    "சிபார்சு": "சிபாரிசு",
-    "ஜமீன்": "பணக்காரா்",
-    "அசல்": "மெய்யான",
-    "அந்தஸ்து": "கௌரவம்",
-    "ஆஜர்": "சமா்ப்பித்தல்",
-    "உசார்": "எச்சரிக்கை",
-    "அச்சா": "நல்ல",
-    # English words used in text conversations
-    "bcoz": "ஏனெனில்",
-    "bcuz": "ஏனெனில்",
-    "fav": "விருப்பமான",
-    "morning": "காலை வணக்கம்",
-    "gdeveng": "மாலை வணக்கம்",
-    "gdnyt": "இரவு வணக்கம்",
-    "gdnit": "இரவு வணக்கம்",
-    "plz": "தயவு செய்து",
-    "pls": "தயவு செய்து",
-    "thx": "நன்றி",
-    "thanx": "நன்றி",
-}
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
--- a/spacy/lang/th/init.py
+++ b/spacy/lang/th/init.py
@ -4,14 +4,12 @@ from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
-from .norm_exceptions import NORM_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS

-from ..norm_exceptions import BASE_NORMS
-from ...attrs import LANG, NORM
+from ...attrs import LANG
 from ...language import Language
 from ...tokens import Doc
-from ...util import DummyTokenizer, add_lookups
+from ...util import DummyTokenizer


 class ThaiTokenizer(DummyTokenizer):
@ -37,9 +35,6 @@ class ThaiDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda _text: "th"
-    lex_attr_getters[NORM] = add_lookups(
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
-    )
    tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/th/norm_exceptions.py
+++ b/spacy/lang/th/norm_exceptions.py
@ -1,113 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
-_exc = {
-    # Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์)
-    "สนุ๊กเกอร์": "สนุกเกอร์",
-    "โน้ต": "โน้ต",
-    # Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ)
-    "โทสับ": "โทรศัพท์",
-    "พุ่งนี้": "พรุ่งนี้",
-    # Strange (ให้ดูแปลกตา)
-    "ชะมะ": "ใช่ไหม",
-    "ชิมิ": "ใช่ไหม",
-    "ชะ": "ใช่ไหม",
-    "ช่ายมะ": "ใช่ไหม",
-    "ป่าว": "เปล่า",
-    "ป่ะ": "เปล่า",
-    "ปล่าว": "เปล่า",
-    "คัย": "ใคร",
-    "ไค": "ใคร",
-    "คราย": "ใคร",
-    "เตง": "ตัวเอง",
-    "ตะเอง": "ตัวเอง",
-    "รึ": "หรือ",
-    "เหรอ": "หรือ",
-    "หรา": "หรือ",
-    "หรอ": "หรือ",
-    "ชั้น": "ฉัน",
-    "ชั้ล": "ฉัน",
-    "ช้าน": "ฉัน",
-    "เทอ": "เธอ",
-    "เทอร์": "เธอ",
-    "เทอว์": "เธอ",
-    "แกร": "แก",
-    "ป๋ม": "ผม",
-    "บ่องตง": "บอกตรงๆ",
-    "ถ่ามตง": "ถามตรงๆ",
-    "ต่อมตง": "ตอบตรงๆ",
-    "เพิ่ล": "เพื่อน",
-    "จอบอ": "จอบอ",
-    "ดั้ย": "ได้",
-    "ขอบคุง": "ขอบคุณ",
-    "ยังงัย": "ยังไง",
-    "Inw": "เทพ",
-    "uou": "นอน",
-    "Lกรีeu": "เกรียน",
-    # Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์)
-    "เปงราย": "เป็นอะไร",
-    "เปนรัย": "เป็นอะไร",
-    "เปงรัย": "เป็นอะไร",
-    "เป็นอัลไล": "เป็นอะไร",
-    "ทามมาย": "ทำไม",
-    "ทามมัย": "ทำไม",
-    "จังรุย": "จังเลย",
-    "จังเยย": "จังเลย",
-    "จุงเบย": "จังเลย",
-    "ไม่รู้": "มะรุ",
-    "เฮ่ย": "เฮ้ย",
-    "เห้ย": "เฮ้ย",
-    "น่าร็อค": "น่ารัก",
-    "น่าร๊าก": "น่ารัก",
-    "ตั้ลล๊าก": "น่ารัก",
-    "คือร๊ะ": "คืออะไร",
-    "โอป่ะ": "โอเคหรือเปล่า",
-    "น่ามคาน": "น่ารำคาญ",
-    "น่ามสาร": "น่าสงสาร",
-    "วงวาร": "สงสาร",
-    "บับว่า": "แบบว่า",
-    "อัลไล": "อะไร",
-    "อิจ": "อิจฉา",
-    # Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์)
-    "กรู": "กู",
-    "กุ": "กู",
-    "กรุ": "กู",
-    "ตู": "กู",
-    "ตรู": "กู",
-    "มรึง": "มึง",
-    "เมิง": "มึง",
-    "มืง": "มึง",
-    "มุง": "มึง",
-    "สาด": "สัตว์",
-    "สัส": "สัตว์",
-    "สัก": "สัตว์",
-    "แสรด": "สัตว์",
-    "โคโตะ": "โคตร",
-    "โคด": "โคตร",
-    "โครต": "โคตร",
-    "โคตะระ": "โคตร",
-    "พ่อง": "พ่อมึง",
-    "แม่เมิง": "แม่มึง",
-    "เชี่ย": "เหี้ย",
-    # Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร)
-    "แอร๊ยย": "อ๊าย",
-    "อร๊ายยย": "อ๊าย",
-    "มันส์": "มัน",
-    "วู๊วววววววว์": "วู้",
-    # Acronym (แบบคำย่อ)
-    "หมาลัย": "มหาวิทยาลัย",
-    "วิดวะ": "วิศวะ",
-    "สินสาด ": "ศิลปศาสตร์",
-    "สินกำ ": "ศิลปกรรมศาสตร์",
-    "เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ",
-    "เมกา ": "อเมริกา",
-    "มอไซค์ ": "มอเตอร์ไซค์",
-}
-
-
-NORM_EXCEPTIONS = {}
-
-for string, norm in _exc.items():
-    NORM_EXCEPTIONS[string] = norm
-    NORM_EXCEPTIONS[string.title()] = norm
--- a/spacy/lang/ur/tag_map.py
+++ b/spacy/lang/ur/tag_map.py
@ -38,7 +38,6 @@ TAG_MAP = {
    "NNPC": {POS: PROPN},
    "NNC": {POS: NOUN},
    "PSP": {POS: ADP},
-
    ".": {POS: PUNCT},
    ",": {POS: PUNCT},
    "-LRB-": {POS: PUNCT},
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -104,6 +104,23 @@ class ChineseTokenizer(DummyTokenizer):
            (words, spaces) = util.get_words_and_spaces(words, text)
            return Doc(self.vocab, words=words, spaces=spaces)

+    def pkuseg_update_user_dict(self, words, reset=False):
+        if self.pkuseg_seg:
+            if reset:
+                try:
+                    import pkuseg
+
+                    self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None)
+                except ImportError:
+                    if self.use_pkuseg:
+                        msg = (
+                            "pkuseg not installed: unable to reset pkuseg "
+                            "user dict. Please " + _PKUSEG_INSTALL_MSG
+                        )
+                        raise ImportError(msg)
+            for word in words:
+                self.pkuseg_seg.preprocesser.insert(word.strip(), "")
+
    def _get_config(self):
        config = OrderedDict(
            (
@ -152,21 +169,16 @@ class ChineseTokenizer(DummyTokenizer):
        return util.to_bytes(serializers, [])

    def from_bytes(self, data, **kwargs):
-        pkuseg_features_b = b""
-        pkuseg_weights_b = b""
-        pkuseg_processors_data = None
+        pkuseg_data = {"features_b": b"", "weights_b": b"", "processors_data": None}

        def deserialize_pkuseg_features(b):
-            nonlocal pkuseg_features_b
-            pkuseg_features_b = b
+            pkuseg_data["features_b"] = b

        def deserialize_pkuseg_weights(b):
-            nonlocal pkuseg_weights_b
-            pkuseg_weights_b = b
+            pkuseg_data["weights_b"] = b

        def deserialize_pkuseg_processors(b):
-            nonlocal pkuseg_processors_data
-            pkuseg_processors_data = srsly.msgpack_loads(b)
+            pkuseg_data["processors_data"] = srsly.msgpack_loads(b)

        deserializers = OrderedDict(
            (
@ -178,13 +190,13 @@ class ChineseTokenizer(DummyTokenizer):
        )
        util.from_bytes(data, deserializers, [])

-        if pkuseg_features_b and pkuseg_weights_b:
+        if pkuseg_data["features_b"] and pkuseg_data["weights_b"]:
            with tempfile.TemporaryDirectory() as tempdir:
                tempdir = Path(tempdir)
                with open(tempdir / "features.pkl", "wb") as fileh:
-                    fileh.write(pkuseg_features_b)
+                    fileh.write(pkuseg_data["features_b"])
                with open(tempdir / "weights.npz", "wb") as fileh:
-                    fileh.write(pkuseg_weights_b)
+                    fileh.write(pkuseg_data["weights_b"])
                try:
                    import pkuseg
                except ImportError:
@ -193,13 +205,9 @@ class ChineseTokenizer(DummyTokenizer):
                        + _PKUSEG_INSTALL_MSG
                    )
                self.pkuseg_seg = pkuseg.pkuseg(str(tempdir))
-            if pkuseg_processors_data:
-                (
-                    user_dict,
-                    do_process,
-                    common_words,
-                    other_words,
-                ) = pkuseg_processors_data
+            if pkuseg_data["processors_data"]:
+                processors_data = pkuseg_data["processors_data"]
+                (user_dict, do_process, common_words, other_words) = processors_data
                self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict)
                self.pkuseg_seg.postprocesser.do_process = do_process
                self.pkuseg_seg.postprocesser.common_words = set(common_words)
--- a/spacy/language.py
+++ b/spacy/language.py
@ -4,10 +4,7 @@ from __future__ import absolute_import, unicode_literals
 import random
 import itertools
 import warnings
-
 from thinc.extra import load_nlp
-
-from spacy.util import minibatch
 import weakref
 import functools
 from collections import OrderedDict
@ -28,10 +25,11 @@ from .compat import izip, basestring_, is_python2, class_types
 from .gold import GoldParse
 from .scorer import Scorer
 from ._ml import link_vectors_to_models, create_default_optimizer
-from .attrs import IS_STOP, LANG
+from .attrs import IS_STOP, LANG, NORM
 from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .lang.punctuation import TOKENIZER_INFIXES
 from .lang.tokenizer_exceptions import TOKEN_MATCH, TOKEN_MATCH_WITH_AFFIXES
+from .lang.norm_exceptions import BASE_NORMS
 from .lang.tag_map import TAG_MAP
 from .tokens import Doc
 from .lang.lex_attrs import LEX_ATTRS, is_stop
@ -77,6 +75,11 @@ class BaseDefaults(object):
            lemmatizer=lemmatizer,
            lookups=lookups,
        )
+        vocab.lex_attr_getters[NORM] = util.add_lookups(
+            vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]),
+            BASE_NORMS,
+            vocab.lookups.get_table("lexeme_norm"),
+        )
        for tag_str, exc in cls.morph_rules.items():
            for orth_str, attrs in exc.items():
                vocab.morphology.add_special_case(tag_str, orth_str, attrs)
@ -417,7 +420,7 @@ class Language(object):

    def __call__(self, text, disable=[], component_cfg=None):
        """Apply the pipeline to some text. The text can span multiple sentences,
-        and can contain arbtrary whitespace. Alignment into the original string
+        and can contain arbitrary whitespace. Alignment into the original string
        is preserved.

        text (unicode): The text to be processed.
@ -849,7 +852,7 @@ class Language(object):
            *[mp.Pipe(False) for _ in range(n_process)]
        )

-        batch_texts = minibatch(texts, batch_size)
+        batch_texts = util.minibatch(texts, batch_size)
        # Sender sends texts to the workers.
        # This is necessary to properly handle infinite length of texts.
        # (In this case, all data cannot be sent to the workers at once)
@ -907,9 +910,8 @@ class Language(object):
        serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
            p, exclude=["vocab"]
        )
-        serializers["meta.json"] = lambda p: p.open("w").write(
-            srsly.json_dumps(self.meta)
-        )
+        serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
+
        for name, proc in self.pipeline:
            if not hasattr(proc, "name"):
                continue
@ -973,7 +975,9 @@ class Language(object):
        serializers = OrderedDict()
        serializers["vocab"] = lambda: self.vocab.to_bytes()
        serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
-        serializers["meta.json"] = lambda: srsly.json_dumps(OrderedDict(sorted(self.meta.items())))
+        serializers["meta.json"] = lambda: srsly.json_dumps(
+            OrderedDict(sorted(self.meta.items()))
+        )
        for name, proc in self.pipeline:
            if name in exclude:
                continue
@ -1075,7 +1079,7 @@ def _fix_pretrained_vectors_name(nlp):
    else:
        raise ValueError(Errors.E092)
    if nlp.vocab.vectors.size != 0:
-        link_vectors_to_models(nlp.vocab)
+        link_vectors_to_models(nlp.vocab, skip_rank=True)
    for name, proc in nlp.pipeline:
        if not hasattr(proc, "cfg"):
            continue
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@ -6,6 +6,7 @@ from collections import OrderedDict
 from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
 from .errors import Errors
 from .lookups import Lookups
+from .parts_of_speech import NAMES as UPOS_NAMES


 class Lemmatizer(object):
@ -43,17 +44,11 @@ class Lemmatizer(object):
        lookup_table = self.lookups.get_table("lemma_lookup", {})
        if "lemma_rules" not in self.lookups:
            return [lookup_table.get(string, string)]
-        if univ_pos in (NOUN, "NOUN", "noun"):
-            univ_pos = "noun"
-        elif univ_pos in (VERB, "VERB", "verb"):
-            univ_pos = "verb"
-        elif univ_pos in (ADJ, "ADJ", "adj"):
-            univ_pos = "adj"
-        elif univ_pos in (PUNCT, "PUNCT", "punct"):
-            univ_pos = "punct"
-        elif univ_pos in (PROPN, "PROPN"):
-            return [string]
-        else:
+        if isinstance(univ_pos, int):
+            univ_pos = UPOS_NAMES.get(univ_pos, "X")
+        univ_pos = univ_pos.lower()
+
+        if univ_pos in ("", "eol", "space"):
            return [string.lower()]
        # See Issue #435 for example of where this logic is requied.
        if self.is_base_form(univ_pos, morphology):
@ -61,6 +56,11 @@ class Lemmatizer(object):
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
        rules_table = self.lookups.get_table("lemma_rules", {})
+        if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
+            if univ_pos == "propn":
+                return [string]
+            else:
+                return [string.lower()]
        lemmas = self.lemmatize(
            string,
            index_table.get(univ_pos, {}),
--- a/spacy/lexeme.pxd
+++ b/spacy/lexeme.pxd
@ -1,8 +1,8 @@
 from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t
 from .attrs cimport attr_id_t
-from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER, LANG
+from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG

-from .structs cimport LexemeC, SerializedLexemeC
+from .structs cimport LexemeC
 from .strings cimport StringStore
 from .vocab cimport Vocab

@ -24,22 +24,6 @@ cdef class Lexeme:
        self.vocab = vocab
        self.orth = lex.orth

-    @staticmethod
-    cdef inline SerializedLexemeC c_to_bytes(const LexemeC* lex) nogil:
-        cdef SerializedLexemeC lex_data
-        buff = <const unsigned char*>&lex.flags
-        end = <const unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
-        for i in range(sizeof(lex_data.data)):
-            lex_data.data[i] = buff[i]
-        return lex_data
-
-    @staticmethod
-    cdef inline void c_from_bytes(LexemeC* lex, SerializedLexemeC lex_data) nogil:
-        buff = <unsigned char*>&lex.flags
-        end = <unsigned char*>&lex.sentiment + sizeof(lex.sentiment)
-        for i in range(sizeof(lex_data.data)):
-            buff[i] = lex_data.data[i]
-
    @staticmethod
    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
        if name < (sizeof(flags_t) * 8):
@ -56,8 +40,6 @@ cdef class Lexeme:
            lex.prefix = value
        elif name == SUFFIX:
            lex.suffix = value
-        elif name == CLUSTER:
-            lex.cluster = value
        elif name == LANG:
            lex.lang = value

@ -84,8 +66,6 @@ cdef class Lexeme:
            return lex.suffix
        elif feat_name == LENGTH:
            return lex.length
-        elif feat_name == CLUSTER:
-            return lex.cluster
        elif feat_name == LANG:
            return lex.lang
        else:
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -17,7 +17,7 @@ from .typedefs cimport attr_t, flags_t
 from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
 from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
-from .attrs cimport IS_CURRENCY, IS_OOV, PROB
+from .attrs cimport IS_CURRENCY

 from .attrs import intify_attrs
 from .errors import Errors, Warnings
@ -89,12 +89,11 @@ cdef class Lexeme:
        cdef attr_id_t attr
        attrs = intify_attrs(attrs)
        for attr, value in attrs.items():
-            if attr == PROB:
-                self.c.prob = value
-            elif attr == CLUSTER:
-                self.c.cluster = int(value)
-            elif isinstance(value, int) or isinstance(value, long):
-                Lexeme.set_struct_attr(self.c, attr, value)
+            # skip PROB, e.g. from lexemes.jsonl
+            if isinstance(value, float):
+                continue
+            elif isinstance(value, (int, long)):
+                 Lexeme.set_struct_attr(self.c, attr, value)
            else:
                Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))

@ -137,34 +136,6 @@ cdef class Lexeme:
        xp = get_array_module(vector)
        return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))

-    def to_bytes(self):
-        lex_data = Lexeme.c_to_bytes(self.c)
-        start = <const char*>&self.c.flags
-        end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
-        if (end-start) != sizeof(lex_data.data):
-            raise ValueError(Errors.E072.format(length=end-start,
-                                                bad_length=sizeof(lex_data.data)))
-        byte_string = b"\0" * sizeof(lex_data.data)
-        byte_chars = <char*>byte_string
-        for i in range(sizeof(lex_data.data)):
-            byte_chars[i] = lex_data.data[i]
-        if len(byte_string) != sizeof(lex_data.data):
-            raise ValueError(Errors.E072.format(length=len(byte_string),
-                                                bad_length=sizeof(lex_data.data)))
-        return byte_string
-
-    def from_bytes(self, bytes byte_string):
-        # This method doesn't really have a use-case --- wrote it for testing.
-        # Possibly delete? It puts the Lexeme out of synch with the vocab.
-        cdef SerializedLexemeC lex_data
-        if len(byte_string) != sizeof(lex_data.data):
-            raise ValueError(Errors.E072.format(length=len(byte_string),
-                                                bad_length=sizeof(lex_data.data)))
-        for i in range(len(byte_string)):
-            lex_data.data[i] = byte_string[i]
-        Lexeme.c_from_bytes(self.c, lex_data)
-        self.orth = self.c.orth
-
    @property
    def has_vector(self):
        """RETURNS (bool): Whether a word vector is associated with the object.
@ -208,10 +179,14 @@ cdef class Lexeme:
        """RETURNS (float): A scalar value indicating the positivity or
            negativity of the lexeme."""
        def __get__(self):
-            return self.c.sentiment
+            sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
+            return sentiment_table.get(self.c.orth, 0.0)

-        def __set__(self, float sentiment):
-            self.c.sentiment = sentiment
+        def __set__(self, float x):
+            if "lexeme_sentiment" not in self.vocab.lookups:
+                self.vocab.lookups.add_table("lexeme_sentiment")
+            sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
+            sentiment_table[self.c.orth] = x

    @property
    def orth_(self):
@ -238,9 +213,13 @@ cdef class Lexeme:
            lexeme text.
        """
        def __get__(self):
-                return self.c.norm
+            return self.c.norm

        def __set__(self, attr_t x):
+            if "lexeme_norm" not in self.vocab.lookups:
+                self.vocab.lookups.add_table("lexeme_norm")
+            norm_table = self.vocab.lookups.get_table("lexeme_norm")
+            norm_table[self.c.orth] = self.vocab.strings[x]
            self.c.norm = x

    property shape:
@ -276,10 +255,12 @@ cdef class Lexeme:
    property cluster:
        """RETURNS (int): Brown cluster ID."""
        def __get__(self):
-            return self.c.cluster
+            cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
+            return cluster_table.get(self.c.orth, 0)

-        def __set__(self, attr_t x):
-            self.c.cluster = x
+        def __set__(self, int x):
+            cluster_table = self.vocab.load_extra_lookups("lexeme_cluster")
+            cluster_table[self.c.orth] = x

    property lang:
        """RETURNS (uint64): Language of the parent vocabulary."""
@ -293,10 +274,14 @@ cdef class Lexeme:
        """RETURNS (float): Smoothed log probability estimate of the lexeme's
            type."""
        def __get__(self):
-            return self.c.prob
+            prob_table = self.vocab.load_extra_lookups("lexeme_prob")
+            settings_table = self.vocab.load_extra_lookups("lexeme_settings")
+            default_oov_prob = settings_table.get("oov_prob", -20.0)
+            return prob_table.get(self.c.orth, default_oov_prob)

        def __set__(self, float x):
-            self.c.prob = x
+            prob_table = self.vocab.load_extra_lookups("lexeme_prob")
+            prob_table[self.c.orth] = x

    property lower_:
        """RETURNS (unicode): Lowercase form of the word."""
@ -314,7 +299,7 @@ cdef class Lexeme:
            return self.vocab.strings[self.c.norm]

        def __set__(self, unicode x):
-            self.c.norm = self.vocab.strings.add(x)
+            self.norm = self.vocab.strings.add(x)

    property shape_:
        """RETURNS (unicode): Transform of the word's string, to show
@ -362,13 +347,10 @@ cdef class Lexeme:
        def __set__(self, flags_t x):
            self.c.flags = x

-    property is_oov:
+    @property
+    def is_oov(self):
        """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        def __get__(self):
-            return Lexeme.c_check_flag(self.c, IS_OOV)
-
-        def __set__(self, attr_t x):
-            Lexeme.c_set_flag(self.c, IS_OOV, x)
+        return self.orth in self.vocab.vectors

    property is_stop:
        """RETURNS (bool): Whether the lexeme is a stop word."""
--- a/spacy/lookups.py
+++ b/spacy/lookups.py
@ -124,7 +124,7 @@ class Lookups(object):
            self._tables[key].update(value)
        return self

-    def to_disk(self, path, **kwargs):
+    def to_disk(self, path, filename="lookups.bin", **kwargs):
        """Save the lookups to a directory as lookups.bin. Expects a path to a
        directory, which will be created if it doesn't exist.

@ -136,11 +136,11 @@ class Lookups(object):
            path = ensure_path(path)
            if not path.exists():
                path.mkdir()
-            filepath = path / "lookups.bin"
+            filepath = path / filename
            with filepath.open("wb") as file_:
                file_.write(self.to_bytes())

-    def from_disk(self, path, **kwargs):
+    def from_disk(self, path, filename="lookups.bin", **kwargs):
        """Load lookups from a directory containing a lookups.bin. Will skip
        loading if the file doesn't exist.

@ -150,7 +150,7 @@ class Lookups(object):
        DOCS: https://spacy.io/api/lookups#from_disk
        """
        path = ensure_path(path)
-        filepath = path / "lookups.bin"
+        filepath = path / filename
        if filepath.exists():
            with filepath.open("rb") as file_:
                data = file_.read()
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -213,28 +213,28 @@ cdef class Matcher:
                else:
                    yield doc

-    def __call__(self, object doc_or_span):
+    def __call__(self, object doclike):
        """Find all token sequences matching the supplied pattern.

-        doc_or_span (Doc or Span): The document to match over.
+        doclike (Doc or Span): The document to match over.
        RETURNS (list): A list of `(key, start, end)` tuples,
            describing the matches. A match tuple describes a span
            `doc[start:end]`. The `label_id` and `key` are both integers.
        """
-        if isinstance(doc_or_span, Doc):
-            doc = doc_or_span
+        if isinstance(doclike, Doc):
+            doc = doclike
            length = len(doc)
-        elif isinstance(doc_or_span, Span):
-            doc = doc_or_span.doc
-            length = doc_or_span.end - doc_or_span.start
+        elif isinstance(doclike, Span):
+            doc = doclike.doc
+            length = doclike.end - doclike.start
        else:
-            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doc_or_span).__name__))
+            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
        if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \
          and not doc.is_tagged:
            raise ValueError(Errors.E155.format())
        if DEP in self._seen_attrs and not doc.is_parsed:
            raise ValueError(Errors.E156.format())
-        matches = find_matches(&self.patterns[0], self.patterns.size(), doc_or_span, length, 
+        matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
                                extensions=self._extensions, predicates=self._extra_predicates)
        for i, (key, start, end) in enumerate(matches):
            on_match = self._callbacks.get(key, None)
@ -257,7 +257,7 @@ def unpickle_matcher(vocab, patterns, callbacks):
    return matcher


-cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int length, extensions=None, predicates=tuple()):
+cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
    """Find matches in a doc, with a compiled array of patterns. Matches are
    returned as a list of (id, start, end) tuples.

@ -286,7 +286,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
    else:
        nr_extra_attr = 0
        extra_attr_values = <attr_t*>mem.alloc(length, sizeof(attr_t))
-    for i, token in enumerate(doc_or_span):
+    for i, token in enumerate(doclike):
        for name, index in extensions.items():
            value = token._.get(name)
            if isinstance(value, basestring):
@ -298,7 +298,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doc_or_span, int lengt
        for j in range(n):
            states.push_back(PatternStateC(patterns[j], i, 0))
        transition_states(states, matches, predicate_cache,
-            doc_or_span[i], extra_attr_values, predicates)
+            doclike[i], extra_attr_values, predicates)
        extra_attr_values += nr_extra_attr
        predicate_cache += len(predicates)
    # Handle matches that end in 0-width patterns
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -203,7 +203,7 @@ class Pipe(object):
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

@ -626,7 +626,7 @@ class Tagger(Pipe):
        serialize = OrderedDict((
            ("vocab", lambda p: self.vocab.to_disk(p)),
            ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
-            ("model", lambda p: p.open("wb").write(self.model.to_bytes())),
+            ("model", lambda p: self.model.to_disk(p)),
            ("cfg", lambda p: srsly.write_json(p, self.cfg))
        ))
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
@ -1395,7 +1395,7 @@ class EntityLinker(Pipe):
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        serialize["kb"] = lambda p: self.kb.dump(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

--- a/spacy/structs.pxd
+++ b/spacy/structs.pxd
@ -23,29 +23,6 @@ cdef struct LexemeC:
    attr_t prefix
    attr_t suffix

-    attr_t cluster
-
-    float prob
-    float sentiment
-
-
-cdef struct SerializedLexemeC:
-    unsigned char[8 + 8*10 + 4 + 4] data
-    #    sizeof(flags_t)  # flags
-    #    + sizeof(attr_t) # lang
-    #    + sizeof(attr_t) # id
-    #    + sizeof(attr_t) # length
-    #    + sizeof(attr_t) # orth
-    #    + sizeof(attr_t) # lower
-    #    + sizeof(attr_t) # norm
-    #    + sizeof(attr_t) # shape
-    #    + sizeof(attr_t) # prefix
-    #    + sizeof(attr_t) # suffix
-    #    + sizeof(attr_t) # cluster
-    #    + sizeof(float)  # prob
-    #    + sizeof(float)  # cluster
-    #    + sizeof(float) # l2_norm
-

 cdef struct SpanC:
    hash_t id
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -12,7 +12,7 @@ cdef enum symbol_t:
    LIKE_NUM
    LIKE_EMAIL
    IS_STOP
-    IS_OOV
+    IS_OOV_DEPRECATED
    IS_BRACKET
    IS_QUOTE
    IS_LEFT_PUNCT
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -17,7 +17,7 @@ IDS = {
    "LIKE_NUM": LIKE_NUM,
    "LIKE_EMAIL": LIKE_EMAIL,
    "IS_STOP": IS_STOP,
-    "IS_OOV": IS_OOV,
+    "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
    "IS_BRACKET": IS_BRACKET,
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -9,7 +9,6 @@ import numpy
 cimport cython.parallel
 import numpy.random
 cimport numpy as np
-from itertools import islice
 from cpython.ref cimport PyObject, Py_XDECREF
 from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
 from libc.math cimport exp
@ -621,15 +620,15 @@ cdef class Parser:
            self.model, cfg = self.Model(self.moves.n_moves, **cfg)
            if sgd is None:
                sgd = self.create_optimizer()
-            doc_sample = []
-            gold_sample = []
-            for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
+            docs = []
+            golds = []
+            for raw_text, annots_brackets in get_gold_tuples():
                for annots, brackets in annots_brackets:
                    ids, words, tags, heads, deps, ents = annots
-                    doc_sample.append(Doc(self.vocab, words=words))
-                    gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
-                                                 heads=heads, deps=deps, entities=ents))
-            self.model.begin_training(doc_sample, gold_sample)
+                    docs.append(Doc(self.vocab, words=words))
+                    golds.append(GoldParse(docs[-1], words=words, tags=tags,
+                                           heads=heads, deps=deps, entities=ents))
+            self.model.begin_training(docs, golds)
            if pipeline is not None:
                self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
            link_vectors_to_models(self.vocab)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -88,6 +88,11 @@ def eu_tokenizer():
    return get_lang_class("eu").Defaults.create_tokenizer()


+@pytest.fixture(scope="session")
+def fa_tokenizer():
+    return get_lang_class("fa").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
 def fi_tokenizer():
    return get_lang_class("fi").Defaults.create_tokenizer()
@ -107,6 +112,7 @@ def ga_tokenizer():
 def gu_tokenizer():
    return get_lang_class("gu").Defaults.create_tokenizer()

+
@pytest.fixture(scope="session")
 def he_tokenizer():
    return get_lang_class("he").Defaults.create_tokenizer()
@ -241,7 +247,9 @@ def yo_tokenizer():

@pytest.fixture(scope="session")
 def zh_tokenizer_char():
-    return get_lang_class("zh").Defaults.create_tokenizer(config={"use_jieba": False, "use_pkuseg": False})
+    return get_lang_class("zh").Defaults.create_tokenizer(
+        config={"use_jieba": False, "use_pkuseg": False}
+    )


@pytest.fixture(scope="session")
@ -253,7 +261,9 @@ def zh_tokenizer_jieba():
@pytest.fixture(scope="session")
 def zh_tokenizer_pkuseg():
    pytest.importorskip("pkuseg")
-    return get_lang_class("zh").Defaults.create_tokenizer(config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True})
+    return get_lang_class("zh").Defaults.create_tokenizer(
+        config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}
+    )


@pytest.fixture(scope="session")
--- a/spacy/tests/doc/test_creation.py
+++ b/spacy/tests/doc/test_creation.py
@ -50,7 +50,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
+        word for word in words if not word.isspace()
+    ]

    # partial whitespace in words
    words = ["  ", "'", "dogs", "'", "\n\n", "run", " "]
@ -60,7 +62,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
+        word for word in words if not word.isspace()
+    ]

    # non-standard whitespace tokens
    words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
@ -70,7 +74,9 @@ def test_create_from_words_and_text(vocab):
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
-    assert [t.text for t in doc if not t.text.isspace()] == [word for word in words if not word.isspace()]
+    assert [t.text for t in doc if not t.text.isspace()] == [
+        word for word in words if not word.isspace()
+    ]

    # mismatch between words and text
    with pytest.raises(ValueError):
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -181,6 +181,7 @@ def test_is_sent_start(en_tokenizer):
    doc.is_parsed = True
    assert len(list(doc.sents)) == 2

+
 def test_is_sent_end(en_tokenizer):
    doc = en_tokenizer("This is a sentence. This is another.")
    assert doc[4].is_sent_end is None
@ -213,6 +214,7 @@ def test_token0_has_sent_start_true():
    assert doc[1].is_sent_start is None
    assert not doc.is_sentenced

+
 def test_tokenlast_has_sent_end_true():
    doc = Doc(Vocab(), words=["hello", "world"])
    assert doc[0].is_sent_end is None
--- a/spacy/tests/lang/da/test_exceptions.py
+++ b/spacy/tests/lang/da/test_exceptions.py
@ -37,14 +37,6 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
    assert tokens[7].text == "."


-@pytest.mark.parametrize(
-    "text,norm", [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]
-)
-def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
-    tokens = da_tokenizer(text)
-    assert tokens[0].norm_ == norm
-
-
@pytest.mark.parametrize(
    "text,n_tokens",
    [
--- a/spacy/tests/lang/de/test_exceptions.py
+++ b/spacy/tests/lang/de/test_exceptions.py
@ -22,17 +22,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer):
    assert len(tokens) == 6
    assert tokens[2].text == "z.Zt."
    assert tokens[2].lemma_ == "zur Zeit"
-
-
-@pytest.mark.parametrize(
-    "text,norms", [("vor'm", ["vor", "dem"]), ("du's", ["du", "es"])]
-)
-def test_de_tokenizer_norm_exceptions(de_tokenizer, text, norms):
-    tokens = de_tokenizer(text)
-    assert [token.norm_ for token in tokens] == norms
-
-
-@pytest.mark.parametrize("text,norm", [("daß", "dass")])
-def test_de_lex_attrs_norm_exceptions(de_tokenizer, text, norm):
-    tokens = de_tokenizer(text)
-    assert tokens[0].norm_ == norm
--- a/spacy/tests/lang/de/test_noun_chunks.py
+++ b/spacy/tests/lang/de/test_noun_chunks.py
@ -0,0 +1,16 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_noun_chunks_is_parsed_de(de_tokenizer):
+    """Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed.
+    To check this test, we're constructing a Doc
+    with a new Vocab here and forcing is_parsed to 'False'
+    to make sure the noun chunks don't run.
+    """
+    doc = de_tokenizer("Er lag auf seinem")
+    doc.is_parsed = False
+    with pytest.raises(ValueError):
+        list(doc.noun_chunks)
--- a/spacy/tests/lang/el/test_noun_chunks.py
+++ b/spacy/tests/lang/el/test_noun_chunks.py
@ -0,0 +1,16 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_noun_chunks_is_parsed_el(el_tokenizer):
+    """Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed.
+    To check this test, we're constructing a Doc
+    with a new Vocab here and forcing is_parsed to 'False'
+    to make sure the noun chunks don't run.
+    """
+    doc = el_tokenizer("είναι χώρα της νοτιοανατολικής")
+    doc.is_parsed = False
+    with pytest.raises(ValueError):
+        list(doc.noun_chunks)
--- a/spacy/tests/lang/en/test_exceptions.py
+++ b/spacy/tests/lang/en/test_exceptions.py
@ -118,6 +118,7 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
    assert [token.norm_ for token in tokens] == norms


+@pytest.mark.skip
@pytest.mark.parametrize(
    "text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
 )
--- a/spacy/tests/lang/en/test_noun_chunks.py
+++ b/spacy/tests/lang/en/test_noun_chunks.py
@ -6,9 +6,24 @@ from spacy.attrs import HEAD, DEP
 from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
 from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS

+import pytest
+
+
 from ...util import get_doc


+def test_noun_chunks_is_parsed(en_tokenizer):
+    """Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed.
+    To check this test, we're constructing a Doc
+    with a new Vocab here and forcing is_parsed to 'False'
+    to make sure the noun chunks don't run.
+    """
+    doc = en_tokenizer("This is a sentence")
+    doc.is_parsed = False
+    with pytest.raises(ValueError):
+        list(doc.noun_chunks)
+
+
 def test_en_noun_chunks_not_nested(en_vocab):
    words = ["Peter", "has", "chronic", "command", "and", "control", "issues"]
    heads = [1, 0, 4, 3, -1, -2, -5]
--- a/spacy/tests/lang/es/test_noun_chunks.py
+++ b/spacy/tests/lang/es/test_noun_chunks.py
@ -0,0 +1,16 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_noun_chunks_is_parsed_es(es_tokenizer):
+    """Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed.
+    To check this test, we're constructing a Doc
+    with a new Vocab here and forcing is_parsed to 'False'
+    to make sure the noun chunks don't run.
+    """
+    doc = es_tokenizer("en Oxford este verano")
+    doc.is_parsed = False
+    with pytest.raises(ValueError):
+        list(doc.noun_chunks)
--- a/spacy/tests/lang/es/test_text.py
+++ b/spacy/tests/lang/es/test_text.py
@ -62,4 +62,4 @@ def test_lex_attrs_like_number(es_tokenizer, text, match):
@pytest.mark.parametrize("word", ["once"])
 def test_es_lex_attrs_capitals(word):
    assert like_num(word)
-    assert like_num(word.upper())
+    assert like_num(word.upper())
--- a/Show More
+++ b/Show More