Merge master into develop. Big merge, many conflicts -- need to review

2025-09-22 20:16:43 +03:00 · 2018-04-29 14:49:26 +02:00 · 2018-04-29 14:49:26 +02:00 · 2c4a6d66fa
commit 2c4a6d66fa
parent 6efb76bb3f 45bb8d75a5
155 changed files with 699309 additions and 2553 deletions
--- a/.github/CONTRIBUTOR_AGREEMENT.md
+++ b/.github/CONTRIBUTOR_AGREEMENT.md
@ -78,7 +78,7 @@ took place before the date you sign these terms.
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
-    representations inaccurate in any respect. We may publicly disclose your 
+    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
@ -87,11 +87,11 @@ U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
-    * [x] I am signing on behalf of myself as an individual and no other person
+    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
-    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
--- a/.github/contributors/ivyleavedtoadflax.md
+++ b/.github/contributors/ivyleavedtoadflax.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |Matthew Upson         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |2018-04-24            |
 | GitHub username                |ivyleavedtoadflax     |
 | Website (optional)             |www.machinegurning.com|
--- a/.github/contributors/katrinleinweber.md
+++ b/.github/contributors/katrinleinweber.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Katrin Leinweber     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2018-03-30           |
 | GitHub username                | katrinleinweber      |
 | Website (optional)             |                      |
--- a/.github/contributors/miroli.md
+++ b/.github/contributors/miroli.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Robin Linderborg     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2018-04-23           |
 | GitHub username                | miroli               |
 | Website (optional)             |                      |
--- a/.github/contributors/mollerhoj.md
+++ b/.github/contributors/mollerhoj.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jens Dahl Mollerhoj  |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 4/04/2018            |
 | GitHub username                | mollerhoj            |
 | Website (optional)             |                      |
--- a/.github/contributors/skrcode.md
+++ b/.github/contributors/skrcode.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Suraj Rajan          |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 31/Mar/2018          |
 | GitHub username                | skrcode              |
 | Website (optional)             |                      |
--- a/.github/contributors/trungtv.md
+++ b/.github/contributors/trungtv.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                              |
 |------------------------------- | --------------------               |
 | Name                           | Viet-Trung Tran                    |
 | Company name (if applicable)   |   |
 | Title or role (if applicable)  |           |
 | Date                           | 2018-03-28                   |
 | GitHub username                | trungtv                           |
 | Website (optional)             | https://datalab.vn        |
--- a/6
+++ b/6
@ -0,0 +1,6 @@
@ARTICLE{spacy2,
   AUTHOR  = {Honnibal, Matthew AND Montani, Ines},
   TITLE   = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
   YEAR    = {2017},
   JOURNAL = {To appear}
 }
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -73,28 +73,8 @@ so it only becomes visible on click, making the issue easier to read and follow.
 ### Issue labels
 To distinguish issues that are opened by us, the maintainers, we usually add a
-💫 to the title. We also use the following system to tag our issues and pull
+💫 to the title. [See this page](https://github.com/explosion/spaCy/labels) 
-requests:
+for an overview of the system we use to tag our issues and pull requests.
 | Issue label | Description |
 | --- | --- |
 | [`bug`](https://github.com/explosion/spaCy/labels/bug) | Bugs and behaviour differing from documentation |
 | [`enhancement`](https://github.com/explosion/spaCy/labels/enhancement) | Feature requests and improvements |
 | [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
 | [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
 | [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
 | [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
 | [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
 | [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
 | [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
 | [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
 | [`compat`](https://github.com/explosion/spaCy/labels/compat) | Cross-platform and cross-Python compatibility issues |
 | [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests |
 | [`v1`](https://github.com/explosion/spaCy/labels/v1) | Reports related to spaCy v1.x |
 | [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
 | [`third-party`](https://github.com/explosion/spaCy/labels/third-party) | Issues related to third-party packages and services |
 | [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
 | [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
 ## Contributing to the code base
@ -220,7 +200,7 @@ All Python code must be written in an **intersection of Python 2 and Python 3**.
 This is easy in Cython, but somewhat ugly in Python. Logic that deals with
 Python or platform compatibility should only live in
 [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
-functions, replacement functions are suffixed with an undersocre, for example
+functions, replacement functions are suffixed with an underscore, for example
 `unicode_`. If you need to access the user's version or platform information,
 for example to show more specific error messages, you can use the `is_config()`
 helper function.
--- a/README.rst
+++ b/README.rst
@ -12,11 +12,11 @@ integration. It's commercial open-source software, released under the MIT licens
 💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
-.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
+.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
    :target: https://travis-ci.org/explosion/spaCy
    :alt: Build Status
-.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square
+.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
    :target: https://ci.appveyor.com/project/explosion/spaCy
    :alt: Appveyor Build Status
@ -28,11 +28,11 @@ integration. It's commercial open-source software, released under the MIT licens
    :target: https://pypi.python.org/pypi/spacy
    :alt: pypi Version
-.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
+.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
    :target: https://anaconda.org/conda-forge/spacy
    :alt: conda Version
-.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square
+.. image:: https://img.shields.io/badge/chat-join%20%E2%86%92-09a3d5.svg?style=flat-square&logo=gitter-white
    :target: https://gitter.im/explosion/spaCy
    :alt: spaCy on Gitter
@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
 `New in v2.0`_       New features, backwards incompatibilities and migration guide.
 `API Reference`_     The detailed reference for spaCy's API.
 `Models`_            Download statistical language models for spaCy.
-`Resources`_         Libraries, extensions, demos, books and courses.
+`Universe`_          Libraries, extensions, demos, books and courses.
 `Changelog`_         Changes and version history.
 `Contribute`_        How to contribute to the spaCy project and code base.
 ===================  ===
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
 .. _Usage Guides: https://spacy.io/usage/
 .. _API Reference: https://spacy.io/api/
 .. _Models: https://spacy.io/models
-.. _Resources: https://spacy.io/usage/resources
+.. _Universe: https://spacy.io/universe
 .. _Changelog: https://spacy.io/usage/#changelog
 .. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
@ -308,18 +308,20 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
 Run tests
 =========
-spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where
+spaCy comes with an `extensive test suite <spacy/tests>`_.  In order to run the
-spaCy is installed:
+tests, you'll usually want to clone the repository and build spaCy from source.
 This will also install the required development dependencies and test utilities
 defined in the ``requirements.txt``.
 Alternatively, you can find out where spaCy is installed and run ``pytest`` on
 that directory. Don't forget to also install the test utilities via spaCy's
 ``requirements.txt``:
 .. code:: bash
    python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
-
+    pip install -r path/to/requirements.txt
 Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow``
 and ``--model`` are optional and enable additional tests:
 .. code:: bash
    # make sure you are using recent pytest version
    python -m pip install -U pytest
    python -m pytest <spacy-directory>
 See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
 examples.
--- a/examples/pipeline/custom_component_countries_api.py
+++ b/examples/pipeline/custom_component_countries_api.py
@ -9,6 +9,7 @@ coordinates. Can be extended with more details from the API.
 * Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
 Compatible with: spaCy v2.0.0+
 Prerequisites: pip install requests
 """
 from __future__ import unicode_literals, print_function
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -81,7 +81,6 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
@ -92,11 +91,18 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
        ner = nlp.get_pipe('ner')
    ner.add_label(LABEL)   # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        # Note that 'begin_training' initializes the models, so it'll zero out
        # existing entity types.
        optimizer = nlp.entity.create_optimizer()
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
--- a/examples/training/train_textcat.py
+++ b/examples/training/train_textcat.py
@ -1,6 +1,6 @@
 #!/usr/bin/env python
 # coding: utf8
-"""Train a multi-label convolutional neural network text classifier on the
+"""Train a convolutional neural network text classifier on the
 IMDB dataset, using the TextCategorizer component. The dataset will be loaded
 automatically via Thinc's built-in dataset loader. The model is added to
 spacy.pipeline, and predictions are available via `doc.cats`. For more details,
--- a/requirements.txt
+++ b/requirements.txt
@ -9,10 +9,6 @@ cytoolz>=0.9.0,<0.10.0
 plac<1.0.0,>=0.9.6
 ujson>=1.35
 dill>=0.2,<0.3
 requests>=2.13.0,<3.0.0
 regex==2017.4.5
 ftfy>=4.4.2,<5.0.0
 pytest>=3.0.6,<4.0.0
 mock>=2.0.0,<3.0.0
 msgpack-python==0.5.4
 msgpack-numpy==0.4.1
--- a/setup.py
+++ b/setup.py
@ -38,6 +38,7 @@ MOD_NAMES = [
    'spacy.tokens.doc',
    'spacy.tokens.span',
    'spacy.tokens.token',
    'spacy.tokens._retokenize',
    'spacy.matcher',
    'spacy.syntax.ner',
    'spacy.symbols',
@ -195,11 +196,6 @@ def setup_package():
                'pathlib',
                'ujson>=1.35',
                'dill>=0.2,<0.3',
                'requests>=2.13.0,<3.0.0',
                'regex==2017.4.5',
                'ftfy>=4.4.2,<5.0.0',
                'msgpack-python==0.5.4',
                'msgpack-numpy==0.4.1'],
            setup_requires=['wheel'],
            classifiers=[
                'Development Status :: 5 - Production/Stable',
--- a/spacy/init.py
+++ b/spacy/init.py
@ -4,18 +4,14 @@ from __future__ import unicode_literals
 from .cli.info import info as cli_info
 from .glossary import explain
 from .about import __version__
 from .errors import Warnings, deprecation_warning
 from . import util
 def load(name, **overrides):
    depr_path = overrides.get('path')
    if depr_path not in (True, False, None):
-        util.deprecated(
+        deprecation_warning(Warnings.W001.format(path=depr_path))
            "As of spaCy v2.0, the keyword argument `path=` is deprecated. "
            "You can now call spacy.load with the path as its first argument, "
            "and the model's meta.json will be used to determine the language "
            "to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
            'error')
    return util.load_model(name, **overrides)
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -23,6 +23,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed
 import thinc.extra.load_nlp
 from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
 from .errors import Errors
 from . import util
@ -157,7 +158,7 @@ class PrecomputableAffine(Model):
                sgd(self._mem.weights, self._mem.gradient, key=self.id)
            return dXf.reshape((dXf.shape[0], self.nF, self.nI))
        return Yf, backward
-    
+
    def _add_padding(self, Yf):
        Yf_padded = self.ops.xp.vstack((self.pad, Yf))
        return Yf_padded
@ -225,6 +226,11 @@ class PrecomputableAffine(Model):
 def link_vectors_to_models(vocab):
    vectors = vocab.vectors
    if vectors.name is None:
        vectors.name = VECTORS_KEY
        print(
            "Warning: Unnamed vectors -- this won't allow multiple vectors "
            "models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
    ops = Model.ops
    for word in vocab:
        if word.orth in vectors.key2row:
@ -234,11 +240,11 @@ def link_vectors_to_models(vocab):
    data = ops.asarray(vectors.data)
    # Set an entry here, so that vectors are accessed by StaticVectors
    # (unideal, I know)
-    thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
+    thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data
 def Tok2Vec(width, embed_size, **kwargs):
-    pretrained_dims = kwargs.get('pretrained_dims', 0)
+    pretrained_vectors = kwargs.get('pretrained_vectors', None)
    cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
    cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
    with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
@ -251,16 +257,16 @@ def Tok2Vec(width, embed_size, **kwargs):
                           name='embed_suffix')
        shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
                          name='embed_shape')
-        if pretrained_dims is not None and pretrained_dims >= 1:
+        if pretrained_vectors is not None:
-            glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
+            glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
            embed = uniqued(
                (glove | norm | prefix | suffix | shape)
-                >> LN(Maxout(width, width*5, pieces=3)), column=5)
+                >> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
        else:
            embed = uniqued(
                (norm | prefix | suffix | shape)
-                >> LN(Maxout(width, width*4, pieces=3)), column=5)
+                >> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
        convolution = Residual(
            ExtractWindow(nW=1)
@ -318,10 +324,10 @@ def _divide_array(X, size):
 def get_col(idx):
-    assert idx >= 0, idx
+    if idx < 0:
        raise IndexError(Errors.E066.format(value=idx))
    def forward(X, drop=0.):
        assert idx >= 0, idx
        if isinstance(X, numpy.ndarray):
            ops = NumpyOps()
        else:
@ -329,7 +335,6 @@ def get_col(idx):
        output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
        def backward(y, sgd=None):
            assert idx >= 0, idx
            dX = ops.allocate(X.shape)
            dX[:, idx] += y
            return dX
@ -416,13 +421,13 @@ def build_tagger_model(nr_class, **cfg):
        token_vector_width = cfg['token_vector_width']
    else:
        token_vector_width = util.env_opt('token_vector_width', 128)
-    pretrained_dims = cfg.get('pretrained_dims', 0)
+    pretrained_vectors = cfg.get('pretrained_vectors')
    with Model.define_operators({'>>': chain, '+': add}):
        if 'tok2vec' in cfg:
            tok2vec = cfg['tok2vec']
        else:
            tok2vec = Tok2Vec(token_vector_width, embed_size,
-                              pretrained_dims=pretrained_dims)
+                              pretrained_vectors=pretrained_vectors)
        softmax = with_flatten(Softmax(nr_class, token_vector_width))
        model = (
            tok2vec
--- a/spacy/about.py
+++ b/spacy/about.py
@ -11,7 +11,6 @@ __email__ = 'contact@explosion.ai'
 __license__ = 'MIT'
 __release__ = False
 __docs_models__ = 'https://spacy.io/usage/models'
 __download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
 __compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
 __shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
--- a/spacy/cli/_messages.py
+++ b/spacy/cli/_messages.py
@ -0,0 +1,74 @@
 # coding: utf8
 from __future__ import unicode_literals
 class Messages(object):
    M001 = ("Download successful but linking failed")
    M002 = ("Creating a shortcut link for 'en' didn't work (maybe you "
            "don't have admin permissions?), but you can still load the "
            "model via its full package name: nlp = spacy.load('{name}')")
    M003 = ("Server error ({code}: {desc})")
    M004 = ("Couldn't fetch {desc}. Please find a model for your spaCy "
            "installation (v{version}), and download it manually. For more "
            "details, see the documentation: https://spacy.io/usage/models")
    M005 = ("Compatibility error")
    M006 = ("No compatible models found for v{version} of spaCy.")
    M007 = ("No compatible model found for '{name}' (spaCy v{version}).")
    M008 = ("Can't locate model data")
    M009 = ("The data should be located in {path}")
    M010 = ("Can't find the spaCy data path to create model symlink")
    M011 = ("Make sure a directory `/data` exists within your spaCy "
            "installation and try again. The data directory should be "
            "located here:")
    M012 = ("Link '{name}' already exists")
    M013 = ("To overwrite an existing link, use the --force flag.")
    M014 = ("Can't overwrite symlink '{name}'")
    M015 = ("This can happen if your data directory contains a directory or "
            "file of the same name.")
    M016 = ("Error: Couldn't link model to '{name}'")
    M017 = ("Creating a symlink in spacy/data failed. Make sure you have the "
            "required permissions and try re-running the command as admin, or "
            "use a virtualenv. You can still import the model as a module and "
            "call its load() method, or create the symlink manually.")
    M018 = ("Linking successful")
    M019 = ("You can now load the model via spacy.load('{name}')")
    M020 = ("Can't find model meta.json")
    M021 = ("Couldn't fetch compatibility table.")
    M022 = ("Can't find spaCy v{version} in compatibility table")
    M023 = ("Installed models (spaCy v{version})")
    M024 = ("No models found in your current environment.")
    M025 = ("Use the following commands to update the model packages:")
    M026 = ("The following models are not available for spaCy "
            "v{version}: {models}")
    M027 = ("You may also want to overwrite the incompatible links using the "
            "`python -m spacy link` command with `--force`, or remove them "
            "from the data directory. Data path: {path}")
    M028 = ("Input file not found")
    M029 = ("Output directory not found")
    M030 = ("Unknown format")
    M031 = ("Can't find converter for {converter}")
    M032 = ("Generated output file {name}")
    M033 = ("Created {n_docs} documents")
    M034 = ("Evaluation data not found")
    M035 = ("Visualization output directory not found")
    M036 = ("Generated {n} parses as HTML")
    M037 = ("Can't find words frequencies file")
    M038 = ("Sucessfully compiled vocab")
    M039 = ("{entries} entries, {vectors} vectors")
    M040 = ("Output directory not found")
    M041 = ("Loaded meta.json from file")
    M042 = ("Successfully created package '{name}'")
    M043 = ("To build the package, run `python setup.py sdist` in this "
            "directory.")
    M044 = ("Package directory already exists")
    M045 = ("Please delete the directory and try again, or use the `--force` "
            "flag to overwrite existing directories.")
    M046 = ("Generating meta.json")
    M047 = ("Enter the package settings for your model. The following "
            "information will be read from your model data: pipeline, vectors.")
    M048 = ("No '{key}' setting found in meta.json")
    M049 = ("This setting is required to build your package.")
    M050 = ("Training data not found")
    M051 = ("Development data not found")
    M052 = ("Not a valid meta.json format")
    M053 = ("Expected dict but got: {meta_type}")
--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -5,6 +5,7 @@ import plac
 from pathlib import Path
 from .converters import conllu2json, iob2json, conll_ner2json
 from ._messages import Messages
 from ..util import prints
 # Converters are matched by file extension. To add a converter, add a new
@ -32,14 +33,14 @@ def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto
    input_path = Path(input_file)
    output_path = Path(output_dir)
    if not input_path.exists():
-        prints(input_path, title="Input file not found", exits=1)
+        prints(input_path, title=Messages.M028, exits=1)
    if not output_path.exists():
-        prints(output_path, title="Output directory not found", exits=1)
+        prints(output_path, title=Messages.M029, exits=1)
    if converter == 'auto':
        converter = input_path.suffix[1:]
    if converter not in CONVERTERS:
-            prints("Can't find converter for %s" % converter,
+            prints(Messages.M031.format(converter=converter),
-                title="Unknown format", exits=1)
+                   title=Messages.M030, exits=1)
    func = CONVERTERS[converter]
    func(input_path, output_path,
         n_sents=n_sents, use_morphology=morphology)
--- a/spacy/cli/converters/conll_ner2json.py
+++ b/spacy/cli/converters/conll_ner2json.py
@ -1,6 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .._messages import Messages
 from ...compat import json_dumps, path2str
 from ...util import prints
 from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
    output_file = output_path / output_filename
    with output_file.open('w', encoding='utf-8') as f:
        f.write(json_dumps(docs))
-    prints("Created %d documents" % len(docs),
+    prints(Messages.M033.format(n_docs=len(docs)),
-           title="Generated output file %s" % path2str(output_file))
+           title=Messages.M032.format(name=path2str(output_file)))
 def read_conll_ner(input_path):
--- a/spacy/cli/converters/conllu2json.py
+++ b/spacy/cli/converters/conllu2json.py
@ -1,6 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .._messages import Messages
 from ...compat import json_dumps, path2str
 from ...util import prints
@ -32,8 +33,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
    output_file = output_path / output_filename
    with output_file.open('w', encoding='utf-8') as f:
        f.write(json_dumps(docs))
-    prints("Created %d documents" % len(docs),
+    prints(Messages.M033.format(n_docs=len(docs)),
-           title="Generated output file %s" % path2str(output_file))
+           title=Messages.M032.format(name=path2str(output_file)))
 def read_conllx(input_path, use_morphology=False, n=0):
--- a/spacy/cli/converters/iob2json.py
+++ b/spacy/cli/converters/iob2json.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals
 from cytoolz import partition_all, concat
 from .._messages import Messages
 from ...compat import json_dumps, path2str
 from ...util import prints
 from ...gold import iob_to_biluo
@ -18,8 +19,8 @@ def iob2json(input_path, output_path, n_sents=10, *a, **k):
    output_file = output_path / output_filename
    with output_file.open('w', encoding='utf-8') as f:
        f.write(json_dumps(docs))
-    prints("Created %d documents" % len(docs),
+    prints(Messages.M033.format(n_docs=len(docs)),
-           title="Generated output file %s" % path2str(output_file))
+           title=Messages.M032.format(name=path2str(output_file)))
 def read_iob(raw_sents):
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -2,13 +2,15 @@
 from __future__ import unicode_literals
 import plac
 import requests
 import os
 import subprocess
 import sys
 import ujson
 from .link import link
 from ._messages import Messages
 from ..util import prints, get_package_path
 from ..compat import url_read, HTTPError
 from .. import about
@ -31,9 +33,7 @@ def download(model, direct=False):
        version = get_version(model_name, compatibility)
        dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
                                                            v=version))
-        if dl != 0:
+        if dl != 0:  # if download subprocess doesn't return 0, exit
            # if download subprocess doesn't return 0, exit with the respective
            # exit code before doing anything else
            sys.exit(dl)
        try:
            # Get package path here because link uses
@ -47,22 +47,16 @@ def download(model, direct=False):
            # Dirty, but since spacy.download and the auto-linking is
            # mostly a convenience wrapper, it's best to show a success
            # message and loading instructions, even if linking fails.
-            prints(
+            prints(Messages.M001.format(name=model_name), title=Messages.M002)
                "Creating a shortcut link for 'en' didn't work (maybe "
                "you don't have admin permissions?), but you can still "
                "load the model via its full package name:",
                "nlp = spacy.load('%s')" % model_name,
                title="Download successful but linking failed")
 def get_json(url, desc):
-    r = requests.get(url)
+    try:
-    if r.status_code != 200:
+        data = url_read(url)
-        msg = ("Couldn't fetch %s. Please find a model for your spaCy "
+    except HTTPError as e:
-               "installation (v%s), and download it manually.")
+        prints(Messages.M004.format(desc, about.__version__),
-        prints(msg % (desc, about.__version__), about.__docs_models__,
+               title=Messages.M003.format(e.code, e.reason), exits=1)
-               title="Server error (%d)" % r.status_code, exits=1)
+    return ujson.loads(data)
    return r.json()
 def get_compatibility():
@ -71,17 +65,16 @@ def get_compatibility():
    comp_table = get_json(about.__compatibility__, "compatibility table")
    comp = comp_table['spacy']
    if version not in comp:
-        prints("No compatible models found for v%s of spaCy." % version,
+        prints(Messages.M006.format(version=version), title=Messages.M005,
-               title="Compatibility error", exits=1)
+               exits=1)
    return comp[version]
 def get_version(model, comp):
    model = model.rsplit('.dev', 1)[0]
    if model not in comp:
-        version = about.__version__
+        prints(Messages.M007.format(name=model, version=about.__version__),
-        msg = "No compatible model found for '%s' (spaCy v%s)."
+               title=Messages.M005, exits=1)
        prints(msg % (model, version), title="Compatibility error", exits=1)
    return comp[model][0]
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -4,6 +4,7 @@ from __future__ import unicode_literals, division, print_function
 import plac
 from timeit import default_timer as timer
 from ._messages import Messages
 from ..gold import GoldCorpus
 from ..util import prints
 from .. import util
@ -33,10 +34,9 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
    data_path = util.ensure_path(data_path)
    displacy_path = util.ensure_path(displacy_path)
    if not data_path.exists():
-        prints(data_path, title="Evaluation data not found", exits=1)
+        prints(data_path, title=Messages.M034, exits=1)
    if displacy_path and not displacy_path.exists():
-        prints(displacy_path, title="Visualization output directory not found",
+        prints(displacy_path, title=Messages.M035, exits=1)
               exits=1)
    corpus = GoldCorpus(data_path, data_path)
    nlp = util.load_model(model)
    dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
@ -52,8 +52,7 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
        render_ents = 'ner' in nlp.meta.get('pipeline', [])
        render_parses(docs, displacy_path, model_name=model,
                      limit=displacy_limit, deps=render_deps, ents=render_ents)
-        msg = "Generated %s parses as HTML" % displacy_limit
+        prints(displacy_path, title=Messages.M036.format(n=displacy_limit))
        prints(displacy_path, title=msg)
 def render_parses(docs, output_path, model_name='', limit=250, deps=True,
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -5,15 +5,17 @@ import plac
 import platform
 from pathlib import Path
 from ._messages import Messages
 from ..compat import path2str
 from .. import about
 from .. import util
 from .. import about
@plac.annotations(
    model=("optional: shortcut link of model", "positional", None, str),
-    markdown=("generate Markdown for GitHub issues", "flag", "md", str))
+    markdown=("generate Markdown for GitHub issues", "flag", "md", str),
-def info(model=None, markdown=False):
+    silent=("don't print anything (just return)", "flag", "s"))
 def info(model=None, markdown=False, silent=False):
    """Print info about spaCy installation. If a model shortcut link is
    speficied as an argument, print model information. Flag --markdown
    prints details in Markdown for easy copy-pasting to GitHub issues.
@ -25,21 +27,24 @@ def info(model=None, markdown=False):
            model_path = util.get_data_path() / model
        meta_path = model_path / 'meta.json'
        if not meta_path.is_file():
-            util.prints(meta_path, title="Can't find model meta.json", exits=1)
+            util.prints(meta_path, title=Messages.M020, exits=1)
        meta = util.read_json(meta_path)
        if model_path.resolve() != model_path:
            meta['link'] = path2str(model_path)
            meta['source'] = path2str(model_path.resolve())
        else:
            meta['source'] = path2str(model_path)
-        print_info(meta, 'model %s' % model, markdown)
+        if not silent:
-    else:
+            print_info(meta, 'model %s' % model, markdown)
-        data = {'spaCy version': about.__version__,
+        return meta
-                'Location': path2str(Path(__file__).parent.parent),
+    data = {'spaCy version': about.__version__,
-                'Platform': platform.platform(),
+            'Location': path2str(Path(__file__).parent.parent),
-                'Python version': platform.python_version(),
+            'Platform': platform.platform(),
-                'Models': list_models()}
+            'Python version': platform.python_version(),
            'Models': list_models()}
    if not silent:
        print_info(data, 'spaCy', markdown)
    return data
 def print_info(data, title, markdown):
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -12,10 +12,16 @@ import tarfile
 import gzip
 import zipfile
-from ..compat import fix_text
+from ._messages import Messages
 from ..vectors import Vectors
 from ..errors import Errors, Warnings, user_warning
 from ..util import prints, ensure_path, get_lang_class
 try:
    import ftfy
 except ImportError:
    ftfy = None
@plac.annotations(
    lang=("model language", "positional", None, str),
@ -23,27 +29,26 @@ from ..util import prints, ensure_path, get_lang_class
    freqs_loc=("location of words frequencies file", "positional", None, Path),
    clusters_loc=("optional: location of brown clusters data",
                  "option", "c", str),
-    vectors_loc=("optional: location of vectors file in GenSim text format",
+    vectors_loc=("optional: location of vectors file in Word2Vec format "
-                 "option", "v", str),
+                 "(either as .txt or zipped as .zip or .tar.gz)", "option",
                 "v", str),
    prune_vectors=("optional: number of vectors to prune to",
                   "option", "V", int)
 )
-def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
+def init_model(lang, output_dir, freqs_loc=None, clusters_loc=None,
               vectors_loc=None, prune_vectors=-1):
    """
    Create a new model from raw data, like word frequencies, Brown clusters
    and word vectors.
    """
    if freqs_loc is not None and not freqs_loc.exists():
-        prints(freqs_loc, title="Can't find words frequencies file", exits=1)
+        prints(freqs_loc, title=Messages.M037, exits=1)
    clusters_loc = ensure_path(clusters_loc)
    vectors_loc = ensure_path(vectors_loc)
    probs, oov_prob = read_freqs(freqs_loc) if freqs_loc is not None else ({}, -20)
    vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
    clusters = read_clusters(clusters_loc) if clusters_loc else {}
    nlp = create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, prune_vectors)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
@ -71,7 +76,6 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
    nlp = lang_class()
    for lexeme in nlp.vocab:
        lexeme.rank = 0
    lex_added = 0
    for i, (word, prob) in enumerate(tqdm(sorted(probs.items(), key=lambda item: item[1], reverse=True))):
        lexeme = nlp.vocab[word]
@ -91,15 +95,13 @@ def create_model(lang, probs, oov_prob, clusters, vectors_data, vector_keys, pru
            lexeme = nlp.vocab[word]
            lexeme.is_oov = False
            lex_added += 1
    if len(vectors_data):
        nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
    if prune_vectors >= 1:
        nlp.vocab.prune_vectors(prune_vectors)
    vec_added = len(nlp.vocab.vectors)
-
+    prints(Messages.M039.format(entries=lex_added, vectors=vec_added),
-    prints("{} entries, {} vectors".format(lex_added, vec_added),
+           title=Messages.M038)
           title="Sucessfully compiled vocab")
    return nlp
@ -114,8 +116,7 @@ def read_vectors(vectors_loc):
        pieces = line.rsplit(' ', vectors_data.shape[1]+1)
        word = pieces.pop(0)
        if len(pieces) != vectors_data.shape[1]:
-            print(word, repr(line))
+            raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc))
            raise ValueError("Bad line in file")
        vectors_data[i] = numpy.asarray(pieces, dtype='f')
        vectors_keys.append(word)
    return vectors_data, vectors_keys
@ -150,11 +151,14 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
 def read_clusters(clusters_loc):
    print("Reading clusters...")
    clusters = {}
    if ftfy is None:
        user_warning(Warnings.W004)
    with clusters_loc.open() as f:
        for line in tqdm(f):
            try:
                cluster, word, freq = line.split()
-                word = fix_text(word)
+                if ftfy is not None:
                    word = ftfy.fix_text(word)
            except ValueError:
                continue
            # If the clusterer has only seen the word a few times, its
--- a/spacy/cli/link.py
+++ b/spacy/cli/link.py
@ -4,6 +4,7 @@ from __future__ import unicode_literals
 import plac
 from pathlib import Path
 from ._messages import Messages
 from ..compat import symlink_to, path2str
 from ..util import prints
 from .. import util
@ -24,40 +25,29 @@ def link(origin, link_name, force=False, model_path=None):
    else:
        model_path = Path(origin) if model_path is None else Path(model_path)
    if not model_path.exists():
-        prints("The data should be located in %s" % path2str(model_path),
+        prints(Messages.M009.format(path=path2str(model_path)),
-               title="Can't locate model data", exits=1)
+               title=Messages.M008, exits=1)
    data_path = util.get_data_path()
    if not data_path or not data_path.exists():
        spacy_loc = Path(__file__).parent.parent
-        prints("Make sure a directory `/data` exists within your spaCy "
+        prints(Messages.M011, spacy_loc, title=Messages.M010, exits=1)
               "installation and try again. The data directory should be "
               "located here:", path2str(spacy_loc), exits=1,
               title="Can't find the spaCy data path to create model symlink")
    link_path = util.get_data_path() / link_name
    if link_path.is_symlink() and not force:
-        prints("To overwrite an existing link, use the --force flag.",
+        prints(Messages.M013, title=Messages.M012.format(name=link_name),
-               title="Link %s already exists" % link_name, exits=1)
+               exits=1)
    elif link_path.is_symlink():  # does a symlink exist?
        # NB: It's important to check for is_symlink here and not for exists,
        # because invalid/outdated symlinks would return False otherwise.
        link_path.unlink()
    elif link_path.exists(): # does it exist otherwise?
        # NB: Check this last because valid symlinks also "exist".
-        prints("This can happen if your data directory contains a directory "
+        prints(Messages.M015, link_path,
-               "or file of the same name.", link_path,
+               title=Messages.M014.format(name=link_name), exits=1)
-               title="Can't overwrite symlink %s" % link_name, exits=1)
+    msg = "%s --> %s" % (path2str(model_path), path2str(link_path))
    try:
        symlink_to(link_path, model_path)
    except:
        # This is quite dirty, but just making sure other errors are caught.
-        prints("Creating a symlink in spacy/data failed. Make sure you have "
+        prints(Messages.M017, msg, title=Messages.M016.format(name=link_name))
               "the required permissions and try re-running the command as "
               "admin, or use a virtualenv. You can still import the model as "
               "a module and call its load() method, or create the symlink "
               "manually.",
               "%s --> %s" % (path2str(model_path), path2str(link_path)),
               title="Error: Couldn't link model to '%s'" % link_name)
        raise
-    prints("%s --> %s" % (path2str(model_path), path2str(link_path)),
+    prints(msg, Messages.M019.format(name=link_name), title=Messages.M018)
           "You can now load the model via spacy.load('%s')" % link_name,
           title="Linking successful")
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -5,6 +5,7 @@ import plac
 import shutil
 from pathlib import Path
 from ._messages import Messages
 from ..compat import path2str, json_dumps
 from ..util import prints
 from .. import util
@ -31,17 +32,17 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
    output_path = util.ensure_path(output_dir)
    meta_path = util.ensure_path(meta_path)
    if not input_path or not input_path.exists():
-        prints(input_path, title="Model directory not found", exits=1)
+        prints(input_path, title=Messages.M008, exits=1)
    if not output_path or not output_path.exists():
-        prints(output_path, title="Output directory not found", exits=1)
+        prints(output_path, title=Messages.M040, exits=1)
    if meta_path and not meta_path.exists():
-        prints(meta_path, title="meta.json not found", exits=1)
+        prints(meta_path, title=Messages.M020, exits=1)
    meta_path = meta_path or input_path / 'meta.json'
    if meta_path.is_file():
        meta = util.read_json(meta_path)
        if not create_meta:  # only print this if user doesn't want to overwrite
-            prints(meta_path, title="Loaded meta.json from file")
+            prints(meta_path, title=Messages.M041)
        else:
            meta = generate_meta(input_dir, meta)
    meta = validate_meta(meta, ['lang', 'name', 'version'])
@ -57,9 +58,8 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False,
    create_file(main_path / 'setup.py', TEMPLATE_SETUP)
    create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
    create_file(package_path / '__init__.py', TEMPLATE_INIT)
-    prints(main_path, "To build the package, run `python setup.py sdist` in "
+    prints(main_path, Messages.M043,
-           "this directory.",
+           title=Messages.M042.format(name=model_name_v))
           title="Successfully created package '%s'" % model_name_v)
 def create_dirs(package_path, force):
@ -67,10 +67,7 @@ def create_dirs(package_path, force):
        if force:
            shutil.rmtree(path2str(package_path))
        else:
-            prints(package_path, "Please delete the directory and try again, "
+            prints(package_path, Messages.M045, title=Messages.M044, exits=1)
                   "or use the --force flag to overwrite existing "
                   "directories.", title="Package directory already exists",
                   exits=1)
    Path.mkdir(package_path, parents=True)
@ -97,9 +94,7 @@ def generate_meta(model_path, existing_meta):
    meta['vectors'] = {'width': nlp.vocab.vectors_length,
                       'vectors': len(nlp.vocab.vectors),
                       'keys': nlp.vocab.vectors.n_keys}
-    prints("Enter the package settings for your model. The following "
+    prints(Messages.M047, title=Messages.M046)
           "information will be read from your model data: pipeline, vectors.",
           title="Generating meta.json")
    for setting, desc, default in settings:
        response = util.get_raw_input(desc, default)
        meta[setting] = default if response == '' and default else response
@ -111,8 +106,7 @@ def generate_meta(model_path, existing_meta):
 def validate_meta(meta, keys):
    for key in keys:
        if key not in meta or meta[key] == '':
-            prints("This setting is required to build your package.",
+            prints(Messages.M049, title=Messages.M048.format(key=key), exits=1)
                   title='No "%s" setting found in meta.json' % key, exits=1)
    return meta
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -7,6 +7,7 @@ import tqdm
 from thinc.neural._classes.model import Model
 from timeit import default_timer as timer
 from ._messages import Messages
 from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from ..util import prints, minibatch, minibatch_by_words
@ -52,15 +53,15 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    dev_path = util.ensure_path(dev_data)
    meta_path = util.ensure_path(meta_path)
    if not train_path.exists():
-        prints(train_path, title="Training data not found", exits=1)
+        prints(train_path, title=Messages.M050, exits=1)
    if dev_path and not dev_path.exists():
-        prints(dev_path, title="Development data not found", exits=1)
+        prints(dev_path, title=Messages.M051, exits=1)
    if meta_path is not None and not meta_path.exists():
-        prints(meta_path, title="meta.json not found", exits=1)
+        prints(meta_path, title=Messages.M020, exits=1)
    meta = util.read_json(meta_path) if meta_path else {}
    if not isinstance(meta, dict):
-        prints("Expected dict but got: {}".format(type(meta)),
+        prints(Messages.M053.format(meta_type=type(meta)),
-               title="Not a valid meta.json format", exits=1)
+               title=Messages.M052, exits=1)
    meta.setdefault('lang', lang)
    meta.setdefault('name', 'unnamed')
@ -94,6 +95,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    meta['pipeline'] = pipeline
    nlp.meta.update(meta)
    if vectors:
        print("Load vectors model", vectors)
        util.load_model(vectors, vocab=nlp.vocab)
        for lex in nlp.vocab:
            values = {}
--- a/spacy/cli/validate.py
+++ b/spacy/cli/validate.py
@ -1,12 +1,13 @@
 # coding: utf8
 from __future__ import unicode_literals, print_function
 import requests
 import pkg_resources
 from pathlib import Path
 import sys
 import ujson
-from ..compat import path2str, locale_escape
+from ._messages import Messages
 from ..compat import path2str, locale_escape, url_read, HTTPError
 from ..util import prints, get_data_path, read_json
 from .. import about
@ -15,16 +16,16 @@ def validate():
    """Validate that the currently installed version of spaCy is compatible
    with the installed models. Should be run after `pip install -U spacy`.
    """
-    r = requests.get(about.__compatibility__)
+    try:
-    if r.status_code != 200:
+        data = url_read(about.__compatibility__)
-        prints("Couldn't fetch compatibility table.",
+    except HTTPError as e:
-               title="Server error (%d)" % r.status_code, exits=1)
+        title = Messages.M003.format(code=e.code, desc=e.reason)
-    compat = r.json()['spacy']
+        prints(Messages.M021, title=title, exits=1)
    compat = ujson.loads(data)['spacy']
    current_compat = compat.get(about.__version__)
    if not current_compat:
        prints(about.__compatibility__, exits=1,
-               title="Can't find spaCy v{} in compatibility table"
+               title=Messages.M022.format(version=about.__version__))
               .format(about.__version__))
    all_models = set()
    for spacy_v, models in dict(compat).items():
        all_models.update(models.keys())
@ -41,7 +42,7 @@ def validate():
    update_models = [m for m in incompat_models if m in current_compat]
    prints(path2str(Path(__file__).parent.parent),
-           title="Installed models (spaCy v{})".format(about.__version__))
+           title=Messages.M023.format(version=about.__version__))
    if model_links or model_pkgs:
        print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
        for name, data in model_pkgs.items():
@ -49,23 +50,16 @@ def validate():
        for name, data in model_links.items():
            print(get_model_row(current_compat, name, data, 'link'))
    else:
-        prints("No models found in your current environment.", exits=0)
+        prints(Messages.M024, exits=0)
    if update_models:
        cmd = '    python -m spacy download {}'
-        print("\n    Use the following commands to update the model packages:")
+        print("\n    " + Messages.M025)
        print('\n'.join([cmd.format(pkg) for pkg in update_models]))
    if na_models:
-        prints("The following models are not available for spaCy v{}: {}"
+        prints(Messages.M025.format(version=about.__version__,
-               .format(about.__version__, ', '.join(na_models)))
+                                    models=', '.join(na_models)))
    if incompat_links:
-        prints("You may also want to overwrite the incompatible links using "
+        prints(Messages.M027.format(path=path2str(get_data_path())))
               "the `python -m spacy link` command with `--force`, or remove "
               "them from the data directory. Data path: {}"
               .format(path2str(get_data_path())))
    if incompat_models or incompat_links:
        sys.exit(1)
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -1,7 +1,6 @@
 # coding: utf8
 from __future__ import unicode_literals
 import ftfy
 import sys
 import ujson
 import itertools
@ -34,11 +33,20 @@ try:
 except ImportError:
    from thinc.neural.optimizers import Adam as Optimizer
 try:
    import urllib.request
 except ImportError:
    import urllib2 as urllib
 try:
    from urllib.error import HTTPError
 except ImportError:
    from urllib2 import HTTPError
 pickle = pickle
 copy_reg = copy_reg
 CudaStream = CudaStream
 cupy = cupy
 fix_text = ftfy.fix_text
 copy_array = copy_array
 izip = getattr(itertools, 'izip', zip)
@ -58,6 +66,7 @@ if is_python2:
    input_ = raw_input  # noqa: F821
    json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
    path2str = lambda path: str(path).decode('utf8')
    url_open = urllib.urlopen
 elif is_python3:
    bytes_ = bytes
@ -66,6 +75,16 @@ elif is_python3:
    input_ = input
    json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
    path2str = lambda path: str(path)
    url_open = urllib.request.urlopen
 def url_read(url):
    file_ = url_open(url)
    code = file_.getcode()
    if code != 200:
        raise HTTPError(url, code, "Cannot GET url", [], file_)
    data = file_.read()
    return data
 def b_to_str(b_str):
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -4,6 +4,7 @@ from __future__ import unicode_literals
 from .render import DependencyRenderer, EntityRenderer
 from ..tokens import Doc
 from ..compat import b_to_str
 from ..errors import Errors, Warnings, user_warning
 from ..util import prints, is_in_jupyter
@ -27,7 +28,7 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
    factories = {'dep': (DependencyRenderer, parse_deps),
                 'ent': (EntityRenderer, parse_ents)}
    if style not in factories:
-        raise ValueError("Unknown style: %s" % style)
+        raise ValueError(Errors.E087.format(style=style))
    if isinstance(docs, Doc) or isinstance(docs, dict):
        docs = [docs]
    renderer, converter = factories[style]
@ -57,12 +58,12 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
    render(docs, style=style, page=page, minify=minify, options=options,
           manual=manual)
    httpd = simple_server.make_server('0.0.0.0', port, app)
-    prints("Using the '%s' visualizer" % style,
+    prints("Using the '{}' visualizer".format(style),
-           title="Serving on port %d..." % port)
+           title="Serving on port {}...".format(port))
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
-        prints("Shutting down server on port %d." % port)
+        prints("Shutting down server on port {}.".format(port))
    finally:
        httpd.server_close()
@ -83,6 +84,12 @@ def parse_deps(orig_doc, options={}):
    RETURNS (dict): Generated dependency parse keyed by words and arcs.
    """
    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
    if not doc.is_parsed:
        user_warning(Warnings.W005)
    if options.get('collapse_phrases', False):
        for np in list(doc.noun_chunks):
            np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
                    ent_type=np.root.ent_type_)
    if options.get('collapse_punct', True):
        spans = []
        for word in doc[:-1]:
@ -120,6 +127,8 @@ def parse_ents(doc, options={}):
    """
    ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
            for ent in doc.ents]
    if not ents:
        user_warning(Warnings.W006)
    title = (doc.user_data.get('title', None)
             if hasattr(doc, 'user_data') else None)
    return {'text': doc.text, 'ents': ents, 'title': title}
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -0,0 +1,313 @@
 # coding: utf8
 from __future__ import unicode_literals
 import os
 import warnings
 import inspect
 def add_codes(err_cls):
    """Add error codes to string messages via class attribute names."""
    class ErrorsWithCodes(object):
        def __getattribute__(self, code):
            msg = getattr(err_cls, code)
            return '[{code}] {msg}'.format(code=code, msg=msg)
    return ErrorsWithCodes()
@add_codes
 class Warnings(object):
    W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
            "You can now call spacy.load with the path as its first argument, "
            "and the model's meta.json will be used to determine the language "
            "to load. For example:\nnlp = spacy.load('{path}')")
    W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
            "instead and pass in the strings as the `words` keyword argument, "
            "for example:\nfrom spacy.tokens import Doc\n"
            "doc = Doc(nlp.vocab, words=[...])")
    W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
            "the keyword arguments, for example tag=, lemma= or ent_type=.")
    W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
            "using ftfy.fix_text if necessary.")
    W005 = ("Doc object not parsed. This means displaCy won't be able to "
            "generate a dependency visualization for it. Make sure the Doc "
            "was processed with a model that supports dependency parsing, and "
            "not just a language class like `English()`. For more info, see "
            "the docs:\nhttps://spacy.io/usage/models")
    W006 = ("No entities to visualize found in Doc object. If this is "
            "surprising to you, make sure the Doc was processed using a model "
            "that supports named entity recognition, and check the `doc.ents` "
            "property manually if necessary.")
@add_codes
 class Errors(object):
    E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
    E002 = ("Can't find factory for '{name}'. This usually happens when spaCy "
            "calls `nlp.create_pipe` with a component name that's not built "
            "in - for example, when constructing the pipeline from a model's "
            "meta.json. If you're using a custom component, you can write to "
            "`Language.factories['{name}']` or remove it from the model meta "
            "and add it via `nlp.add_pipe` instead.")
    E003 = ("Not a valid pipeline component. Expected callable, but "
            "got {component} (name: '{name}').")
    E004 = ("If you meant to add a built-in component, use `create_pipe`: "
            "`nlp.add_pipe(nlp.create_pipe('{component}'))`")
    E005 = ("Pipeline component '{name}' returned None. If you're using a "
            "custom component, maybe you forgot to return the processed Doc?")
    E006 = ("Invalid constraints. You can only set one of the following: "
            "before, after, first, last.")
    E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
    E008 = ("Some current components would be lost when restoring previous "
            "pipeline state. If you added components after calling "
            "`nlp.disable_pipes()`, you should remove them explicitly with "
            "`nlp.remove_pipe()` before the pipeline is restored. Names of "
            "the new components: {names}")
    E009 = ("The `update` method expects same number of docs and golds, but "
            "got: {n_docs} docs, {n_golds} golds.")
    E010 = ("Word vectors set to length 0. This may be because you don't have "
            "a model installed or loaded, or because your model doesn't "
            "include word vectors. For more info, see the docs:\n"
            "https://spacy.io/usage/models")
    E011 = ("Unknown operator: '{op}'. Options: {opts}")
    E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
    E013 = ("Error selecting action in matcher")
    E014 = ("Uknown tag ID: {tag}")
    E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
            "`force=True` to overwrite.")
    E016 = ("MultitaskObjective target should be function or one of: dep, "
            "tag, ent, dep_tag_offset, ent_tag.")
    E017 = ("Can only add unicode or bytes. Got type: {value_type}")
    E018 = ("Can't retrieve string for hash '{hash_value}'.")
    E019 = ("Can't create transition with unknown action ID: {action}. Action "
            "IDs are enumerated in spacy/syntax/{src}.pyx.")
    E020 = ("Could not find a gold-standard action to supervise the "
            "dependency parser. The tree is non-projective (i.e. it has "
            "crossing arcs - see spacy/syntax/nonproj.pyx for definitions). "
            "The ArcEager transition system only supports projective trees. "
            "To learn non-projective representations, transform the data "
            "before training and after parsing. Either pass "
            "`make_projective=True` to the GoldParse class, or use "
            "spacy.syntax.nonproj.preprocess_training_data.")
    E021 = ("Could not find a gold-standard action to supervise the "
            "dependency parser. The GoldParse was projective. The transition "
            "system has {n_actions} actions. State at failure: {state}")
    E022 = ("Could not find a transition with the name '{name}' in the NER "
            "model.")
    E023 = ("Error cleaning up beam: The same state occurred twice at "
            "memory address {addr} and position {i}.")
    E024 = ("Could not find an optimal move to supervise the parser. Usually, "
            "this means the GoldParse was not correct. For example, are all "
            "labels added to the model?")
    E025 = ("String is too long: {length} characters. Max is 2**30.")
    E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
            "length {length}.")
    E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
            "length, or 'spaces' should be left default at None. spaces "
            "should be a sequence of booleans, with True meaning that the "
            "word owns a ' ' character following it.")
    E028 = ("orths_and_spaces expects either a list of unicode string or a "
            "list of (unicode, bool) tuples. Got bytes instance: {value}")
    E029 = ("noun_chunks requires the dependency parse, which requires a "
            "statistical model to be installed and loaded. For more info, see "
            "the documentation:\nhttps://spacy.io/usage/models")
    E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
            "component to the pipeline with: "
            "nlp.add_pipe(nlp.create_pipe('sentencizer')) "
            "Alternatively, add the dependency parser, or set sentence "
            "boundaries by setting doc[i].is_sent_start.")
    E031 = ("Invalid token: empty string ('') at position {i}.")
    E032 = ("Conflicting attributes specified in doc.from_array(): "
            "(HEAD, SENT_START). The HEAD attribute currently sets sentence "
            "boundaries implicitly, based on the tree structure. This means "
            "the HEAD attribute would potentially override the sentence "
            "boundaries set by SENT_START.")
    E033 = ("Cannot load into non-empty Doc of length {length}.")
    E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
            "either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
            "Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
    E035 = ("Error creating span with start {start} and end {end} for Doc of "
            "length {length}.")
    E036 = ("Error calculating span: Can't find a token starting at character "
            "offset {start}.")
    E037 = ("Error calculating span: Can't find a token ending at character "
            "offset {end}.")
    E038 = ("Error finding sentence for span. Infinite loop detected.")
    E039 = ("Array bounds exceeded while searching for root word. This likely "
            "means the parse tree is in an invalid state. Please report this "
            "issue here: http://github.com/explosion/spaCy/issues")
    E040 = ("Attempt to access token at {i}, max length {max_length}.")
    E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
    E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
    E043 = ("Refusing to write to token.sent_start if its document is parsed, "
            "because this may cause inconsistent state.")
    E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
            "None, True, False")
    E045 = ("Possibly infinite loop encountered while looking for {attr}.")
    E046 = ("Can't retrieve unregistered extension attribute '{name}'. Did "
            "you forget to call the `set_extension` method?")
    E047 = ("Can't assign a value to unregistered extension attribute "
            "'{name}'. Did you forget to call the `set_extension` method?")
    E048 = ("Can't import language {lang} from spacy.lang.")
    E049 = ("Can't find spaCy data directory: '{path}'. Check your "
            "installation and permissions, or use spacy.util.set_data_path "
            "to customise the location if necessary.")
    E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
            "link, a Python package or a valid path to a data directory.")
    E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
            "it points to a valid package (not just a data directory).")
    E052 = ("Can't find model directory: {path}")
    E053 = ("Could not read meta.json from {path}")
    E054 = ("No valid '{setting}' setting found in model meta.json.")
    E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}")
    E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
            "original string.\nKey: {key}\nOrths: {orths}")
    E057 = ("Stepped slices not supported in Span objects. Try: "
            "list(tokens)[start:stop:step] instead.")
    E058 = ("Could not retrieve vector for key {key}.")
    E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
    E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
            "({rows}, {cols}).")
    E061 = ("Bad file name: {filename}. Example of a valid file name: "
            "'vectors.128.f.bin'")
    E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 "
            "and 63 are occupied. You can replace one by specifying the "
            "`flag_id` explicitly, e.g. "
            "`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
    E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
            "and 63 (inclusive).")
    E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
            "string, the lexeme returned had an orth ID that did not match "
            "the query string. This means that the cached lexeme structs are "
            "mismatched to the string encoding table. The mismatched:\n"
            "Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}")
    E065 = ("Only one of the vector table's width and shape can be specified. "
            "Got width {width} and shape {shape}.")
    E066 = ("Error creating model helper for extracting columns. Can only "
            "extract columns by positive integer. Got: {value}.")
    E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside "
            "an entity) without a preceding 'B' (beginning of an entity). "
            "Tag sequence:\n{tags}")
    E068 = ("Invalid BILUO tag: '{tag}'.")
    E069 = ("Invalid gold-standard parse tree. Found cycle between word "
            "IDs: {cycle}")
    E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) "
            "does not align with number of annotations ({n_annots}).")
    E071 = ("Error creating lexeme: specified orth ID ({orth}) does not "
            "match the one in the vocab ({vocab_orth}).")
    E072 = ("Error serializing lexeme: expected data length {length}, "
            "got {bad_length}.")
    E073 = ("Cannot assign vector of length {new_length}. Existing vectors "
            "are of length {length}. You can use `vocab.reset_vectors` to "
            "clear the existing vectors and resize the table.")
    E074 = ("Error interpreting compiled match pattern: patterns are expected "
            "to end with the attribute {attr}. Got: {bad_attr}.")
    E075 = ("Error accepting match: length ({length}) > maximum length "
            "({max_len}).")
    E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc "
            "has {words} words.")
    E077 = ("Error computing {value}: number of Docs ({n_docs}) does not "
            "equal number of GoldParse objects ({n_golds}) in batch.")
    E078 = ("Error computing score: number of words in Doc ({words_doc}) does "
            "not equal number of words in GoldParse ({words_gold}).")
    E079 = ("Error computing states in beam: number of predicted beams "
            "({pbeams}) does not equal number of gold beams ({gbeams}).")
    E080 = ("Duplicate state found in beam: {key}.")
    E081 = ("Error getting gradient in beam: number of histories ({n_hist}) "
            "does not equal number of losses ({losses}).")
    E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), "
            "projective heads ({n_proj_heads}) and labels ({n_labels}) do not "
            "match.")
    E083 = ("Error setting extension: only one of `default`, `method`, or "
            "`getter` (plus optional `setter`) is allowed. Got: {nr_defined}")
    E084 = ("Error assigning label ID {label} to span: not in StringStore.")
    E085 = ("Can't create lexeme for string '{string}'.")
    E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does "
            "not match hash {hash_id} in StringStore.")
    E087 = ("Unknown displaCy style: {style}.")
    E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
            "v2.x parser and NER models require roughly 1GB of temporary "
            "memory per 100,000 characters in the input. This means long "
            "texts may cause memory allocation errors. If you're not using "
            "the parser or NER, it's probably safe to increase the "
            "`nlp.max_length` limit. The limit is in number of characters, so "
            "you can check whether your inputs are too long by checking "
            "`len(text)`.")
    E089 = ("Extensions can't have a setter argument without a getter "
            "argument. Check the keyword arguments on `set_extension`.")
    E090 = ("Extension '{name}' already exists on {obj}. To overwrite the "
            "existing extension, set `force=True` on `{obj}.set_extension`.")
    E091 = ("Invalid extension attribute {name}: expected callable or None, "
            "but got: {value}")
    E092 = ("Could not find or assign name for word vectors. Ususally, the "
            "name is read from the model's meta.json in vector.name. "
            "Alternatively, it is built from the 'lang' and 'name' keys in "
            "the meta.json. Vector names are required to avoid issue #1660.")
    E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}")
    E094 = ("Error reading line {line_num} in vectors file {loc}.")
@add_codes
 class TempErrors(object):
    T001 = ("Max length currently 10 for phrase matching")
    T002 = ("Pattern length ({doc_len}) >= phrase_matcher.max_length "
            "({max_len}). Length can be set on initialization, up to 10.")
    T003 = ("Resizing pre-trained Tagger models is not currently supported.")
    T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.")
    T005 = ("Currently history size is hard-coded to 0. Received: {value}.")
    T006 = ("Currently history width is hard-coded to 0. Received: {value}.")
    T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
            "issue tracker: http://github.com/explosion/spaCy/issues")
    T008 = ("Bad configuration of Tagger. This is probably a bug within "
            "spaCy. We changed the name of an internal attribute for loading "
            "pre-trained vectors, and the class has been passed the old name "
            "(pretrained_dims) but not the new name (pretrained_vectors).")
 class ModelsWarning(UserWarning):
    pass
 WARNINGS = {
    'user': UserWarning,
    'deprecation': DeprecationWarning,
    'models': ModelsWarning,
 }
 def _get_warn_types(arg):
    if arg == '':  # don't show any warnings
        return []
    if not arg or arg == 'all':  # show all available warnings
        return WARNINGS.keys()
    return [w_type.strip() for w_type in arg.split(',')
            if w_type.strip() in WARNINGS]
 SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always')
 SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
 def user_warning(message):
    _warn(message, 'user')
 def deprecation_warning(message):
    _warn(message, 'deprecation')
 def models_warning(message):
    _warn(message, 'models')
 def _warn(message, warn_type='user'):
    """
    message (unicode): The message to display.
    category (Warning): The Warning to show.
    """
    if warn_type in SPACY_WARNING_TYPES:
        category = WARNINGS[warn_type]
        stack = inspect.stack()[-1]
        with warnings.catch_warnings():
            warnings.simplefilter(SPACY_WARNING_FILTER, category)
            warnings.warn_explicit(message, category, stack[1], stack[2])
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -17,6 +17,7 @@ import ujson
 from . import _align 
 from .syntax import nonproj
 from .tokens import Doc
 from .errors import Errors
 from . import util
 from .util import minibatch, itershuffle
 from .compat import json_dumps
@ -37,7 +38,8 @@ def tags_to_entities(tags):
        elif tag == '-':
            continue
        elif tag.startswith('I'):
-            assert start is not None, tags[:i]
+            if start is None:
                raise ValueError(Errors.E067.format(tags=tags[:i]))
            continue
        if tag.startswith('U'):
            entities.append((tag[2:], i, i))
@ -47,7 +49,7 @@ def tags_to_entities(tags):
            entities.append((tag[2:], start, i))
            start = None
        else:
-            raise Exception(tag)
+            raise ValueError(Errors.E068.format(tag=tag))
    return entities
@ -225,7 +227,9 @@ class GoldCorpus(object):
    @classmethod
    def _make_golds(cls, docs, paragraph_tuples, make_projective):
-        assert len(docs) == len(paragraph_tuples)
+        if len(docs) != len(paragraph_tuples):
            raise ValueError(Errors.E070.format(n_docs=len(docs),
                                                n_annots=len(paragraph_tuples)))
        if len(docs) == 1:
            return [GoldParse.from_annot_tuples(docs[0],
                                                paragraph_tuples[0][0],
@ -525,7 +529,7 @@ cdef class GoldParse:
        cycle = nonproj.contains_cycle(self.heads)
        if cycle is not None:
-            raise Exception("Cycle found: %s" % cycle)
+            raise ValueError(Errors.E069.format(cycle=cycle))
    def __len__(self):
        """Get the number of gold-standard tokens.
--- a/spacy/lang/da/init.py
+++ b/spacy/lang/da/init.py
@ -8,6 +8,7 @@ from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .morph_rules import MORPH_RULES
 from ..tag_map import TAG_MAP
 from .lemmatizer import LOOKUP
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -28,6 +29,7 @@ class DanishDefaults(Language.Defaults):
    suffixes = TOKENIZER_SUFFIXES
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
    lemma_lookup = LOOKUP
 class Danish(Language):
--- a/spacy/lang/da/lemmatizer.py
+++ b/spacy/lang/da/lemmatizer.py
--- a/spacy/lang/it/lemmatizer.py
+++ b/spacy/lang/it/lemmatizer.py
@ -286069,7 +286069,6 @@ LOOKUP = {
    "sonnolente": "sonnolento",
    "sonnolenti": "sonnolento",
    "sonnolenze": "sonnolenza",
    "sono": "sonare",
    "sonora": "sonoro",
    "sonore": "sonoro",
    "sonori": "sonoro",
@ -333681,6 +333680,7 @@ LOOKUP = {
    "zurliniane": "zurliniano",
    "zurliniani": "zurliniano",
    "àncore": "àncora",
    "sono": "essere",
    "è": "essere",
    "èlites": "èlite",
    "ère": "èra",
--- a/spacy/lang/sv/lemmatizer/lookup.py
+++ b/spacy/lang/sv/lemmatizer/lookup.py
@ -190262,7 +190262,6 @@ LOOKUP = {
    "gämserna": "gäms",
    "gämsernas": "gäms",
    "gämsers": "gäms",
    "gäng": "gänga",
    "gängad": "gänga",
    "gängade": "gängad",
    "gängades": "gängad",
@ -651423,7 +651422,6 @@ LOOKUP = {
    "åpnasts": "åpen",
    "åpne": "åpen",
    "åpnes": "åpen",
    "år": "åra",
    "åran": "åra",
    "årans": "åra",
    "åras": "åra",
--- a/spacy/lang/vi/init.py
+++ b/spacy/lang/vi/init.py
@ -1,19 +1,53 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ...attrs import LANG
+from ...attrs import LANG, NORM
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...tokens import Doc
 from .stop_words import STOP_WORDS
 from ...util import update_exc, add_lookups
 from .lex_attrs import LEX_ATTRS
 #from ..tokenizer_exceptions import BASE_EXCEPTIONS
 #from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 class VietnameseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: 'vi'  # for pickling
    # add more norm exception dictionaries here
    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
    # overwrite functions for lexical attributes
    lex_attr_getters.update(LEX_ATTRS)
    # merge base exceptions and custom tokenizer exceptions
    #tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    use_pyvi = True
 class Vietnamese(Language):
    lang = 'vi'
    Defaults = VietnameseDefaults  # override defaults
    def make_doc(self, text):
        if self.Defaults.use_pyvi:
            try:
                from pyvi import ViTokenizer
            except ImportError:
                msg = ("Pyvi not installed. Either set Vietnamese.use_pyvi = False, "
                       "or install it https://pypi.python.org/pypi/pyvi")
                raise ImportError(msg)
            words, spaces = ViTokenizer.spacy_tokenize(text)
            return Doc(self.vocab, words=words, spaces=spaces)
        else:
            words = []
            spaces = []
            doc = self.tokenizer(text)
            for token in self.tokenizer(text):
                words.extend(list(token.text))
                spaces.extend([False]*len(token.text))
                spaces[-1] = bool(token.whitespace_)
            return Doc(self.vocab, words=words, spaces=spaces)
 __all__ = ['Vietnamese']
--- a/spacy/lang/vi/lex_attrs.py
+++ b/spacy/lang/vi/lex_attrs.py
@ -0,0 +1,26 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
 _num_words = ['không', 'một', 'hai', 'ba', 'bốn', 'năm', 'sáu', 'bẩy',
              'tám', 'chín', 'mười', 'trăm', 'tỷ']
 def like_num(text):
    text = text.replace(',', '').replace('.', '')
    if text.isdigit():
        return True
    if text.count('/') == 1:
        num, denom = text.split('/')
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {
    LIKE_NUM: like_num
 }
--- a/spacy/lang/vi/stop_words.py
+++ b/spacy/lang/vi/stop_words.py
--- a/spacy/lang/vi/tag_map.py
+++ b/spacy/lang/vi/tag_map.py
@ -0,0 +1,36 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
 from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
 # Add a tag map
 # Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
 # Universal Dependencies: http://universaldependencies.org/u/pos/all.html
 # The keys of the tag map should be strings in your tag set. The dictionary must
 # have an entry POS whose value is one of the Universal Dependencies tags.
 # Optionally, you can also include morphological features or other attributes.
 TAG_MAP = {
    "ADV":      {POS: ADV},
    "NOUN":     {POS: NOUN},
    "ADP":      {POS: ADP},
    "PRON":     {POS: PRON},
    "SCONJ":    {POS: SCONJ},
    "PROPN":    {POS: PROPN},
    "DET":      {POS: DET},
    "SYM":      {POS: SYM},
    "INTJ":     {POS: INTJ},
    "PUNCT":    {POS: PUNCT},
    "NUM":      {POS: NUM},
    "AUX":      {POS: AUX},
    "X":        {POS: X},
    "CONJ":     {POS: CONJ},
    "CCONJ":    {POS: CCONJ},
    "ADJ":      {POS: ADJ},
    "VERB":     {POS: VERB},
    "PART":     {POS: PART},
    "SP":     	{POS: SPACE}
 }
--- a/spacy/language.py
+++ b/spacy/language.py
@ -28,6 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
 from .lang.tokenizer_exceptions import TOKEN_MATCH
 from .lang.tag_map import TAG_MAP
 from .lang.lex_attrs import LEX_ATTRS, is_stop
 from .errors import Errors
 from . import util
 from . import about
@ -112,7 +113,7 @@ class Language(object):
        'merge_subtokens': lambda nlp, **cfg: merge_subtokens,
    }
-    def __init__(self, vocab=True, make_doc=True, meta={}, **kwargs):
+    def __init__(self, vocab=True, make_doc=True, max_length=10**6, meta={}, **kwargs):
        """Initialise a Language object.
        vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
@ -127,6 +128,15 @@ class Language(object):
            string occurs in both, the component is not loaded.
        meta (dict): Custom meta data for the Language class. Is written to by
            models to add model meta data.
        max_length (int) :
            Maximum number of characters in a single text. The current v2 models
            may run out memory on extremely long texts, due to large internal
            allocations. You should segment these texts into meaningful units,
            e.g. paragraphs, subsections etc, before passing them to spaCy.
            Default maximum length is 1,000,000 characters (1mb). As a rule of
            thumb, if all pipeline components are enabled, spaCy's default
            models currently requires roughly 1GB of temporary memory per
            100,000 characters in one text.
        RETURNS (Language): The newly constructed object.
        """
        self._meta = dict(meta)
@ -134,12 +144,15 @@ class Language(object):
        if vocab is True:
            factory = self.Defaults.create_vocab
            vocab = factory(self, **meta.get('vocab', {}))
            if vocab.vectors.name is None:
                vocab.vectors.name = meta.get('vectors', {}).get('name')
        self.vocab = vocab
        if make_doc is True:
            factory = self.Defaults.create_tokenizer
            make_doc = factory(self, **meta.get('tokenizer', {}))
        self.tokenizer = make_doc
        self.pipeline = []
        self.max_length = max_length
        self._optimizer = None
    @property
@ -159,7 +172,8 @@ class Language(object):
        self._meta.setdefault('license', '')
        self._meta['vectors'] = {'width': self.vocab.vectors_length,
                                 'vectors': len(self.vocab.vectors),
-                                 'keys': self.vocab.vectors.n_keys}
+                                 'keys': self.vocab.vectors.n_keys,
                                 'name': self.vocab.vectors.name}
        self._meta['pipeline'] = self.pipe_names
        return self._meta
@ -205,8 +219,7 @@ class Language(object):
        for pipe_name, component in self.pipeline:
            if pipe_name == name:
                return component
-        msg = "No component '{}' found in pipeline. Available names: {}"
+        raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
        raise KeyError(msg.format(name, self.pipe_names))
    def create_pipe(self, name, config=dict()):
        """Create a pipeline component from a factory.
@ -216,7 +229,7 @@ class Language(object):
        RETURNS (callable): Pipeline component.
        """
        if name not in self.factories:
-            raise KeyError("Can't find factory for '{}'.".format(name))
+            raise KeyError(Errors.E002.format(name=name))
        factory = self.factories[name]
        return factory(self, **config)
@ -241,12 +254,9 @@ class Language(object):
            >>> nlp.add_pipe(component, name='custom_name', last=True)
        """
        if not hasattr(component, '__call__'):
-            msg = ("Not a valid pipeline component. Expected callable, but "
+            msg = Errors.E003.format(component=repr(component), name=name)
                   "got {}. ".format(repr(component)))
            if isinstance(component, basestring_) and component in self.factories:
-                msg += ("If you meant to add a built-in component, use "
+                msg += Errors.E004.format(component=component)
                        "create_pipe: nlp.add_pipe(nlp.create_pipe('{}'))"
                        .format(component))
            raise ValueError(msg)
        if name is None:
            if hasattr(component, 'name'):
@ -259,11 +269,9 @@ class Language(object):
            else:
                name = repr(component)
        if name in self.pipe_names:
-            raise ValueError("'{}' already exists in pipeline.".format(name))
+            raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
        if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
-            msg = ("Invalid constraints. You can only set one of the "
+            raise ValueError(Errors.E006)
                   "following: before, after, first, last.")
            raise ValueError(msg)
        pipe = (name, component)
        if last or not any([first, before, after]):
            self.pipeline.append(pipe)
@ -274,9 +282,8 @@ class Language(object):
        elif after and after in self.pipe_names:
            self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
        else:
-            msg = "Can't find '{}' in pipeline. Available names: {}"
+            raise ValueError(Errors.E001.format(name=before or after,
-            unfound = before or after
+                                                opts=self.pipe_names))
            raise ValueError(msg.format(unfound, self.pipe_names))
    def has_pipe(self, name):
        """Check if a component name is present in the pipeline. Equivalent to
@ -294,8 +301,7 @@ class Language(object):
        component (callable): Pipeline component.
        """
        if name not in self.pipe_names:
-            msg = "Can't find '{}' in pipeline. Available names: {}"
+            raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
            raise ValueError(msg.format(name, self.pipe_names))
        self.pipeline[self.pipe_names.index(name)] = (name, component)
    def rename_pipe(self, old_name, new_name):
@ -305,11 +311,9 @@ class Language(object):
        new_name (unicode): New name of the component.
        """
        if old_name not in self.pipe_names:
-            msg = "Can't find '{}' in pipeline. Available names: {}"
+            raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
            raise ValueError(msg.format(old_name, self.pipe_names))
        if new_name in self.pipe_names:
-            msg = "'{}' already exists in pipeline. Existing names: {}"
+            raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names))
            raise ValueError(msg.format(new_name, self.pipe_names))
        i = self.pipe_names.index(old_name)
        self.pipeline[i] = (new_name, self.pipeline[i][1])
@ -320,8 +324,7 @@ class Language(object):
        RETURNS (tuple): A `(name, component)` tuple of the removed component.
        """
        if name not in self.pipe_names:
-            msg = "Can't find '{}' in pipeline. Available names: {}"
+            raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
            raise ValueError(msg.format(name, self.pipe_names))
        return self.pipeline.pop(self.pipe_names.index(name))
    def __call__(self, text, disable=[]):
@ -338,11 +341,18 @@ class Language(object):
            >>> tokens[0].text, tokens[0].head.tag_
            ('An', 'NN')
        """
        if len(text) >= self.max_length:
            raise ValueError(Errors.E088.format(length=len(text),
                                                max_length=self.max_length))
        doc = self.make_doc(text)
        for name, proc in self.pipeline:
            if name in disable:
                continue
            if not hasattr(proc, '__call__'):
                raise ValueError(Errors.E003.format(component=type(proc), name=name))
            doc = proc(doc)
            if doc is None:
                raise ValueError(Errors.E005.format(name=name))
        return doc
    def disable_pipes(self, *names):
@ -384,8 +394,7 @@ class Language(object):
            >>>            state = nlp.update(docs, golds, sgd=optimizer)
        """
        if len(docs) != len(golds):
-            raise IndexError("Update expects same number of docs and golds "
+            raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
                             "Got: %d, %d" % (len(docs), len(golds)))
        if len(docs) == 0:
            return
        if sgd is None:
@ -458,6 +467,8 @@ class Language(object):
        else:
            device = None
        link_vectors_to_models(self.vocab)
        if self.vocab.vectors.data.shape[1]:
            cfg['pretrained_vectors'] = self.vocab.vectors.name
        if sgd is None:
            sgd = create_default_optimizer(Model.ops)
        self._optimizer = sgd
@ -626,9 +637,10 @@ class Language(object):
        """
        path = util.ensure_path(path)
        deserializers = OrderedDict((
-            ('vocab', lambda p: self.vocab.from_disk(p)),
+            ('meta.json', lambda p: self.meta.update(util.read_json(p))),
            ('vocab', lambda p: (
                self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
            ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
            ('meta.json', lambda p: self.meta.update(util.read_json(p)))
        ))
        for name, proc in self.pipeline:
            if name in disable:
@ -671,9 +683,10 @@ class Language(object):
        RETURNS (Language): The `Language` object.
        """
        deserializers = OrderedDict((
-            ('vocab', lambda b: self.vocab.from_bytes(b)),
+            ('meta', lambda b: self.meta.update(ujson.loads(b))),
            ('vocab', lambda b: (
                self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self))),
            ('tokenizer', lambda b: self.tokenizer.from_bytes(b, vocab=False)),
            ('meta', lambda b: self.meta.update(ujson.loads(b)))
        ))
        for i, (name, proc) in enumerate(self.pipeline):
            if name in disable:
@ -685,6 +698,27 @@ class Language(object):
        return self
 def _fix_pretrained_vectors_name(nlp):
    # TODO: Replace this once we handle vectors consistently as static
    # data
    if 'vectors' in nlp.meta and nlp.meta['vectors'].get('name'):
        nlp.vocab.vectors.name = nlp.meta['vectors']['name']
    elif not nlp.vocab.vectors.size:
        nlp.vocab.vectors.name = None
    elif 'name' in nlp.meta and 'lang' in nlp.meta:
        vectors_name = '%s_%s.vectors' % (nlp.meta['lang'], nlp.meta['name'])
        nlp.vocab.vectors.name = vectors_name
    else:
        raise ValueError(Errors.E092)
    if nlp.vocab.vectors.size != 0:
        link_vectors_to_models(nlp.vocab)
    for name, proc in nlp.pipeline:
        if not hasattr(proc, 'cfg'):
            continue
        proc.cfg.setdefault('deprecation_fixes', {})
        proc.cfg['deprecation_fixes']['vectors_name'] = nlp.vocab.vectors.name
 class DisabledPipes(list):
    """Manager for temporary pipeline disabling."""
    def __init__(self, nlp, *names):
@ -711,14 +745,7 @@ class DisabledPipes(list):
        if unexpected:
            # Don't change the pipeline if we're raising an error.
            self.nlp.pipeline = current
-            msg = (
+            raise ValueError(Errors.E008.format(names=unexpected))
                "Some current components would be lost when restoring "
                "previous pipeline state. If you added components after "
                "calling nlp.disable_pipes(), you should remove them "
                "explicitly with nlp.remove_pipe() before the pipeline is "
                "restore. Names of the new components: %s"
            )
            raise ValueError(msg % unexpected)
        self[:] = []
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -15,7 +15,7 @@ from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
 from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
 from .attrs cimport PROB
 from .attrs import intify_attrs
-from . import about
+from .errors import Errors
 memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
@ -37,7 +37,8 @@ cdef class Lexeme:
        self.vocab = vocab
        self.orth = orth
        self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
-        assert self.c.orth == orth
+        if self.c.orth != orth:
            raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
    def __richcmp__(self, other, int op):
        if other is None:
@ -129,20 +130,25 @@ cdef class Lexeme:
        lex_data = Lexeme.c_to_bytes(self.c)
        start = <const char*>&self.c.flags
        end = <const char*>&self.c.sentiment + sizeof(self.c.sentiment)
-        assert (end-start) == sizeof(lex_data.data), (end-start, sizeof(lex_data.data))
+        if (end-start) != sizeof(lex_data.data):
            raise ValueError(Errors.E072.format(length=end-start,
                                                bad_length=sizeof(lex_data.data)))
        byte_string = b'\0' * sizeof(lex_data.data)
        byte_chars = <char*>byte_string
        for i in range(sizeof(lex_data.data)):
            byte_chars[i] = lex_data.data[i]
-        assert len(byte_string) == sizeof(lex_data.data), (len(byte_string),
+        if len(byte_string) != sizeof(lex_data.data):
-                sizeof(lex_data.data))
+            raise ValueError(Errors.E072.format(length=len(byte_string),
                                                bad_length=sizeof(lex_data.data)))
        return byte_string
    def from_bytes(self, bytes byte_string):
        # This method doesn't really have a use-case --- wrote it for testing.
        # Possibly delete? It puts the Lexeme out of synch with the vocab.
        cdef SerializedLexemeC lex_data
-        assert len(byte_string) == sizeof(lex_data.data)
+        if len(byte_string) != sizeof(lex_data.data):
            raise ValueError(Errors.E072.format(length=len(byte_string),
                                                bad_length=sizeof(lex_data.data)))
        for i in range(len(byte_string)):
            lex_data.data[i] = byte_string[i]
        Lexeme.c_from_bytes(self.c, lex_data)
@ -169,16 +175,13 @@ cdef class Lexeme:
        def __get__(self):
            cdef int length = self.vocab.vectors_length
            if length == 0:
-                raise ValueError(
+                raise ValueError(Errors.E010)
                    "Word vectors set to length 0. This may be because you "
                    "don't have a model installed or loaded, or because your "
                    "model doesn't include word vectors. For more info, see "
                    "the documentation: \n%s\n" % about.__docs_models__
                )
            return self.vocab.get_vector(self.c.orth)
        def __set__(self, vector):
-            assert len(vector) == self.vocab.vectors_length
+            if len(vector) != self.vocab.vectors_length:
                raise ValueError(Errors.E073.format(new_length=len(vector),
                                                    length=self.vocab.vectors_length))
            self.vocab.set_vector(self.c.orth, vector)
    property rank:
--- a/spacy/matcher.pyx
+++ b/spacy/matcher.pyx
@ -13,6 +13,8 @@ from .vocab cimport Vocab
 from .tokens.doc cimport Doc
 from .tokens.doc cimport get_token_attr
 from .attrs cimport ID, attr_id_t, NULL_ATTR
 from .errors import Errors, TempErrors
 from .attrs import IDS
 from .attrs import FLAG61 as U_ENT
 from .attrs import FLAG60 as B2_ENT
@ -321,6 +323,8 @@ cdef attr_t get_pattern_key(const TokenPatternC* pattern) nogil:
    while pattern.nr_attr != 0:
        pattern += 1
    id_attr = pattern[0].attrs[0]
    if id_attr.attr != ID:
        raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
    return id_attr.value
 def _convert_strings(token_specs, string_store):
@ -341,8 +345,8 @@ def _convert_strings(token_specs, string_store):
                if value in operators:
                    ops = operators[value]
                else:
-                    msg = "Unknown operator '%s'. Options: %s"
+                    keys = ', '.join(operators.keys())
-                    raise KeyError(msg % (value, ', '.join(operators.keys())))
+                    raise KeyError(Errors.E011.format(op=value, opts=keys))
            if isinstance(attr, basestring):
                attr = IDS.get(attr.upper())
            if isinstance(value, basestring):
@ -429,9 +433,7 @@ cdef class Matcher:
        """
        for pattern in patterns:
            if len(pattern) == 0:
-                msg = ("Cannot add pattern for zero tokens to matcher.\n"
+                raise ValueError(Errors.E012.format(key=key))
                       "key: {key}\n")
                raise ValueError(msg.format(key=key))
        key = self._normalize_key(key)
        for pattern in patterns:
            specs = _convert_strings(pattern, self.vocab.strings)
--- a/spacy/morphology.pyx
+++ b/spacy/morphology.pyx
@ -9,6 +9,7 @@ from .attrs import LEMMA, intify_attrs
 from .parts_of_speech cimport SPACE
 from .parts_of_speech import IDS as POS_IDS
 from .lexeme cimport Lexeme
 from .errors import Errors
 def _normalize_props(props):
@ -93,7 +94,7 @@ cdef class Morphology:
    cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1:
        if tag_id > self.n_tags:
-            raise ValueError("Unknown tag ID: %s" % tag_id)
+            raise ValueError(Errors.E014.format(tag=tag_id))
        # TODO: It's pretty arbitrary to put this logic here. I guess the
        # justification is that this is where the specific word and the tag
        # interact. Still, we should have a better way to enforce this rule, or
@ -129,7 +130,7 @@ cdef class Morphology:
        tag (unicode): The part-of-speech tag to key the exception.
        orth (unicode): The word-form to key the exception.
        """
-        # TODO: Currently we've assumed that we know the number of tags -- 
+        # TODO: Currently we've assumed that we know the number of tags --
        # RichTagC is an array, and _cache is a PreshMapArray
        # This is really bad: it makes the morphology typed to the tagger
        # classes, which is all wrong.
@ -147,9 +148,7 @@ cdef class Morphology:
        elif force:
            memset(cached, 0, sizeof(cached[0]))
        else:
-            raise ValueError(
+            raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str))
                "Conflicting morphology exception for (%s, %s). Use "
                "force=True to overwrite." % (tag_str, orth_str))
        cached.tag = rich_tag
        # TODO: Refactor this to take arbitrary attributes.
--- a/spacy/pipeline.pyx
+++ b/spacy/pipeline.pyx
@ -8,7 +8,9 @@ cimport numpy as np
 import cytoolz
 from collections import OrderedDict
 import ujson
-import msgpack
+
 from .util import msgpack
 from .util import msgpack_numpy
 from thinc.api import chain
 from thinc.v2v import Affine, SELU, Softmax
@ -32,6 +34,7 @@ from .parts_of_speech import X
 from ._ml import Tok2Vec, build_text_classifier, build_tagger_model
 from ._ml import link_vectors_to_models, zero_init, flatten
 from ._ml import create_default_optimizer
 from .errors import Errors, TempErrors
 from . import util
@ -77,7 +80,7 @@ def merge_noun_chunks(doc):
    RETURNS (Doc): The Doc object with merged noun chunks.
    """
    if not doc.is_parsed:
-        return
+        return doc
    spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
             for np in doc.noun_chunks]
    for start, end, tag, dep in spans:
@ -214,8 +217,10 @@ class Pipe(object):
    def from_bytes(self, bytes_data, **exclude):
        """Load the pipe from a bytestring."""
        def load_model(b):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            if self.model is True:
                self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
                self.model = self.Model(**self.cfg)
            self.model.from_bytes(b)
@ -239,8 +244,10 @@ class Pipe(object):
    def from_disk(self, path, **exclude):
        """Load the pipe from disk."""
        def load_model(p):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            if self.model is True:
                self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
                self.model = self.Model(**self.cfg)
            self.model.from_bytes(p.open('rb').read())
@ -298,7 +305,6 @@ class Tensorizer(Pipe):
        self.model = model
        self.input_models = []
        self.cfg = dict(cfg)
        self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
        self.cfg.setdefault('cnn_maxout_pieces', 3)
    def __call__(self, doc):
@ -343,7 +349,8 @@ class Tensorizer(Pipe):
        tensors (object): Vector representation for each token in the docs.
        """
        for doc, tensor in zip(docs, tensors):
-            assert tensor.shape[0] == len(doc)
+            if tensor.shape[0] != len(doc):
                raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
            doc.tensor = tensor
    def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
@ -415,8 +422,6 @@ class Tagger(Pipe):
        self.model = model
        self.cfg = OrderedDict(sorted(cfg.items()))
        self.cfg.setdefault('cnn_maxout_pieces', 2)
        self.cfg.setdefault('pretrained_dims',
                            self.vocab.vectors.data.shape[1])
    @property
    def labels(self):
@ -477,7 +482,7 @@ class Tagger(Pipe):
                    doc.extend_tensor(tensors[i].get())
                else:
                    doc.extend_tensor(tensors[i])
-        doc.is_tagged = True
+            doc.is_tagged = True
    def update(self, docs, golds, drop=0., sgd=None, losses=None):
        if losses is not None and self.name not in losses:
@ -527,8 +532,8 @@ class Tagger(Pipe):
            vocab.morphology = Morphology(vocab.strings, new_tag_map,
                                          vocab.morphology.lemmatizer,
                                          exc=vocab.morphology.exc)
        self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
        if self.model is True:
            self.cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
            self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
        link_vectors_to_models(self.vocab)
        if sgd is None:
@ -537,6 +542,8 @@ class Tagger(Pipe):
    @classmethod
    def Model(cls, n_tags, **cfg):
        if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
            raise ValueError(TempErrors.T008)
        return build_tagger_model(n_tags, **cfg)
    def add_label(self, label, values=None):
@ -552,9 +559,7 @@ class Tagger(Pipe):
            # copy_array(larger.W[:smaller.nO], smaller.W)
            # copy_array(larger.b[:smaller.nO], smaller.b)
            # self.model._layers[-1] = larger
-            raise ValueError(
+            raise ValueError(TempErrors.T003)
                "Resizing pre-trained Tagger models is not "
                "currently supported.")
        tag_map = dict(self.vocab.morphology.tag_map)
        if values is None:
            values = {POS: "X"}
@ -584,6 +589,10 @@ class Tagger(Pipe):
    def from_bytes(self, bytes_data, **exclude):
        def load_model(b):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            if self.model is True:
                token_vector_width = util.env_opt(
                    'token_vector_width',
@ -609,7 +618,6 @@ class Tagger(Pipe):
        return self
    def to_disk(self, path, **exclude):
        self.cfg.setdefault('pretrained_dims', self.vocab.vectors.data.shape[1])
        tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
        serialize = OrderedDict((
            ('vocab', lambda p: self.vocab.to_disk(p)),
@ -622,6 +630,9 @@ class Tagger(Pipe):
    def from_disk(self, path, **exclude):
        def load_model(p):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            if self.model is True:
                self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
            with p.open('rb') as file_:
@ -669,12 +680,9 @@ class MultitaskObjective(Tagger):
        elif hasattr(target, '__call__'):
            self.make_label = target
        else:
-            raise ValueError("MultitaskObjective target should be function or "
+            raise ValueError(Errors.E016)
                             "one of: dep, tag, ent, sent_start, dep_tag_offset, ent_tag.")
        self.cfg = dict(cfg)
        self.cfg.setdefault('cnn_maxout_pieces', 2)
        self.cfg.setdefault('pretrained_dims',
                            self.vocab.vectors.data.shape[1])
    @property
    def labels(self):
@ -723,7 +731,9 @@ class MultitaskObjective(Tagger):
        return tokvecs, scores
    def get_loss(self, docs, golds, scores):
-        assert len(docs) == len(golds)
+        if len(docs) != len(golds):
            raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
                                                n_golds=len(golds)))
        cdef int idx = 0
        correct = numpy.zeros((scores.shape[0],), dtype='i')
        guesses = scores.argmax(axis=1)
@ -962,16 +972,17 @@ class TextCategorizer(Pipe):
        self.labels.append(label)
        return 1
-    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None):
+    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
                       **kwargs):
        if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
            token_vector_width = pipeline[0].model.nO
        else:
            token_vector_width = 64
        if self.model is True:
-            self.cfg['pretrained_dims'] = self.vocab.vectors_length
+            self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
-            self.cfg['nr_class'] = len(self.labels)
+            self.model = self.Model(len(self.labels), token_vector_width,
-            self.cfg['width'] = token_vector_width
+                                    **self.cfg)
            self.model = self.Model(**self.cfg)
            link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -2,6 +2,7 @@
 from __future__ import division, print_function, unicode_literals
 from .gold import tags_to_entities, GoldParse
 from .errors import Errors
 class PRFScore(object):
@ -85,8 +86,7 @@ class Scorer(object):
    def score(self, tokens, gold, verbose=False, punct_labels=('p', 'punct')):
        if len(tokens) != len(gold):
-            gold = GoldParse.from_annot_tuples(tokens, zip(*gold.orig_annot))
+            raise ValueError(Errors.E078.format(words_doc=len(tokens), words_gold=len(gold)))
        assert len(tokens) == len(gold)
        gold_deps = set()
        gold_tags = set()
        gold_ents = set(tags_to_entities([annot[-1]
--- a/spacy/strings.pyx
+++ b/spacy/strings.pyx
@ -13,6 +13,7 @@ from .symbols import IDS as SYMBOLS_BY_STR
 from .symbols import NAMES as SYMBOLS_BY_INT
 from .typedefs cimport hash_t
 from .compat import json_dumps
 from .errors import Errors
 from . import util
@ -59,7 +60,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
        string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
        string.p[0] = length
        memcpy(&string.p[1], chars, length)
        assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
        return string
    else:
        i = 0
@ -69,7 +69,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
            string.p[i] = 255
        string.p[n_length_bytes-1] = length % 255
        memcpy(&string.p[n_length_bytes], chars, length)
        assert string.s[0] >= sizeof(string.s) or string.s[0] == 0, string.s[0]
        return string
@ -115,7 +114,7 @@ cdef class StringStore:
            self.hits.insert(key)
            utf8str = <Utf8Str*>self._map.get(key)
            if utf8str is NULL:
-                raise KeyError(string_or_id)
+                raise KeyError(Errors.E018.format(hash_value=string_or_id))
            else:
                return decode_Utf8Str(utf8str)
@ -136,8 +135,7 @@ cdef class StringStore:
            key = hash_utf8(string, len(string))
            self._intern_utf8(string, len(string))
        else:
-            raise TypeError(
+            raise TypeError(Errors.E017.format(value_type=type(string)))
                "Can only add unicode or bytes. Got type: %s" % type(string))
        return key
    def __len__(self):
--- a/spacy/syntax/_beam_utils.pyx
+++ b/spacy/syntax/_beam_utils.pyx
@ -10,6 +10,7 @@ from thinc.extra.search cimport MaxViolation
 from .transition_system cimport TransitionSystem, Transition
 from ..gold cimport GoldParse
 from ..errors import Errors
 from .stateclass cimport StateC, StateClass
@ -220,7 +221,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
    p_indices = []
    g_indices = []
    cdef Beam pbeam, gbeam
-    assert len(pbeams) == len(gbeams)
+    if len(pbeams) != len(gbeams):
        raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams)))
    for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)):
        p_indices.append([])
        g_indices.append([])
@ -228,7 +230,8 @@ def get_states(pbeams, gbeams, beam_map, nr_update):
            state = StateClass.borrow(<StateC*>pbeam.at(i))
            if not state.is_final():
                key = tuple([eg_id] + pbeam.histories[i])
-                assert key not in seen, (key, seen)
+                if key in seen:
                    raise ValueError(Errors.E080.format(key=key))
                seen[key] = len(states)
                p_indices[-1].append(len(states))
                states.append(state)
@ -271,7 +274,8 @@ def get_gradient(nr_class, beam_maps, histories, losses):
    for i in range(nr_step):
        grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class),
                                 dtype='f'))
-    assert len(histories) == len(losses)
+    if len(histories) != len(losses):
        raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses)))
    for eg_id, hists in enumerate(histories):
        for loss, hist in zip(losses[eg_id], hists):
            if loss == 0.0 or numpy.isnan(loss):
--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -16,6 +16,7 @@ from . import nonproj
 from .transition_system cimport move_cost_func_t, label_cost_func_t
 from ..gold cimport GoldParse, GoldParseC
 from ..structs cimport TokenC
 from ..errors import Errors
 # Calculate cost as gold/not gold. We don't use scalar value anyway.
 cdef int BINARY_COSTS = 1
@ -484,7 +485,7 @@ cdef class ArcEager(TransitionSystem):
            t.do = Break.transition
            t.get_cost = Break.cost
        else:
-            raise Exception(move)
+            raise ValueError(Errors.E019.format(action=move, src='arc_eager'))
        return t
    cdef int initialize_state(self, StateC* st) nogil:
@ -556,35 +557,13 @@ cdef class ArcEager(TransitionSystem):
                is_valid[i] = False
                costs[i] = 9000
        if n_gold < 1:
-            # Check label set --- leading cause
+            # Check projectivity --- leading cause
-            label_set = set([self.strings[self.c[i].label] for i in range(self.n_moves)])
+            if is_nonproj_tree(gold.heads):
-            for label_str in gold.labels:
+                raise ValueError(Errors.E020)
                if label_str is not None and label_str not in label_set:
                    raise ValueError("Cannot get gold parser action: unknown label: %s" % label_str)
            # Check projectivity --- other leading cause
            if nonproj.is_nonproj_tree(gold.heads):
                raise ValueError(
                    "Could not find a gold-standard action to supervise the "
                    "dependency parser. Likely cause: the tree is "
                    "non-projective (i.e. it has crossing arcs -- see "
                    "spacy/syntax/nonproj.pyx for definitions). The ArcEager "
                    "transition system only supports projective trees. To "
                    "learn non-projective representations, transform the data "
                    "before training and after parsing. Either pass "
                    "make_projective=True to the GoldParse class, or use "
                    "spacy.syntax.nonproj.preprocess_training_data.")
            else:
-                print(gold.orig_annot)
+                failure_state = stcls.print_state(gold.words)
-                print(gold.words)
+                raise ValueError(Errors.E021.format(n_actions=self.n_moves,
-                print(gold.heads)
+                                                    state=failure_state))
                print(gold.labels)
                print(gold.sent_starts)
                raise ValueError(
                    "Could not find a gold-standard action to supervise the"
                    "dependency parser. The GoldParse was projective. The "
                    "transition system has %d actions. State at failure: %s"
                    % (self.n_moves, stcls.print_state(gold.words)))
        assert n_gold >= 1
    def get_beam_annot(self, Beam beam):
        length = (<StateC*>beam.at(0)).length
--- a/spacy/syntax/ner.pyx
+++ b/spacy/syntax/ner.pyx
@ -10,6 +10,7 @@ from ._state cimport StateC
 from .transition_system cimport Transition
 from .transition_system cimport do_func_t
 from ..gold cimport GoldParseC, GoldParse
 from ..errors import Errors
 cdef enum:
@ -81,9 +82,7 @@ cdef class BiluoPushDown(TransitionSystem):
            for (ids, words, tags, heads, labels, biluo), _ in sents:
                for i, ner_tag in enumerate(biluo):
                    if ner_tag != 'O' and ner_tag != '-':
-                        if ner_tag.count('-') != 1:
+                        _, label = ner_tag.split('-', 1)
                            raise ValueError(ner_tag)
                        _, label = ner_tag.split('-')
                        for action in (BEGIN, IN, LAST, UNIT):
                            actions[action][label] += 1
        return actions
@ -170,7 +169,7 @@ cdef class BiluoPushDown(TransitionSystem):
            if self.c[i].move == move and self.c[i].label == label:
                return self.c[i]
        else:
-            raise KeyError(name)
+            raise KeyError(Errors.E022.format(name=name))
    cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
        # TODO: Apparent Cython bug here when we try to use the Transition()
@ -205,7 +204,7 @@ cdef class BiluoPushDown(TransitionSystem):
            t.do = Out.transition
            t.get_cost = Out.cost
        else:
-            raise Exception(move)
+            raise ValueError(Errors.E019.format(action=move, src='ner'))
        return t
    def add_action(self, int action, label_name, freq=None):
@ -227,7 +226,6 @@ cdef class BiluoPushDown(TransitionSystem):
            self._size *= 2
            self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
        self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
        assert self.c[self.n_moves].label == label_id
        self.n_moves += 1
        if self.labels.get(action, []):
            freq = min(0, min(self.labels[action].values()))
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -35,6 +35,7 @@ from .._ml import link_vectors_to_models, create_default_optimizer
 from ..compat import json_dumps, copy_array
 from ..tokens.doc cimport Doc
 from ..gold cimport GoldParse
 from ..errors import Errors, TempErrors
 from .. import util
 from .stateclass cimport StateClass
 from ._state cimport StateC
@ -244,7 +245,7 @@ cdef class Parser:
    def Model(cls, nr_class, **cfg):
        depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1))
        if depth != 1:
-            raise ValueError("Currently parser depth is hard-coded to 1.")
+            raise ValueError(TempErrors.T004.format(value=depth))
        parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
                                            cfg.get('maxout_pieces', 2))
        token_vector_width = util.env_opt('token_vector_width',
@ -254,11 +255,12 @@ cdef class Parser:
        hist_size = util.env_opt('history_feats', cfg.get('hist_size', 0))
        hist_width = util.env_opt('history_width', cfg.get('hist_width', 0))
        if hist_size != 0:
-            raise ValueError("Currently history size is hard-coded to 0")
+            raise ValueError(TempErrors.T005.format(value=hist_size))
        if hist_width != 0:
-            raise ValueError("Currently history width is hard-coded to 0")
+            raise ValueError(TempErrors.T006.format(value=hist_width))
        pretrained_vectors = cfg.get('pretrained_vectors', None)
        tok2vec = Tok2Vec(token_vector_width, embed_size,
-                          pretrained_dims=cfg.get('pretrained_dims', 0))
+                          pretrained_vectors=pretrained_vectors)
        tok2vec = chain(tok2vec, flatten)
        lower = PrecomputableAffine(hidden_width,
                    nF=cls.nr_feature, nI=token_vector_width,
@ -277,6 +279,7 @@ cdef class Parser:
            'token_vector_width': token_vector_width,
            'hidden_width': hidden_width,
            'maxout_pieces': parser_maxout_pieces,
            'pretrained_vectors': pretrained_vectors,
            'hist_size': hist_size,
            'hist_width': hist_width
        }
@ -296,9 +299,9 @@ cdef class Parser:
            unless True (default), in which case a new instance is created with
            `Parser.Moves()`.
        model (object): Defines how the parse-state is created, updated and
-            evaluated. The value is set to the .model attribute unless True
+            evaluated. The value is set to the .model attribute. If set to True
-            (default), in which case a new instance is created with
+            (default), a new instance will be created with `Parser.Model()`
-            `Parser.Model()`.
+            in parser.begin_training(), parser.from_disk() or parser.from_bytes().
        **cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute
        """
        self.vocab = vocab
@ -310,8 +313,7 @@ cdef class Parser:
            cfg['beam_width'] = util.env_opt('beam_width', 1)
        if 'beam_density' not in cfg:
            cfg['beam_density'] = util.env_opt('beam_density', 0.0)
-        if 'pretrained_dims' not in cfg:
+        cfg.setdefault('cnn_maxout_pieces', 3)
            cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
        self.cfg = cfg
        self.model = model
        self._multitasks = []
@ -551,8 +553,13 @@ cdef class Parser:
    def update(self, docs, golds, drop=0., sgd=None, losses=None):
        if not any(self.moves.has_gold(gold) for gold in golds):
            return None
-        assert len(docs) == len(golds)
+        if len(docs) != len(golds):
-        if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= 0.0:
+            raise ValueError(Errors.E077.format(value='update', n_docs=len(docs),
                                                n_golds=len(golds)))
        # The probability we use beam update, instead of falling back to
        # a greedy update
        beam_update_prob = 1-self.cfg.get('beam_update_prob', 0.5)
        if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() >= beam_update_prob:
            return self.update_beam(docs, golds,
                    self.cfg['beam_width'], self.cfg['beam_density'],
                    drop=drop, sgd=sgd, losses=losses)
@ -620,7 +627,7 @@ cdef class Parser:
                break
        self._make_updates(d_tokvecs,
            bp_tokvecs, backprops, sgd, cuda_stream)
-    
+
    def update_beam(self, docs, golds, width=None, density=None,
            drop=0., sgd=None, losses=None):
        if not any(self.moves.has_gold(gold) for gold in golds):
@ -634,7 +641,6 @@ cdef class Parser:
        if losses is not None and self.name not in losses:
            losses[self.name] = 0.
        lengths = [len(d) for d in docs]
        assert min(lengths) >= 1
        states = self.moves.init_batch(docs)
        for gold in golds:
            self.moves.preprocess_gold(gold)
@ -846,7 +852,6 @@ cdef class Parser:
        self.moves.initialize_actions(actions)
        cfg.setdefault('token_vector_width', 128)
        if self.model is True:
            cfg['pretrained_dims'] = self.vocab.vectors_length
            self.model, cfg = self.Model(self.moves.n_moves, **cfg)
            if sgd is None:
                sgd = self.create_optimizer()
@ -910,9 +915,11 @@ cdef class Parser:
        }
        util.from_disk(path, deserializers, exclude)
        if 'model' not in exclude:
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            path = util.ensure_path(path)
            if self.model is True:
                self.cfg.setdefault('pretrained_dims', self.vocab.vectors_length)
                self.model, cfg = self.Model(**self.cfg)
            else:
                cfg = {}
@ -955,12 +962,13 @@ cdef class Parser:
        ))
        msg = util.from_bytes(bytes_data, deserializers, exclude)
        if 'model' not in exclude:
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
                self.cfg['pretrained_vectors'] = self.vocab.vectors.name
            if self.model is True:
                self.model, cfg = self.Model(**self.cfg)
                cfg['pretrained_dims'] = self.vocab.vectors_length
            else:
                cfg = {}
            cfg['pretrained_dims'] = self.vocab.vectors_length
            if 'tok2vec_model' in msg:
                self.model[0].from_bytes(msg['tok2vec_model'])
            if 'lower_model' in msg:
@ -1033,15 +1041,11 @@ def _cleanup(Beam beam):
            del state
            seen.add(addr)
        else:
-            print(i, addr)
+            raise ValueError(Errors.E023.format(addr=addr, i=i))
            print(seen)
            raise Exception
        addr = <size_t>beam._states[i].content
        if addr not in seen:
            state = <StateC*>addr
            del state
            seen.add(addr)
        else:
-            print(i, addr)
+            raise ValueError(Errors.E023.format(addr=addr, i=i))
            print(seen)
            raise Exception
--- a/spacy/syntax/nonproj.pyx
+++ b/spacy/syntax/nonproj.pyx
@ -10,6 +10,7 @@ from __future__ import unicode_literals
 from copy import copy
 from ..tokens.doc cimport Doc, set_children_from_heads
 from ..errors import Errors
 DELIMITER = '||'
@ -146,7 +147,10 @@ cpdef deprojectivize(Doc doc):
 def _decorate(heads, proj_heads, labels):
    # uses decoration scheme HEAD from Nivre & Nilsson 2005
-    assert(len(heads) == len(proj_heads) == len(labels))
+    if (len(heads) != len(proj_heads)) or (len(proj_heads) != len(labels)):
        raise ValueError(Errors.E082.format(n_heads=len(heads),
                                            n_proj_heads=len(proj_heads),
                                            n_labels=len(labels)))
    deco_labels = []
    for tokenid, head in enumerate(heads):
        if head != proj_heads[tokenid]:
--- a/spacy/syntax/transition_system.pyx
+++ b/spacy/syntax/transition_system.pyx
@ -12,6 +12,7 @@ from ..structs cimport TokenC
 from .stateclass cimport StateClass
 from ..typedefs cimport attr_t
 from ..compat import json_dumps
 from ..errors import Errors
 from .. import util
@ -73,10 +74,7 @@ cdef class TransitionSystem:
                    action.do(state.c, action.label)
                    break
            else:
-                print(gold.words)
+                raise ValueError(Errors.E024)
                print(gold.ner)
                print(history)
                raise ValueError("Could not find gold move")
        return history
    cdef int initialize_state(self, StateC* state) nogil:
@ -123,17 +121,7 @@ cdef class TransitionSystem:
            else:
                costs[i] = 9000
        if n_gold <= 0:
-            print(gold.words)
+            raise ValueError(Errors.E024)
            print(gold.ner)
            print([gold.c.ner[i].clas for i in range(gold.length)])
            print([gold.c.ner[i].move for i in range(gold.length)])
            print([gold.c.ner[i].label for i in range(gold.length)])
            print("Self labels",
                  [self.c[i].label for i in range(self.n_moves)])
            raise ValueError(
                "Could not find a gold-standard action to supervise "
                "the entity recognizer. The transition system has "
                "%d actions." % (self.n_moves))
    def get_class_name(self, int clas):
        act = self.c[clas]
@ -171,7 +159,6 @@ cdef class TransitionSystem:
            self._size *= 2
            self.c = <Transition*>self.mem.realloc(self.c, self._size * sizeof(self.c[0]))
        self.c[self.n_moves] = self.init_transition(self.n_moves, action, label_id)
        assert self.c[self.n_moves].label == label_id
        self.n_moves += 1
        if self.labels.get(action, []):
            new_freq = min(self.labels[action].values())
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -19,7 +19,9 @@ _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
 _models = {'en': ['en_core_web_sm'],
           'de': ['de_core_news_md'],
           'fr': ['fr_core_news_sm'],
-           'xx': ['xx_ent_web_md']}
+           'xx': ['xx_ent_web_md'],
           'en_core_web_md': ['en_core_web_md'],
           'es_core_news_md': ['es_core_news_md']}
 # only used for tests that require loading the models
@ -183,6 +185,9 @@ def pytest_addoption(parser):
    for lang in _languages + ['all']:
        parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
    for model in _models:
        if model not in _languages:
            parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
 def pytest_runtest_setup(item):
--- a/spacy/tests/lang/da/test_lemma.py
+++ b/spacy/tests/lang/da/test_lemma.py
@ -0,0 +1,13 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
                                          ('detailhandelsstrukturernes', 'detailhandelsstruktur'),
                                          ('kolesterols', 'kolesterol'),
                                          ('åsyns', 'åsyn')])
 def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
    tokens = da_tokenizer(string)
    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/regression/test_issue1660.py
+++ b/spacy/tests/regression/test_issue1660.py
@ -0,0 +1,12 @@
 from __future__ import unicode_literals
 import pytest
 from ...util import load_model
@pytest.mark.models("en_core_web_md")
@pytest.mark.models("es_core_news_md")
 def test_models_with_different_vectors():
    nlp = load_model('en_core_web_md')
    doc = nlp(u'hello world')
    nlp2 = load_model('es_core_news_md')
    doc2 = nlp2(u'hola')
    doc = nlp(u'hello world')
--- a/spacy/tests/regression/test_issue1967.py
+++ b/spacy/tests/regression/test_issue1967.py
@ -0,0 +1,15 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from ...pipeline import EntityRecognizer
 from ...vocab import Vocab
@pytest.mark.parametrize('label', ['U-JOB-NAME'])
 def test_issue1967(label):
    ner = EntityRecognizer(Vocab())
    entry = ([0], ['word'], ['tag'], [0], ['dep'], [label])
    gold_parses = [(None, [(entry, None)])]
    ner.moves.get_actions(gold_parses=gold_parses)
--- a/spacy/tests/serialize/test_serialize_language.py
+++ b/spacy/tests/serialize/test_serialize_language.py
@ -17,6 +17,7 @@ def meta_data():
        'email': 'email-in-fixture',
        'url': 'url-in-fixture',
        'license': 'license-in-fixture',
        'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}
    }
--- a/spacy/tests/test_textcat.py
+++ b/spacy/tests/test_textcat.py
@ -10,8 +10,8 @@ from ..gold import GoldParse
 def test_textcat_learns_multilabel():
-    random.seed(0)
+    random.seed(1)
-    numpy.random.seed(0)
+    numpy.random.seed(1)
    docs = []
    nlp = English()
    vocab = nlp.vocab
--- a/spacy/tests/test_underscore.py
+++ b/spacy/tests/test_underscore.py
@ -1,4 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from mock import Mock
 from ..vocab import Vocab
 from ..tokens import Doc, Span, Token
 from ..tokens.underscore import Underscore
@ -51,3 +58,42 @@ def test_token_underscore_method():
                                            None, None)
    token._ = Underscore(Underscore.token_extensions, token, start=token.idx)
    assert token._.hello() == 'cheese'
@pytest.mark.parametrize('obj', [Doc, Span, Token])
 def test_doc_underscore_remove_extension(obj):
    ext_name = 'to_be_removed'
    obj.set_extension(ext_name, default=False)
    assert obj.has_extension(ext_name)
    obj.remove_extension(ext_name)
    assert not obj.has_extension(ext_name)
@pytest.mark.parametrize('obj', [Doc, Span, Token])
 def test_underscore_raises_for_dup(obj):
    obj.set_extension('test', default=None)
    with pytest.raises(ValueError):
        obj.set_extension('test', default=None)
@pytest.mark.parametrize('invalid_kwargs', [
    {'getter': None, 'setter': lambda: None},
    {'default': None, 'method': lambda: None, 'getter': lambda: None},
    {'setter': lambda: None},
    {'default': None, 'method': lambda: None},
    {'getter': True}])
 def test_underscore_raises_for_invalid(invalid_kwargs):
    invalid_kwargs['force'] = True
    with pytest.raises(ValueError):
        Doc.set_extension('test', **invalid_kwargs)
@pytest.mark.parametrize('valid_kwargs', [
    {'getter': lambda: None},
    {'getter': lambda: None, 'setter': lambda: None},
    {'default': 'hello'},
    {'default': None},
    {'method': lambda: None}])
 def test_underscore_accepts_valid(valid_kwargs):
    valid_kwargs['force'] = True
    Doc.set_extension('test', **valid_kwargs)
--- a/spacy/tests/vectors/test_vectors.py
+++ b/spacy/tests/vectors/test_vectors.py
@ -28,12 +28,38 @@ def vectors():
 def data():
    return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f')
@pytest.fixture
 def resize_data():
    return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f')
@pytest.fixture()
 def vocab(en_vocab, vectors):
    add_vecs_to_vocab(en_vocab, vectors)
    return en_vocab
 def test_init_vectors_with_resize_shape(strings,resize_data):
    v = Vectors(shape=(len(strings), 3))
    v.resize(shape=resize_data.shape)
    assert v.shape == resize_data.shape
    assert v.shape != (len(strings), 3)
 def test_init_vectors_with_resize_data(data,resize_data):
    v = Vectors(data=data)
    v.resize(shape=resize_data.shape)
    assert v.shape == resize_data.shape
    assert v.shape != data.shape
 def test_get_vector_resize(strings, data,resize_data):
    v = Vectors(data=data)
    v.resize(shape=resize_data.shape)
    strings = [hash_string(s) for s in strings]
    for i, string in enumerate(strings):
        v.add(string, row=i)
    assert list(v[strings[0]]) == list(resize_data[0])
    assert list(v[strings[0]]) != list(resize_data[1])
    assert list(v[strings[1]]) != list(resize_data[0])
    assert list(v[strings[1]]) == list(resize_data[1])
 def test_init_vectors_with_data(strings, data):
    v = Vectors(data=data)
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -13,6 +13,7 @@ cimport cython
 from .tokens.doc cimport Doc
 from .strings cimport hash_string
 from .errors import Errors, Warnings, deprecation_warning
 from . import util
@ -63,11 +64,7 @@ cdef class Tokenizer:
        return (self.__class__, args, None, None)
    cpdef Doc tokens_from_list(self, list strings):
-        util.deprecated(
+        deprecation_warning(Warnings.W002)
            "Tokenizer.from_list is now deprecated. Create a new Doc "
            "object instead and pass in the strings as the `words` keyword "
            "argument, for example:\nfrom spacy.tokens import Doc\n"
            "doc = Doc(nlp.vocab, words=[...])")
        return Doc(self.vocab, words=strings)
    @cython.boundscheck(False)
@ -78,8 +75,7 @@ cdef class Tokenizer:
        RETURNS (Doc): A container for linguistic annotations.
        """
        if len(string) >= (2 ** 30):
-            msg = "String is too long: %d characters. Max is 2**30."
+            raise ValueError(Errors.E025.format(length=len(string)))
            raise ValueError(msg % len(string))
        cdef int length = len(string)
        cdef Doc doc = Doc(self.vocab)
        if length == 0:
--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -0,0 +1,129 @@
 # coding: utf8
 # cython: infer_types=True
 # cython: bounds_check=False
 # cython: profile=True
 from __future__ import unicode_literals
 from libc.string cimport memcpy, memset
 from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
 from .span cimport Span
 from .token cimport Token
 from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..structs cimport LexemeC, TokenC
 from ..attrs cimport *
 cdef class Retokenizer:
    '''Helper class for doc.retokenize() context manager.'''
    cdef Doc doc
    cdef list merges
    cdef list splits
    def __init__(self, doc):
        self.doc = doc
        self.merges = []
        self.splits = []
    def merge(self, Span span, attrs=None):
        '''Mark a span for merging. The attrs will be applied to the resulting
        token.'''
        self.merges.append((span.start_char, span.end_char, attrs))
    def split(self, Token token, orths, attrs=None):
        '''Mark a Token for splitting, into the specified orths. The attrs
        will be applied to each subtoken.'''
        self.splits.append((token.start_char, orths, attrs))
    def __enter__(self):
        self.merges = []
        self.splits = []
        return self
    def __exit__(self, *args):
        # Do the actual merging here
        for start_char, end_char, attrs in self.merges:
            start = token_by_start(self.doc.c, self.doc.length, start_char)
            end = token_by_end(self.doc.c, self.doc.length, end_char)
            _merge(self.doc, start, end+1, attrs)
        for start_char, orths, attrs in self.splits:
            raise NotImplementedError
 def _merge(Doc doc, int start, int end, attributes):
    """Retokenize the document, such that the span at
    `doc.text[start_idx : end_idx]` is merged into a single token. If
    `start_idx` and `end_idx `do not mark start and end token boundaries,
    the document remains unchanged.
    start_idx (int): Character index of the start of the slice to merge.
    end_idx (int): Character index after the end of the slice to merge.
    **attributes: Attributes to assign to the merged token. By default,
        attributes are inherited from the syntactic root of the span.
    RETURNS (Token): The newly merged token, or `None` if the start and end
        indices did not fall at token boundaries.
    """
    cdef Span span = doc[start:end]
    cdef int start_char = span.start_char
    cdef int end_char = span.end_char
    # Get LexemeC for newly merged token
    new_orth = ''.join([t.text_with_ws for t in span])
    if span[-1].whitespace_:
        new_orth = new_orth[:-len(span[-1].whitespace_)]
    cdef const LexemeC* lex = doc.vocab.get(doc.mem, new_orth)
    # House the new merged token where it starts
    cdef TokenC* token = &doc.c[start]
    token.spacy = doc.c[end-1].spacy
    for attr_name, attr_value in attributes.items():
        if attr_name == TAG:
            doc.vocab.morphology.assign_tag(token, attr_value)
        else:
            Token.set_struct_attr(token, attr_name, attr_value)
    # Make sure ent_iob remains consistent
    if doc.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
        if token.ent_type == doc.c[end].ent_type:
            token.ent_iob = 3
        else:
            # If they're not the same entity type, let them be two entities
            doc.c[end].ent_iob = 3
    # Begin by setting all the head indices to absolute token positions
    # This is easier to work with for now than the offsets
    # Before thinking of something simpler, beware the case where a
    # dependency bridges over the entity. Here the alignment of the
    # tokens changes.
    span_root = span.root.i
    token.dep = span.root.dep
    # We update token.lex after keeping span root and dep, since
    # setting token.lex will change span.start and span.end properties
    # as it modifies the character offsets in the doc
    token.lex = lex
    for i in range(doc.length):
        doc.c[i].head += i
    # Set the head of the merged token, and its dep relation, from the Span
    token.head = doc.c[span_root].head
    # Adjust deps before shrinking tokens
    # Tokens which point into the merged token should now point to it
    # Subtract the offset from all tokens which point to >= end
    offset = (end - start) - 1
    for i in range(doc.length):
        head_idx = doc.c[i].head
        if start <= head_idx < end:
            doc.c[i].head = start
        elif head_idx >= end:
            doc.c[i].head -= offset
    # Now compress the token array
    for i in range(end, doc.length):
        doc.c[i - offset] = doc.c[i]
    for i in range(doc.length - offset, doc.length):
        memset(&doc.c[i], 0, sizeof(TokenC))
        doc.c[i].lex = &EMPTY_LEXEME
    doc.length -= offset
    for i in range(doc.length):
        # ...And, set heads back to a relative position
        doc.c[i].head -= i
    # Set the left/right children, left/right edges
    set_children_from_heads(doc.c, doc.length)
    # Clear the cached Python objects
    # Return the merged Python object
    return doc[start]
--- a/spacy/tokens/doc.pxd
+++ b/spacy/tokens/doc.pxd
@ -28,6 +28,8 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except
 cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2
 cdef int set_children_from_heads(TokenC* tokens, int length) except -1
 cdef class Doc:
    cdef readonly Pool mem
    cdef readonly Vocab vocab
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -31,18 +31,19 @@ from ..attrs cimport ENT_TYPE, SENT_START
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 from ..util import normalize_slice
 from ..compat import is_config, copy_reg, pickle, basestring_
-from .. import about
+from ..errors import Errors, Warnings, deprecation_warning
 from .. import util
-from .underscore import Underscore
+from .underscore import Underscore, get_ext_args
 from ._retokenize import Retokenizer
 DEF PADDING = 5
 cdef int bounds_check(int i, int length, int padding) except -1:
    if (i + padding) < 0:
-        raise IndexError
+        raise IndexError(Errors.E026.format(i=i, length=length))
    if (i - padding) >= length:
-        raise IndexError
+        raise IndexError(Errors.E026.format(i=i, length=length))
 cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
@ -94,11 +95,10 @@ cdef class Doc:
                      spaces=[True, False, False])
    """
    @classmethod
-    def set_extension(cls, name, default=None, method=None,
+    def set_extension(cls, name, **kwargs):
-                      getter=None, setter=None):
+        if cls.has_extension(name) and not kwargs.get('force', False):
-        nr_defined = sum(t is not None for t in (default, getter, setter, method))
+            raise ValueError(Errors.E090.format(name=name, obj='Doc'))
-        assert nr_defined == 1
+        Underscore.doc_extensions[name] = get_ext_args(**kwargs)
        Underscore.doc_extensions[name] = (default, method, getter, setter)
    @classmethod
    def get_extension(cls, name):
@ -108,6 +108,12 @@ cdef class Doc:
    def has_extension(cls, name):
        return name in Underscore.doc_extensions
    @classmethod
    def remove_extension(cls, name):
        if not cls.has_extension(name):
            raise ValueError(Errors.E046.format(name=name))
        return Underscore.doc_extensions.pop(name)
    def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
                 orths_and_spaces=None):
        """Create a Doc object.
@ -154,11 +160,7 @@ cdef class Doc:
            if spaces is None:
                spaces = [True] * len(words)
            elif len(spaces) != len(words):
-                raise ValueError(
+                raise ValueError(Errors.E027)
                    "Arguments 'words' and 'spaces' should be sequences of "
                    "the same length, or 'spaces' should be left default at "
                    "None. spaces should be a sequence of booleans, with True "
                    "meaning that the word owns a ' ' character following it.")
            orths_and_spaces = zip(words, spaces)
        if orths_and_spaces is not None:
            for orth_space in orths_and_spaces:
@ -166,10 +168,7 @@ cdef class Doc:
                    orth = orth_space
                    has_space = True
                elif isinstance(orth_space, bytes):
-                    raise ValueError(
+                    raise ValueError(Errors.E028.format(value=orth_space))
                        "orths_and_spaces expects either List(unicode) or "
                        "List((unicode, bool)). "
                        "Got bytes instance: %s" % (str(orth_space)))
                else:
                    orth, has_space = orth_space
                # Note that we pass self.mem here --- we have ownership, if LexemeC
@ -319,7 +318,7 @@ cdef class Doc:
                        break
                else:
                    return 1.0
- 
+
        if self.vector_norm == 0 or other.vector_norm == 0:
            return 0.0
        return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
@ -437,10 +436,7 @@ cdef class Doc:
                if token.ent_iob == 1:
                    if start == -1:
                        seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
-                        raise ValueError(
+                        raise ValueError(Errors.E093.format(seq=' '.join(seq)))
                            "token.ent_iob values make invalid sequence: "
                            "I without B\n"
                            "{seq}".format(seq=' '.join(seq)))
                elif token.ent_iob == 2 or token.ent_iob == 0:
                    if start != -1:
                        output.append(Span(self, start, i, label=label))
@ -503,19 +499,16 @@ cdef class Doc:
        """
        def __get__(self):
            if not self.is_parsed:
-                raise ValueError(
+                raise ValueError(Errors.E029)
                    "noun_chunks requires the dependency parse, which "
                    "requires a statistical model to be installed and loaded. "
                    "For more info, see the "
                    "documentation: \n%s\n" % about.__docs_models__)
            # Accumulate the result before beginning to iterate over it. This
            # prevents the tokenisation from being changed out from under us
            # during the iteration. The tricky thing here is that Span accepts
            # its tokenisation changing, so it's okay once we have the Span
            # objects. See Issue #375.
            spans = []
-            for start, end, label in self.noun_chunks_iterator(self):
+            if self.noun_chunks_iterator is not None:
-                spans.append(Span(self, start, end, label=label))
+                for start, end, label in self.noun_chunks_iterator(self):
                    spans.append(Span(self, start, end, label=label))
            for span in spans:
                yield span
@ -532,12 +525,7 @@ cdef class Doc:
        """
        def __get__(self):
            if not self.is_sentenced:
-                raise ValueError(
+                raise ValueError(Errors.E030)
                    "Sentence boundaries unset. You can add the 'sentencizer' "
                    "component to the pipeline with: "
                    "nlp.add_pipe(nlp.create_pipe('sentencizer')) "
                    "Alternatively, add the dependency parser, or set "
                    "sentence boundaries by setting doc[i].sent_start")
            if 'sents' in self.user_hooks:
                yield from self.user_hooks['sents'](self)
            else:
@ -567,7 +555,8 @@ cdef class Doc:
            t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
        t.l_edge = self.length
        t.r_edge = self.length
-        assert t.lex.orth != 0
+        if t.lex.orth == 0:
            raise ValueError(Errors.E031.format(i=self.length))
        t.spacy = has_space
        self.length += 1
        return t.idx + t.lex.length + t.spacy
@ -683,13 +672,7 @@ cdef class Doc:
    def from_array(self, attrs, array):
        if SENT_START in attrs and HEAD in attrs:
-            raise ValueError(
+            raise ValueError(Errors.E032)
                "Conflicting attributes specified in doc.from_array(): "
                "(HEAD, SENT_START)\n"
                "The HEAD attribute currently sets sentence boundaries "
                "implicitly, based on the tree structure. This means the HEAD "
                "attribute would potentially override the sentence boundaries "
                "set by SENT_START.")
        cdef int i, col
        cdef attr_id_t attr_id
        cdef TokenC* tokens = self.c
@ -827,7 +810,7 @@ cdef class Doc:
        RETURNS (Doc): Itself.
        """
        if self.length != 0:
-            raise ValueError("Cannot load into non-empty Doc")
+            raise ValueError(Errors.E033.format(length=self.length))
        deserializers = {
            'text': lambda b: None,
            'array_head': lambda b: None,
@ -878,7 +861,7 @@ cdef class Doc:
        computed by the models in the pipeline. Let's say a
        document with 30 words has a tensor with 128 dimensions
        per word. doc.tensor.shape will be (30, 128). After
-        calling doc.extend_tensor with an array of hape (30, 64),
+        calling doc.extend_tensor with an array of shape (30, 64),
        doc.tensor == (30, 192).
        '''
        xp = get_array_module(self.tensor)
@ -888,6 +871,18 @@ cdef class Doc:
        else:
            self.tensor = xp.hstack((self.tensor, tensor))
    def retokenize(self):
        '''Context manager to handle retokenization of the Doc.
        Modifications to the Doc's tokenization are stored, and then
        made all at once when the context manager exits. This is
        much more efficient, and less error-prone.
        All views of the Doc (Span and Token) created before the
        retokenization are invalidated, although they may accidentally
        continue to work.
        '''
        return Retokenizer(self)
    def merge(self, int start_idx, int end_idx, *args, **attributes):
        """Retokenize the document, such that the span at
        `doc.text[start_idx : end_idx]` is merged into a single token. If
@ -903,10 +898,7 @@ cdef class Doc:
        """
        cdef unicode tag, lemma, ent_type
        if len(args) == 3:
-            util.deprecated(
+            deprecation_warning(Warnings.W003)
                "Positional arguments to Doc.merge are deprecated. Instead, "
                "use the keyword arguments, for example tag=, lemma= or "
                "ent_type=.")
            tag, lemma, ent_type = args
            attributes[TAG] = tag
            attributes[LEMMA] = lemma
@ -920,13 +912,9 @@ cdef class Doc:
            if 'ent_type' in attributes:
                attributes[ENT_TYPE] = attributes['ent_type']
        elif args:
-            raise ValueError(
+            raise ValueError(Errors.E034.format(n_args=len(args),
-                "Doc.merge received %d non-keyword arguments. Expected either "
+                                                args=repr(args),
-                "3 arguments (deprecated), or 0 (use keyword arguments). "
+                                                kwargs=repr(attributes)))
                "Arguments supplied:\n%s\n"
                "Keyword arguments: %s\n" % (len(args), repr(args),
                                             repr(attributes)))
        # More deprecated attribute handling =/
        if 'label' in attributes:
            attributes['ent_type'] = attributes.pop('label')
@ -941,66 +929,8 @@ cdef class Doc:
            return None
        # Currently we have the token index, we want the range-end index
        end += 1
-        cdef Span span = self[start:end]
+        with self.retokenize() as retokenizer:
-        # Get LexemeC for newly merged token
+            retokenizer.merge(self[start:end], attrs=attributes)
        new_orth = ''.join([t.text_with_ws for t in span])
        if span[-1].whitespace_:
            new_orth = new_orth[:-len(span[-1].whitespace_)]
        cdef const LexemeC* lex = self.vocab.get(self.mem, new_orth)
        # House the new merged token where it starts
        cdef TokenC* token = &self.c[start]
        token.spacy = self.c[end-1].spacy
        for attr_name, attr_value in attributes.items():
            if attr_name == TAG:
                self.vocab.morphology.assign_tag(token, attr_value)
            else:
                Token.set_struct_attr(token, attr_name, attr_value)
        # Make sure ent_iob remains consistent
        if self.c[end].ent_iob == 1 and token.ent_iob in (0, 2):
            if token.ent_type == self.c[end].ent_type:
                token.ent_iob = 3
            else:
                # If they're not the same entity type, let them be two entities
                self.c[end].ent_iob = 3
        # Begin by setting all the head indices to absolute token positions
        # This is easier to work with for now than the offsets
        # Before thinking of something simpler, beware the case where a
        # dependency bridges over the entity. Here the alignment of the
        # tokens changes.
        span_root = span.root.i
        token.dep = span.root.dep
        # We update token.lex after keeping span root and dep, since
        # setting token.lex will change span.start and span.end properties
        # as it modifies the character offsets in the doc
        token.lex = lex
        for i in range(self.length):
            self.c[i].head += i
        # Set the head of the merged token, and its dep relation, from the Span
        token.head = self.c[span_root].head
        # Adjust deps before shrinking tokens
        # Tokens which point into the merged token should now point to it
        # Subtract the offset from all tokens which point to >= end
        offset = (end - start) - 1
        for i in range(self.length):
            head_idx = self.c[i].head
            if start <= head_idx < end:
                self.c[i].head = start
            elif head_idx >= end:
                self.c[i].head -= offset
        # Now compress the token array
        for i in range(end, self.length):
            self.c[i - offset] = self.c[i]
        for i in range(self.length - offset, self.length):
            memset(&self.c[i], 0, sizeof(TokenC))
            self.c[i].lex = &EMPTY_LEXEME
        self.length -= offset
        for i in range(self.length):
            # ...And, set heads back to a relative position
            self.c[i].head -= i
        # Set the left/right children, left/right edges
        set_children_from_heads(self.c, self.length)
        # Clear the cached Python objects
        # Return the merged Python object
        return self[start]
    def print_tree(self, light=False, flat=False):
--- a/spacy/tokens/printers.py
+++ b/spacy/tokens/printers.py
@ -8,7 +8,7 @@ from ..symbols import HEAD, TAG, DEP, ENT_IOB, ENT_TYPE
 def merge_ents(doc):
    """Helper: merge adjacent entities into single tokens; modifies the doc."""
    for ent in doc.ents:
-        ent.merge(ent.root.tag_, ent.text, ent.label_)
+        ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
    return doc
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -16,16 +16,17 @@ from ..util import normalize_slice
 from ..attrs cimport IS_PUNCT, IS_SPACE
 from ..lexeme cimport Lexeme
 from ..compat import is_config
-from .. import about
+from ..errors import Errors, TempErrors
-from .underscore import Underscore
+from .underscore import Underscore, get_ext_args
 cdef class Span:
    """A slice from a Doc object."""
    @classmethod
-    def set_extension(cls, name, default=None, method=None,
+    def set_extension(cls, name, **kwargs):
-                      getter=None, setter=None):
+        if cls.has_extension(name) and not kwargs.get('force', False):
-        Underscore.span_extensions[name] = (default, method, getter, setter)
+            raise ValueError(Errors.E090.format(name=name, obj='Span'))
        Underscore.span_extensions[name] = get_ext_args(**kwargs)
    @classmethod
    def get_extension(cls, name):
@ -35,6 +36,12 @@ cdef class Span:
    def has_extension(cls, name):
        return name in Underscore.span_extensions
    @classmethod
    def remove_extension(cls, name):
        if not cls.has_extension(name):
            raise ValueError(Errors.E046.format(name=name))
        return Underscore.span_extensions.pop(name)
    def __cinit__(self, Doc doc, int start, int end, attr_t label=0,
                  vector=None, vector_norm=None):
        """Create a `Span` object from the slice `doc[start : end]`.
@ -48,8 +55,7 @@ cdef class Span:
        RETURNS (Span): The newly constructed object.
        """
        if not (0 <= start <= end <= len(doc)):
-            raise IndexError
+            raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
        self.doc = doc
        self.start = start
        self.start_char = self.doc[start].idx if start < self.doc.length else 0
@ -58,7 +64,8 @@ cdef class Span:
            self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1])
        else:
            self.end_char = 0
-        assert label in doc.vocab.strings, label
+        if label not in doc.vocab.strings:
            raise ValueError(Errors.E084.format(label=label))
        self.label = label
        self._vector = vector
        self._vector_norm = vector_norm
@ -267,11 +274,10 @@ cdef class Span:
        or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char:
            start = token_by_start(self.doc.c, self.doc.length, self.start_char)
            if self.start == -1:
-                raise IndexError("Error calculating span: Can't find start")
+                raise IndexError(Errors.E036.format(start=self.start_char))
            end = token_by_end(self.doc.c, self.doc.length, self.end_char)
            if end == -1:
-                raise IndexError("Error calculating span: Can't find end")
+                raise IndexError(Errors.E037.format(end=self.end_char))
            self.start = start
            self.end = end + 1
@ -294,12 +300,11 @@ cdef class Span:
            cdef int i
            if self.doc.is_parsed:
                root = &self.doc.c[self.start]
                n = 0
                while root.head != 0:
                    root += root.head
                    n += 1
                    if n >= self.doc.length:
-                        raise RuntimeError
+                        raise RuntimeError(Errors.E038)
                return self.doc[root.l_edge:root.r_edge + 1]
            elif self.doc.is_sentenced:
                # find start of the sentence
@ -314,13 +319,7 @@ cdef class Span:
                    n += 1
                    if n >= self.doc.length:
                        break
                #
                return self.doc[start:end]
            else:
                raise ValueError(
                    "Access to sentence requires either the dependency parse "
                    "or sentence boundaries to be set by setting " +
                    "doc[i].is_sent_start = True")
    property has_vector:
        """RETURNS (bool): Whether a word vector is associated with the object.
@ -402,11 +401,7 @@ cdef class Span:
        """
        def __get__(self):
            if not self.doc.is_parsed:
-                raise ValueError(
+                raise ValueError(Errors.E029)
                    "noun_chunks requires the dependency parse, which "
                    "requires a statistical model to be installed and loaded. "
                    "For more info, see the "
                    "documentation: \n%s\n" % about.__docs_models__)
            # Accumulate the result before beginning to iterate over it. This
            # prevents the tokenisation from being changed out from under us
            # during the iteration. The tricky thing here is that Span accepts
@ -552,9 +547,7 @@ cdef class Span:
            return self.root.ent_id
        def __set__(self, hash_t key):
-            raise NotImplementedError(
+            raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
                "Can't yet set ent_id from Span. Vote for this feature on "
                "the issue tracker: http://github.com/explosion/spaCy/issues")
    property ent_id_:
        """RETURNS (unicode): The (string) entity ID."""
@ -562,9 +555,7 @@ cdef class Span:
            return self.root.ent_id_
        def __set__(self, hash_t key):
-            raise NotImplementedError(
+            raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
                "Can't yet set ent_id_ from Span. Vote for this feature on the "
                "issue tracker: http://github.com/explosion/spaCy/issues")
    property orth_:
        """Verbatim text content (identical to Span.text). Exists mostly for
@ -612,9 +603,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
        token += token.head
        n += 1
        if n >= sent_length:
-            raise RuntimeError(
+            raise RuntimeError(Errors.E039)
                "Array bounds exceeded while searching for root word. This "
                "likely means the parse tree is in an invalid state. Please "
                "report this issue here: "
                "http://github.com/explosion/spaCy/issues")
    return n
--- a/spacy/tokens/token.pxd
+++ b/spacy/tokens/token.pxd
@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t
 from ..parts_of_speech cimport univ_pos_t
 from .doc cimport Doc
 from ..lexeme cimport Lexeme
 from ..errors import Errors
 cdef class Token:
@ -17,8 +18,7 @@ cdef class Token:
    @staticmethod
    cdef inline Token cinit(Vocab vocab, const TokenC* token, int offset, Doc doc):
        if offset < 0 or offset >= doc.length:
-            msg = "Attempt to access token at %d, max length %d"
+            raise IndexError(Errors.E040.format(i=offset, max_length=doc.length))
            raise IndexError(msg % (offset, doc.length))
        cdef Token self = Token.__new__(Token, vocab, doc, offset)
        return self
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -19,18 +19,19 @@ from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM
 from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
 from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
 from ..compat import is_config
 from ..errors import Errors
 from .. import util
-from .. import about
+from .underscore import Underscore, get_ext_args
 from .underscore import Underscore
 cdef class Token:
    """An individual token – i.e. a word, punctuation symbol, whitespace,
    etc."""
    @classmethod
-    def set_extension(cls, name, default=None, method=None,
+    def set_extension(cls, name, **kwargs):
-                      getter=None, setter=None):
+        if cls.has_extension(name) and not kwargs.get('force', False):
-        Underscore.token_extensions[name] = (default, method, getter, setter)
+            raise ValueError(Errors.E090.format(name=name, obj='Token'))
        Underscore.token_extensions[name] = get_ext_args(**kwargs)
    @classmethod
    def get_extension(cls, name):
@ -40,6 +41,12 @@ cdef class Token:
    def has_extension(cls, name):
        return name in Underscore.span_extensions
    @classmethod
    def remove_extension(cls, name):
        if not cls.has_extension(name):
            raise ValueError(Errors.E046.format(name=name))
        return Underscore.token_extensions.pop(name)
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        """Construct a `Token` object.
@ -106,7 +113,7 @@ cdef class Token:
        elif op == 5:
            return my >= their
        else:
-            raise ValueError(op)
+            raise ValueError(Errors.E041.format(op=op))
    @property
    def _(self):
@ -135,8 +142,7 @@ cdef class Token:
        RETURNS (Token): The token at position `self.doc[self.i+i]`.
        """
        if self.i+i < 0 or (self.i+i >= len(self.doc)):
-            msg = "Error accessing doc[%d].nbor(%d), for doc of length %d"
+            raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
            raise IndexError(msg % (self.i, i, len(self.doc)))
        return self.doc[self.i+i]
    def similarity(self, other):
@ -354,14 +360,7 @@ cdef class Token:
    property sent_start:
        def __get__(self):
-            # Raising a deprecation warning causes errors for autocomplete
+            # Raising a deprecation warning here causes errors for autocomplete
            #util.deprecated(
            #    "Token.sent_start is now deprecated. Use Token.is_sent_start "
            #    "instead, which returns a boolean value or None if the answer "
            #    "is unknown – instead of a misleading 0 for False and 1 for "
            #    "True. It also fixes a quirk in the old logic that would "
            #    "always set the property to 0 for the first word of the "
            #    "document.")
            # Handle broken backwards compatibility case: doc[0].sent_start
            # was False.
            if self.i == 0:
@ -386,9 +385,7 @@ cdef class Token:
        def __set__(self, value):
            if self.doc.is_parsed:
-                raise ValueError(
+                raise ValueError(Errors.E043)
                    "Refusing to write to token.sent_start if its document "
                    "is parsed, because this may cause inconsistent state.")
            if value is None:
                self.c.sent_start = 0
            elif value is True:
@ -396,8 +393,7 @@ cdef class Token:
            elif value is False:
                self.c.sent_start = -1
            else:
-                raise ValueError("Invalid value for token.sent_start. Must be "
+                raise ValueError(Errors.E044.format(value=value))
                                 "one of: None, True, False")
    property lefts:
        """The leftward immediate children of the word, in the syntactic
@ -415,8 +411,7 @@ cdef class Token:
                nr_iter += 1
                # This is ugly, but it's a way to guard out infinite loops
                if nr_iter >= 10000000:
-                    raise RuntimeError("Possibly infinite loop encountered "
+                    raise RuntimeError(Errors.E045.format(attr='token.lefts'))
                                       "while looking for token.lefts")
    property rights:
        """The rightward immediate children of the word, in the syntactic
@ -434,8 +429,7 @@ cdef class Token:
                ptr -= 1
                nr_iter += 1
                if nr_iter >= 10000000:
-                    raise RuntimeError("Possibly infinite loop encountered "
+                    raise RuntimeError(Errors.E045.format(attr='token.rights'))
                                       "while looking for token.rights")
            tokens.reverse()
            for t in tokens:
                yield t
--- a/spacy/tokens/underscore.py
+++ b/spacy/tokens/underscore.py
@ -3,6 +3,8 @@ from __future__ import unicode_literals
 import functools
 from ..errors import Errors
 class Underscore(object):
    doc_extensions = {}
@ -23,7 +25,7 @@ class Underscore(object):
    def __getattr__(self, name):
        if name not in self._extensions:
-            raise AttributeError(name)
+            raise AttributeError(Errors.E046.format(name=name))
        default, method, getter, setter = self._extensions[name]
        if getter is not None:
            return getter(self._obj)
@ -34,7 +36,7 @@ class Underscore(object):
    def __setattr__(self, name, value):
        if name not in self._extensions:
-            raise AttributeError(name)
+            raise AttributeError(Errors.E047.format(name=name))
        default, method, getter, setter = self._extensions[name]
        if setter is not None:
            return setter(self._obj, value)
@ -52,3 +54,24 @@ class Underscore(object):
    def _get_key(self, name):
        return ('._.', name, self._start, self._end)
 def get_ext_args(**kwargs):
    """Validate and convert arguments. Reused in Doc, Token and Span."""
    default = kwargs.get('default')
    getter = kwargs.get('getter')
    setter = kwargs.get('setter')
    method = kwargs.get('method')
    if getter is None and setter is not None:
        raise ValueError(Errors.E089)
    valid_opts = ('default' in kwargs, method is not None, getter is not None)
    nr_defined = sum(t is True for t in valid_opts)
    if nr_defined != 1:
        raise ValueError(Errors.E083.format(nr_defined=nr_defined))
    if setter is not None and not hasattr(setter, '__call__'):
        raise ValueError(Errors.E091.format(name='setter', value=repr(setter)))
    if getter is not None and not hasattr(getter, '__call__'):
        raise ValueError(Errors.E091.format(name='getter', value=repr(getter)))
    if method is not None and not hasattr(method, '__call__'):
        raise ValueError(Errors.E091.format(name='method', value=repr(method)))
    return (default, method, getter, setter)
--- a/spacy/util.py
+++ b/spacy/util.py
@ -11,8 +11,6 @@ import sys
 import textwrap
 import random
 from collections import OrderedDict
 import inspect
 import warnings
 from thinc.neural._classes.model import Model
 from thinc.neural.ops import NumpyOps
 import functools
@ -23,10 +21,12 @@ import numpy.random
 from .symbols import ORTH
 from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
 from .compat import import_file
 from .errors import Errors
-import msgpack
+# Import these directly from Thinc, so that we're sure we always have the
-import msgpack_numpy
+# same version.
-msgpack_numpy.patch()
+from thinc.neural._classes.model import msgpack
 from thinc.neural._classes.model import msgpack_numpy
 LANGUAGES = {}
@ -50,8 +50,7 @@ def get_lang_class(lang):
        try:
            module = importlib.import_module('.lang.%s' % lang, 'spacy')
        except ImportError:
-            msg = "Can't import language %s from spacy.lang."
+            raise ImportError(Errors.E048.format(lang=lang))
            raise ImportError(msg % lang)
        LANGUAGES[lang] = getattr(module, module.__all__[0])
    return LANGUAGES[lang]
@ -108,7 +107,7 @@ def load_model(name, **overrides):
    """
    data_path = get_data_path()
    if not data_path or not data_path.exists():
-        raise IOError("Can't find spaCy data path: %s" % path2str(data_path))
+        raise IOError(Errors.E049.format(path=path2str(data_path)))
    if isinstance(name, basestring_):  # in data dir / shortcut
        if name in set([d.name for d in data_path.iterdir()]):
            return load_model_from_link(name, **overrides)
@ -118,7 +117,7 @@ def load_model(name, **overrides):
            return load_model_from_path(Path(name), **overrides)
    elif hasattr(name, 'exists'):  # Path or Path-like to model data
        return load_model_from_path(name, **overrides)
-    raise IOError("Can't find model '%s'" % name)
+    raise IOError(Errors.E050.format(name=name))
 def load_model_from_link(name, **overrides):
@ -127,9 +126,7 @@ def load_model_from_link(name, **overrides):
    try:
        cls = import_file(name, path)
    except AttributeError:
-        raise IOError(
+        raise IOError(Errors.E051.format(name=name))
            "Cant' load '%s'. If you're using a shortcut link, make sure it "
            "points to a valid package (not just a data directory)." % name)
    return cls.load(**overrides)
@ -173,8 +170,7 @@ def load_model_from_init_py(init_file, **overrides):
    data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
    data_path = model_path / data_dir
    if not model_path.exists():
-        msg = "Can't find model directory: %s"
+        raise IOError(Errors.E052.format(path=path2str(data_path)))
        raise ValueError(msg % path2str(data_path))
    return load_model_from_path(data_path, meta, **overrides)
@ -186,16 +182,14 @@ def get_model_meta(path):
    """
    model_path = ensure_path(path)
    if not model_path.exists():
-        msg = "Can't find model directory: %s"
+        raise IOError(Errors.E052.format(path=path2str(model_path)))
        raise ValueError(msg % path2str(model_path))
    meta_path = model_path / 'meta.json'
    if not meta_path.is_file():
-        raise IOError("Could not read meta.json from %s" % meta_path)
+        raise IOError(Errors.E053.format(path=meta_path))
    meta = read_json(meta_path)
    for setting in ['lang', 'name', 'version']:
        if setting not in meta or not meta[setting]:
-            msg = "No valid '%s' setting found in model meta.json"
+            raise ValueError(Errors.E054.format(setting=setting))
            raise ValueError(msg % setting)
    return meta
@ -344,13 +338,10 @@ def update_exc(base_exceptions, *addition_dicts):
        for orth, token_attrs in additions.items():
            if not all(isinstance(attr[ORTH], unicode_)
                       for attr in token_attrs):
-                msg = "Invalid ORTH value in exception: key='%s', orths='%s'"
+                raise ValueError(Errors.E055.format(key=orth, orths=token_attrs))
                raise ValueError(msg % (orth, token_attrs))
            described_orth = ''.join(attr[ORTH] for attr in token_attrs)
            if orth != described_orth:
-                msg = ("Invalid tokenizer exception: ORTH values combined "
+                raise ValueError(Errors.E056.format(key=orth, orths=described_orth))
                       "don't match original string. key='%s', orths='%s'")
                raise ValueError(msg % (orth, described_orth))
        exc.update(additions)
    exc = expand_exc(exc, "'", "’")
    return exc
@ -380,8 +371,7 @@ def expand_exc(excs, search, replace):
 def normalize_slice(length, start, stop, step=None):
    if not (step is None or step == 1):
-        raise ValueError("Stepped slices not supported in Span objects."
+        raise ValueError(Errors.E057)
                         "Try: list(tokens)[start:stop:step] instead.")
    if start is None:
        start = 0
    elif start < 0:
@ -392,7 +382,6 @@ def normalize_slice(length, start, stop, step=None):
    elif stop < 0:
        stop += length
    stop = min(length, max(start, stop))
    assert 0 <= start <= stop <= length
    return start, stop
@ -552,18 +541,6 @@ def from_disk(path, readers, exclude):
    return path
 def deprecated(message, filter='always'):
    """Show a deprecation warning.
    message (unicode): The message to display.
    filter (unicode): Filter value.
    """
    stack = inspect.stack()[-1]
    with warnings.catch_warnings():
        warnings.simplefilter(filter, DeprecationWarning)
        warnings.warn_explicit(message, DeprecationWarning, stack[1], stack[2])
 def print_table(data, title=None):
    """Print data in table format.
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -1,24 +1,43 @@
 # coding: utf8
 from __future__ import unicode_literals
 import functools
 import numpy
 from collections import OrderedDict
-import msgpack
+
-import msgpack_numpy
+from .util import msgpack
-msgpack_numpy.patch()
+from .util import msgpack_numpy
 cimport numpy as np
 from thinc.neural.util import get_array_module
 from thinc.neural._classes.model import Model
 from .strings cimport StringStore, hash_string
 from .compat import basestring_, path2str
 from .errors import Errors
 from . import util
 from cython.operator cimport dereference as deref
 from libcpp.set cimport set as cppset
 def unpickle_vectors(bytes_data):
    return Vectors().from_bytes(bytes_data)
 class GlobalRegistry(object):
    '''Global store of vectors, to avoid repeatedly loading the data.'''
    data = {}
    @classmethod
    def register(cls, name, data):
        cls.data[name] = data
        return functools.partial(cls.get, name)
    @classmethod
    def get(cls, name):
        return cls.data[name]
 cdef class Vectors:
    """Store, save and load word vectors.
@ -31,18 +50,21 @@ cdef class Vectors:
    the table need to be assigned --- so len(list(vectors.keys())) may be
    greater or smaller than vectors.shape[0].
    """
    cdef public object name
    cdef public object data
    cdef public object key2row
-    cdef public object _unset
+    cdef cppset[int] _unset
-    def __init__(self, *, shape=None, data=None, keys=None):
+    def __init__(self, *, shape=None, data=None, keys=None, name=None):
        """Create a new vector store.
        shape (tuple): Size of the table, as (# entries, # columns)
        data (numpy.ndarray): The vector data.
        keys (iterable): A sequence of keys, aligned with the data.
        name (string): A name to identify the vectors table.
        RETURNS (Vectors): The newly created object.
        """
        self.name = name
        if data is None:
            if shape is None:
                shape = (0,0)
@ -50,9 +72,9 @@ cdef class Vectors:
        self.data = data
        self.key2row = OrderedDict()
        if self.data is not None:
-            self._unset = set(range(self.data.shape[0]))
+            self._unset = cppset[int]({i for i in range(self.data.shape[0])})
        else:
-            self._unset = set()
+            self._unset = cppset[int]()
        if keys is not None:
            for i, key in enumerate(keys):
                self.add(key, row=i)
@ -74,7 +96,7 @@ cdef class Vectors:
    @property
    def is_full(self):
        """RETURNS (bool): `True` if no slots are available for new keys."""
-        return len(self._unset) == 0
+        return self._unset.size() == 0
    @property
    def n_keys(self):
@ -93,7 +115,7 @@ cdef class Vectors:
        """
        i = self.key2row[key]
        if i is None:
-            raise KeyError(key)
+            raise KeyError(Errors.E058.format(key=key))
        else:
            return self.data[i]
@ -105,8 +127,8 @@ cdef class Vectors:
        """
        i = self.key2row[key]
        self.data[i] = vector
-        if i in self._unset:
+        if self._unset.count(i):
-            self._unset.remove(i)
+            self._unset.erase(self._unset.find(i))
    def __iter__(self):
        """Iterate over the keys in the table.
@ -145,7 +167,7 @@ cdef class Vectors:
            xp = get_array_module(self.data)
            self.data = xp.resize(self.data, shape)
        filled = {row for row in self.key2row.values()}
-        self._unset = {row for row in range(shape[0]) if row not in filled}
+        self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled})
        removed_items = []
        for key, row in list(self.key2row.items()):
            if row >= shape[0]:
@ -169,7 +191,7 @@ cdef class Vectors:
        YIELDS (ndarray): A vector in the table.
        """
        for row, vector in enumerate(range(self.data.shape[0])):
-            if row not in self._unset:
+            if not self._unset.count(row):
                yield vector
    def items(self):
@ -194,7 +216,8 @@ cdef class Vectors:
        RETURNS: The requested key, keys, row or rows.
        """
        if sum(arg is None for arg in (key, keys, row, rows)) != 3:
-            raise ValueError("One (and only one) keyword arg must be set.")
+            bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
            raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
        xp = get_array_module(self.data)
        if key is not None:
            if isinstance(key, basestring_):
@ -233,14 +256,14 @@ cdef class Vectors:
            row = self.key2row[key]
        elif row is None:
            if self.is_full:
-                raise ValueError("Cannot add new key to vectors -- full")
+                raise ValueError(Errors.E060.format(rows=self.data.shape[0],
-            row = min(self._unset)
+                                                    cols=self.data.shape[1]))
-
+            row = deref(self._unset.begin())
        self.key2row[key] = row
        if vector is not None:
            self.data[row] = vector
-            if row in self._unset:
+            if self._unset.count(row):
-                self._unset.remove(row)
+                self._unset.erase(self._unset.find(row))
        return row
    def most_similar(self, queries, *, batch_size=1024):
@ -297,7 +320,7 @@ cdef class Vectors:
                width = int(dims)
                break
        else:
-            raise IOError("Expected file named e.g. vectors.128.f.bin")
+            raise IOError(Errors.E061.format(filename=path))
        bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
                                                             dtype=dtype)
        xp = get_array_module(self.data)
@ -346,8 +369,8 @@ cdef class Vectors:
                with path.open('rb') as file_:
                    self.key2row = msgpack.load(file_)
            for key, row in self.key2row.items():
-                if row in self._unset:
+                if self._unset.count(row):
-                    self._unset.remove(row)
+                    self._unset.erase(self._unset.find(row))
        def load_keys(path):
            if path.exists():
--- a/spacy/vocab.pyx
+++ b/spacy/vocab.pyx
@ -16,6 +16,7 @@ from .attrs cimport PROB, LANG, ORTH, TAG
 from .structs cimport SerializedLexemeC
 from .compat import copy_reg, basestring_
 from .errors import Errors
 from .lemmatizer import Lemmatizer
 from .attrs import intify_attrs
 from .vectors import Vectors
@ -100,15 +101,9 @@ cdef class Vocab:
                    flag_id = bit
                    break
            else:
-                raise ValueError(
+                raise ValueError(Errors.E062)
                    "Cannot find empty bit for new lexical flag. All bits "
                    "between 0 and 63 are occupied. You can replace one by "
                    "specifying the flag_id explicitly, e.g. "
                    "`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
        elif flag_id >= 64 or flag_id < 1:
-            raise ValueError(
+            raise ValueError(Errors.E063.format(value=flag_id))
                "Invalid value for flag_id: %d. Flag IDs must be between "
                "1 and 63 (inclusive)" % flag_id)
        for lex in self:
            lex.set_flag(flag_id, flag_getter(lex.orth_))
        self.lex_attr_getters[flag_id] = flag_getter
@ -127,8 +122,9 @@ cdef class Vocab:
        cdef size_t addr
        if lex != NULL:
            if lex.orth != self.strings[string]:
-                raise LookupError.mismatched_strings(
+                raise KeyError(Errors.E064.format(string=lex.orth,
-                    lex.orth, self.strings[string], string)
+                                                  orth=self.strings[string],
                                                  orth_id=string))
            return lex
        else:
            return self._new_lexeme(mem, string)
@ -171,7 +167,8 @@ cdef class Vocab:
        if not is_oov:
            key = hash_string(string)
            self._add_lex_to_vocab(key, lex)
-        assert lex != NULL, string
+        if lex == NULL:
            raise ValueError(Errors.E085.format(string=string))
        return lex
    cdef int _add_lex_to_vocab(self, hash_t key, const LexemeC* lex) except -1:
@ -254,7 +251,7 @@ cdef class Vocab:
        width, you have to call this to change the size of the vectors.
        """
        if width is not None and shape is not None:
-            raise ValueError("Only one of width and shape can be specified")
+            raise ValueError(Errors.E065.format(width=width, shape=shape))
        elif shape is not None:
            self.vectors = Vectors(shape=shape)
        else:
@ -381,7 +378,8 @@ cdef class Vocab:
            self.lexemes_from_bytes(file_.read())
        if self.vectors is not None:
            self.vectors.from_disk(path, exclude='strings.json')
-        link_vectors_to_models(self)
+        if self.vectors.name is not None:
            link_vectors_to_models(self)
        return self
    def to_bytes(self, **exclude):
@ -421,6 +419,8 @@ cdef class Vocab:
            ('vectors', lambda b: serialize_vectors(b))
        ))
        util.from_bytes(bytes_data, setters, exclude)
        if self.vectors.name is not None:
            link_vectors_to_models(self)
        return self
    def lexemes_to_bytes(self):
@ -468,7 +468,10 @@ cdef class Vocab:
            if ptr == NULL:
                continue
            py_str = self.strings[lexeme.orth]
-            assert self.strings[py_str] == lexeme.orth, (py_str, lexeme.orth)
+            if self.strings[py_str] != lexeme.orth:
                raise ValueError(Errors.E086.format(string=py_str,
                                                    orth_id=lexeme.orth,
                                                    hash_id=self.strings[py_str]))
            key = hash_string(py_str)
            self._by_hash.set(key, lexeme)
            self._by_orth.set(lexeme.orth, lexeme)
@ -509,16 +512,3 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
 copy_reg.pickle(Vocab, pickle_vocab, unpickle_vocab)
 class LookupError(Exception):
    @classmethod
    def mismatched_strings(cls, id_, id_string, original_string):
        return cls(
            "Error fetching a Lexeme from the Vocab. When looking up a "
            "string, the lexeme returned had an orth ID that did not match "
            "the query string. This means that the cached lexeme structs are "
            "mismatched to the string encoding table. The mismatched:\n"
            "Query string: {}\n"
            "Orth cached: {}\n"
            "Orth ID: {}".format(repr(original_string), repr(id_string), id_))
--- a/website/_harp.json
+++ b/website/_harp.json
@ -1,7 +1,7 @@
 {
    "globals": {
        "title": "spaCy",
-        "description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.",
+        "description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
        "SITENAME": "spaCy",
        "SLOGAN": "Industrial-strength Natural Language Processing in Python",
@ -10,10 +10,13 @@
        "COMPANY": "Explosion AI",
        "COMPANY_URL": "https://explosion.ai",
-        "DEMOS_URL": "https://demos.explosion.ai",
+        "DEMOS_URL": "https://explosion.ai/demos",
        "MODELS_REPO": "explosion/spacy-models",
        "KERNEL_BINDER": "ines/spacy-binder",
        "KERNEL_PYTHON": "python3",
        "SPACY_VERSION": "2.0",
        "BINDER_VERSION": "2.0.11",
        "SOCIAL": {
            "twitter": "spacy_io",
@ -26,7 +29,8 @@
        "NAVIGATION": {
            "Usage": "/usage",
            "Models": "/models",
-            "API": "/api"
+            "API": "/api",
            "Universe": "/universe"
        },
        "FOOTER": {
@ -34,7 +38,7 @@
                "Usage": "/usage",
                "Models": "/models",
                "API Reference": "/api",
-                "Resources": "/usage/resources"
+                "Universe": "/universe"
            },
            "Support": {
                "Issue Tracker": "https://github.com/explosion/spaCy/issues",
@ -82,8 +86,8 @@
            }
        ],
-        "V_CSS": "2.0.1",
+        "V_CSS": "2.1.2",
-        "V_JS": "2.0.1",
+        "V_JS": "2.1.0",
        "DEFAULT_SYNTAX": "python",
        "ANALYTICS": "UA-58931649-1",
        "MAILCHIMP": {
--- a/website/_includes/_functions.jade
+++ b/website/_includes/_functions.jade
@ -15,12 +15,39 @@
 - MODEL_META = public.models._data.MODEL_META
 - MODEL_LICENSES = public.models._data.MODEL_LICENSES
 - MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
 - EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
 - EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
 - IS_PAGE = (SECTION != "index") && !landing
 - IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
 - HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
 //- Get page URL
 -    function getPageUrl() {
 -       var path = current.path;
 -       if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
 -       return `${SITE_URL}/${path.join('/')}`;
 -   }
 //- Get pretty page title depending on section
 -   function getPageTitle() {
 -       var sections = ['api', 'usage', 'models'];
 -       if (sections.includes(SECTION)) {
 -           var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
 -           return `${title} · ${SITENAME} ${titleSection} Documentation`;
 -       }
 -       else if (SECTION != 'index') return `${title} · ${SITENAME}`;
 -       return `${SITENAME} · ${SLOGAN}`;
 -   }
 //- Get social image based on section and settings
 -   function getPageImage() {
 -       var img = (SECTION == 'api') ? 'api' : 'default';
 -       return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
 -   }
 //- Add prefixes to items of an array (for modifier CSS classes)
    array   - [array] list of class names or options, e.g. ["foot"]
--- a/website/_includes/_mixins.jade
+++ b/website/_includes/_mixins.jade
@ -7,7 +7,7 @@ include _functions
    id - [string] anchor assigned to section (used for breadcrumb navigation)
 mixin section(id)
-    section.o-section(id="section-" + id data-section=id)
+    section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
        block
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
 mixin aside(label, emoji)
    +aside-wrapper(label, emoji)
-        .c-aside__text.u-text-small
+        .c-aside__text.u-text-small&attributes(attributes)
            block
@ -154,7 +154,7 @@ mixin aside(label, emoji)
    prompt   - [string] prompt displayed before first line, e.g. "$"
 mixin aside-code(label, language, prompt)
-    +aside-wrapper(label)
+    +aside-wrapper(label)&attributes(attributes)
        +code(false, language, prompt).o-no-block
            block
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
            argument to be able to wrap it for spacing
 mixin infobox(label, emoji)
-    aside.o-box.o-block.u-text-small
+    aside.o-box.o-block.u-text-small&attributes(attributes)
        if label
            h3.u-heading.u-text-label.u-color-theme
                if emoji
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
    wrap     - [boolean] wrap text and disable horizontal scrolling
 mixin code(label, language, prompt, height, icon, wrap)
-    pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
+    - var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
    - var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
    pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
        if label
            h4.u-text-label.u-text-label--dark=label
        if icon
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
        code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
            block
 //- Executable code
 mixin code-exec(label, large)
    - label = (label || "Editable code example") + " (experimental)"
    +terminal-wrapper(label, !large)
        figure.thebelab-wrapper
            span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} &middot; Python 3 &middot; via #[+a("https://mybinder.org/").u-hide-link Binder]
            +code(data-executable="true")&attributes(attributes)
                block
 //- Wrapper for code blocks to display old/new versions
@ -658,12 +669,16 @@ mixin qs(data, style)
 //- Terminal-style code window
    label - [string] title displayed in top bar of terminal window
-mixin terminal(label, button_text, button_url)
+mixin terminal-wrapper(label, small)
-    .x-terminal
+    .x-terminal(class=small ? "x-terminal--small" : null)
-        .x-terminal__icons: span
+        .x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
-        .u-padding-small.u-text-label.u-text-center=label
+        .u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
            strong=label
        block
-        +code.x-terminal__code
+mixin terminal(label, button_text, button_url, exec)
    +terminal-wrapper(label)
        +code.x-terminal__code(data-executable=exec ? "" : null)
            block
        if button_text && button_url
--- a/website/_includes/_navigation.jade
+++ b/website/_includes/_navigation.jade
@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
            li.c-nav__menu__item(class=is_active ? "is-active" : null)
                +a(url)(tabindex=is_active ? "-1" : null)=item
-        li.c-nav__menu__item.u-hidden-xs
+        li.c-nav__menu__item
            +a("https://survey.spacy.io", true) User Survey 2018
        li.c-nav__menu__item.u-hidden-xs
            +a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
    progress.c-progress.js-progress(value="0" max="1")
--- a/website/_includes/_page_models.jade
+++ b/website/_includes/_page_models.jade
@ -1,77 +1,110 @@
 //- 💫 INCLUDES > MODELS PAGE TEMPLATE
 for id in CURRENT_MODELS
    - var comps = getModelComponents(id)
    +section(id)
-        +grid("vcenter").o-no-block(id=id)
+        section(data-vue=id data-model=id)
-            +grid-col("two-thirds")
+            +grid("vcenter").o-no-block(id=id)
-                +h(2)
+                +grid-col("two-thirds")
-                    +a("#" + id).u-permalink=id
+                    +h(2)
                        +a("#" + id).u-permalink=id
-            +grid-col("third").u-text-right
+                +grid-col("third").u-text-right
-                .u-color-subtle.u-text-tiny
+                    .u-color-subtle.u-text-tiny
-                    +button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download")
+                        +button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
-                        |  Release details
+                            |  Release details
-                    .u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a]
+                        .u-padding-small Latest: #[code(v-text="version") n/a]
-        +aside-code("Installation", "bash", "$").
+            +aside-code("Installation", "bash", "$").
-            python -m spacy download #{id}
+                python -m spacy download #{id}
-        - var comps = getModelComponents(id)
+            p(v-if="description" v-text="description")
-        p(data-tpl=id data-tpl-key="description")
+            +infobox(v-if="error")
        div(data-tpl=id data-tpl-key="error")
            +infobox
                |  Unable to load model details from GitHub. To find out more
                |  about this model, see the overview of the
                |  #[+a(gh("spacy-models") + "/releases") latest model releases].
-        +table.o-block-small(data-tpl=id data-tpl-key="table")
+            +table.o-block-small(v-bind:data-loading="loading")
            +row
                +cell #[+label Language]
                +cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
            for comp, label in {"Type": comps.type, "Genre": comps.genre}
                +row
-                    +cell #[+label=label]
+                    +cell #[+label Language]
-                    +cell #[+tag=comp] #{MODEL_META[comp]}
+                    +cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
-            +row
+                for comp, label in {"Type": comps.type, "Genre": comps.genre}
-                +cell #[+label Size]
+                    +row
-                +cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
+                        +cell #[+label=label]
                        +cell #[+tag=comp] #{MODEL_META[comp]}
                +row
                    +cell #[+label Size]
                    +cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
-            each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
+                +row(v-if="pipeline && pipeline.length" v-cloak="")
                - var field = label.toLowerCase()
                if field == "vectors"
                    - field = "vecs"
                +row
                    +cell.u-nowrap
                        +label=label
                            if MODEL_META[field]
                                |  #[+help(MODEL_META[field]).u-color-subtle]
                    +cell
-                        span(data-tpl=id data-tpl-key=field) #[em n/a]
+                        +label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
                    +cell
                        span(v-for="(pipe, index) in pipeline" v-if="pipeline")
                            code(v-text="pipe")
                            span(v-if="index != pipeline.length - 1") ,&nbsp;
-            +row(data-tpl=id data-tpl-key="compat-wrapper" hidden="")
+                +row(v-if="vectors" v-cloak="")
-                +cell
+                    +cell
-                    +label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle]
+                        +label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
-                +cell
+                    +cell(v-text="vectors")
                    .o-field.u-float-left
                        select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
                    div(data-tpl=id data-tpl-key="compat-versions") &nbsp;
-        section(data-tpl=id data-tpl-key="benchmarks" hidden="")
+                +row(v-if="sources && sources.length" v-cloak="")
-            +grid.o-block-small
+                    +cell
                        +label Sources #[+help(MODEL_META.sources).u-color-subtle]
                    +cell
                        span(v-for="(source, index) in sources") {{ source }}
                            span(v-if="index != sources.length - 1") ,&nbsp;
                +row(v-if="author" v-cloak="")
                    +cell #[+label Author]
                    +cell
                        +a("")(v-bind:href="url" v-if="url" v-text="author")
                        span(v-else="" v-text="author") {{ model.author }}
                +row(v-if="license" v-cloak="")
                    +cell #[+label License]
                    +cell
                        +a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
                        span(v-else="") {{ license }}
                +row(v-cloak="")
                    +cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
                    +cell
                        .o-field.u-float-left
                            select.o-field__select.u-text-small(v-model="spacyVersion")
                                option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
                        code(v-if="compatVersion" v-text="compatVersion")
                        em(v-else="") not compatible
            +grid.o-block-small(v-cloak="" v-if="hasAccuracy")
                for keys, label in MODEL_BENCHMARKS
-                    .u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="")
+                    .u-flex-full.u-padding-small
                        +table.o-block-small
                            +row("head")
                                +head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
                            for label, field in keys
-                                +row(hidden="")
+                                +row
                                    +cell.u-nowrap
                                        +label=label
                                            if MODEL_META[field]
                                                |  #[+help(MODEL_META[field]).u-color-subtle]
-                                    +cell("num")(data-tpl=id data-tpl-key=field)
+                                    +cell("num")
-                                        |  n/a
+                                        span(v-if="#{field}" v-text="#{field}")
                                        em(v-if="!#{field}") n/a
            p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
        if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
            section
                +code-exec("Test the model live").
                    import spacy
                    from spacy.lang.#{comps.lang}.examples import sentences
                    nlp = spacy.load('#{id}')
                    doc = nlp(sentences[0])
                    print(doc.text)
                    for token in doc:
                        print(token.text, token.pos_, token.dep_)
        p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")
--- a/website/_includes/_scripts.jade
+++ b/website/_includes/_scripts.jade
@ -1,86 +1,33 @@
 //- 💫 INCLUDES > SCRIPTS
-if quickstart
+if IS_PAGE || SECTION == "index"
-    script(src="/assets/js/vendor/quickstart.min.js")
+    script(type="text/x-thebe-config")
        | { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
        |  kernelOptions: { name: "#{KERNEL_PYTHON}" }}
-if IS_PAGE
+- scripts = ["vendor/prism.min", "vendor/vue.min"]
-    script(src="/assets/js/vendor/in-view.min.js")
+- if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
 - if (quickstart) scripts.push("vendor/quickstart.min")
 - if (IS_PAGE) scripts.push("vendor/in-view.min")
 - if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
 for script in scripts
    script(src="/assets/js/" + script + ".js")
 script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
 if environment == "deploy"
-    script(async src="https://www.google-analytics.com/analytics.js")
+    script(src="https://www.google-analytics.com/analytics.js", async)
-
+    script
 script(src="/assets/js/vendor/prism.min.js")
 if compare_models
    script(src="/assets/js/vendor/chart.min.js")
 script
    if quickstart
        | new Quickstart("#qs");
    if environment == "deploy"
        | window.ga=window.ga||function(){
        | (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
        | ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
-    if IS_PAGE
+if IS_PAGE
    script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
    script
        | ((window.gitter = {}).chat = {}).options = {
        |     useStyles: false,
        |     activationElement: '.js-gitter-button',
        |     targetElement: '.js-gitter',
        |     room: '!{SOCIAL.gitter}'
        | };
 if IS_PAGE
    script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
 //- JS modules – slightly hacky, but necessary to dynamically instantiate the
    classes with data from the Harp JSON files, while still being able to
    support older browsers that can't handle JS modules. More details:
    https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
 - ProgressBar = "new ProgressBar('.js-progress');"
 - Accordion = "new Accordion('.js-accordion');"
 - Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
 - NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
 - GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
 - ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
 - ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
 if environment == "deploy"
    //- DEPLOY: use compiled rollup.js and instantiate classes directly
    script(src="/assets/js/rollup.js?v#{V_JS}")
    script
        !=ProgressBar
        if changelog
            !=Changelog
        if IS_PAGE
            !=NavHighlighter
            !=GitHubEmbed
            !=Accordion
        if HAS_MODELS
            !=ModelLoader
        if compare_models
            !=ModelComparer
 else
    //- DEVELOPMENT: Use ES6 modules
    script(type="module")
        | import ProgressBar from '/assets/js/progress.js';
        !=ProgressBar
        if changelog
            | import Changelog from '/assets/js/changelog.js';
            !=Changelog
        if IS_PAGE
            | import NavHighlighter from '/assets/js/nav-highlighter.js';
            !=NavHighlighter
            | import GitHubEmbed from '/assets/js/github-embed.js';
            !=GitHubEmbed
            | import Accordion from '/assets/js/accordion.js';
            !=Accordion
        if HAS_MODELS
            | import { ModelLoader } from '/assets/js/models.js';
            !=ModelLoader
        if compare_models
            | import { ModelComparer } from '/assets/js/models.js';
            !=ModelComparer
--- a/website/_includes/_svg.jade
+++ b/website/_includes/_svg.jade
@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
        symbol#svg_github(viewBox="0 0 27 32")
            path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
        symbol#svg_twitter(viewBox="0 0 30 32")
            path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
        symbol#svg_website(viewBox="0 0 32 32")
            path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
        symbol#svg_code(viewBox="0 0 20 20")
            path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")
--- a/website/_layout.jade
+++ b/website/_layout.jade
@ -3,23 +3,15 @@
 include _includes/_mixins
 - title = IS_MODELS ? LANGUAGES[current.source] || title : title
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
+
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg"
+- PAGE_URL = getPageUrl()
 - PAGE_TITLE = getPageTitle()
 - PAGE_IMAGE = getPageImage()
 doctype html
 html(lang="en")
    head
-        title
+        title=PAGE_TITLE
            if SECTION == "api" || SECTION == "usage" || SECTION == "models"
                - var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
                | #{title} | #{SITENAME} #{title_section} Documentation
            else if SECTION != "index"
                | #{title} | #{SITENAME}
            else
                | #{SITENAME} - #{SLOGAN}
        meta(charset="utf-8")
        meta(name="viewport" content="width=device-width, initial-scale=1.0")
        meta(name="referrer" content="always")
@ -27,23 +19,24 @@ html(lang="en")
        meta(property="og:type" content="website")
        meta(property="og:site_name" content=sitename)
-        meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}")
+        meta(property="og:url" content=PAGE_URL)
-        meta(property="og:title" content=social_title)
+        meta(property="og:title" content=PAGE_TITLE)
        meta(property="og:description" content=description)
-        meta(property="og:image" content=social_img)
+        meta(property="og:image" content=PAGE_IMAGE)
        meta(name="twitter:card" content="summary_large_image")
        meta(name="twitter:site" content="@" + SOCIAL.twitter)
-        meta(name="twitter:title" content=social_title)
+        meta(name="twitter:title" content=PAGE_TITLE)
        meta(name="twitter:description" content=description)
-        meta(name="twitter:image" content=social_img)
+        meta(name="twitter:image" content=PAGE_IMAGE)
        link(rel="shortcut icon" href="/assets/img/favicon.ico")
        link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
        if SECTION == "api"
            link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
-
+        else if SECTION == "universe"
            link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
        else
            link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
@ -54,6 +47,9 @@ html(lang="en")
        if !landing
            include _includes/_page-docs
        else if SECTION == "universe"
            !=yield
        else
            main!=yield
                include _includes/_footer
--- a/website/api/_annotation/_pos-tags.jade
+++ b/website/api/_annotation/_pos-tags.jade
@ -29,7 +29,7 @@ p
        +ud-row("NUM", "numeral", "1, 2017, one, seventy-seven, IV, MMXIV")
        +ud-row("PART", "particle", "'s, not, ")
        +ud-row("PRON", "pronoun", "I, you, he, she, myself, themselves, somebody")
-        +ud-row("PROPN", "proper noun", "Mary, John, Londin, NATO, HBO")
+        +ud-row("PROPN", "proper noun", "Mary, John, London, NATO, HBO")
        +ud-row("PUNCT", "punctuation", "., (, ), ?")
        +ud-row("SCONJ", "subordinating conjunction", "if, while, that")
        +ud-row("SYM", "symbol", "$, %, §, ©, +, −, ×, ÷, =, :), 😝")
--- a/website/api/_architecture/_nn-model.jade
+++ b/website/api/_architecture/_nn-model.jade
@ -1,5 +1,13 @@
 //- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
 p
    |  spaCy's statistical models have been custom-designed to give a
    |  high-performance mix of speed and accuracy. The current architecture
    |  hasn't been published yet, but in the meantime we prepared a video that
    |  explains how the models work, with particular focus on NER.
 +youtube("sqDHBH9IjRU")
 p
    |  The parsing model is a blend of recent results. The two recent
    |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
@ -44,7 +52,7 @@ p
        +cell First two words of the buffer.
    +row
-        +cell.u-nowrap
+        +cell
            |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
            |  #[code B1L1]#[br]
            |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
@ -54,7 +62,7 @@ p
            |  #[code S2], #[code B0] and #[code B1].
    +row
-        +cell.u-nowrap
+        +cell
            |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
            |  #[code B1R1]#[br]
            |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
--- a/website/api/_top-level/_compat.jade
+++ b/website/api/_top-level/_compat.jade
@ -6,8 +6,7 @@ p
    |  but somewhat ugly in Python. Logic that deals with Python or platform
    |  compatibility only lives in #[code spacy.compat]. To distinguish them from
    |  the builtin functions, replacement functions are suffixed with an
-    |  undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
+    |  undersocre, e.e #[code unicode_].
    |  #[code six] and #[code ftfy] packages.
 +aside-code("Example").
    from spacy.compat import unicode_, json_dumps
--- a/website/api/cli.jade
+++ b/website/api/cli.jade
@ -533,8 +533,10 @@ p
        +cell option
        +cell
            |  Optional location of vectors file. Should be a tab-separated
-            |  file where the first column contains the word and the remaining
+            |  file in Word2Vec format where the first column contains the word
-            |  columns the values.
+            |  and the remaining columns the values. File can be provided in
            |  #[code .txt] format or as a zipped text file in #[code .zip] or
            |  #[code .tar.gz] format.
    +row
        +cell #[code --prune-vectors], #[code -V]
--- a/website/assets/css/_base/_grid.sass
+++ b/website/assets/css/_base/_grid.sass
@ -31,6 +31,7 @@
    $grid-gutter: 2rem
    margin-top: $grid-gutter
    min-width: 0  // hack to prevent overflow
    @include breakpoint(min, lg)
        display: flex
--- a/website/assets/css/_base/_objects.sass
+++ b/website/assets/css/_base/_objects.sass
@ -60,6 +60,13 @@
        padding-bottom: 4rem
        border-bottom: 1px dotted $color-subtle
    &.o-section--small
        overflow: auto
        &:not(:last-child)
            margin-bottom: 3.5rem
            padding-bottom: 2rem
 .o-block
    margin-bottom: 4rem
@ -142,6 +149,14 @@
 .o-badge
    border-radius: 1em
 .o-thumb
    @include size(100px)
    overflow: hidden
    border-radius: 50%
    &.o-thumb--small
        @include size(35px)
 //- SVG
--- a/website/assets/css/_base/_utilities.sass
+++ b/website/assets/css/_base/_utilities.sass
@ -103,6 +103,9 @@
    &:hover
        color: $color-theme-dark
 .u-hand
    cursor: pointer
 .u-hide-link.u-hide-link
    border: none
    color: inherit
@ -224,6 +227,7 @@
    $spinner-size: 75px
    $spinner-bar: 8px
    min-height: $spinner-size * 2
    position: relative
    & > *
@ -245,10 +249,19 @@
 //- Hidden elements
-.u-hidden
+.u-hidden,
-    display: none
+[v-cloak]
    display: none !important
@each $breakpoint in (xs, sm, md)
    .u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
        @include breakpoint(max, $breakpoint)
            display: none
 //- Transitions
 .u-fade-enter-active
    transition: opacity 0.5s
 .u-fade-enter
    opacity: 0
--- a/website/assets/css/_components/_code.sass
+++ b/website/assets/css/_components/_code.sass
@ -2,7 +2,8 @@
 //- Code block
-.c-code-block
+.c-code-block,
 .thebelab-cell
    background: $color-front
    color: darken($color-back, 20)
    padding: 0.75em 0
@ -13,11 +14,11 @@
    white-space: pre
    direction: ltr
-    &.c-code-block--has-icon
+.c-code-block--has-icon
-        padding: 0
+    padding: 0
-        display: flex
+    display: flex
-        border-top-left-radius: 0
+    border-top-left-radius: 0
-        border-bottom-left-radius: 0
+    border-bottom-left-radius: 0
 .c-code-block__icon
    padding: 0 0 0 1rem
@ -28,26 +29,66 @@
    &.c-code-block__icon--border
        border-left: 6px solid
 //- Code block content
-.c-code-block__content
+.c-code-block__content,
 .thebelab-input,
 .jp-OutputArea
    display: block
    font: normal normal 1.1rem/#{1.9} $font-code
    padding: 1em 2em
-    &[data-prompt]:before,
+.c-code-block__content[data-prompt]:before,
-        content: attr(data-prompt)
+    content: attr(data-prompt)
-        margin-right: 0.65em
+    margin-right: 0.65em
-        display: inline-block
+    display: inline-block
-        vertical-align: middle
+    vertical-align: middle
-        opacity: 0.5
+    opacity: 0.5
 //- Thebelab
 [data-executable]
    margin-bottom: 0
 .thebelab-input.thebelab-input
    padding: 3em 2em 1em
 .jp-OutputArea
    &:not(:empty)
        padding: 2rem 2rem 1rem
        border-top: 1px solid $color-dark
        margin-top: 2rem
    .entities, svg
        white-space: initial
        font-family: inherit
    .entities
        font-size: 1.35rem
 .jp-OutputArea pre
    font: inherit
 .jp-OutputPrompt.jp-OutputArea-prompt
    padding-top: 0.5em
    margin-right: 1rem
    font-family: inherit
    font-weight: bold
 .thebelab-run-button
    @extend .u-text-label, .u-text-label--dark
 .thebelab-wrapper
    position: relative
 .thebelab-wrapper__text
    @include position(absolute, top, right, 1.25rem, 1.25rem)
    color: $color-subtle-dark
    z-index: 10
 //- Code
-code
+code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
    -webkit-font-smoothing: subpixel-antialiased
    -moz-osx-font-smoothing: auto
@ -73,7 +114,7 @@ code
        text-shadow: none
-//- Syntax Highlighting
+//- Syntax Highlighting (Prism)
 [class*="language-"] .token
    &.comment, &.prolog, &.doctype, &.cdata, &.punctuation
@ -103,3 +144,50 @@ code
    &.italic
        font-style: italic
 //- Syntax Highlighting (CodeMirror)
 .CodeMirror.cm-s-default
    background: $color-front
    color: darken($color-back, 20)
    .CodeMirror-selected
        background: $color-theme
        color: $color-back
    .CodeMirror-cursor
        border-left-color: currentColor
    .cm-variable-2
        color: inherit
        font-style: italic
    .cm-comment
        color: map-get($syntax-highlighting, comment)
    .cm-keyword, .cm-builtin
        color: map-get($syntax-highlighting, keyword)
    .cm-operator
        color: map-get($syntax-highlighting, operator)
    .cm-string
        color: map-get($syntax-highlighting, selector)
    .cm-number
        color: map-get($syntax-highlighting, number)
    .cm-def
        color: map-get($syntax-highlighting, function)
 //- Syntax highlighting (Jupyter)
 .jp-RenderedText pre
    .ansi-cyan-fg
        color: map-get($syntax-highlighting, function)
    .ansi-green-fg
        color: $color-green
    .ansi-red-fg
        color: map-get($syntax-highlighting, operator)
--- a/website/assets/css/_components/_misc.sass
+++ b/website/assets/css/_components/_misc.sass
@ -8,10 +8,20 @@
    width: 100%
    position: relative
    &.x-terminal--small
        background: $color-dark
        color: $color-subtle
        border-radius: 4px
        margin-bottom: 4rem
 .x-terminal__icons
    display: none
    position: absolute
    padding: 10px
    @include breakpoint(min, sm)
        display: block
    &:before,
    &:after,
    span
@ -32,6 +42,12 @@
        content: ""
        background: $color-yellow
    &.x-terminal__icons--small
        &:before,
        &:after,
        span
            @include size(10px)
 .x-terminal__code
    margin: 0
    border: none
--- a/website/assets/css/_components/_navigation.sass
+++ b/website/assets/css/_components/_navigation.sass
@ -9,7 +9,7 @@
    display: flex
    justify-content: space-between
    flex-flow: row nowrap
-    padding: 0 2rem 0 1rem
+    padding: 0 0 0 1rem
    z-index: 30
    width: 100%
    box-shadow: $box-shadow
@ -21,11 +21,20 @@
 .c-nav__menu
    @include size(100%)
    display: flex
    justify-content: flex-end
    flex-flow: row nowrap
    border-color: inherit
    flex: 1
    @include breakpoint(max, sm)
        @include scroll-shadow-base($color-front)
        overflow-x: auto
        overflow-y: hidden
        -webkit-overflow-scrolling: touch
    @include breakpoint(min, md)
        justify-content: flex-end
 .c-nav__menu__item
    display: flex
    align-items: center
@ -39,6 +48,14 @@
    &:not(:first-child)
        margin-left: 2em
    &:last-child
        @include scroll-shadow-cover(right, $color-back)
        padding-right: 2rem
    &:first-child
        @include scroll-shadow-cover(left, $color-back)
        padding-left: 2rem
    &.is-active
        color: $color-dark
        pointer-events: none
--- a/website/assets/css/_variables.sass
+++ b/website/assets/css/_variables.sass
@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
 // Colors
-$colors: ( blue: #09a3d5, green: #05b083 )
+$colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
 $color-back: #fff !default
 $color-front: #1a1e23 !default
--- a/website/assets/css/style_purple.sass
+++ b/website/assets/css/style_purple.sass
@ -0,0 +1,4 @@
 //- 💫 STYLESHEET (PURPLE)
 $theme: purple
@import style
--- a/website/assets/img/pattern_purple.jpg
+++ b/website/assets/img/pattern_purple.jpg
--- a/website/assets/img/resources/spacy-vis.jpg
+++ b/website/assets/img/resources/spacy-vis.jpg
--- a/website/assets/img/social/preview_api.jpg
+++ b/website/assets/img/social/preview_api.jpg
--- a/website/assets/img/social/preview_universe.jpg
+++ b/website/assets/img/social/preview_universe.jpg
--- a/Show More
+++ b/Show More