Merge branch 'develop' into feature/refactor-parser

2026-03-05 04:11:26 +03:00 · 2018-05-15 18:39:21 +02:00 · 2018-05-15 18:39:21 +02:00 · dc1a479fbd
commit dc1a479fbd
parent 13faf4e1ea 546dd99cdf
67 changed files with 316912 additions and 102 deletions
--- a/.github/ISSUE_TEMPLATE/01_bugs.md
+++ b/.github/ISSUE_TEMPLATE/01_bugs.md
@ -0,0 +1,15 @@
+---
+name: "\U0001F6A8 Bug Report"
+about: Did you come across a bug or unexpected behaviour differing from the docs?
+
+---
+
+## How to reproduce the behaviour
+<!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->
+
+## Your Environment
+<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
+* Operating System:
+* Python Version Used:
+* spaCy Version Used:
+* Environment Information:
--- a/.github/ISSUE_TEMPLATE/02_install.md
+++ b/.github/ISSUE_TEMPLATE/02_install.md
@ -0,0 +1,21 @@
+---
+name: "\U000023F3 Installation Problem"
+about: Do you have problems installing spaCy, and none of the suggestions in the docs
+  and other issues helped?
+
+---
+<!-- Before submitting an issue, make sure to check the docs and closed issues to see if any of the solutions work for you. Installation problems can often be related to Python environment issues and problems with compilation. -->
+
+## How to reproduce the problem
+<!-- Include the details of how the problem occurred. Which command did you run to install spaCy? Did you come across an error? What else did you try? -->
+
+```bash
+# copy-paste the error message here
+```
+
+## Your Environment
+<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
+* Operating System:
+* Python Version Used:
+* spaCy Version Used:
+* Environment Information:
--- a/.github/ISSUE_TEMPLATE/03_request.md
+++ b/.github/ISSUE_TEMPLATE/03_request.md
@ -0,0 +1,11 @@
+---
+name: "\U0001F381 Feature Request"
+about: Do you have an idea for an improvement, a new feature or a plugin?
+
+---
+
+## Feature description
+<!-- Please describe the feature: Which area of the library is it related to? What specific solution would you like? -->
+
+## Could the feature be a [custom component](https://spacy.io/usage/processing-pipelines#custom-components) or [spaCy plugin](https://spacy.io/universe)?
+If so, we will tag it as [`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) so other users can take it on.
--- a/.github/ISSUE_TEMPLATE/04_docs.md
+++ b/.github/ISSUE_TEMPLATE/04_docs.md
@ -0,0 +1,10 @@
+---
+name: "\U0001F4DA Documentation"
+about: Did you spot a mistake in the docs, is anything unclear or do you have a
+  suggestion?
+
+---
+<!-- Describe the problem or suggestion here. If you've found a mistake and you know the answer, feel free to submit a pull request straight away: https://github.com/explosion/spaCy/pulls -->
+
+## Which page or section is this issue related to?
+<!-- Please include the URL and/or source. -->
--- a/.github/ISSUE_TEMPLATE/05_other.md
+++ b/.github/ISSUE_TEMPLATE/05_other.md
@ -0,0 +1,15 @@
+---
+name: "\U0001F4AC Anything else?"
+about: For general usage questions or help with your code, please consider
+  posting on StackOverflow instead.
+
+---
+
+<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on StackOverflow instead: http://stackoverflow.com/questions/tagged/spacy -->
+
+## Your Environment
+<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
+* Operating System:
+* Python Version Used:
+* spaCy Version Used:
+* Environment Information:
--- a/.github/contributors/LRAbbade.md
+++ b/.github/contributors/LRAbbade.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Lucas Riêra Abbade |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2018-05-08 |
+| GitHub username                | LRAbbade |
+| Website (optional)             |                      |
--- a/.github/contributors/alexvy86.md
+++ b/.github/contributors/alexvy86.md
@ -0,0 +1,87 @@
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Alejandro Villarreal |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2018-05-01           |
+| GitHub username                | alexvy86             |
+| Website (optional)             |                      |
--- a/.github/contributors/bellabie
+++ b/.github/contributors/bellabie
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |       B Cavello      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |      2018-05-06      |
+| GitHub username                |       bellabie       |
+| Website (optional)             |     bcavello.com     |
--- a/.github/contributors/janimo.md
+++ b/.github/contributors/janimo.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Jani Monoses         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 5/10/2018            |
+| GitHub username                | janimo               |
+| Website (optional)             |                      |
--- a/.github/contributors/knoxdw.md
+++ b/.github/contributors/knoxdw.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Douglas Knox         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2018-04-27           |
+| GitHub username                | knoxdw               |
+| Website (optional)             |                      |
--- a/.github/contributors/mauryaland.md
+++ b/.github/contributors/mauryaland.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Amaury Fouret        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 05/08/2018           |
+| GitHub username                | mauryaland           |
+| Website (optional)             |                      |
--- a/.github/contributors/mn3mos.md
+++ b/.github/contributors/mn3mos.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Gaëtan PRUVOST       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 13/04/2018           |
+| GitHub username                | mn3mos               |
+| Website (optional)             |                      |
--- a/.github/contributors/tzano.md
+++ b/.github/contributors/tzano.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Tahar Zanouda        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 09-05-2018           |
+| GitHub username                | tzano                |
+| Website (optional)             |                      |
--- a/.github/contributors/vishnumenon.md
+++ b/.github/contributors/vishnumenon.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Vishnu Menon         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 12 May 2018          |
+| GitHub username                | vishnumenon          |
+| Website (optional)             |                      |
--- a/.github/lock.yml
+++ b/.github/lock.yml
@ -0,0 +1,19 @@
+# Configuration for lock-threads - https://github.com/dessant/lock-threads
+
+# Number of days of inactivity before a closed issue or pull request is locked
+daysUntilLock: 30
+
+# Issues and pull requests with these labels will not be locked. Set to `[]` to disable
+exemptLabels: []
+
+# Label to add before locking, such as `outdated`. Set to `false` to disable
+lockLabel: false
+
+# Comment to post before locking. Set to `false` to disable
+lockComment: >
+  This thread has been automatically locked since there has not been
+  any recent activity after it was closed. Please open a new issue for
+  related bugs.
+
+# Limit to only `issues` or `pulls`
+only: issues
--- a/.github/no-response.yml
+++ b/.github/no-response.yml
@ -0,0 +1,13 @@
+# Configuration for probot-no-response - https://github.com/probot/no-response
+
+# Number of days of inactivity before an Issue is closed for lack of response
+daysUntilClose: 14
+# Label requiring a response
+responseRequiredLabel: more-info-needed
+# Comment to post when closing an Issue for lack of response. Set to `false` to disable
+closeComment: >
+  This issue has been automatically closed because there has been no response
+  to a request for more information from the original author. With only the
+  information that is currently in the issue, there's not enough information
+  to take action. If you're the original author, feel free to reopen the issue
+  if you have or find the answers needed to investigate further.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -73,7 +73,7 @@ so it only becomes visible on click, making the issue easier to read and follow.
 ### Issue labels

 To distinguish issues that are opened by us, the maintainers, we usually add a
-💫 to the title. [See this page](https://github.com/explosion/spaCy/labels) 
+💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
 for an overview of the system we use to tag our issues and pull requests.

 ## Contributing to the code base
--- a/README.rst
+++ b/README.rst
@ -199,6 +199,11 @@ or manually by pointing pip to a path or URL.
    # pip install .tar.gz archive from path or URL
    pip install /Users/you/en_core_web_sm-2.0.0.tar.gz

+If you have SSL certification problems, SSL customization options are described in the help:
+
+    # help for the download command
+    python -m spacy download --help
+
 Loading and using models
 ------------------------

--- a/examples/pipeline/custom_component_countries_api.py
+++ b/examples/pipeline/custom_component_countries_api.py
@ -68,9 +68,9 @@ class RESTCountriesComponent(object):
        # the matches, so we're only setting a default value, not a getter.
        # If no default value is set, it defaults to None.
        Token.set_extension('is_country', default=False)
-        Token.set_extension('country_capital')
-        Token.set_extension('country_latlng')
-        Token.set_extension('country_flag')
+        Token.set_extension('country_capital', default=False)
+        Token.set_extension('country_latlng', default=False)
+        Token.set_extension('country_flag', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_country == True.
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -17,19 +17,39 @@ from .. import about
@plac.annotations(
    model=("model to download, shortcut or name)", "positional", None, str),
    direct=("force direct download. Needs model name with version and won't "
-            "perform compatibility check", "flag", "d", bool))
-def download(model, direct=False):
+            "perform compatibility check", "flag", "d", bool),
+    insecure=("insecure mode - disables the verification of certificates",
+              "flag", "i", bool),
+    ca_file=("specify a certificate authority file to use for certificates "
+             "validation. Ignored if --insecure is used", "option", "c"))
+def download(model, direct=False, insecure=False, ca_file=None):
    """
    Download compatible model from default download path using pip. Model
    can be shortcut, model name or, if --direct flag is set, full model name
    with version.
+    The --insecure optional flag can be used to disable ssl verification
+    The --ca-file option can be used to provide a local CA file
+    used for certificate verification.
    """
+
+    # ssl_verify is the argument handled to the 'verify' parameter
+    # of requests package. It must be either None, a boolean,
+    # or a string containing the path to CA file
+    ssl_verify = None
+    if insecure:
+        ca_file = None
+        ssl_verify = False
+    else:
+        if ca_file is not None:
+            ssl_verify = ca_file
+
+    # Download the model
    if direct:
        dl = download_model('{m}/{m}.tar.gz'.format(m=model))
    else:
-        shortcuts = get_json(about.__shortcuts__, "available shortcuts")
+        shortcuts = get_json(about.__shortcuts__, "available shortcuts", ssl_verify)
        model_name = shortcuts.get(model, model)
-        compatibility = get_compatibility()
+        compatibility = get_compatibility(ssl_verify)
        version = get_version(model_name, compatibility)
        dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
                                                            v=version))
@ -41,8 +61,7 @@ def download(model, direct=False):
            # package, which fails if model was just installed via
            # subprocess
            package_path = get_package_path(model_name)
-            link(model_name, model, force=True,
-                    model_path=package_path)
+            link(model_name, model, force=True, model_path=package_path)
        except:
            # Dirty, but since spacy.download and the auto-linking is
            # mostly a convenience wrapper, it's best to show a success
@ -50,19 +69,19 @@ def download(model, direct=False):
            prints(Messages.M001.format(name=model_name), title=Messages.M002)


-def get_json(url, desc):
+def get_json(url, desc, ssl_verify):
    try:
-        data = url_read(url)
+        data = url_read(url, verify=ssl_verify)
    except HTTPError as e:
        prints(Messages.M004.format(desc, about.__version__),
               title=Messages.M003.format(e.code, e.reason), exits=1)
    return ujson.loads(data)


-def get_compatibility():
+def get_compatibility(ssl_verify):
    version = about.__version__
    version = version.rsplit('.dev', 1)[0]
-    comp_table = get_json(about.__compatibility__, "compatibility table")
+    comp_table = get_json(about.__compatibility__, "compatibility table", ssl_verify)
    comp = comp_table['spacy']
    if version not in comp:
        prints(Messages.M006.format(version=version), title=Messages.M005,
--- a/spacy/cli/ud_train.py
+++ b/spacy/cli/ud_train.py
@ -124,13 +124,16 @@ def read_conllu(file_):
    return docs


-def _make_gold(nlp, text, sent_annots):
+def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
    # Flatten the conll annotations, and adjust the head indices
    flat = defaultdict(list)
+    sent_starts = []
    for sent in sent_annots:
        flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
        for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
            flat[field].extend(sent[field])
+        sent_starts.append(True)
+        sent_starts.extend([False] * (len(sent['words'])-1))
    # Construct text if necessary
    assert len(flat['words']) == len(flat['spaces'])
    if text is None:
@ -138,6 +141,12 @@ def _make_gold(nlp, text, sent_annots):
    doc = nlp.make_doc(text)
    flat.pop('spaces')
    gold = GoldParse(doc, **flat)
+    gold.sent_starts = sent_starts
+    for i in range(len(gold.heads)):
+        if random.random() < drop_deps:
+            gold.heads[i] = None
+            gold.labels[i] = None
+
    return doc, gold

 #############################
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -545,9 +545,20 @@ cdef class GoldParse:
        """
        return not nonproj.is_nonproj_tree(self.heads)

-    @property
-    def sent_starts(self):
-        return [self.c.sent_start[i] for i in range(self.length)]
+    property sent_starts:
+        def __get__(self):
+            return [self.c.sent_start[i] for i in range(self.length)]
+
+        def __set__(self, sent_starts):
+            for gold_i, is_sent_start in enumerate(sent_starts):
+                i = self.gold_to_cand[gold_i]
+                if i is not None:
+                    if is_sent_start in (1, True):
+                        self.c.sent_start[i] = 1
+                    elif is_sent_start in (-1, False):
+                        self.c.sent_start[i] = -1
+                    else:
+                        self.c.sent_start[i] = 0


 def biluo_tags_from_offsets(doc, entities, missing='O'):
--- a/spacy/lang/ar/init.py
+++ b/spacy/lang/ar/init.py
@ -0,0 +1,31 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_SUFFIXES
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+class ArabicDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: 'ar'
+    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    stop_words = STOP_WORDS
+    suffixes = TOKENIZER_SUFFIXES
+
+
+class Arabic(Language):
+    lang = 'ar'
+    Defaults = ArabicDefaults
+
+
+__all__ = ['Arabic']
--- a/spacy/lang/ar/examples.py
+++ b/spacy/lang/ar/examples.py
@ -0,0 +1,20 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ar.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+    "نال الكاتب خالد توفيق  جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
+    "أين تقع دمشق ؟"
+    "كيف حالك ؟",
+    "هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
+    "ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
+    "هل بالإمكان أن نلتقي غدا؟",
+    "هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
+    "كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
+]
--- a/spacy/lang/ar/lex_attrs.py
+++ b/spacy/lang/ar/lex_attrs.py
@ -0,0 +1,95 @@
+# coding: utf8
+from __future__ import unicode_literals
+from ...attrs import LIKE_NUM
+
+_num_words = set("""
+صفر
+واحد
+إثنان
+اثنان
+ثلاثة
+ثلاثه
+أربعة
+أربعه
+خمسة
+خمسه
+ستة
+سته
+سبعة
+سبعه
+ثمانية
+ثمانيه
+تسعة
+تسعه
+ﻋﺸﺮﺓ
+ﻋﺸﺮه
+عشرون
+عشرين
+ثلاثون
+ثلاثين
+اربعون
+اربعين
+أربعون
+أربعين
+خمسون
+خمسين
+ستون
+ستين
+سبعون
+سبعين
+ثمانون
+ثمانين
+تسعون
+تسعين
+مائتين
+مائتان
+ثلاثمائة
+خمسمائة
+سبعمائة
+الف
+آلاف
+ملايين
+مليون
+مليار
+مليارات
+""".split())
+
+_ordinal_words = set("""
+اول
+أول
+حاد
+واحد
+ثان
+ثاني
+ثالث
+رابع
+خامس
+سادس
+سابع
+ثامن
+تاسع
+عاشر
+""".split())
+
+
+def like_num(text):
+    """
+    check if text resembles a number
+    """
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/ar/punctuation.py
+++ b/spacy/lang/ar/punctuation.py
@ -0,0 +1,15 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..punctuation import TOKENIZER_INFIXES
+from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
+from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+
+_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
+             [r'(?<=[0-9])\+',
+              # Arabic is written from Right-To-Left
+              r'(?<=[0-9])(?:{})'.format(CURRENCY),
+              r'(?<=[0-9])(?:{})'.format(UNITS),
+              r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
+
+TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/ar/stop_words.py
+++ b/spacy/lang/ar/stop_words.py
@ -0,0 +1,229 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+STOP_WORDS = set("""
+من
+نحو
+لعل
+بما
+بين
+وبين
+ايضا
+وبينما
+تحت
+مثلا
+لدي
+عنه
+مع
+هي
+وهذا
+واذا
+هذان
+انه
+بينما
+أمسى
+وسوف
+ولم
+لذلك
+إلى
+منه
+منها
+كما
+ظل
+هنا
+به
+كذلك
+اما
+هما
+بعد
+بينهم
+التي
+أبو
+اذا
+بدلا
+لها
+أمام
+يلي
+حين
+ضد
+الذي
+قد
+صار
+إذا
+مابرح
+قبل
+كل
+وليست
+الذين
+لهذا
+وثي
+انهم
+باللتي
+مافتئ
+ولا
+بهذه
+بحيث
+كيف
+وله
+علي
+بات
+لاسيما
+حتى
+وقد
+و
+أما
+فيها
+بهذا
+لذا
+حيث
+لقد
+إن
+فإن
+اول
+ليت
+فاللتي
+ولقد
+لسوف
+هذه
+ولماذا
+معه
+الحالي
+بإن
+حول
+في
+عليه
+مايزال
+ولعل
+أنه
+أضحى
+اي
+ستكون
+لن
+أن
+ضمن
+وعلى
+امسى
+الي
+ذات
+ولايزال
+ذلك
+فقد
+هم
+أي
+عند
+ابن
+أو
+فهو
+فانه
+سوف
+ما
+آل
+كلا
+عنها
+وكذلك
+ليست
+لم
+وأن
+ماذا
+لو
+وهل
+اللتي
+ولذا
+يمكن
+فيه
+الا
+عليها
+وبينهم
+يوم
+وبما
+لما
+فكان
+اضحى
+اصبح
+لهم
+بها
+او
+الذى
+الى
+إلي
+قال
+والتي
+لازال
+أصبح
+ولهذا
+مثل
+وكانت
+لكنه
+بذلك
+هذا
+لماذا
+قالت
+فقط
+لكن
+مما
+وكل
+وان
+وأبو
+ومن
+كان
+مازال
+هل
+بينهن
+هو
+وما
+على
+وهو
+لأن
+واللتي
+والذي
+دون
+عن
+وايضا
+هناك
+بلا
+جدا
+ثم
+منذ
+اللذين
+لايزال
+بعض
+مساء
+تكون
+فلا
+بيننا
+لا
+ولكن
+إذ
+وأثناء
+ليس
+ومع
+فيهم
+ولسوف
+بل
+تلك
+أحد
+وهي
+وكان
+ومنها
+وفي
+ماانفك
+اليوم
+وماذا
+هؤلاء
+وليس
+له
+أثناء
+بد
+اليه
+كأن
+اليها
+بتلك
+يكون
+ولما
+هن
+والى
+كانت
+وقبل
+ان
+لدى
+""".split())
--- a/spacy/lang/ar/tokenizer_exceptions.py
+++ b/spacy/lang/ar/tokenizer_exceptions.py
@ -0,0 +1,47 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
+import re
+
+_exc = {}
+
+# time
+for exc_data in [
+    {LEMMA: "قبل الميلاد", ORTH: "ق.م"},
+    {LEMMA: "بعد الميلاد", ORTH: "ب. م"},
+    {LEMMA: "ميلادي", ORTH: ".م"},
+    {LEMMA: "هجري", ORTH: ".هـ"},
+    {LEMMA: "توفي", ORTH: ".ت"}]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+# scientific abv.
+for exc_data in [
+    {LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
+    {LEMMA: "الشارح", ORTH: "الشـ"},
+    {LEMMA: "الظاهر", ORTH: "الظـ"},
+    {LEMMA: "أيضًا", ORTH: "أيضـ"},
+    {LEMMA: "إلى آخره", ORTH: "إلخ"},
+    {LEMMA: "انتهى", ORTH: "اهـ"},
+    {LEMMA: "حدّثنا", ORTH: "ثنا"},
+    {LEMMA: "حدثني", ORTH: "ثنى"},
+    {LEMMA: "أنبأنا", ORTH: "أنا"},
+    {LEMMA: "أخبرنا", ORTH: "نا"},
+    {LEMMA: "مصدر سابق", ORTH: "م. س"},
+    {LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+# other abv.
+for exc_data in [
+    {LEMMA: "دكتور", ORTH: "د."},
+    {LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
+    {LEMMA: "أستاذ", ORTH: "أ."},
+    {LEMMA: "بروفيسور", ORTH: "ب."}]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+for exc_data in [
+    {LEMMA: "تلفون", ORTH: "ت."},
+    {LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
+    _exc[exc_data[ORTH]] = [exc_data]
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -3,13 +3,11 @@ from __future__ import unicode_literals

 import regex as re

-
 re.DEFAULT_VERSION = re.VERSION1
 merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
 split_chars = lambda char: list(char.strip().split(' '))
 merge_chars = lambda char: char.strip().replace(' ', '|')

-
 _bengali = r'[\p{L}&&\p{Bengali}]'
 _hebrew = r'[\p{L}&&\p{Hebrew}]'
 _latin_lower = r'[\p{Ll}&&\p{Latin}]'
@ -27,11 +25,11 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
 ALPHA_LOWER = merge_char_classes(_lower + _uncased)
 ALPHA_UPPER = merge_char_classes(_upper + _uncased)

-
 _units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
          'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
          'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
-          'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
+          'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
+          'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
 _currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'

 # These expressions contain various unicode variations, including characters
@ -45,7 +43,6 @@ _hyphens = '- – — -- --- —— ~'
 # Details: https://www.compart.com/en/unicode/category/So
 _other_symbols = r'[\p{So}]'

-
 UNITS = merge_chars(_units)
 CURRENCY = merge_chars(_currency)
 QUOTES = merge_chars(_quotes)
--- a/spacy/lang/fr/stop_words.py
+++ b/spacy/lang/fr/stop_words.py
@ -11,14 +11,14 @@ avais avait avant avec avoir avons ayant

 bah bas basee bat beau beaucoup bien bigre boum bravo brrr

-ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
+c' c’ ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
 celui-ci celui-là cent cependant certain certaine certaines certains certes ces
 cet cette ceux ceux-ci ceux-là chacun chacune chaque cher chers chez chiche
 chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac
 clic combien comme comment comparable comparables compris concernant contre
 couic crac

-da dans de debout dedans dehors deja delà depuis dernier derniere derriere
+d' d’ da dans de debout dedans dehors deja delà depuis dernier derniere derriere
 derrière des desormais desquelles desquels dessous dessus deux deuxième
 deuxièmement devant devers devra different differentes differents différent
 différente différentes différents dire directe directement dit dite dits divers
@ -37,16 +37,16 @@ gens
 ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum
 hurrah hé hélas i il ils importe

-je jusqu jusque juste
+j' j’ je jusqu jusque juste

-la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
+l' l’ la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
 lors lorsque lui lui-meme lui-même là lès

-ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
+m' m’ ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
 mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
 mon moyennant multiple multiples même mêmes

-na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
+n' n’ na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
 neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
 nul néanmoins nôtre nôtres

@ -60,21 +60,21 @@ plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
 pourrais pourrait pouvait prealable precisement premier première premièrement
 pres probable probante procedant proche près psitt pu puis puisque pur pure

-qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
+qu' qu’ quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
 quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
 quelques quels qui quiconque quinze quoi quoique

 rare rarement rares relative relativement remarquable rend rendre restant reste
 restent restrictif retour revoici revoilà rien

-sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
+s' s’ sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
 semble semblent sent sept septième sera seraient serait seront ses seul seule
 seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
 soixante son sont sous souvent specifique specifiques speculatif stop
 strictement subtiles suffisant suffisante suffit suis suit suivant suivante
 suivantes suivants suivre superpose sur surtout

-ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
+t' t’ ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
 tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous
 tout toute toutefois toutes treize trente tres trois troisième troisièmement
 trop très tsoin tsouin tu té
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -3,23 +3,87 @@ from __future__ import unicode_literals, print_function

 from ...language import Language
 from ...attrs import LANG
-from ...tokens import Doc
+from ...tokens import Doc, Token
 from ...tokenizer import Tokenizer
+from .tag_map import TAG_MAP

+import re
+from collections import namedtuple
+
+ShortUnitWord = namedtuple('ShortUnitWord', ['surface', 'lemma', 'pos'])
+
+# XXX Is this the right place for this?
+Token.set_extension('mecab_tag', default=None)
+
+def try_mecab_import():
+    """Mecab is required for Japanese support, so check for it.
+
+    It it's not available blow up and explain how to fix it."""
+    try:
+        import MeCab
+        return MeCab
+    except ImportError:
+        raise ImportError("Japanese support requires MeCab: "
+                          "https://github.com/SamuraiT/mecab-python3")
+
+def resolve_pos(token):
+    """If necessary, add a field to the POS tag for UD mapping.
+
+    Under Universal Dependencies, sometimes the same Unidic POS tag can
+    be mapped differently depending on the literal token or its context
+    in the sentence. This function adds information to the POS tag to 
+    resolve ambiguous mappings.
+    """
+
+    # NOTE: This is a first take. The rules here are crude approximations.
+    # For many of these, full dependencies are needed to properly resolve
+    # PoS mappings.
+
+    if token.pos == '連体詞,*,*,*':
+        if re.match('^[こそあど此其彼]の', token.surface):
+            return token.pos + ',DET'
+        if re.match('^[こそあど此其彼]', token.surface):
+            return token.pos + ',PRON'
+        else:
+            return token.pos + ',ADJ'
+    return token.pos
+
+def detailed_tokens(tokenizer, text):
+    """Format Mecab output into a nice data structure, based on Janome."""
+
+    node = tokenizer.parseToNode(text)
+    node = node.next # first node is beginning of sentence and empty, skip it
+    words = []
+    while node.posid != 0:
+        surface = node.surface
+        base = surface # a default value. Updated if available later.
+        parts = node.feature.split(',')
+        pos = ','.join(parts[0:4])
+
+        if len(parts) > 6:
+            # this information is only available for words in the tokenizer dictionary
+            reading = parts[6]
+            base = parts[7]
+
+        words.append( ShortUnitWord(surface, base, pos) )
+        node = node.next
+    return words

 class JapaneseTokenizer(object):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        try:
-            from janome.tokenizer import Tokenizer
-        except ImportError:
-            raise ImportError("The Japanese tokenizer requires the Janome "
-                              "library: https://github.com/mocobeta/janome")
-        self.tokenizer = Tokenizer()
+
+        MeCab = try_mecab_import()
+        self.tokenizer = MeCab.Tagger()

    def __call__(self, text):
-        words = [x.surface for x in self.tokenizer.tokenize(text)]
-        return Doc(self.vocab, words=words, spaces=[False]*len(words))
+        dtokens = detailed_tokens(self.tokenizer, text)
+        words = [x.surface for x in dtokens]
+        doc = Doc(self.vocab, words=words, spaces=[False]*len(words))
+        for token, dtoken in zip(doc, dtokens):
+            token._.mecab_tag = dtoken.pos
+            token.tag_ = resolve_pos(dtoken)
+        return doc

    # add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
    # allow serialization (see #1557)
@ -53,6 +117,7 @@ class JapaneseCharacterSegmenter(object):
 class JapaneseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: 'ja'
+    tag_map = TAG_MAP
    use_janome = True

    @classmethod
@ -62,13 +127,12 @@ class JapaneseDefaults(Language.Defaults):
        else:
            return JapaneseCharacterSegmenter(cls, nlp.vocab)

-
 class Japanese(Language):
    lang = 'ja'
    Defaults = JapaneseDefaults
+    Tokenizer = JapaneseTokenizer

    def make_doc(self, text):
        return self.tokenizer(text)

-
 __all__ = ['Japanese']
--- a/spacy/lang/ja/tag_map.py
+++ b/spacy/lang/ja/tag_map.py
@ -0,0 +1,88 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import *
+
+TAG_MAP = {
+    # Explanation of Unidic tags:
+    # https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
+
+    # Universal Dependencies Mapping:
+    # http://universaldependencies.org/ja/overview/morphology.html
+    # http://universaldependencies.org/ja/pos/all.html
+
+    "記号,一般,*,*":{POS: PUNCT}, # this includes characters used to represent sounds like ドレミ
+    "記号,文字,*,*":{POS: PUNCT}, # this is for Greek and Latin characters used as sumbols, as in math
+
+    "感動詞,フィラー,*,*": {POS: INTJ},
+    "感動詞,一般,*,*": {POS: INTJ},
+
+    # this is specifically for unicode full-width space
+    "空白,*,*,*": {POS: X}, 
+
+    "形状詞,一般,*,*":{POS: ADJ},
+    "形状詞,タリ,*,*":{POS: ADJ}, 
+    "形状詞,助動詞語幹,*,*":{POS: ADJ}, 
+    "形容詞,一般,*,*":{POS: ADJ},
+    "形容詞,非自立可能,*,*":{POS: AUX}, # XXX ADJ if alone, AUX otherwise
+
+    "助詞,格助詞,*,*":{POS: ADP}, 
+    "助詞,係助詞,*,*":{POS: ADP}, 
+    "助詞,終助詞,*,*":{POS: PART}, 
+    "助詞,準体助詞,*,*":{POS: SCONJ}, # の as in 走るのが速い
+    "助詞,接続助詞,*,*":{POS: SCONJ}, # verb ending て
+    "助詞,副助詞,*,*":{POS: PART},  # ばかり, つつ after a verb
+    "助動詞,*,*,*":{POS: AUX},
+    "接続詞,*,*,*":{POS: SCONJ}, # XXX: might need refinement
+
+    "接頭辞,*,*,*":{POS: NOUN}, 
+    "接尾辞,形状詞的,*,*":{POS: ADJ}, # がち, チック 
+    "接尾辞,形容詞的,*,*":{POS: ADJ}, # -らしい
+    "接尾辞,動詞的,*,*":{POS: NOUN},  # -じみ
+    "接尾辞,名詞的,サ変可能,*":{POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
+    "接尾辞,名詞的,一般,*":{POS: NOUN},
+    "接尾辞,名詞的,助数詞,*":{POS: NOUN}, 
+    "接尾辞,名詞的,副詞可能,*":{POS: NOUN}, # -後, -過ぎ
+
+    "代名詞,*,*,*":{POS: PRON},
+    "動詞,一般,*,*":{POS: VERB},
+    "動詞,非自立可能,*,*":{POS: VERB}, # XXX VERB if alone, AUX otherwise
+    "動詞,非自立可能,*,*,AUX":{POS: AUX},
+    "動詞,非自立可能,*,*,VERB":{POS: VERB},
+    "副詞,*,*,*":{POS: ADV},
+
+    "補助記号,ＡＡ,一般,*":{POS: SYM}, # text art
+    "補助記号,ＡＡ,顔文字,*":{POS: SYM}, # kaomoji
+    "補助記号,一般,*,*":{POS: SYM}, 
+    "補助記号,括弧開,*,*":{POS: PUNCT}, # open bracket
+    "補助記号,括弧閉,*,*":{POS: PUNCT}, # close bracket
+    "補助記号,句点,*,*":{POS: PUNCT}, # period or other EOS marker
+    "補助記号,読点,*,*":{POS: PUNCT}, # comma
+
+    "名詞,固有名詞,一般,*":{POS: PROPN}, # general proper noun
+    "名詞,固有名詞,人名,一般":{POS: PROPN}, # person's name
+    "名詞,固有名詞,人名,姓":{POS: PROPN}, # surname
+    "名詞,固有名詞,人名,名":{POS: PROPN}, # first name
+    "名詞,固有名詞,地名,一般":{POS: PROPN}, # place name
+    "名詞,固有名詞,地名,国":{POS: PROPN}, # country name
+
+    "名詞,助動詞語幹,*,*":{POS: AUX}, 
+    "名詞,数詞,*,*":{POS: NUM}, # includes Chinese numerals
+
+    "名詞,普通名詞,サ変可能,*":{POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
+    "名詞,普通名詞,サ変可能,*,NOUN":{POS: NOUN}, 
+    "名詞,普通名詞,サ変可能,*,VERB":{POS: VERB}, 
+
+    "名詞,普通名詞,サ変形状詞可能,*":{POS: NOUN}, # ex: 下手
+    "名詞,普通名詞,一般,*":{POS: NOUN}, 
+    "名詞,普通名詞,形状詞可能,*":{POS: NOUN}, # XXX: sometimes ADJ in UDv2
+    "名詞,普通名詞,形状詞可能,*,NOUN":{POS: NOUN}, 
+    "名詞,普通名詞,形状詞可能,*,ADJ":{POS: ADJ}, 
+    "名詞,普通名詞,助数詞可能,*":{POS: NOUN}, # counter / unit
+    "名詞,普通名詞,副詞可能,*":{POS: NOUN},
+
+    "連体詞,*,*,*":{POS: ADJ}, # XXX this has exceptions based on literal token
+    "連体詞,*,*,*,ADJ":{POS: ADJ}, 
+    "連体詞,*,*,*,PRON":{POS: PRON}, 
+    "連体詞,*,*,*,DET":{POS: DET}, 
+}
--- a/spacy/lang/pt/lex_attrs.py
+++ b/spacy/lang/pt/lex_attrs.py
@ -6,10 +6,10 @@ from ...attrs import LIKE_NUM

 _num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
              'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',
-              'quinze', 'dezasseis', 'dezassete', 'dezoito', 'dezanove', 'vinte',
+              'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte',
              'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',
-              'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião',
-              'quadrilião']
+              'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião',
+              'quatrilhão']

 _ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
                  'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -3,6 +3,7 @@ from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
+from .lemmatizer import LOOKUP

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -17,6 +18,7 @@ class RomanianDefaults(Language.Defaults):
    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
+    lemma_lookup = LOOKUP


 class Romanian(Language):
--- a/spacy/lang/ro/examples.py
+++ b/spacy/lang/ro/examples.py
@ -0,0 +1,23 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ro import Romanian
+>>> from spacy.lang.ro.examples import sentences
+>>> nlp = Romanian()
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Apple plănuiește să cumpere o companie britanică pentru un miliard de dolari",
+    "Municipalitatea din San Francisco ia în calcul interzicerea roboților curieri pe trotuar",
+    "Londra este un oraș mare în Regatul Unit",
+    "Unde ești?",
+    "Cine este președintele Franței?",
+    "Care este capitala Statelor Unite?",
+    "Când s-a născut Barack Obama?"
+]
--- a/spacy/lang/ro/lemmatizer.py
+++ b/spacy/lang/ro/lemmatizer.py
--- a/spacy/lang/ro/stop_words.py
+++ b/spacy/lang/ro/stop_words.py
@ -28,6 +28,8 @@ acestia
 acestui
 aceşti
 aceştia
+acești
+aceștia
 acolo
 acord
 acum
@ -51,6 +53,7 @@ altfel
 alti
 altii
 altul
+alături
 am
 anume
 apoi
@ -80,11 +83,15 @@ au
 avea
 avem
 aveţi
+aveți
 avut
 azi
 aş
 aşadar
 aţi
+aș
+așadar
+ați
 b
 ba
 bine
@ -136,11 +143,13 @@ cât
 câte
 câtva
 câţi
+câți
 cînd
 cît
 cîte
 cîtva
 cîţi
+cîți
 că
 căci
 cărei
@ -167,6 +176,7 @@ departe
 desi
 despre
 deşi
+deși
 din
 dinaintea
 dintr
@ -191,6 +201,7 @@ este
 eu
 exact
 eşti
+ești
 f
 face
 fara
@ -203,6 +214,7 @@ fii
 fim
 fiu
 fiţi
+fiți
 foarte
 fost
 frumos
@ -210,6 +222,7 @@ fără
 g
 geaba
 graţie
+grație
 h
 halbă
 i
@ -259,6 +272,8 @@ multi
 multă
 mulţi
 mulţumesc
+mulți
+mulțumesc
 mâine
 mîine
 mă
@ -274,6 +289,7 @@ nimeri
 nimic
 niste
 nişte
+niște
 noastre
 noastră
 noi
@ -284,6 +300,7 @@ nou
 noua
 nouă
 noştri
+noștri
 nu
 numai
 o
@ -322,6 +339,9 @@ putini
 puţin
 puţina
 puţină
+puțin
+puțina
+puțină
 până
 pînă
 r
@ -343,11 +363,13 @@ sub
 sunt
 suntem
 sunteţi
+sunteți
 sus
 sută
 sînt
 sîntem
 sînteţi
+sînteți
 să
 săi
 său
@ -367,7 +389,9 @@ toti
 totul
 totusi
 totuşi
+totuși
 toţi
+toți
 trei
 treia
 treilea
@ -404,6 +428,7 @@ vor
 vostru
 vouă
 voştri
+voștri
 vreme
 vreo
 vreun
@ -428,15 +453,23 @@ zice
 întrucât
 întrucît
 îţi
+îți
 ăla
 ălea
 ăsta
 ăstea
 ăştia
+ăștia
 şapte
 şase
 şi
 ştiu
 ţi
 ţie
+șapte
+șase
+și
+știu
+ți
+ție
 """.split())
--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -58,9 +58,9 @@ cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) no
    cdef int i, S_i
    for i in range(stcls.stack_depth()):
        S_i = stcls.S(i)
-        if gold.heads[target] == S_i:
+        if gold.has_dep[target] and gold.heads[target] == S_i:
            cost += 1
-        if gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)):
+        if gold.has_dep[S_i] and gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)):
            cost += 1
        if BINARY_COSTS and cost >= 1:
            return cost
@ -73,10 +73,12 @@ cdef weight_t pop_cost(StateClass stcls, const GoldParseC* gold, int target) nog
    cdef int i, B_i
    for i in range(stcls.buffer_length()):
        B_i = stcls.B(i)
-        cost += gold.heads[B_i] == target
-        cost += gold.heads[target] == B_i
-        if gold.heads[B_i] == B_i or gold.heads[B_i] < target:
-            break
+        if gold.has_dep[B_i]:
+            cost += gold.heads[B_i] == target
+            if gold.heads[B_i] == B_i or gold.heads[B_i] < target:
+                break
+        if gold.has_dep[target]:
+            cost += gold.heads[target] == B_i
        if BINARY_COSTS and cost >= 1:
            return cost
    if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
@ -107,7 +109,10 @@ cdef bint arc_is_gold(const GoldParseC* gold, int head, int child) nogil:

 cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil:
    if not gold.has_dep[child]:
-        return True
+        if label == SUBTOK_LABEL:
+            return False
+        else:
+            return True
    elif label == 0:
        return True
    elif gold.labels[child] == label:
@ -167,7 +172,7 @@ cdef class Reduce:
            # Decrement cost for the arcs e save
            for i in range(1, st.stack_depth()):
                S_i = st.S(i)
-                if gold.heads[st.S(0)] == S_i:
+                if gold.has_dep[st.S(0)] and gold.heads[st.S(0)] == S_i:
                    cost -= 1
                if gold.heads[S_i] == st.S(0):
                    cost -= 1
@ -208,8 +213,10 @@ cdef class LeftArc:
            # Account for deps we might lose between S0 and stack
            if not s.has_head(s.S(0)):
                for i in range(1, s.stack_depth()):
-                    cost += gold.heads[s.S(i)] == s.S(0)
-                    cost += gold.heads[s.S(0)] == s.S(i)
+                    if gold.has_dep[s.S(i)]:
+                        cost += gold.heads[s.S(i)] == s.S(0)
+                    if gold.has_dep[s.S(0)]:
+                        cost += gold.heads[s.S(0)] == s.S(i)
            return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0))

    @staticmethod
@ -284,18 +291,20 @@ cdef class Break:
            S_i = s.S(i)
            for j in range(s.buffer_length()):
                B_i = s.B(j)
-                cost += gold.heads[S_i] == B_i
-                cost += gold.heads[B_i] == S_i
+                if gold.has_dep[S_i]:
+                    cost += gold.heads[S_i] == B_i
+                if gold.has_dep[B_i]:
+                    cost += gold.heads[B_i] == S_i
                if cost != 0:
                    return cost
        # Check for sentence boundary --- if it's here, we can't have any deps
        # between stack and buffer, so rest of action is irrelevant.
-        s0_root = _get_root(s.S(0), gold)
-        b0_root = _get_root(s.B(0), gold)
-        if s0_root != b0_root or s0_root == -1 or b0_root == -1:
+        if not gold.has_dep[s.S(0)] or not gold.has_dep[s.B(0)]:
            return cost
+        if gold.sent_start[s.B_(0).l_edge] == -1:
+            return cost+1
        else:
-            return cost + 1
+            return cost

    @staticmethod
    cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil:
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -15,7 +15,8 @@ from .. import util
 # here if it's using spaCy's tokenizer (not a different library)
 # TODO: re-implement generic tokenizer tests
 _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
-              'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'xx']
+              'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
+
 _models = {'en': ['en_core_web_sm'],
           'de': ['de_core_news_md'],
           'fr': ['fr_core_news_sm'],
@ -50,8 +51,8 @@ def RU(request):

 #@pytest.fixture(params=_languages)
 #def tokenizer(request):
-    #lang = util.get_lang_class(request.param)
-    #return lang.Defaults.create_tokenizer()
+#lang = util.get_lang_class(request.param)
+#return lang.Defaults.create_tokenizer()


@pytest.fixture
@ -100,6 +101,11 @@ def fi_tokenizer():
    return util.get_lang_class('fi').Defaults.create_tokenizer()


+@pytest.fixture
+def ro_tokenizer():
+    return util.get_lang_class('ro').Defaults.create_tokenizer()
+
+
@pytest.fixture
 def id_tokenizer():
    return util.get_lang_class('id').Defaults.create_tokenizer()
@ -135,10 +141,9 @@ def da_tokenizer():

@pytest.fixture
 def ja_tokenizer():
-    janome = pytest.importorskip("janome")
+    janome = pytest.importorskip("MeCab")
    return util.get_lang_class('ja').Defaults.create_tokenizer()

-
@pytest.fixture
 def th_tokenizer():
    pythainlp = pytest.importorskip("pythainlp")
@ -148,6 +153,9 @@ def th_tokenizer():
 def tr_tokenizer():
    return util.get_lang_class('tr').Defaults.create_tokenizer()

+@pytest.fixture
+def ar_tokenizer():
+    return util.get_lang_class('ar').Defaults.create_tokenizer()

@pytest.fixture
 def ru_tokenizer():
@ -162,7 +170,7 @@ def stringstore():

@pytest.fixture
 def en_entityrecognizer():
-     return util.get_lang_class('en').Defaults.create_entity()
+    return util.get_lang_class('en').Defaults.create_entity()


@pytest.fixture
@ -177,11 +185,11 @@ def text_file_b():

 def pytest_addoption(parser):
    parser.addoption("--models", action="store_true",
-        help="include tests that require full models")
+                     help="include tests that require full models")
    parser.addoption("--vectors", action="store_true",
-        help="include word vectors tests")
+                     help="include word vectors tests")
    parser.addoption("--slow", action="store_true",
-        help="include slow tests")
+                     help="include slow tests")

    for lang in _languages + ['all']:
        parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
--- a/spacy/tests/lang/ar/init.py
+++ b/spacy/tests/lang/ar/init.py
--- a/spacy/tests/lang/ar/test_exceptions.py
+++ b/spacy/tests/lang/ar/test_exceptions.py
@ -0,0 +1,26 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('text',
+                         ["ق.م", "إلخ", "ص.ب", "ت."])
+def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
+    tokens = ar_tokenizer(text)
+    assert len(tokens) == 1
+
+
+def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
+    text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
+    tokens = ar_tokenizer(text)
+    assert len(tokens) == 7
+    assert tokens[6].text == "ق.م"
+    assert tokens[6].lemma_ == "قبل الميلاد"
+
+
+def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
+    text = u"يبلغ طول مضيق طارق 14كم "
+    tokens = ar_tokenizer(text)
+    print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
+    assert len(tokens) == 6
--- a/spacy/tests/lang/ar/test_text.py
+++ b/spacy/tests/lang/ar/test_text.py
@ -0,0 +1,13 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+def test_tokenizer_handles_long_text(ar_tokenizer):
+    text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
+     ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
+      فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
+      و قد نجح في الحصول على جائزة نوبل للآداب، ليكون بذلك العربي الوحيد الذي فاز بها."""
+
+    tokens = ar_tokenizer(text)
+    assert tokens[3].is_stop == True
+    assert len(tokens) == 77
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -5,15 +5,41 @@ import pytest


 TOKENIZER_TESTS = [
-        ("日本語だよ", ['日本語', 'だ', 'よ']),
+        ("日本語だよ", ['日本', '語', 'だ', 'よ']),
        ("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
        ("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
-        ("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お仕置き', 'よ', '!']),
+        ("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
        ("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
 ]

+TAG_TESTS = [
+        ("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
+        ("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
+        ("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
+        ("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
+        ("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
+]
+
+POS_TESTS = [
+        ('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
+        ('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
+        ('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
+        ('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
+        ('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
+]
+

@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
 def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
    tokens = [token.text for token in ja_tokenizer(text)]
    assert tokens == expected_tokens
+
+@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
+def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
+    tags = [token.tag_ for token in ja_tokenizer(text)]
+    assert tags == expected_tags
+
+@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
+def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
+    pos = [token.pos_ for token in ja_tokenizer(text)]
+    assert pos == expected_pos
--- a/spacy/tests/lang/ro/init.py
+++ b/spacy/tests/lang/ro/init.py
--- a/spacy/tests/lang/ro/test_lemmatizer.py
+++ b/spacy/tests/lang/ro/test_lemmatizer.py
@ -0,0 +1,13 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
+                                          ('expedițiilor', 'expediție'),
+                                          ('pensete', 'pensetă'),
+                                          ('erau', 'fi')])
+def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
+    tokens = ro_tokenizer(string)
+    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/regression/test_issue2219.py
+++ b/spacy/tests/regression/test_issue2219.py
@ -0,0 +1,18 @@
+# coding: utf8
+from __future__ import unicode_literals
+from ..util import add_vecs_to_vocab, get_doc
+import pytest
+
+@pytest.fixture
+def vectors():
+    return [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
+
+@pytest.fixture
+def vocab(en_vocab, vectors):
+    add_vecs_to_vocab(en_vocab, vectors)
+    return en_vocab
+
+def test_issue2219(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    doc = get_doc(vocab, words=[word1, word2])
+    assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -155,7 +155,7 @@ cdef class Token:
        """
        if 'similarity' in self.doc.user_token_hooks:
            return self.doc.user_token_hooks['similarity'](self)
-        if hasattr(other, '__len__') and len(other) == 1:
+        if hasattr(other, '__len__') and len(other) == 1 and hasattr(other, "__getitem__"):
            if self.c.lex.orth == getattr(other[0], 'orth', None):
                return 1.0
        elif hasattr(other, 'orth'):
--- a/website/README.md
+++ b/website/README.md
@ -27,8 +27,6 @@ The docs can always use another example or more detail, and they should always b

 While all page content lives in the `.jade` files, article meta (page titles, sidebars etc.) is stored as JSON. Each folder contains a `_data.json` with all required meta for its files.

-For simplicity, all sites linked in the [tutorials](https://spacy.io/docs/usage/tutorials) and [showcase](https://spacy.io/docs/usage/showcase) are also stored as JSON. So in order to edit those pages, there's no need to dig into the Jade files – simply edit the [`_data.json`](docs/usage/_data.json).
-
 ### Markup language and conventions

 Jade/Pug is a whitespace-sensitive markup language that compiles to HTML. Indentation is used to nest elements, and for template logic, like `if`/`else` or `for`, mainly used to iterate over objects and arrays in the meta data. It also allows inline JavaScript expressions.
--- a/website/_harp.json
+++ b/website/_harp.json
@ -12,8 +12,6 @@
        "COMPANY_URL": "https://explosion.ai",
        "DEMOS_URL": "https://explosion.ai/demos",
        "MODELS_REPO": "explosion/spacy-models",
-        "KERNEL_BINDER": "ines/spacy-binder",
-        "KERNEL_PYTHON": "python3",

        "SPACY_VERSION": "2.0",
        "BINDER_VERSION": "2.0.11",
@ -87,7 +85,7 @@
        ],

        "V_CSS": "2.1.3",
-        "V_JS": "2.1.1",
+        "V_JS": "2.1.2",
        "DEFAULT_SYNTAX": "python",
        "ANALYTICS": "UA-58931649-1",
        "MAILCHIMP": {
--- a/website/api/_annotation/_named-entities.jade
+++ b/website/api/_annotation/_named-entities.jade
@ -15,7 +15,7 @@ p
        +cell Nationalities or religious or political groups.

    +row
-        +cell #[code FACILITY]
+        +cell #[code FAC]
        +cell Buildings, airports, highways, bridges, etc.

    +row
--- a/website/api/doc.jade
+++ b/website/api/doc.jade
@ -149,7 +149,7 @@ p

 +aside-code("Example").
    from spacy.tokens import Doc
-    city_getter = lambda doc: doc.text in ('New York', 'Paris', 'Berlin')
+    city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin'))
    Doc.set_extension('has_city', getter=city_getter)
    doc = nlp(u'I like New York')
    assert doc._.has_city
--- a/website/api/span.jade
+++ b/website/api/span.jade
@ -127,7 +127,7 @@ p

 +aside-code("Example").
    from spacy.tokens import Span
-    city_getter = lambda span: span.text in ('New York', 'Paris', 'Berlin')
+    city_getter = lambda span: any(city in span.text for city in ('New York', 'Paris', 'Berlin'))
    Span.set_extension('has_city', getter=city_getter)
    doc = nlp(u'I like New York in Autumn')
    assert doc[1:4]._.has_city
--- a/website/assets/js/main.js
+++ b/website/assets/js/main.js
@ -47,7 +47,7 @@ import initUniverse from './universe.vue.js';
 */
 {
    if (window.Juniper) {
-        new Juniper({ repo: 'ines/spacy-binder' });
+        new Juniper({ repo: 'ines/spacy-io-binder' });
    }
 }

--- a/website/assets/js/vendor/juniper.min.js
+++ b/website/assets/js/vendor/juniper.min.js
--- a/website/universe/universe.json
+++ b/website/universe/universe.json
@ -445,6 +445,29 @@
            },
            "category": ["visualizers"]
        },
+        {
+            "id": "scattertext",
+            "slogan": "Beautiful visualizations of how language differs among document types",
+            "description": "A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.",
+            "github": "JasonKessler/scattertext",
+            "image": "https://jasonkessler.github.io/2012conventions0.0.2.2.png",
+            "code_example": [
+                "import spacy",
+                "import scattertext as st",
+                "",
+                "nlp = spacy.load('en')",
+                "corpus = st.CorpusFromPandas(convention_df,",
+                "                             category_col='party',",
+                "                             text_col='text',",
+                "                             nlp=nlp).build()"
+            ],
+            "author": "Jason Kessler",
+            "author_links": {
+                "github": "JasonKessler",
+                "twitter": "jasonkessler"
+            },
+            "category": ["visualizers"]
+        },
        {
            "id": "rasa",
            "title": "Rasa NLU",
--- a/website/usage/_adding-languages/_language-data.jade
+++ b/website/usage/_adding-languages/_language-data.jade
@ -4,7 +4,7 @@ p
    |  The individual components #[strong expose variables] that can be imported
    |  within a language module, and added to the language's #[code Defaults].
    |  Some components, like the punctuation rules, usually don't need much
-    |  customisation and can simply be imported from the global rules. Others,
+    |  customisation and can be imported from the global rules. Others,
    |  like the tokenizer and norm exceptions, are very specific and will make
    |  a big difference to spaCy's performance on the particular language and
    |  training a language model.
--- a/website/usage/_data.json
+++ b/website/usage/_data.json
@ -92,6 +92,7 @@
            "Dependency Parse": "dependency-parse",
            "Named Entities": "named-entities",
            "Tokenization": "tokenization",
+            "Sentence Segmentation": "sbd",
            "Rule-based Matching": "rule-based-matching"
        }
    },
--- a/website/usage/_install/_troubleshooting.jade
+++ b/website/usage/_install/_troubleshooting.jade
@ -39,7 +39,7 @@ p
    |  this. The above error mostly occurs when doing a system-wide installation,
    |  which will create the symlinks in a system directory. Run the
    |  #[code download] or #[code link] command as administrator (on Windows,
-    |  simply right-click on your terminal or shell ans select "Run as
+    |  you can either right-click on your terminal or shell ans select "Run as
    |  Administrator"), or use a #[code virtualenv] to install spaCy in a user
    |  directory, instead of doing a system-wide installation.

--- a/website/usage/_linguistic-features/_dependency-parse.jade
+++ b/website/usage/_linguistic-features/_dependency-parse.jade
@ -220,8 +220,8 @@ p

 p
    |  The best way to understand spaCy's dependency parser is interactively.
-    |  To make this easier, spaCy v2.0+ comes with a visualization module. Simply
-    |  pass a #[code Doc] or a list of #[code Doc] objects to
+    |  To make this easier, spaCy v2.0+ comes with a visualization module. You
+    |  can pass a #[code Doc] or a list of #[code Doc] objects to
    |  displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
    |  run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
    |  to generate the raw markup. If you want to know how to write rules that
--- a/website/usage/_linguistic-features/_named-entities.jade
+++ b/website/usage/_linguistic-features/_named-entities.jade
@ -195,7 +195,7 @@ p
    |  lets you explore an entity recognition model's behaviour interactively.
    |  If you're training a model, it's very useful to run the visualization
    |  yourself. To help you do that, spaCy v2.0+ comes with a visualization
-    |  module. Simply pass a #[code Doc] or a list of #[code Doc] objects to
+    |  module. You can pass a #[code Doc] or a list of #[code Doc] objects to
    |  displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
    |  run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
    |  to generate the raw markup.
--- a/website/usage/_linguistic-features/_sentence-segmentation.jade
+++ b/website/usage/_linguistic-features/_sentence-segmentation.jade
@ -0,0 +1,129 @@
+//- 💫 DOCS > USAGE > LINGUISTIC FEATURES > SENTENCE SEGMENTATION
+
+p
+    |  A #[+api("doc") #[code Doc]] object's sentences are available via the
+    |  #[code Doc.sents] property. Unlike other libraries, spaCy uses the
+    |  dependency parse to determine sentence boundaries. This is usually more
+    |  accurate than a rule-based approach, but it also means you'll need a
+    |  #[strong statistical model] and accurate predictions. If your
+    |  texts are closer to general-purpose news or web text, this should work
+    |  well out-of-the-box. For social media or conversational text that
+    |  doesn't follow the same rules, your application may benefit from a custom
+    |  rule-based implementation. You can either plug a rule-based component
+    |  into your #[+a("/usage/processing-pipelines") processing pipeline] or use
+    |  the #[code SentenceSegmenter] component with a custom stategy.
+
+h(3, "sbd-parser") Default: Using the dependency parse
+    +tag-model("dependency parser")
+
+p
+    |  To view a #[code Doc]'s sentences, you can iterate over the
+    |  #[code Doc.sents], a generator that yields
+    |  #[+api("span") #[code Span]] objects.
+
+code-exec.
+    import spacy
+
+    nlp = spacy.load('en_core_web_sm')
+    doc = nlp(u"This is a sentence. This is another sentence.")
+    for sent in doc.sents:
+        print(sent.text)
+
+h(3, "sbd-manual") Setting boundaries manually
+
+p
+    |  spaCy's dependency parser respects already set boundaries, so you can
+    |  preprocess your #[code Doc] using custom rules #[code before] it's
+    |  parsed. This can be done by adding a
+    |  #[+a("/usage/processing-pipelines") custom pipeline component]. Depending
+    |  on your text, this may also improve accuracy, since the parser is
+    |  constrained to predict parses consistent with the sentence boundaries.
+
+infobox("Important note", "⚠️")
+    |  To prevent inconsitent state, you can only set boundaries #[em before] a
+    |  document is parsed (and #[code Doc.is_parsed] is #[code False]). To
+    |  ensure that your component is added in the right place, you can set
+    |  #[code before='parser'] or #[code first=True] when adding it to the
+    |  pipeline using #[+api("language#add_pipe") #[code nlp.add_pipe]].
+
+p
+    |  Here's an example of a component that implements a pre-processing rule
+    |  for splitting on #[code '...'] tokens. The component is added before
+    |  the parser, which is then used to further segment the text. This
+    |  approach can be useful if you want to implement #[em additional] rules
+    |  specific to your data, while still being able to take advantage of
+    |  dependency-based sentence segmentation.
+
+code-exec.
+    import spacy
+
+    text = u"this is a sentence...hello...and another sentence."
+
+    nlp = spacy.load('en_core_web_sm')
+    doc = nlp(text)
+    print('Before:', [sent.text for sent in doc.sents])
+
+    def set_custom_boundaries(doc):
+        for token in doc[:-1]:
+            if token.text == '...':
+                doc[token.i+1].is_sent_start = True
+        return doc
+
+    nlp.add_pipe(set_custom_boundaries, before='parser')
+    doc = nlp(text)
+    print('After:', [sent.text for sent in doc.sents])
+
+h(3, "sbd-component") Rule-based pipeline component
+
+p
+    |  The #[code sentencizer] component is a
+    |  #[+a("/usage/processing-pipelines") pipeline component] that splits
+    |  sentences on punctuation like #[code &period;], #[code !] or #[code ?].
+    |  You can plug it into your pipeline if you only need sentence boundaries
+    |  without the dependency parse. Note that #[code Doc.sents] will
+    |  #[strong raise an error] if no sentence boundaries are set.
+
+code-exec.
+    import spacy
+    from spacy.lang.en import English
+
+    nlp = English()  # just the language with no model
+    sbd = nlp.create_pipe('sentencizer')   # or: nlp.create_pipe('sbd')
+    nlp.add_pipe(sbd)
+    doc = nlp(u"This is a sentence. This is another sentence.")
+    for sent in doc.sents:
+        print(sent.text)
+
+h(3, "sbd-custom") Custom rule-based strategy
+
+p
+    |  If you want to implement your own strategy that differs from the default
+    |  rule-based approach of splitting on sentences, you can also instantiate
+    |  the #[code SentenceSegmenter] directly and pass in your own strategy.
+    |  The strategy should be a function that takes a #[code Doc] object and
+    |  yields a #[code Span] for each sentence. Here's an example of a custom
+    |  segmentation strategy for splitting on newlines only:
+
+code-exec.
+    from spacy.lang.en import English
+    from spacy.pipeline import SentenceSegmenter
+
+    def split_on_newlines(doc):
+        start = 0
+        seen_newline = False
+        for word in doc:
+            if seen_newline and not word.is_space:
+                yield doc[start:word.i]
+                start = word.i
+                seen_newline = False
+            elif word.text == '\n':
+                seen_newline = True
+        if start &lt; len(doc):
+            yield doc[start:len(doc)]
+
+    nlp = English()  # just the language with no model
+    sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
+    nlp.add_pipe(sbd)
+    doc = nlp(u"This is a sentence\n\nThis is another sentence\nAnd more")
+    for sent in doc.sents:
+        print([token.text for token in sent])
--- a/website/usage/_linguistic-features/_tokenization.jade
+++ b/website/usage/_linguistic-features/_tokenization.jade
@ -274,7 +274,7 @@ p
        |  In spaCy v1.x, you had to add a custom tokenizer by passing it to the
        |  #[code make_doc] keyword argument, or by passing a tokenizer "factory"
        |  to #[code create_make_doc]. This was unnecessarily complicated. Since
-        |  spaCy v2.0, you can simply write to #[code nlp.tokenizer]. If your
+        |  spaCy v2.0, you can write to #[code nlp.tokenizer] instead. If your
        |  tokenizer needs the vocab, you can write a function and use
        |  #[code nlp.vocab].

--- a/website/usage/_models/_install.jade
+++ b/website/usage/_models/_install.jade
@ -19,15 +19,15 @@ include _install-basics
 +h(3, "download-pip") Installation via pip

 p
-    | To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip],
-    |  simply point #[code pip install] to the URL or local path of the archive
+    |  To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip],
+    |  point #[code pip install] to the URL or local path of the archive
    |  file. To find the direct link to a model, head over to the
    |  #[+a(gh("spacy-models") + "/releases") model releases], right click on the archive
    |  link and copy it to your clipboard.

 +code(false, "bash").
    # with external URL
-    pip install #{gh("spacy-models")}/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
+    pip install #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

    # with local file
    pip install /Users/you/en_core_web_md-1.2.0.tar.gz
@ -69,7 +69,7 @@ p

 p
    |  You can place the #[strong model package directory] anywhere on your
-    |  local file system. To use it with spaCy, simply assign it a name by
+    |  local file system. To use it with spaCy, assign it a name by
    |  creating a #[+a("#usage") shortcut link] for the data directory.

 +h(3, "usage") Using models with spaCy
--- a/website/usage/_models/_production.jade
+++ b/website/usage/_models/_production.jade
@ -26,7 +26,7 @@ p
 p
    |  Because all models are valid Python packages, you can add them to your
    |  application's #[code requirements.txt]. If you're running your own
-    |  internal PyPi installation, you can simply upload the models there. pip's
+    |  internal PyPi installation, you can upload the models there. pip's
    |  #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
    |  supports both package names to download via a PyPi server, as well as direct
    |  URLs.
--- a/website/usage/_spacy-101/_tokenization.jade
+++ b/website/usage/_spacy-101/_tokenization.jade
@ -5,7 +5,7 @@ p
    |  segments it into words, punctuation and so on. This is done by applying
    |  rules specific to each language. For example, punctuation at the end of a
    |  sentence should be split off – whereas "U.K." should remain one token.
-    |  Each #[code Doc] consists of individual tokens, and we can simply iterate
+    |  Each #[code Doc] consists of individual tokens, and we can iterate
    |  over them:

 +code-exec.
--- a/website/usage/_visualizers/_html.jade
+++ b/website/usage/_visualizers/_html.jade
@ -72,10 +72,11 @@ p
    |  you want to visualize output from other libraries, like
    |  #[+a("http://www.nltk.org") NLTK] or
    |  #[+a("https://github.com/tensorflow/models/tree/master/research/syntaxnet") SyntaxNet].
-    |  Simply convert the dependency parse or recognised entities to displaCy's
-    |  format and set #[code manual=True] on either #[code render()] or
-    |  #[code serve()]. When setting #[code ents] manually, make sure to supply
-    |  them in the right order, i.e. starting with the lowest start position.
+    |  If you set #[code manual=True] on either #[code render()] or
+    |  #[code serve()], you can pass in data in displaCy's format (instead of
+    |  #[code Doc] objects). When setting #[code ents] manually, make sure to
+    |  supply them in the right order, i.e. starting with the lowest start
+    |  position.

 +aside-code("Example").
    ex = [{'text': 'But Google is starting from behind.',
@ -109,7 +110,7 @@ p
    |  If you want to use the visualizers as part of a web application, for
    |  example to create something like our
    |  #[+a(DEMOS_URL + "/displacy") online demo], it's not recommended to
-    |  simply wrap and serve the displaCy renderer. Instead, you should only
+    |  only wrap and serve the displaCy renderer. Instead, you should only
    |  rely on the server to perform spaCy's processing capabilities, and use
    |  #[+a(gh("displacy")) displaCy.js] to render the JSON-formatted output.

--- a/website/usage/linguistic-features.jade
+++ b/website/usage/linguistic-features.jade
@ -33,6 +33,10 @@ p
    +h(2, "tokenization") Tokenization
    include _linguistic-features/_tokenization

+section("sbd")
+    +h(2, "sbd") Sentence Segmentation
+    include _linguistic-features/_sentence-segmentation
+
 +section("rule-based-matching")
    +h(2, "rule-based-matching") Rule-based matching
    include _linguistic-features/_rule-based-matching