Merge branch 'develop' into nightly.spacy.io

2025-08-27 23:45:01 +03:00 · 2020-09-08 14:28:18 +02:00 · 2020-09-08 14:28:18 +02:00 · fa101a1bb6
commit fa101a1bb6
parent 46e58aa675 d98ae9d918
75 changed files with 3535 additions and 390 deletions
--- a/.github/contributors/bittlingmayer.md
+++ b/.github/contributors/bittlingmayer.md
@ -0,0 +1,107 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Adam Bittlingmayer   |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 12 Aug 2020          |
 | GitHub username                | bittlingmayer        |
 | Website (optional)             |                      |
--- a/.github/contributors/graue70.md
+++ b/.github/contributors/graue70.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Thomas               |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-08-11           |
 | GitHub username                | graue70              |
 | Website (optional)             |                      |
--- a/.github/contributors/holubvl3.md
+++ b/.github/contributors/holubvl3.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |  Vladimir Holubec    |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |  30.07.2020          |
 | GitHub username                |  holubvl3            |
 | Website (optional)             |                      |
--- a/.github/contributors/idoshr.md
+++ b/.github/contributors/idoshr.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Ido Shraga           |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 20-09-2020           |
 | GitHub username                | idoshr               |
 | Website (optional)             |                      |
--- a/.github/contributors/jgutix.md
+++ b/.github/contributors/jgutix.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Juan Gutiérrez       |
 | Company name (if applicable)   | Ojtli                |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-08-28           |
 | GitHub username                | jgutix               |
 | Website (optional)             | ojtli.app            |
--- a/.github/contributors/leyendecker.md
+++ b/.github/contributors/leyendecker.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                        |
 |------------------------------- | ---------------------------- |
 | Name                           | Gustavo Zadrozny Leyendecker |
 | Company name (if applicable)   |                              |
 | Title or role (if applicable)  |                              |
 | Date                           | July 29, 2020                |
 | GitHub username                | leyendecker                  |
 | Website (optional)             |                              |
--- a/.github/contributors/lizhe2004.md
+++ b/.github/contributors/lizhe2004.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                    |
 |------------------------------- | ------------------------ |
 | Name                           | Zhe li                   |
 | Company name (if applicable)   |                          |
 | Title or role (if applicable)  |                          |
 | Date                           | 2020-07-24               |
 | GitHub username                | lizhe2004                |
 | Website (optional)             | http://www.huahuaxia.net|
--- a/.github/contributors/snsten.md
+++ b/.github/contributors/snsten.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Shashank Shekhar     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-08-23           |
 | GitHub username                | snsten               |
 | Website (optional)             |                      |
--- a/.github/contributors/solarmist.md
+++ b/.github/contributors/solarmist.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                     |
 |------------------------------- | ------------------------- |
 | Name                           | Joshua Olson              |
 | Company name (if applicable)   |                           |
 | Title or role (if applicable)  |                           |
 | Date                           | 2020-07-22                |
 | GitHub username                | solarmist                 |
 | Website (optional)             | http://blog.solarmist.net |
--- a/.github/contributors/tilusnet.md
+++ b/.github/contributors/tilusnet.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |  Attila Szász        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |  12 Aug 2020         |
 | GitHub username                |  tilusnet            |
 | Website (optional)             |                      |
--- a/licenses/3rd_party_licenses.txt
+++ b/licenses/3rd_party_licenses.txt
@ -0,0 +1,38 @@
 Third Party Licenses for spaCy
 ==============================
 NumPy
 -----
 * Files: setup.py
 Copyright (c) 2005-2020, NumPy Developers.
 All rights reserved.
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are
 met:
    * Redistributions of source code must retain the above copyright
       notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above
       copyright notice, this list of conditions and the following
       disclaimer in the documentation and/or other materials provided
       with the distribution.
    * Neither the name of the NumPy Developers nor the names of any
       contributors may be used to endorse or promote products derived
       from this software without specific prior written permission.
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy-nightly"
-__version__ = "3.0.0a13"
+__version__ = "3.0.0a14"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@ -40,5 +40,6 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
            url = storage.pull(output_path, command_hash=cmd_hash)
            yield url, output_path
-        if cmd.get("outputs") and all(loc.exists() for loc in cmd["outputs"]):
+        out_locs = [project_dir / out for out in cmd.get("outputs", [])]
        if all(loc.exists() for loc in out_locs):
            update_lockfile(project_dir, cmd)
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@ -45,10 +45,19 @@ def project_push(project_dir: Path, remote: str):
        )
        for output_path in cmd.get("outputs", []):
            output_loc = project_dir / output_path
-            if output_loc.exists():
+            if output_loc.exists() and _is_not_empty_dir(output_loc):
                url = storage.push(
                    output_path,
                    command_hash=cmd_hash,
                    content_hash=get_content_hash(output_loc),
                )
                yield output_path, url
 def _is_not_empty_dir(loc: Path):
    if not loc.is_dir():
        return True
    elif any(_is_not_empty_dir(child) for child in loc.iterdir()):
        return True
    else:
        return False
--- a/spacy/cli/templates/quickstart_training.jinja
+++ b/spacy/cli/templates/quickstart_training.jinja
@ -186,11 +186,14 @@ accumulate_gradient = {{ transformer["size_factor"] }}
 [training.optimizer]
@optimizers = "Adam.v1"
 {% if use_transformer -%}
 [training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
 warmup_steps = 250
 total_steps = 20000
 initial_rate = 5e-5
 {% endif %}
 [training.train_corpus]
@readers = "spacy.Corpus.v1"
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -329,7 +329,11 @@ class EntityRenderer:
            else:
                markup += entity
            offset = end
-        markup += escape_html(text[offset:])
+        fragments = text[offset:].split("\n")
        for i, fragment in enumerate(fragments):
            markup += escape_html(fragment)
            if len(fragments) > 1 and i != len(fragments) - 1:
                markup += "</br>"
        markup = TPL_ENTS.format(content=markup, dir=self.direction)
        if title:
            markup = TPL_TITLE.format(title=title) + markup
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -76,6 +76,10 @@ class Warnings:
            "If this is surprising, make sure you have the spacy-lookups-data "
            "package installed. The languages with lexeme normalization tables "
            "are currently: {langs}")
    W034 = ("Please install the package spacy-lookups-data in order to include "
            "the default lexeme normalization table for the language '{lang}'.")
    W035 = ('Discarding subpattern "{pattern}" due to an unrecognized '
            "attribute or operator.")
    # TODO: fix numbering after merging develop into master
    W090 = ("Could not locate any binary .spacy files in path '{path}'.")
@ -284,12 +288,12 @@ class Errors:
            "Span objects, or dicts if set to manual=True.")
    E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
            "phrase pattern (string) but got:\n{pattern}")
-    E098 = ("Invalid pattern specified: expected both SPEC and PATTERN.")
+    E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
-    E099 = ("First node of pattern should be a root node. The root should "
+    E099 = ("Invalid pattern: the first node of pattern should be an anchor "
-            "only contain NODE_NAME.")
+            "node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
-    E100 = ("Nodes apart from the root should contain NODE_NAME, NBOR_NAME and "
+    E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
-            "NBOR_RELOP.")
+            "REL_OP and RIGHT_ID.")
-    E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
+    E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
            "have been declared in previous edges.")
    E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
            "tokens to merge. If you want to find the longest non-overlapping "
@ -474,6 +478,9 @@ class Errors:
    E198 = ("Unable to return {n} most similar vectors for the current vectors "
            "table, which contains {n_rows} vectors.")
    E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
    E200 = ("Specifying a base model with a pretrained component '{component}' "
            "can not be combined with adding a pretrained Tok2Vec layer.")
    E201 = ("Span index out of range.")
    # TODO: fix numbering after merging develop into master
    E925 = ("Invalid color values for displaCy visualizer: expected dictionary "
@ -654,6 +661,9 @@ class Errors:
             "'{chunk}'. Tokenizer exceptions are only allowed to specify "
             "`ORTH` and `NORM`.")
    E1006 = ("Unable to initialize {name} model with 0 labels.")
    E1007 = ("Unsupported DependencyMatcher operator '{op}'.")
    E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check "
             "that you are providing a list of patterns as `List[List[dict]]`.")
@add_codes
--- a/spacy/lang/cs/init.py
+++ b/spacy/lang/cs/init.py
@ -1,9 +1,11 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 class CzechDefaults(Language.Defaults):
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
 class Czech(Language):
--- a/spacy/lang/cs/examples.py
+++ b/spacy/lang/cs/examples.py
@ -0,0 +1,38 @@
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.cs.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Máma mele maso.",
    "Příliš žluťoučký kůň úpěl ďábelské ódy.",
    "ArcGIS je geografický informační systém určený pro práci s prostorovými daty.",
    "Může data vytvářet a spravovat, ale především je dokáže analyzovat, najít v nich nové vztahy a vše přehledně vizualizovat.",
    "Dnes je krásné počasí.",
    "Nestihl autobus, protože pozdě vstal z postele.",
    "Než budeš jíst, jdi si umýt ruce.",
    "Dnes je neděle.",
    "Škola začíná v 8:00.",
    "Poslední autobus jede v jedenáct hodin večer.",
    "V roce 2020 se téměř zastavila světová ekonomika.",
    "Praha je hlavní město České republiky.",
    "Kdy půjdeš ven?",
    "Kam pojedete na dovolenou?",
    "Kolik stojí iPhone 12?",
    "Průměrná mzda je 30000 Kč.",
    "1. ledna 1993 byla založena Česká republika.",
    "Co se stalo 21.8.1968?",
    "Moje telefonní číslo je 712 345 678.",
    "Můj pes má blechy.",
    "Když bude přes noc více než 20°, tak nás čeká tropická noc.",
    "Kolik bylo letos tropických nocí?",
    "Jak to mám udělat?",
    "Bydlíme ve čtvrtém patře.",
    "Vysílají 30. sezonu seriálu Simpsonovi.",
    "Adresa ČVUT je Thákurova 7, 166 29, Praha 6.",
    "Jaké PSČ má Praha 1?",
    "PSČ Prahy 1 je 110 00.",
    "Za 20 minut jede vlak.",
 ]
--- a/spacy/lang/cs/lex_attrs.py
+++ b/spacy/lang/cs/lex_attrs.py
@ -0,0 +1,61 @@
 from ...attrs import LIKE_NUM
 _num_words = [
    "nula",
    "jedna",
    "dva",
    "tři",
    "čtyři",
    "pět",
    "šest",
    "sedm",
    "osm",
    "devět",
    "deset",
    "jedenáct",
    "dvanáct",
    "třináct",
    "čtrnáct",
    "patnáct",
    "šestnáct",
    "sedmnáct",
    "osmnáct",
    "devatenáct",
    "dvacet",
    "třicet",
    "čtyřicet",
    "padesát",
    "šedesát",
    "sedmdesát",
    "osmdesát",
    "devadesát",
    "sto",
    "tisíc",
    "milion",
    "miliarda",
    "bilion",
    "biliarda",
    "trilion",
    "triliarda",
    "kvadrilion",
    "kvadriliarda",
    "kvintilion",
 ]
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/cs/stop_words.py
+++ b/spacy/lang/cs/stop_words.py
@ -1,14 +1,23 @@
 # Source: https://github.com/Alir3z4/stop-words
 # Source: https://github.com/stopwords-iso/stopwords-cs/blob/master/stopwords-cs.txt
 STOP_WORDS = set(
    """
-ačkoli
+a
 aby
 ahoj
 ačkoli
 ale
 alespoň
 anebo
 ani
 aniž
 ano
 atd.
 atp.
 asi
 aspoň
 až
 během
 bez
 beze
@ -21,12 +30,14 @@ budeš
 budete
 budou
 budu
 by
 byl
 byla
 byli
 bylo
 byly
 bys
 být
 čau
 chce
 chceme
@ -35,14 +46,21 @@ chcete
 chci
 chtějí
 chtít
-chut'
+chuť
 chuti
 co
 což
 cz
 či
 článek
 článku
 články
 čtrnáct
 čtyři
 dál
 dále
 daleko
 další
 děkovat
 děkujeme
 děkuji
@ -50,6 +68,7 @@ den
 deset
 devatenáct
 devět
 dnes
 do
 dobrý
 docela
@ -57,9 +76,15 @@ dva
 dvacet
 dvanáct
 dvě
 email
 ho
 hodně
 i
 já
 jak
 jakmile
 jako
 jakož
 jde
 je
 jeden
@ -69,25 +94,39 @@ jedno
 jednou
 jedou
 jeho
 jehož
 jej
 její
 jejich
 jejichž
 jehož
 jelikož
 jemu
 jen
 jenom
 jenž
 jež
 ještě
 jestli
 jestliže
 ještě
 ji
 jí
 jich
 jím
 jim
 jimi
 jinak
-jsem
+jiné
 již
 jsi
 jsme
 jsem
 jsou
 jste
 k
 kam
 každý
 kde
 kdo
 kdy
@ -96,10 +135,13 @@ ke
 kolik
 kromě
 která
 kterak
 kterou
 které
 kteří
 který
 kvůli
 ku
 má
 mají
 málo
@ -110,8 +152,10 @@ máte
 mé
 mě
 mezi
 mi
 mí
 mít
 mne
 mně
 mnou
 moc
@ -134,6 +178,7 @@ nás
 náš
 naše
 naši
 načež
 ne
 ně
 nebo
@ -141,6 +186,7 @@ nebyl
 nebyla
 nebyli
 nebyly
 nechť
 něco
 nedělá
 nedělají
@ -150,6 +196,7 @@ neděláš
 neděláte
 nějak
 nejsi
 nejsou
 někde
 někdo
 nemají
@ -157,15 +204,22 @@ nemáme
 nemáte
 neměl
 němu
 němuž
 není
 nestačí
 ně
 nevadí
 nové
 nový
 noví
 než
 nic
 nich
 ní
 ním
 nimi
 nula
 o
 od
 ode
 on
@ -179,22 +233,37 @@ pak
 patnáct
 pět
 po
 pod
 pokud
 pořád
 pouze
 potom
 pozdě
 pravé
 před
 přede
 přes
-přese
+přece
 pro
 proč
 prosím
 prostě
 proto
 proti
 první
 právě
 protože
 při
 přičemž
 rovně
 s
 se
 sedm
 sedmnáct
 si
 sice
 skoro
 sic
 šest
 šestnáct
 skoro
@ -203,41 +272,69 @@ smí
 snad
 spolu
 sta
 svůj
 své
 svá
 svých
 svým
 svými
 svůj
 sté
 sto
 strana
 ta
 tady
 tak
 takhle
 taky
 také
 takže
 tam
-tamhle
+támhle
-tamhleto
+támhleto
 tamto
 tě
 tebe
 tebou
-ted'
+teď
 tedy
 ten
 tento
 této
 ti
 tím
 tímto
 tisíc
 tisíce
 to
 tobě
 tohle
 tohoto
 tom
 tomto
 tomu
 tomuto
 toto
 třeba
 tři
 třináct
 trošku
 trochu
 tu
 tuto
 tvá
 tvé
 tvoje
 tvůj
 ty
 tyto
 těm
 těma
 těmi
 u
 určitě
 už
 v
 vám
 vámi
 vás
@ -247,13 +344,19 @@ vaši
 ve
 večer
 vedle
 více
 vlastně
 však
 všechen
 všechno
 všichni
 vůbec
 vy
 vždy
 z
 zda
 za
 zde
 zač
 zatímco
 ze
--- a/spacy/lang/cs/test_text.py
+++ b/spacy/lang/cs/test_text.py
--- a/spacy/lang/en/lex_attrs.py
+++ b/spacy/lang/en/lex_attrs.py
@ -8,6 +8,14 @@ _num_words = [
    "fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand",
    "million", "billion", "trillion", "quadrillion", "gajillion", "bazillion"
 ]
 _ordinal_words = [
    "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth",
    "ninth", "tenth", "eleventh", "twelfth", "thirteenth", "fourteenth",
    "fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth",
    "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth",
    "eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth",
    "trillionth", "quadrillionth", "gajillionth", "bazillionth"
 ]
 # fmt: on
@ -21,8 +29,15 @@ def like_num(text: str) -> bool:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
-    if text.lower() in _num_words:
+    text_lower = text.lower()
    if text_lower in _num_words:
        return True
    # Check ordinal number
    if text_lower in _ordinal_words:
        return True
    if text_lower.endswith("th"):
        if text_lower[:-2].isdigit():
            return True
    return False
--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -19,8 +19,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
    np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
    np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
    stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
-    token = doc[0]
+    for token in doclike:
    while token and token.i < len(doclike):
        if token.pos in [PROPN, NOUN, PRON]:
            left, right = noun_bounds(
                doc, token, np_left_deps, np_right_deps, stop_deps
--- a/spacy/lang/he/init.py
+++ b/spacy/lang/he/init.py
@ -1,9 +1,11 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 class HebrewDefaults(Language.Defaults):
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
    writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
--- a/spacy/lang/he/lex_attrs.py
+++ b/spacy/lang/he/lex_attrs.py
@ -0,0 +1,95 @@
 from ...attrs import LIKE_NUM
 _num_words = [
    "אפס",
    "אחד",
    "אחת",
    "שתיים",
    "שתים",
    "שניים",
    "שנים",
    "שלוש",
    "שלושה",
    "ארבע",
    "ארבעה",
    "חמש",
    "חמישה",
    "שש",
    "שישה",
    "שבע",
    "שבעה",
    "שמונה",
    "תשע",
    "תשעה",
    "עשר",
    "עשרה",
    "אחד עשר",
    "אחת עשרה",
    "שנים עשר",
    "שתים עשרה",
    "שלושה עשר",
    "שלוש עשרה",
    "ארבעה עשר",
    "ארבע עשרה",
    "חמישה עשר",
    "חמש עשרה",
    "ששה עשר",
    "שש עשרה",
    "שבעה עשר",
    "שבע עשרה",
    "שמונה עשר",
    "שמונה עשרה",
    "תשעה עשר",
    "תשע עשרה",
    "עשרים",
    "שלושים",
    "ארבעים",
    "חמישים",
    "שישים",
    "שבעים",
    "שמונים",
    "תשעים",
    "מאה",
    "אלף",
    "מליון",
    "מליארד",
    "טריליון",
 ]
 _ordinal_words = [
    "ראשון",
    "שני",
    "שלישי",
    "רביעי",
    "חמישי",
    "שישי",
    "שביעי",
    "שמיני",
    "תשיעי",
    "עשירי",
 ]
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text in _num_words:
        return True
    # CHeck ordinal number
    if text in _ordinal_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/he/stop_words.py
+++ b/spacy/lang/he/stop_words.py
@ -39,7 +39,6 @@ STOP_WORDS = set(
 בין
 עם
 עד
 נגר
 על
 אל
 מול
@ -58,7 +57,7 @@ STOP_WORDS = set(
 עליך
 עלינו
 עליכם
-לעיכן
+עליכן
 עליהם
 עליהן
 כל
@ -67,8 +66,8 @@ STOP_WORDS = set(
 כך
 ככה
 כזה
 כזאת
 זה
 זות
 אותי
 אותה
 אותם
@ -91,7 +90,7 @@ STOP_WORDS = set(
 איתכן
 יהיה
 תהיה
-היתי
+הייתי
 היתה
 היה
 להיות
@ -101,8 +100,6 @@ STOP_WORDS = set(
 עצמם
 עצמן
 עצמנו
 עצמהם
 עצמהן
 מי
 מה
 איפה
@ -153,6 +150,7 @@ STOP_WORDS = set(
 לאו
 אי
 כלל
 בעד
 נגד
 אם
 עם
@ -196,7 +194,6 @@ STOP_WORDS = set(
 אשר
 ואילו
 למרות
 אס
 כמו
 כפי
 אז
@ -204,8 +201,8 @@ STOP_WORDS = set(
 כן
 לכן
 לפיכך
 מאד
 עז
 מאוד
 מעט
 מעטים
 במידה
--- a/spacy/lang/hi/examples.py
+++ b/spacy/lang/hi/examples.py
@ -15,4 +15,6 @@ sentences = [
    "फ्रांस के राष्ट्रपति कौन हैं?",
    "संयुक्त राज्यों की राजधानी क्या है?",
    "बराक ओबामा का जन्म कब हुआ था?",
    "जवाहरलाल नेहरू भारत के पहले प्रधानमंत्री हैं।",
    "राजेंद्र प्रसाद, भारत के पहले राष्ट्रपति, दो कार्यकाल के लिए कार्यालय रखने वाले एकमात्र व्यक्ति हैं।",
 ]
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -254,7 +254,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
        return text_dtokens, text_spaces
    # align words and dtokens by referring text, and insert gap tokens for the space char spans
-    for word, dtoken in zip(words, dtokens):
+    for i, (word, dtoken) in enumerate(zip(words, dtokens)):
        # skip all space tokens
        if word.isspace():
            continue
@ -275,7 +275,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
        text_spaces.append(False)
        text_pos += len(word)
        # poll a space char after the word
-        if text_pos < len(text) and text[text_pos] == " ":
+        if i + 1 < len(dtokens) and dtokens[i + 1].surface == " ":
            text_spaces[-1] = True
            text_pos += 1
--- a/spacy/lang/lex_attrs.py
+++ b/spacy/lang/lex_attrs.py
@ -8,7 +8,7 @@ from .. import attrs
 _like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match
 _tlds = set(
    "com|org|edu|gov|net|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|"
-    "name|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|"
+    "name|pro|tel|travel|xyz|icu|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|"
    "ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|"
    "cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|"
    "ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|"
--- a/spacy/lang/ne/stop_words.py
+++ b/spacy/lang/ne/stop_words.py
@ -1,7 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 # Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
 STOP_WORDS = set(
--- a/spacy/lang/sa/init.py
+++ b/spacy/lang/sa/init.py
@ -0,0 +1,16 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 class SanskritDefaults(Language.Defaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
 class Sanskrit(Language):
    lang = "sa"
    Defaults = SanskritDefaults
 __all__ = ["Sanskrit"]
--- a/spacy/lang/sa/examples.py
+++ b/spacy/lang/sa/examples.py
@ -0,0 +1,15 @@
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.sa.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "अभ्यावहति कल्याणं विविधं वाक् सुभाषिता ।",
    "मनसि व्याकुले चक्षुः पश्यन्नपि न पश्यति ।",
    "यस्य बुद्धिर्बलं तस्य निर्बुद्धेस्तु कुतो बलम्?",
    "परो अपि हितवान् बन्धुः बन्धुः अपि अहितः परः ।",
    "अहितः देहजः व्याधिः हितम् आरण्यं औषधम् ॥",
 ]
--- a/spacy/lang/sa/lex_attrs.py
+++ b/spacy/lang/sa/lex_attrs.py
@ -0,0 +1,127 @@
 from ...attrs import LIKE_NUM
 # reference 1: https://en.wikibooks.org/wiki/Sanskrit/Numbers
 _num_words = [
    "एकः",
    "द्वौ",
    "त्रयः",
    "चत्वारः",
    "पञ्च",
    "षट्",
    "सप्त",
    "अष्ट",
    "नव",
    "दश",
    "एकादश",
    "द्वादश",
    "त्रयोदश",
    "चतुर्दश",
    "पञ्चदश",
    "षोडश",
    "सप्तदश",
    "अष्टादश",
    "एकान्नविंशति",
    "विंशति",
    "एकाविंशति",
    "द्वाविंशति",
    "त्रयोविंशति",
    "चतुर्विंशति",
    "पञ्चविंशति",
    "षड्विंशति",
    "सप्तविंशति",
    "अष्टाविंशति",
    "एकान्नत्रिंशत्",
    "त्रिंशत्",
    "एकत्रिंशत्",
    "द्वात्रिंशत्",
    "त्रयत्रिंशत्",
    "चतुस्त्रिंशत्",
    "पञ्चत्रिंशत्",
    "षट्त्रिंशत्",
    "सप्तत्रिंशत्",
    "अष्टात्रिंशत्",
    "एकोनचत्वारिंशत्",
    "चत्वारिंशत्",
    "एकचत्वारिंशत्",
    "द्वाचत्वारिंशत्",
    "त्रयश्चत्वारिंशत्",
    "चतुश्चत्वारिंशत्",
    "पञ्चचत्वारिंशत्",
    "षट्चत्वारिंशत्",
    "सप्तचत्वारिंशत्",
    "अष्टाचत्वारिंशत्",
    "एकोनपञ्चाशत्",
    "पञ्चाशत्",
    "एकपञ्चाशत्",
    "द्विपञ्चाशत्",
    "त्रिपञ्चाशत्",
    "चतुःपञ्चाशत्",
    "पञ्चपञ्चाशत्",
    "षट्पञ्चाशत्",
    "सप्तपञ्चाशत्",
    "अष्टपञ्चाशत्",
    "एकोनषष्ठिः",
    "षष्ठिः",
    "एकषष्ठिः",
    "द्विषष्ठिः",
    "त्रिषष्ठिः",
    "चतुःषष्ठिः",
    "पञ्चषष्ठिः",
    "षट्षष्ठिः",
    "सप्तषष्ठिः",
    "अष्टषष्ठिः",
    "एकोनसप्ततिः",
    "सप्ततिः",
    "एकसप्ततिः",
    "द्विसप्ततिः",
    "त्रिसप्ततिः",
    "चतुःसप्ततिः",
    "पञ्चसप्ततिः",
    "षट्सप्ततिः",
    "सप्तसप्ततिः",
    "अष्टसप्ततिः",
    "एकोनाशीतिः",
    "अशीतिः",
    "एकाशीतिः",
    "द्वशीतिः",
    "त्र्यशीतिः",
    "चतुरशीतिः",
    "पञ्चाशीतिः",
    "षडशीतिः",
    "सप्ताशीतिः",
    "अष्टाशीतिः",
    "एकोननवतिः",
    "नवतिः",
    "एकनवतिः",
    "द्विनवतिः",
    "त्रिनवतिः",
    "चतुर्नवतिः",
    "पञ्चनवतिः",
    "षण्णवतिः",
    "सप्तनवतिः",
    "अष्टनवतिः",
    "एकोनशतम्",
    "शतम्",
 ]
 def like_num(text):
    """
   Check if text resembles a number
   """
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text in _num_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/sa/stop_words.py
+++ b/spacy/lang/sa/stop_words.py
@ -0,0 +1,515 @@
 # Source: https://gist.github.com/Akhilesh28/fe8b8e180f64b72e64751bc31cb6d323
 STOP_WORDS = set(
    """
 अहम्
 आवाम्
 वयम्
 माम्  मा
 आवाम्
 अस्मान्  नः
 मया
 आवाभ्याम्
 अस्माभिस्
 मह्यम्  मे
 आवाभ्याम्  नौ
 अस्मभ्यम्  नः
 मत्
 आवाभ्याम्
 अस्मत्
 मम  मे
 आवयोः
 अस्माकम्  नः
 मयि
 आवयोः
 अस्मासु
 त्वम्
 युवाम्
 यूयम्
 त्वाम्  त्वा
 युवाम्  वाम्
 युष्मान्  वः
 त्वया
 युवाभ्याम्
 युष्माभिः
 तुभ्यम्  ते
 युवाभ्याम्  वाम्
 युष्मभ्यम्  वः
 त्वत्
 युवाभ्याम्
 युष्मत्
 तव  ते
 युवयोः  वाम्
 युष्माकम्  वः
 त्वयि
 युवयोः
 युष्मासु
 सः
 तौ
 ते
 तम्
 तौ
 तान्
 तेन
 ताभ्याम्
 तैः
 तस्मै
 ताभ्याम्
 तेभ्यः
 तस्मात्
 ताभ्याम्
 तेभ्यः
 तस्य
 तयोः
 तेषाम्
 तस्मिन्
 तयोः
 तेषु
 सा
 ते
 ताः
 ताम्
 ते
 ताः
 तया
 ताभ्याम्
 ताभिः
 तस्यै
 ताभ्याम्
 ताभ्यः
 तस्याः
 ताभ्याम्
 ताभ्यः
 तस्य
 तयोः
 तासाम्
 तस्याम्
 तयोः
 तासु
 तत्
 ते
 तानि
 तत्
 ते
 तानि
 तया
 ताभ्याम्
 ताभिः
 तस्यै
 ताभ्याम्
 ताभ्यः
 तस्याः
 ताभ्याम्
 ताभ्यः
 तस्य
 तयोः
 तासाम्
 तस्याम्
 तयोः
 तासु
 अयम्
 इमौ
 इमे
 इमम्
 इमौ
 इमान्
 अनेन
 आभ्याम्
 एभिः
 अस्मै
 आभ्याम्
 एभ्यः
 अस्मात्
 आभ्याम्
 एभ्यः
 अस्य
 अनयोः
 एषाम्
 अस्मिन्
 अनयोः
 एषु
 इयम्
 इमे
 इमाः
 इमाम्
 इमे
 इमाः
 अनया
 आभ्याम्
 आभिः
 अस्यै
 आभ्याम्
 आभ्यः
 अस्याः
 आभ्याम्
 आभ्यः
 अस्याः
 अनयोः
 आसाम्
 अस्याम्
 अनयोः
 आसु
 इदम्
 इमे
 इमानि
 इदम्
 इमे
 इमानि
 अनेन
 आभ्याम्
 एभिः
 अस्मै
 आभ्याम्
 एभ्यः
 अस्मात्
 आभ्याम्
 एभ्यः
 अस्य
 अनयोः
 एषाम्
 अस्मिन्
 अनयोः
 एषु
 एषः
 एतौ
 एते
 एतम्  एनम्
 एतौ  एनौ
 एतान्  एनान्
 एतेन
 एताभ्याम्
 एतैः
 एतस्मै
 एताभ्याम्
 एतेभ्यः
 एतस्मात्
 एताभ्याम्
 एतेभ्यः
 एतस्य
 एतस्मिन्
 एतेषाम्
 एतस्मिन्
 एतस्मिन्
 एतेषु
 एषा
 एते
 एताः
 एताम्  एनाम्
 एते  एने
 एताः  एनाः
 एतया  एनया
 एताभ्याम्
 एताभिः
 एतस्यै
 एताभ्याम्
 एताभ्यः
 एतस्याः
 एताभ्याम्
 एताभ्यः
 एतस्याः
 एतयोः  एनयोः
 एतासाम्
 एतस्याम्
 एतयोः  एनयोः
 एतासु
 एतत्  एतद्
 एते
 एतानि
 एतत्  एतद्  एनत्  एनद्
 एते  एने
 एतानि  एनानि
 एतेन  एनेन
 एताभ्याम्
 एतैः
 एतस्मै
 एताभ्याम्
 एतेभ्यः
 एतस्मात्
 एताभ्याम्
 एतेभ्यः
 एतस्य
 एतयोः  एनयोः
 एतेषाम्
 एतस्मिन्
 एतयोः  एनयोः
 एतेषु
 असौ
 अमू
 अमी
 अमूम्
 अमू
 अमून्
 अमुना
 अमूभ्याम्
 अमीभिः
 अमुष्मै
 अमूभ्याम्
 अमीभ्यः
 अमुष्मात्
 अमूभ्याम्
 अमीभ्यः
 अमुष्य
 अमुयोः
 अमीषाम्
 अमुष्मिन्
 अमुयोः
 अमीषु
 असौ
 अमू
 अमूः
 अमूम्
 अमू
 अमूः
 अमुया
 अमूभ्याम्
 अमूभिः
 अमुष्यै
 अमूभ्याम्
 अमूभ्यः
 अमुष्याः
 अमूभ्याम्
 अमूभ्यः
 अमुष्याः
 अमुयोः
 अमूषाम्
 अमुष्याम्
 अमुयोः
 अमूषु
 अमु
 अमुनी
 अमूनि
 अमु
 अमुनी
 अमूनि
 अमुना
 अमूभ्याम्
 अमीभिः
 अमुष्मै
 अमूभ्याम्
 अमीभ्यः
 अमुष्मात्
 अमूभ्याम्
 अमीभ्यः
 अमुष्य
 अमुयोः
 अमीषाम्
 अमुष्मिन्
 अमुयोः
 अमीषु
 कः
 कौ
 के
 कम्
 कौ
 कान्
 केन
 काभ्याम्
 कैः
 कस्मै
 काभ्याम्
 केभ्य
 कस्मात्
 काभ्याम्
 केभ्य
 कस्य
 कयोः
 केषाम्
 कस्मिन्
 कयोः
 केषु
 का
 के
 काः
 काम्
 के
 काः
 कया
 काभ्याम्
 काभिः
 कस्यै
 काभ्याम्
 काभ्यः
 कस्याः
 काभ्याम्
 काभ्यः
 कस्याः
 कयोः
 कासाम्
 कस्याम्
 कयोः
 कासु
 किम्
 के
 कानि
 किम्
 के
 कानि
 केन
 काभ्याम्
 कैः
 कस्मै
 काभ्याम्
 केभ्य
 कस्मात्
 काभ्याम्
 केभ्य
 कस्य
 कयोः
 केषाम्
 कस्मिन्
 कयोः
 केषु
 भवान्
 भवन्तौ
 भवन्तः
 भवन्तम्
 भवन्तौ
 भवतः
 भवता
 भवद्भ्याम्
 भवद्भिः
 भवते
 भवद्भ्याम्
 भवद्भ्यः
 भवतः
 भवद्भ्याम्
 भवद्भ्यः
 भवतः
 भवतोः
 भवताम्
 भवति
 भवतोः
 भवत्सु
 भवती
 भवत्यौ
 भवत्यः
 भवतीम्
 भवत्यौ
 भवतीः
 भवत्या
 भवतीभ्याम्
 भवतीभिः
 भवत्यै
 भवतीभ्याम्
 भवतीभिः
 भवत्याः
 भवतीभ्याम्
 भवतीभिः
 भवत्याः
 भवत्योः
 भवतीनाम्
 भवत्याम्
 भवत्योः
 भवतीषु
 भवत्
 भवती
 भवन्ति
 भवत्
 भवती
 भवन्ति
 भवता
 भवद्भ्याम्
 भवद्भिः
 भवते
 भवद्भ्याम्
 भवद्भ्यः
 भवतः
 भवद्भ्याम्
 भवद्भ्यः
 भवतः
 भवतोः
 भवताम्
 भवति
 भवतोः
 भवत्सु
 अये
 अरे
 अरेरे
 अविधा
 असाधुना
 अस्तोभ
 अहह
 अहावस्
 आम्
 आर्यहलम्
 आह
 आहो
 इस्
 उम्
 उवे
 काम्
 कुम्
 चमत्
 टसत्
 दृन्
 धिक्
 पाट्
 फत्
 फाट्
 फुडुत्
 बत
 बाल्
 वट्
 व्यवस्तोभति व्यवस्तुभ्
 षाट्
 स्तोभ
 हुम्मा
 हूम्
 अति
 अधि
 अनु
 अप
 अपि
 अभि
 अव
 आ
 उद्
 उप
 नि
 निर्
 परा
 परि
 प्र
 प्रति
 वि
 सम्
 अथवा उत
 अन्यथा
 इव
 च
 चेत् यदि
 तु परन्तु
 यतः करणेन हि यतस् यदर्थम् यदर्थे यर्हि यथा यत्कारणम् येन ही हिन
 यथा यतस्
 यद्यपि
 यात् अवधेस् यावति
 येन प्रकारेण
 स्थाने
 अह
 एव
 एवम्
 कच्चित्
 कु
 कुवित्
 कूपत्
 च
 चण्
 चेत्
 तत्र
 नकिम्
 नह
 नुनम्
 नेत्
 भूयस्
 मकिम्
 मकिर्
 यत्र
 युगपत्
 वा
 शश्वत्
 सूपत्
 ह
 हन्त
 हि
 """.split()
 )
--- a/spacy/lang/tokenizer_exceptions.py
+++ b/spacy/lang/tokenizer_exceptions.py
@ -34,13 +34,13 @@ URL_PATTERN = (
    r"|"
    # host & domain names
    # mods: match is case-sensitive, so include [A-Z]
-      "(?:"  # noqa: E131
+    r"(?:"  # noqa: E131
-        "(?:"
+      r"(?:"
-          "[A-Za-z0-9\u00a1-\uffff]"
+        r"[A-Za-z0-9\u00a1-\uffff]"
-          "[A-Za-z0-9\u00a1-\uffff_-]{0,62}"
+        r"[A-Za-z0-9\u00a1-\uffff_-]{0,62}"
-        ")?"
+      r")?"
-        "[A-Za-z0-9\u00a1-\uffff]\."
+      r"[A-Za-z0-9\u00a1-\uffff]\."
-      ")+"
+    r")+"
    # TLD identifier
    # mods: use ALPHA_LOWER instead of a wider range so that this doesn't match
    # strings like "lower.Upper", which can be split on "." by infixes in some
@ -128,6 +128,8 @@ emoticons = set(
 :-]
 [:
 [-:
 [=
 =]
 :o)
 (o:
 :}
@ -159,6 +161,8 @@ emoticons = set(
 =|
 :|
 :-|
 ]=
 =[
 :1
 :P
 :-P
--- a/spacy/language.py
+++ b/spacy/language.py
@ -1,9 +1,8 @@
 from typing import Optional, Any, Dict, Callable, Iterable, Union, List, Pattern
-from typing import Tuple, Iterator, Optional
+from typing import Tuple, Iterator
 from dataclasses import dataclass
 import random
 import itertools
 import weakref
 import functools
 from contextlib import contextmanager
 from copy import deepcopy
@ -1378,8 +1377,6 @@ class Language:
            docs = (self.make_doc(text) for text in texts)
            for pipe in pipes:
                docs = pipe(docs)
        nr_seen = 0
        for doc in docs:
            yield doc
--- a/spacy/matcher/dependencymatcher.pyx
+++ b/spacy/matcher/dependencymatcher.pyx
@ -1,16 +1,16 @@
 # cython: infer_types=True, profile=True
-from cymem.cymem cimport Pool
+from typing import List
 from preshed.maps cimport PreshMap
 from libcpp cimport bool
 import numpy
 from cymem.cymem cimport Pool
 from .matcher cimport Matcher
 from ..vocab cimport Vocab
 from ..tokens.doc cimport Doc
 from .matcher import unpickle_matcher
 from ..errors import Errors
 from ..tokens import Span
 DELIMITER = "||"
@ -22,36 +22,52 @@ cdef class DependencyMatcher:
    """Match dependency parse tree based on pattern rules."""
    cdef Pool mem
    cdef readonly Vocab vocab
-    cdef readonly Matcher token_matcher
+    cdef readonly Matcher matcher
    cdef public object _patterns
    cdef public object _raw_patterns
    cdef public object _keys_to_token
    cdef public object _root
    cdef public object _entities
    cdef public object _callbacks
    cdef public object _nodes
    cdef public object _tree
    cdef public object _ops
-    def __init__(self, vocab):
+    def __init__(self, vocab, *, validate=False):
        """Create the DependencyMatcher.
        vocab (Vocab): The vocabulary object, which must be shared with the
            documents the matcher will operate on.
        validate (bool): Whether patterns should be validated, passed to
            Matcher as `validate`
        """
        size = 20
-        # TODO: make matcher work with validation
+        self.matcher = Matcher(vocab, validate=validate)
        self.token_matcher = Matcher(vocab, validate=False)
        self._keys_to_token = {}
        self._patterns = {}
        self._raw_patterns = {}
        self._root = {}
        self._nodes = {}
        self._tree = {}
        self._entities = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
        self._ops = {
            "<": self.dep,
            ">": self.gov,
            "<<": self.dep_chain,
            ">>": self.gov_chain,
            ".": self.imm_precede,
            ".*": self.precede,
            ";": self.imm_follow,
            ";*": self.follow,
            "$+": self.imm_right_sib,
            "$-": self.imm_left_sib,
            "$++": self.right_sib,
            "$--": self.left_sib,
        }
    def __reduce__(self):
-        data = (self.vocab, self._patterns,self._tree, self._callbacks)
+        data = (self.vocab, self._raw_patterns, self._callbacks)
        return (unpickle_matcher, data, None, None)
    def __len__(self):
@ -74,54 +90,61 @@ cdef class DependencyMatcher:
        idx = 0
        visited_nodes = {}
        for relation in pattern:
-            if "PATTERN" not in relation or "SPEC" not in relation:
+            if not isinstance(relation, dict):
                raise ValueError(Errors.E1008)
            if "RIGHT_ATTRS" not in relation and "RIGHT_ID" not in relation:
                raise ValueError(Errors.E098.format(key=key))
            if idx == 0:
                if not(
-                    "NODE_NAME" in relation["SPEC"]
+                    "RIGHT_ID" in relation
-                    and "NBOR_RELOP" not in relation["SPEC"]
+                    and "REL_OP" not in relation
-                    and "NBOR_NAME" not in relation["SPEC"]
+                    and "LEFT_ID" not in relation
                ):
                    raise ValueError(Errors.E099.format(key=key))
-                visited_nodes[relation["SPEC"]["NODE_NAME"]] = True
+                visited_nodes[relation["RIGHT_ID"]] = True
            else:
                if not(
-                    "NODE_NAME" in relation["SPEC"]
+                    "RIGHT_ID" in relation
-                    and "NBOR_RELOP" in relation["SPEC"]
+                    and "RIGHT_ATTRS" in relation
-                    and "NBOR_NAME" in relation["SPEC"]
+                    and "REL_OP" in relation
                    and "LEFT_ID" in relation
                ):
                    raise ValueError(Errors.E100.format(key=key))
                if (
-                    relation["SPEC"]["NODE_NAME"] in visited_nodes
+                    relation["RIGHT_ID"] in visited_nodes
-                    or relation["SPEC"]["NBOR_NAME"] not in visited_nodes
+                    or relation["LEFT_ID"] not in visited_nodes
                ):
                    raise ValueError(Errors.E101.format(key=key))
-                visited_nodes[relation["SPEC"]["NODE_NAME"]] = True
+                if relation["REL_OP"] not in self._ops:
-                visited_nodes[relation["SPEC"]["NBOR_NAME"]] = True
+                    raise ValueError(Errors.E1007.format(op=relation["REL_OP"]))
                visited_nodes[relation["RIGHT_ID"]] = True
                visited_nodes[relation["LEFT_ID"]] = True
            idx = idx + 1
-    def add(self, key, patterns, *_patterns, on_match=None):
+    def add(self, key, patterns, *, on_match=None):
        """Add a new matcher rule to the matcher.
        key (str): The match ID.
        patterns (list): The patterns to add for the given key.
        on_match (callable): Optional callback executed on match.
        """
-        if patterns is None or hasattr(patterns, "__call__"):  # old API
+        if on_match is not None and not hasattr(on_match, "__call__"):
-            on_match = patterns
+            raise ValueError(Errors.E171.format(arg_type=type(on_match)))
-            patterns = _patterns
+        if patterns is None or not isinstance(patterns, List):  # old API
            raise ValueError(Errors.E948.format(arg_type=type(patterns)))
        for pattern in patterns:
            if len(pattern) == 0:
                raise ValueError(Errors.E012.format(key=key))
-            self.validate_input(pattern,key)
+            self.validate_input(pattern, key)
        key = self._normalize_key(key)
        self._raw_patterns.setdefault(key, [])
        self._raw_patterns[key].extend(patterns)
        _patterns = []
        for pattern in patterns:
            token_patterns = []
            for i in range(len(pattern)):
-                token_pattern = [pattern[i]["PATTERN"]]
+                token_pattern = [pattern[i]["RIGHT_ATTRS"]]
                token_patterns.append(token_pattern)
            # self.patterns.append(token_patterns)
            _patterns.append(token_patterns)
        self._patterns.setdefault(key, [])
        self._callbacks[key] = on_match
@ -135,7 +158,7 @@ cdef class DependencyMatcher:
            # TODO: Better ways to hash edges in pattern?
            for j in range(len(_patterns[i])):
                k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j))
-                self.token_matcher.add(k, [_patterns[i][j]])
+                self.matcher.add(k, [_patterns[i][j]])
                _keys_to_token[k] = j
            _keys_to_token_list.append(_keys_to_token)
        self._keys_to_token.setdefault(key, [])
@ -144,14 +167,14 @@ cdef class DependencyMatcher:
        for pattern in patterns:
            nodes = {}
            for i in range(len(pattern)):
-                nodes[pattern[i]["SPEC"]["NODE_NAME"]] = i
+                nodes[pattern[i]["RIGHT_ID"]] = i
            _nodes_list.append(nodes)
        self._nodes.setdefault(key, [])
        self._nodes[key].extend(_nodes_list)
        # Create an object tree to traverse later on. This data structure
        # enables easy tree pattern match. Doc-Token based tree cannot be
        # reused since it is memory-heavy and tightly coupled with the Doc.
-        self.retrieve_tree(patterns, _nodes_list,key)
+        self.retrieve_tree(patterns, _nodes_list, key)
    def retrieve_tree(self, patterns, _nodes_list, key):
        _heads_list = []
@ -161,13 +184,13 @@ cdef class DependencyMatcher:
            root = -1
            for j in range(len(patterns[i])):
                token_pattern = patterns[i][j]
-                if ("NBOR_RELOP" not in token_pattern["SPEC"]):
+                if ("REL_OP" not in token_pattern):
                    heads[j] = ('root', j)
                    root = j
                else:
                    heads[j] = (
-                        token_pattern["SPEC"]["NBOR_RELOP"],
+                        token_pattern["REL_OP"],
-                        _nodes_list[i][token_pattern["SPEC"]["NBOR_NAME"]]
+                        _nodes_list[i][token_pattern["LEFT_ID"]]
                    )
            _heads_list.append(heads)
            _root_list.append(root)
@ -202,11 +225,21 @@ cdef class DependencyMatcher:
        RETURNS (tuple): The rule, as an (on_match, patterns) tuple.
        """
        key = self._normalize_key(key)
-        if key not in self._patterns:
+        if key not in self._raw_patterns:
            return default
-        return (self._callbacks[key], self._patterns[key])
+        return (self._callbacks[key], self._raw_patterns[key])
-    def __call__(self, Doc doc):
+    def remove(self, key):
        key = self._normalize_key(key)
        if not key in self._patterns:
            raise ValueError(Errors.E175.format(key=key))
        self._patterns.pop(key)
        self._raw_patterns.pop(key)
        self._nodes.pop(key)
        self._tree.pop(key)
        self._root.pop(key)
    def __call__(self, object doclike):
        """Find all token sequences matching the supplied pattern.
        doclike (Doc or Span): The document to match over.
@ -214,8 +247,14 @@ cdef class DependencyMatcher:
            describing the matches. A match tuple describes a span
            `doc[start:end]`. The `label_id` and `key` are both integers.
        """
        if isinstance(doclike, Doc):
            doc = doclike
        elif isinstance(doclike, Span):
            doc = doclike.as_doc()
        else:
            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
        matched_key_trees = []
-        matches = self.token_matcher(doc)
+        matches = self.matcher(doc)
        for key in list(self._patterns.keys()):
            _patterns_list = self._patterns[key]
            _keys_to_token_list = self._keys_to_token[key]
@ -244,26 +283,26 @@ cdef class DependencyMatcher:
                length = len(_nodes)
                matched_trees = []
-                self.recurse(_tree,id_to_position,_node_operator_map,0,[],matched_trees)
+                self.recurse(_tree, id_to_position, _node_operator_map, 0, [], matched_trees)
-                matched_key_trees.append((key,matched_trees))
+                for matched_tree in matched_trees:
-
+                    matched_key_trees.append((key, matched_tree))
-            for i, (ent_id, nodes) in enumerate(matched_key_trees):
+            for i, (match_id, nodes) in enumerate(matched_key_trees):
-                on_match = self._callbacks.get(ent_id)
+                on_match = self._callbacks.get(match_id)
                if on_match is not None:
                    on_match(self, doc, i, matched_key_trees)
        return matched_key_trees
-    def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visited_nodes,matched_trees):
+    def recurse(self, tree, id_to_position, _node_operator_map, int patternLength, visited_nodes, matched_trees):
-        cdef bool isValid;
+        cdef bint isValid;
-        if(patternLength == len(id_to_position.keys())):
+        if patternLength == len(id_to_position.keys()):
            isValid = True
            for node in range(patternLength):
-                if(node in tree):
+                if node in tree:
                    for idx, (relop,nbor) in enumerate(tree[node]):
                        computed_nbors = numpy.asarray(_node_operator_map[visited_nodes[node]][relop])
                        isNbor = False
                        for computed_nbor in computed_nbors:
-                            if(computed_nbor.i == visited_nodes[nbor]):
+                            if computed_nbor.i == visited_nodes[nbor]:
                                isNbor = True
                        isValid = isValid & isNbor
            if(isValid):
@ -271,14 +310,14 @@ cdef class DependencyMatcher:
            return
        allPatternNodes = numpy.asarray(id_to_position[patternLength])
        for patternNode in allPatternNodes:
-            self.recurse(tree,id_to_position,_node_operator_map,patternLength+1,visited_nodes+[patternNode],matched_trees)
+            self.recurse(tree, id_to_position, _node_operator_map, patternLength+1, visited_nodes+[patternNode], matched_trees)
    # Given a node and an edge operator, to return the list of nodes
    # from the doc that belong to node+operator. This is used to store
    # all the results beforehand to prevent unnecessary computation while
    # pattern matching
    # _node_operator_map[node][operator] = [...]
-    def get_node_operator_map(self,doc,tree,id_to_position,nodes,root):
+    def get_node_operator_map(self, doc, tree, id_to_position, nodes, root):
        _node_operator_map = {}
        all_node_indices = nodes.values()
        all_operators = []
@ -295,24 +334,14 @@ cdef class DependencyMatcher:
            _node_operator_map[node] = {}
            for operator in all_operators:
                _node_operator_map[node][operator] = []
        # Used to invoke methods for each operator
        switcher = {
            "<": self.dep,
            ">": self.gov,
            "<<": self.dep_chain,
            ">>": self.gov_chain,
            ".": self.imm_precede,
            "$+": self.imm_right_sib,
            "$-": self.imm_left_sib,
            "$++": self.right_sib,
            "$--": self.left_sib
        }
        for operator in all_operators:
            for node in all_nodes:
-                _node_operator_map[node][operator] = switcher.get(operator)(doc,node)
+                _node_operator_map[node][operator] = self._ops.get(operator)(doc, node)
        return _node_operator_map
    def dep(self, doc, node):
        if doc[node].head == doc[node]:
            return []
        return [doc[node].head]
    def gov(self,doc,node):
@ -322,36 +351,51 @@ cdef class DependencyMatcher:
        return list(doc[node].ancestors)
    def gov_chain(self, doc, node):
-        return list(doc[node].subtree)
+        return [t for t in doc[node].subtree if t != doc[node]]
    def imm_precede(self, doc, node):
-        if node > 0:
+        sent = self._get_sent(doc[node])
        if node < len(doc) - 1 and doc[node + 1] in sent:
            return [doc[node + 1]]
        return []
    def precede(self, doc, node):
        sent = self._get_sent(doc[node])
        return [doc[i] for i in range(node + 1, sent.end)]
    def imm_follow(self, doc, node):
        sent = self._get_sent(doc[node])
        if node > 0 and doc[node - 1] in sent:
            return [doc[node - 1]]
        return []
    def follow(self, doc, node):
        sent = self._get_sent(doc[node])
        return [doc[i] for i in range(sent.start, node)]
    def imm_right_sib(self, doc, node):
        for child in list(doc[node].head.children):
-            if child.i == node - 1:
+            if child.i == node + 1:
                return [doc[child.i]]
        return []
    def imm_left_sib(self, doc, node):
        for child in list(doc[node].head.children):
-            if child.i == node + 1:
+            if child.i == node - 1:
                return [doc[child.i]]
        return []
    def right_sib(self, doc, node):
        candidate_children = []
        for child in list(doc[node].head.children):
-            if child.i < node:
+            if child.i > node:
                candidate_children.append(doc[child.i])
        return candidate_children
    def left_sib(self, doc, node):
        candidate_children = []
        for child in list(doc[node].head.children):
-            if child.i > node:
+            if child.i < node:
                candidate_children.append(doc[child.i])
        return candidate_children
@ -360,3 +404,15 @@ cdef class DependencyMatcher:
            return self.vocab.strings.add(key)
        else:
            return key
    def _get_sent(self, token):
        root = (list(token.ancestors) or [token])[-1]
        return token.doc[root.left_edge.i:root.right_edge.i + 1]
 def unpickle_matcher(vocab, patterns, callbacks):
    matcher = DependencyMatcher(vocab)
    for key, pattern in patterns.items():
        callback = callbacks.get(key, None)
        matcher.add(key, pattern, on_match=callback)
    return matcher
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -829,9 +829,11 @@ def _get_extra_predicates(spec, extra_predicates):
                attr = "ORTH"
            attr = IDS.get(attr.upper())
        if isinstance(value, dict):
            processed = False
            value_with_upper_keys = {k.upper(): v for k, v in value.items()}
            for type_, cls in predicate_types.items():
-                if type_ in value:
+                if type_ in value_with_upper_keys:
-                    predicate = cls(len(extra_predicates), attr, value[type_], type_)
+                    predicate = cls(len(extra_predicates), attr, value_with_upper_keys[type_], type_)
                    # Don't create a redundant predicates.
                    # This helps with efficiency, as we're caching the results.
                    if predicate.key in seen_predicates:
@ -840,6 +842,9 @@ def _get_extra_predicates(spec, extra_predicates):
                        extra_predicates.append(predicate)
                        output.append(predicate.i)
                        seen_predicates[predicate.key] = predicate.i
                    processed = True
            if not processed:
                warnings.warn(Warnings.W035.format(pattern=value))
    return output
--- a/spacy/pipeline/dep_parser.pyx
+++ b/spacy/pipeline/dep_parser.pyx
@ -156,7 +156,7 @@ cdef class DependencyParser(Parser):
        results = {}
        results.update(Scorer.score_spans(examples, "sents", **kwargs))
        kwargs.setdefault("getter", dep_getter)
-        kwargs.setdefault("ignore_label", ("p", "punct"))
+        kwargs.setdefault("ignore_labels", ("p", "punct"))
        results.update(Scorer.score_deps(examples, "dep", **kwargs))
        del results["sents_per_type"]
        return results
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -133,7 +133,7 @@ class EntityRuler:
        matches = set(
            [(m_id, start, end) for m_id, start, end in matches if start != end]
        )
-        get_sort_key = lambda m: (m[2] - m[1], m[1])
+        get_sort_key = lambda m: (m[2] - m[1], -m[1])
        matches = sorted(matches, key=get_sort_key, reverse=True)
        entities = list(doc.ents)
        new_entities = []
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@ -57,12 +57,13 @@ def validate_token_pattern(obj: list) -> List[str]:
 class TokenPatternString(BaseModel):
-    REGEX: Optional[StrictStr]
+    REGEX: Optional[StrictStr] = Field(None, alias="regex")
-    IN: Optional[List[StrictStr]]
+    IN: Optional[List[StrictStr]] = Field(None, alias="in")
-    NOT_IN: Optional[List[StrictStr]]
+    NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in")
    class Config:
        extra = "forbid"
        allow_population_by_field_name = True  # allow alias and field name
    @validator("*", pre=True, each_item=True, allow_reuse=True)
    def raise_for_none(cls, v):
@ -72,9 +73,9 @@ class TokenPatternString(BaseModel):
 class TokenPatternNumber(BaseModel):
-    REGEX: Optional[StrictStr] = None
+    REGEX: Optional[StrictStr] = Field(None, alias="regex")
-    IN: Optional[List[StrictInt]] = None
+    IN: Optional[List[StrictInt]] = Field(None, alias="in")
-    NOT_IN: Optional[List[StrictInt]] = None
+    NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in")
    EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==")
    NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=")
    GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=")
@ -84,6 +85,7 @@ class TokenPatternNumber(BaseModel):
    class Config:
        extra = "forbid"
        allow_population_by_field_name = True  # allow alias and field name
    @validator("*", pre=True, each_item=True, allow_reuse=True)
    def raise_for_none(cls, v):
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -44,6 +44,11 @@ def ca_tokenizer():
    return get_lang_class("ca")().tokenizer
@pytest.fixture(scope="session")
 def cs_tokenizer():
    return get_lang_class("cs")().tokenizer
@pytest.fixture(scope="session")
 def da_tokenizer():
    return get_lang_class("da")().tokenizer
@ -204,6 +209,11 @@ def ru_lemmatizer():
    return get_lang_class("ru")().add_pipe("lemmatizer")
@pytest.fixture(scope="session")
 def sa_tokenizer():
    return get_lang_class("sa")().tokenizer
@pytest.fixture(scope="session")
 def sr_tokenizer():
    return get_lang_class("sr")().tokenizer
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -162,11 +162,36 @@ def test_spans_are_hashable(en_tokenizer):
 def test_spans_by_character(doc):
    span1 = doc[1:-2]
    # default and specified alignment mode "strict"
    span2 = doc.char_span(span1.start_char, span1.end_char, label="GPE")
    assert span1.start_char == span2.start_char
    assert span1.end_char == span2.end_char
    assert span2.label_ == "GPE"
    span2 = doc.char_span(
        span1.start_char, span1.end_char, label="GPE", alignment_mode="strict"
    )
    assert span1.start_char == span2.start_char
    assert span1.end_char == span2.end_char
    assert span2.label_ == "GPE"
    # alignment mode "contract"
    span2 = doc.char_span(
        span1.start_char - 3, span1.end_char, label="GPE", alignment_mode="contract"
    )
    assert span1.start_char == span2.start_char
    assert span1.end_char == span2.end_char
    assert span2.label_ == "GPE"
    # alignment mode "expand"
    span2 = doc.char_span(
        span1.start_char + 1, span1.end_char, label="GPE", alignment_mode="expand"
    )
    assert span1.start_char == span2.start_char
    assert span1.end_char == span2.end_char
    assert span2.label_ == "GPE"
 def test_span_to_array(doc):
    span = doc[1:-2]
--- a/spacy/tests/lang/cs/init.py
+++ b/spacy/tests/lang/cs/init.py
--- a/spacy/tests/lang/cs/test_text.py
+++ b/spacy/tests/lang/cs/test_text.py
@ -0,0 +1,23 @@
 import pytest
@pytest.mark.parametrize(
    "text,match",
    [
        ("10", True),
        ("1", True),
        ("10.000", True),
        ("1000", True),
        ("999,0", True),
        ("devatenáct", True),
        ("osmdesát", True),
        ("kvadrilion", True),
        ("Pes", False),
        (",", False),
        ("1/2", True),
    ],
 )
 def test_lex_attrs_like_number(cs_tokenizer, text, match):
    tokens = cs_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
--- a/spacy/tests/lang/en/test_text.py
+++ b/spacy/tests/lang/en/test_text.py
@ -56,6 +56,11 @@ def test_lex_attrs_like_number(en_tokenizer, text, match):
    assert tokens[0].like_num == match
@pytest.mark.parametrize("word", ["third", "Millionth", "100th", "Hundredth"])
 def test_en_lex_attrs_like_number_for_ordinal(word):
    assert like_num(word)
@pytest.mark.parametrize("word", ["eleven"])
 def test_en_lex_attrs_capitals(word):
    assert like_num(word)
--- a/spacy/tests/lang/he/test_tokenizer.py
+++ b/spacy/tests/lang/he/test_tokenizer.py
@ -1,4 +1,5 @@
 import pytest
 from spacy.lang.he.lex_attrs import like_num
@pytest.mark.parametrize(
@ -39,3 +40,30 @@ def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
 def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
    tokens = he_tokenizer(text)
    assert expected_tokens == [token.text for token in tokens]
@pytest.mark.parametrize(
    "text,match",
    [
        ("10", True),
        ("1", True),
        ("10,000", True),
        ("10,00", True),
        ("999.0", True),
        ("אחד", True),
        ("שתיים", True),
        ("מליון", True),
        ("כלב", False),
        (",", False),
        ("1/2", True),
    ],
 )
 def test_lex_attrs_like_number(he_tokenizer, text, match):
    tokens = he_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
@pytest.mark.parametrize("word", ["שלישי", "מליון", "עשירי", "מאה", "עשר", "אחד עשר"])
 def test_he_lex_attrs_like_number_for_ordinal(word):
    assert like_num(word)
--- a/spacy/tests/lang/ne/test_text.py
+++ b/spacy/tests/lang/ne/test_text.py
@ -1,6 +1,3 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
--- a/spacy/tests/lang/sa/init.py
+++ b/spacy/tests/lang/sa/init.py
--- a/spacy/tests/lang/sa/test_text.py
+++ b/spacy/tests/lang/sa/test_text.py
@ -0,0 +1,42 @@
 import pytest
 def test_sa_tokenizer_handles_long_text(sa_tokenizer):
    text = """नानाविधानि दिव्यानि नानावर्णाकृतीनि च।।"""
    tokens = sa_tokenizer(text)
    assert len(tokens) == 6
@pytest.mark.parametrize(
    "text,length",
    [
        ("श्री भगवानुवाच पश्य मे पार्थ रूपाणि शतशोऽथ सहस्रशः।", 9,),
        ("गुणान् सर्वान् स्वभावो मूर्ध्नि वर्तते ।", 6),
    ],
 )
 def test_sa_tokenizer_handles_cnts(sa_tokenizer, text, length):
    tokens = sa_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize(
    "text,match",
    [
        ("10", True),
        ("1", True),
        ("10.000", True),
        ("1000", True),
        ("999,0", True),
        ("एकः ", True),
        ("दश", True),
        ("पञ्चदश", True),
        ("चत्वारिंशत् ", True),
        ("कूपे", False),
        (",", False),
        ("1/2", True),
    ],
 )
 def test_lex_attrs_like_number(sa_tokenizer, text, match):
    tokens = sa_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
--- a/spacy/tests/matcher/test_dependency_matcher.py
+++ b/spacy/tests/matcher/test_dependency_matcher.py
@ -0,0 +1,334 @@
 import pytest
 import pickle
 import re
 import copy
 from mock import Mock
 from spacy.matcher import DependencyMatcher
 from ..util import get_doc
@pytest.fixture
 def doc(en_vocab):
    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
    deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "pobj", "det", "amod"]
    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
    return doc
@pytest.fixture
 def patterns(en_vocab):
    def is_brown_yellow(text):
        return bool(re.compile(r"brown|yellow").match(text))
    IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
    pattern1 = [
        {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}},
        {
            "LEFT_ID": "fox",
            "REL_OP": ">",
            "RIGHT_ID": "q",
            "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"},
        },
        {
            "LEFT_ID": "fox",
            "REL_OP": ">",
            "RIGHT_ID": "r",
            "RIGHT_ATTRS": {IS_BROWN_YELLOW: True},
        },
    ]
    pattern2 = [
        {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}},
        {
            "LEFT_ID": "jumped",
            "REL_OP": ">",
            "RIGHT_ID": "fox1",
            "RIGHT_ATTRS": {"ORTH": "fox"},
        },
        {
            "LEFT_ID": "jumped",
            "REL_OP": ".",
            "RIGHT_ID": "over",
            "RIGHT_ATTRS": {"ORTH": "over"},
        },
    ]
    pattern3 = [
        {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}},
        {
            "LEFT_ID": "jumped",
            "REL_OP": ">",
            "RIGHT_ID": "fox",
            "RIGHT_ATTRS": {"ORTH": "fox"},
        },
        {
            "LEFT_ID": "fox",
            "REL_OP": ">>",
            "RIGHT_ID": "r",
            "RIGHT_ATTRS": {"ORTH": "brown"},
        },
    ]
    pattern4 = [
        {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}},
        {
            "LEFT_ID": "jumped",
            "REL_OP": ">",
            "RIGHT_ID": "fox",
            "RIGHT_ATTRS": {"ORTH": "fox"},
        }
    ]
    pattern5 = [
        {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}},
        {
            "LEFT_ID": "jumped",
            "REL_OP": ">>",
            "RIGHT_ID": "fox",
            "RIGHT_ATTRS": {"ORTH": "fox"},
        },
    ]
    return [pattern1, pattern2, pattern3, pattern4, pattern5]
@pytest.fixture
 def dependency_matcher(en_vocab, patterns, doc):
    matcher = DependencyMatcher(en_vocab)
    mock = Mock()
    for i in range(1, len(patterns) + 1):
        if i == 1:
            matcher.add("pattern1", [patterns[0]], on_match=mock)
        else:
            matcher.add("pattern" + str(i), [patterns[i - 1]])
    return matcher
 def test_dependency_matcher(dependency_matcher, doc, patterns):
    assert len(dependency_matcher) == 5
    assert "pattern3" in dependency_matcher
    assert dependency_matcher.get("pattern3") == (None, [patterns[2]])
    matches = dependency_matcher(doc)
    assert len(matches) == 6
    assert matches[0][1] == [3, 1, 2]
    assert matches[1][1] == [4, 3, 5]
    assert matches[2][1] == [4, 3, 2]
    assert matches[3][1] == [4, 3]
    assert matches[4][1] == [4, 3]
    assert matches[5][1] == [4, 8]
    span = doc[0:6]
    matches = dependency_matcher(span)
    assert len(matches) == 5
    assert matches[0][1] == [3, 1, 2]
    assert matches[1][1] == [4, 3, 5]
    assert matches[2][1] == [4, 3, 2]
    assert matches[3][1] == [4, 3]
    assert matches[4][1] == [4, 3]
 def test_dependency_matcher_pickle(en_vocab, patterns, doc):
    matcher = DependencyMatcher(en_vocab)
    for i in range(1, len(patterns) + 1):
        matcher.add("pattern" + str(i), [patterns[i - 1]])
    matches = matcher(doc)
    assert matches[0][1] == [3, 1, 2]
    assert matches[1][1] == [4, 3, 5]
    assert matches[2][1] == [4, 3, 2]
    assert matches[3][1] == [4, 3]
    assert matches[4][1] == [4, 3]
    assert matches[5][1] == [4, 8]
    b = pickle.dumps(matcher)
    matcher_r = pickle.loads(b)
    assert len(matcher) == len(matcher_r)
    matches = matcher_r(doc)
    assert matches[0][1] == [3, 1, 2]
    assert matches[1][1] == [4, 3, 5]
    assert matches[2][1] == [4, 3, 2]
    assert matches[3][1] == [4, 3]
    assert matches[4][1] == [4, 3]
    assert matches[5][1] == [4, 8]
 def test_dependency_matcher_pattern_validation(en_vocab):
    pattern = [
        {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}},
        {
            "LEFT_ID": "fox",
            "REL_OP": ">",
            "RIGHT_ID": "q",
            "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"},
        },
        {
            "LEFT_ID": "fox",
            "REL_OP": ">",
            "RIGHT_ID": "r",
            "RIGHT_ATTRS": {"ORTH": "brown"},
        },
    ]
    matcher = DependencyMatcher(en_vocab)
    # original pattern is valid
    matcher.add("FOUNDED", [pattern])
    # individual pattern not wrapped in a list
    with pytest.raises(ValueError):
        matcher.add("FOUNDED", pattern)
    # no anchor node
    with pytest.raises(ValueError):
        matcher.add("FOUNDED", [pattern[1:]])
    # required keys missing
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        del pattern2[0]["RIGHT_ID"]
        matcher.add("FOUNDED", [pattern2])
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        del pattern2[1]["RIGHT_ID"]
        matcher.add("FOUNDED", [pattern2])
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        del pattern2[1]["RIGHT_ATTRS"]
        matcher.add("FOUNDED", [pattern2])
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        del pattern2[1]["LEFT_ID"]
        matcher.add("FOUNDED", [pattern2])
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        del pattern2[1]["REL_OP"]
        matcher.add("FOUNDED", [pattern2])
    # invalid operator
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        pattern2[1]["REL_OP"] = "!!!"
        matcher.add("FOUNDED", [pattern2])
    # duplicate node name
    with pytest.raises(ValueError):
        pattern2 = copy.deepcopy(pattern)
        pattern2[1]["RIGHT_ID"] = "fox"
        matcher.add("FOUNDED", [pattern2])
 def test_dependency_matcher_callback(en_vocab, doc):
    pattern = [
        {"RIGHT_ID": "quick", "RIGHT_ATTRS": {"ORTH": "quick"}},
    ]
    matcher = DependencyMatcher(en_vocab)
    mock = Mock()
    matcher.add("pattern", [pattern], on_match=mock)
    matches = matcher(doc)
    mock.assert_called_once_with(matcher, doc, 0, matches)
    # check that matches with and without callback are the same (#4590)
    matcher2 = DependencyMatcher(en_vocab)
    matcher2.add("pattern", [pattern])
    matches2 = matcher2(doc)
    assert matches == matches2
@pytest.mark.parametrize(
    "op,num_matches", [(".", 8), (".*", 20), (";", 8), (";*", 20),]
 )
 def test_dependency_matcher_precedence_ops(en_vocab, op, num_matches):
    # two sentences to test that all matches are within the same sentence
    doc = get_doc(
        en_vocab,
        words=["a", "b", "c", "d", "e"] * 2,
        heads=[0, -1, -2, -3, -4] * 2,
        deps=["dep"] * 10,
    )
    match_count = 0
    for text in ["a", "b", "c", "d", "e"]:
        pattern = [
            {"RIGHT_ID": "1", "RIGHT_ATTRS": {"ORTH": text}},
            {"LEFT_ID": "1", "REL_OP": op, "RIGHT_ID": "2", "RIGHT_ATTRS": {},},
        ]
        matcher = DependencyMatcher(en_vocab)
        matcher.add("A", [pattern])
        matches = matcher(doc)
        match_count += len(matches)
        for match in matches:
            match_id, token_ids = match
            # token_ids[0] op token_ids[1]
            if op == ".":
                assert token_ids[0] == token_ids[1] - 1
            elif op == ";":
                assert token_ids[0] == token_ids[1] + 1
            elif op == ".*":
                assert token_ids[0] < token_ids[1]
            elif op == ";*":
                assert token_ids[0] > token_ids[1]
            # all tokens are within the same sentence
            assert doc[token_ids[0]].sent == doc[token_ids[1]].sent
    assert match_count == num_matches
@pytest.mark.parametrize(
    "left,right,op,num_matches",
    [
        ("fox", "jumped", "<", 1),
        ("the", "lazy", "<", 0),
        ("jumped", "jumped", "<", 0),
        ("fox", "jumped", ">", 0),
        ("fox", "lazy", ">", 1),
        ("lazy", "lazy", ">", 0),
        ("fox", "jumped", "<<", 2),
        ("jumped", "fox", "<<", 0),
        ("the", "fox", "<<", 2),
        ("fox", "jumped", ">>", 0),
        ("over", "the", ">>", 1),
        ("fox", "the", ">>", 2),
        ("fox", "jumped", ".", 1),
        ("lazy", "fox", ".", 1),
        ("the", "fox", ".", 0),
        ("the", "the", ".", 0),
        ("fox", "jumped", ";", 0),
        ("lazy", "fox", ";", 0),
        ("the", "fox", ";", 0),
        ("the", "the", ";", 0),
        ("quick", "fox", ".*", 2),
        ("the", "fox", ".*", 3),
        ("the", "the", ".*", 1),
        ("fox", "jumped", ";*", 1),
        ("quick", "fox", ";*", 0),
        ("the", "fox", ";*", 1),
        ("the", "the", ";*", 1),
        ("quick", "brown", "$+", 1),
        ("brown", "quick", "$+", 0),
        ("brown", "brown", "$+", 0),
        ("quick", "brown", "$-", 0),
        ("brown", "quick", "$-", 1),
        ("brown", "brown", "$-", 0),
        ("the", "brown", "$++", 1),
        ("brown", "the", "$++", 0),
        ("brown", "brown", "$++", 0),
        ("the", "brown", "$--", 0),
        ("brown", "the", "$--", 1),
        ("brown", "brown", "$--", 0),
    ],
 )
 def test_dependency_matcher_ops(en_vocab, doc, left, right, op, num_matches):
    right_id = right
    if left == right:
        right_id = right + "2"
    pattern = [
        {"RIGHT_ID": left, "RIGHT_ATTRS": {"LOWER": left}},
        {
            "LEFT_ID": left,
            "REL_OP": op,
            "RIGHT_ID": right_id,
            "RIGHT_ATTRS": {"LOWER": right},
        },
    ]
    matcher = DependencyMatcher(en_vocab)
    matcher.add("pattern", [pattern])
    matches = matcher(doc)
    assert len(matches) == num_matches
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -1,7 +1,6 @@
 import pytest
 import re
 from mock import Mock
-from spacy.matcher import Matcher, DependencyMatcher
+from spacy.matcher import Matcher
 from spacy.tokens import Doc, Token, Span
 from ..doc.test_underscore import clean_underscore  # noqa: F401
@ -292,84 +291,6 @@ def test_matcher_extension_set_membership(en_vocab):
    assert len(matches) == 0
@pytest.fixture
 def text():
    return "The quick brown fox jumped over the lazy fox"
@pytest.fixture
 def heads():
    return [3, 2, 1, 1, 0, -1, 2, 1, -3]
@pytest.fixture
 def deps():
    return ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
@pytest.fixture
 def dependency_matcher(en_vocab):
    def is_brown_yellow(text):
        return bool(re.compile(r"brown|yellow|over").match(text))
    IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow)
    pattern1 = [
        {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}},
        {
            "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
            "PATTERN": {"ORTH": "quick", "DEP": "amod"},
        },
        {
            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},
            "PATTERN": {IS_BROWN_YELLOW: True},
        },
    ]
    pattern2 = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
        {
            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
    ]
    pattern3 = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
        {
            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"},
            "PATTERN": {"ORTH": "brown"},
        },
    ]
    matcher = DependencyMatcher(en_vocab)
    matcher.add("pattern1", [pattern1])
    matcher.add("pattern2", [pattern2])
    matcher.add("pattern3", [pattern3])
    return matcher
 def test_dependency_matcher_compile(dependency_matcher):
    assert len(dependency_matcher) == 3
 # def test_dependency_matcher(dependency_matcher, text, heads, deps):
 #     doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps)
 #     matches = dependency_matcher(doc)
 #     assert matches[0][1] == [[3, 1, 2]]
 #     assert matches[1][1] == [[4, 3, 3]]
 #     assert matches[2][1] == [[4, 3, 2]]
 def test_matcher_basic_check(en_vocab):
    matcher = Matcher(en_vocab)
    # Potential mistake: pass in pattern instead of list of patterns
--- a/spacy/tests/matcher/test_pattern_validation.py
+++ b/spacy/tests/matcher/test_pattern_validation.py
@ -59,3 +59,12 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors):
            matcher.add("TEST", [pattern])
    elif n_errors == 0:
        matcher.add("TEST", [pattern])
 def test_pattern_errors(en_vocab):
    matcher = Matcher(en_vocab)
    # normalize "regex" to upper like "text"
    matcher.add("TEST1", [[{"text": {"regex": "regex"}}]])
    # error if subpattern attribute isn't recognized and processed
    with pytest.raises(MatchPatternError):
        matcher.add("TEST2", [[{"TEXT": {"XX": "xx"}}]])
--- a/spacy/tests/pipeline/test_entity_ruler.py
+++ b/spacy/tests/pipeline/test_entity_ruler.py
@ -150,3 +150,15 @@ def test_entity_ruler_properties(nlp, patterns):
    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
    assert sorted(ruler.labels) == sorted(["HELLO", "BYE", "COMPLEX", "TECH_ORG"])
    assert sorted(ruler.ent_ids) == ["a1", "a2"]
 def test_entity_ruler_overlapping_spans(nlp):
    ruler = EntityRuler(nlp)
    patterns = [
        {"label": "FOOBAR", "pattern": "foo bar"},
        {"label": "BARBAZ", "pattern": "bar baz"},
    ]
    ruler.add_patterns(patterns)
    doc = ruler(nlp.make_doc("foo bar baz"))
    assert len(doc.ents) == 1
    assert doc.ents[0].label_ == "FOOBAR"
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -71,6 +71,6 @@ def test_overfitting_IO():
 def test_tagger_requires_labels():
    nlp = English()
-    tagger = nlp.add_pipe("tagger")
+    nlp.add_pipe("tagger")
    with pytest.raises(ValueError):
-        optimizer = nlp.begin_training()
+        nlp.begin_training()
--- a/spacy/tests/regression/test_issue4501-5000.py
+++ b/spacy/tests/regression/test_issue4501-5000.py
@ -38,32 +38,6 @@ def test_gold_misaligned(en_tokenizer, text, words):
    Example.from_dict(doc, {"words": words})
 def test_issue4590(en_vocab):
    """Test that matches param in on_match method are the same as matches run with no on_match method"""
    pattern = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
        {
            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
    ]
    on_match = Mock()
    matcher = DependencyMatcher(en_vocab)
    matcher.add("pattern", on_match, pattern)
    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
    deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
    matches = matcher(doc)
    on_match_args = on_match.call_args
    assert on_match_args[0][3] == matches
 def test_issue4651_with_phrase_matcher_attr():
    """Test that the EntityRuler PhraseMatcher is deserialized correctly using
    the method from_disk when the EntityRuler argument phrase_matcher_attr is
--- a/spacy/tests/regression/test_issue5838.py
+++ b/spacy/tests/regression/test_issue5838.py
@ -0,0 +1,23 @@
 from spacy.lang.en import English
 from spacy.tokens import Span
 from spacy import displacy
 SAMPLE_TEXT = """First line
 Second line, with ent
 Third line
 Fourth line
 """
 def test_issue5838():
    # Displacy's EntityRenderer break line
    # not working after last entity
    nlp = English()
    doc = nlp(SAMPLE_TEXT)
    doc.ents = [Span(doc, 7, 8, label="test")]
    html = displacy.render(doc, style="ent")
    found = html.count("</br>")
    assert found == 4
--- a/spacy/tests/regression/test_issue5918.py
+++ b/spacy/tests/regression/test_issue5918.py
@ -0,0 +1,27 @@
 from spacy.lang.en import English
 from spacy.pipeline import merge_entities
 def test_issue5918():
    # Test edge case when merging entities.
    nlp = English()
    ruler = nlp.add_pipe("entity_ruler")
    patterns = [
        {"label": "ORG", "pattern": "Digicon Inc"},
        {"label": "ORG", "pattern": "Rotan Mosle Inc's"},
        {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
    ]
    ruler.add_patterns(patterns)
    text = """
        Digicon Inc said it has completed the previously-announced disposition
        of its computer systems division to an investment group led by
        Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
        """
    doc = nlp(text)
    assert len(doc.ents) == 3
    # make it so that the third span's head is within the entity (ent_iob=I)
    # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
    doc[29].head = doc[33]
    doc = merge_entities(doc)
    assert len(doc.ents) == 3
--- a/spacy/tests/test_tok2vec.py
+++ b/spacy/tests/test_tok2vec.py
@ -135,6 +135,7 @@ TRAIN_DATA = [
    ("Eat blue ham", {"tags": ["V", "J", "N"]}),
 ]
 def test_tok2vec_listener():
    orig_config = Config().from_str(cfg_string)
    nlp, config = util.load_model_from_config(orig_config, auto_fill=True, validate=True)
--- a/spacy/tests/tokenizer/test_naughty_strings.py
+++ b/spacy/tests/tokenizer/test_naughty_strings.py
@ -29,6 +29,7 @@ NAUGHTY_STRINGS = [
    r"₀₁₂",
    r"⁰⁴⁵₀₁₂",
    r"ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็",
    r" ̄  ̄",
    # Two-Byte Characters
    r"田中さんにあげて下さい",
    r"パーティーへ行かないか",
--- a/spacy/tests/tokenizer/test_whitespace.py
+++ b/spacy/tests/tokenizer/test_whitespace.py
@ -15,7 +15,7 @@ def test_tokenizer_splits_double_space(tokenizer, text):
@pytest.mark.parametrize("text", ["lorem ipsum  "])
-def test_tokenizer_handles_double_trainling_ws(tokenizer, text):
+def test_tokenizer_handles_double_trailing_ws(tokenizer, text):
    tokens = tokenizer(text)
    assert repr(tokens.text_with_ws) == repr(text)
--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -169,6 +169,8 @@ def _merge(Doc doc, merges):
        spans.append(span)
        # House the new merged token where it starts
        token = &doc.c[start]
        start_ent_iob = doc.c[start].ent_iob
        start_ent_type = doc.c[start].ent_type
        # Initially set attributes to attributes of span root
        token.tag = doc.c[span.root.i].tag
        token.pos = doc.c[span.root.i].pos
@ -181,8 +183,8 @@ def _merge(Doc doc, merges):
            merged_iob = 3
            # If start token is I-ENT and previous token is of the same
            # type, then I-ENT (could check I-ENT from start to span root)
-            if doc.c[start].ent_iob == 1 and start > 0 \
+            if start_ent_iob == 1 and start > 0 \
-                    and doc.c[start].ent_type == token.ent_type \
+                    and start_ent_type == token.ent_type \
                    and doc.c[start - 1].ent_type == token.ent_type:
                merged_iob = 1
        token.ent_iob = merged_iob
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -336,17 +336,25 @@ cdef class Doc:
    def doc(self):
        return self
-    def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None):
+    def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None, alignment_mode="strict"):
-        """Create a `Span` object from the slice `doc.text[start : end]`.
+        """Create a `Span` object from the slice
        `doc.text[start_idx : end_idx]`. Returns None if no valid `Span` can be
        created.
        doc (Doc): The parent document.
-        start (int): The index of the first character of the span.
+        start_idx (int): The index of the first character of the span.
-        end (int): The index of the first character after the span.
+        end_idx (int): The index of the first character after the span.
        label (uint64 or string): A label to attach to the Span, e.g. for
            named entities.
-        kb_id (uint64 or string):  An ID from a KB to capture the meaning of a named entity.
+        kb_id (uint64 or string):  An ID from a KB to capture the meaning of a
            named entity.
        vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
            the span.
        alignment_mode (str): How character indices are aligned to token
            boundaries. Options: "strict" (character indices must be aligned
            with token boundaries), "contract" (span of all tokens completely
            within the character span), "expand" (span of all tokens at least
            partially covered by the character span). Defaults to "strict".
        RETURNS (Span): The newly constructed object.
        DOCS: https://nightly.spacy.io/api/doc#char_span
@ -355,12 +363,29 @@ cdef class Doc:
            label = self.vocab.strings.add(label)
        if not isinstance(kb_id, int):
            kb_id = self.vocab.strings.add(kb_id)
-        cdef int start = token_by_start(self.c, self.length, start_idx)
+        if alignment_mode not in ("strict", "contract", "expand"):
-        if start == -1:
+            alignment_mode = "strict"
        cdef int start = token_by_char(self.c, self.length, start_idx)
        if start < 0 or (alignment_mode == "strict" and start_idx != self[start].idx):
            return None
-        cdef int end = token_by_end(self.c, self.length, end_idx)
+        # end_idx is exclusive, so find the token at one char before
-        if end == -1:
+        cdef int end = token_by_char(self.c, self.length, end_idx - 1)
        if end < 0 or (alignment_mode == "strict" and end_idx != self[end].idx + len(self[end])):
            return None
        # Adjust start and end by alignment_mode
        if alignment_mode == "contract":
            if self[start].idx < start_idx:
                start += 1
            if end_idx < self[end].idx + len(self[end]):
                end -= 1
            # if no tokens are completely within the span, return None
            if end < start:
                return None
        elif alignment_mode == "expand":
            # Don't consider the trailing whitespace to be part of the previous
            # token
            if start_idx == self[start].idx + len(self[start]):
                start += 1
        # Currently we have the token index, we want the range-end index
        end += 1
        cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector)
@ -1268,23 +1293,35 @@ cdef class Doc:
 cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2:
-    cdef int i
+    cdef int i = token_by_char(tokens, length, start_char)
-    for i in range(length):
+    if i >= 0 and tokens[i].idx == start_char:
-        if tokens[i].idx == start_char:
+        return i
            return i
    else:
        return -1
 cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2:
-    cdef int i
+    # end_char is exclusive, so find the token at one char before
-    for i in range(length):
+    cdef int i = token_by_char(tokens, length, end_char - 1)
-        if tokens[i].idx + tokens[i].lex.length == end_char:
+    if i >= 0 and tokens[i].idx + tokens[i].lex.length == end_char:
-            return i
+        return i
    else:
        return -1
 cdef int token_by_char(const TokenC* tokens, int length, int char_idx) except -2:
    cdef int start = 0, mid, end = length - 1
    while start <= end:
        mid = (start + end) / 2
        if char_idx < tokens[mid].idx:
            end = mid - 1
        elif char_idx >= tokens[mid].idx + tokens[mid].lex.length + tokens[mid].spacy:
            start = mid + 1
        else:
            return mid
    return -1
 cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
    cdef TokenC* head
    cdef TokenC* child
--- a/website/docs/api/dependencymatcher.md
+++ b/website/docs/api/dependencymatcher.md
@ -1,65 +1,91 @@
 ---
 title: DependencyMatcher
-teaser: Match sequences of tokens, based on the dependency parse
+teaser: Match subtrees within a dependency parse
 tag: class
 new: 3
 source: spacy/matcher/dependencymatcher.pyx
 ---
 The `DependencyMatcher` follows the same API as the [`Matcher`](/api/matcher)
 and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees
-using the
+using
-[Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
+[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
-It requires a trained [`DependencyParser`](/api/parser) or other component that
+It requires a pretrained [`DependencyParser`](/api/parser) or other component
-sets the `Token.dep` attribute.
+that sets the `Token.dep` and `Token.head` attributes. See the
 [usage guide](/usage/rule-based-matching#dependencymatcher) for examples.
 ## Pattern format {#patterns}
-> ```json
+> ```python
 > ### Example
 > # pattern: "[subject] ... initially founded"
 > [
 >   # anchor token: founded
 >   {
->     "SPEC": {"NODE_NAME": "founded"},
+>     "RIGHT_ID": "founded",
->     "PATTERN": {"ORTH": "founded"}
+>     "RIGHT_ATTRS": {"ORTH": "founded"}
 >   },
 >   # founded -> subject
 >   {
->     "SPEC": {
+>     "LEFT_ID": "founded",
->       "NODE_NAME": "founder",
+>     "REL_OP": ">",
->       "NBOR_RELOP": ">",
+>     "RIGHT_ID": "subject",
->       "NBOR_NAME": "founded"
+>     "RIGHT_ATTRS": {"DEP": "nsubj"}
 >   },
 >     "PATTERN": {"DEP": "nsubj"}
 >   },
 >   # "founded" follows "initially"
 >   {
->     "SPEC": {
+>     "LEFT_ID": "founded",
->       "NODE_NAME": "object",
+>     "REL_OP": ";",
->       "NBOR_RELOP": ">",
+>     "RIGHT_ID": "initially",
->       "NBOR_NAME": "founded"
+>     "RIGHT_ATTRS": {"ORTH": "initially"}
 >   },
 >     "PATTERN": {"DEP": "dobj"}
 >   }
 > ]
 > ```
 A pattern added to the `DependencyMatcher` consists of a list of dictionaries,
-with each dictionary describing a node to match. Each pattern should have the
+with each dictionary describing a token to match. Except for the first
-following top-level keys:
+dictionary, which defines an anchor token using only `RIGHT_ID` and
 `RIGHT_ATTRS`, each pattern should have the following keys:
-| Name      | Description                                                                                                                                    |
+| Name          | Description                                                                                                                                                            |
-| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `PATTERN` | The token attributes to match in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
+| `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~                                                                     |
-| `SPEC`    | The relationships of the nodes in the subtree that should be matched. ~~Dict[str, str]~~                                                       |
+| `REL_OP`      | An operator that describes how the two nodes are related. ~~str~~                                                                                                      |
 | `RIGHT_ID`    | A unique name for the right-hand node in the relation. ~~str~~                                                                                                         |
 | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
-The `SPEC` includes the following fields:
+<Infobox title="Designing dependency matcher patterns" emoji="📖">
-| Name         | Description                                                                                                                                                                    |
+For examples of how to construct dependency matcher patterns for different types
-| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+of relations, see the usage guide on
-| `NODE_NAME`  | A unique name for this node to refer to it in other specs. ~~str~~                                                                                                             |
+[dependency matching](/usage/rule-based-matching#dependencymatcher).
-| `NBOR_RELOP` | A [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) operator that describes how the two nodes are related. ~~str~~ |
+
-| `NBOR_NAME`  | The unique name of the node that this node is connected to. ~~str~~                                                                                                            |
+</Infobox>
 ### Operators
 The following operators are supported by the `DependencyMatcher`, most of which
 come directly from
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
 | Symbol    | Description                                                                                                          |
 | --------- | -------------------------------------------------------------------------------------------------------------------- |
 | `A < B`   | `A` is the immediate dependent of `B`.                                                                               |
 | `A > B`   | `A` is the immediate head of `B`.                                                                                    |
 | `A << B`  | `A` is the dependent in a chain to `B` following dep &rarr; head paths.                                              |
 | `A >> B`  | `A` is the head in a chain to `B` following head &rarr; dep paths.                                                   |
 | `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree.                   |
 | `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_.                 |
 | `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
 | `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_.                  |
 | `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`.                 |
 | `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`.                  |
 | `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`.                                |
 | `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`.                                 |
 ## DependencyMatcher.\_\_init\_\_ {#init tag="method"}
-Create a rule-based `DependencyMatcher`.
+Create a `DependencyMatcher`.
 > #### Example
 >
@ -68,13 +94,15 @@ Create a rule-based `DependencyMatcher`.
 > matcher = DependencyMatcher(nlp.vocab)
 > ```
-| Name    | Description                                                                                           |
+| Name           | Description                                                                                           |
-| ------- | ----------------------------------------------------------------------------------------------------- |
+| -------------- | ----------------------------------------------------------------------------------------------------- |
-| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
+| `vocab`        | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
 | _keyword-only_ |                                                                                                       |
 | `validate`     | Validate all patterns added to this matcher. ~~bool~~                                                 |
 ## DependencyMatcher.\_\call\_\_ {#call tag="method"}
-Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
+Find all tokens matching the supplied patterns on the `Doc` or `Span`.
 > #### Example
 >
@ -82,36 +110,32 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
 > from spacy.matcher import DependencyMatcher
 >
 > matcher = DependencyMatcher(nlp.vocab)
-> pattern = [
+> pattern = [{"RIGHT_ID": "founded_id",
->     {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}},
+>   "RIGHT_ATTRS": {"ORTH": "founded"}}]
->     {"SPEC": {"NODE_NAME": "founder", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}},
+> matcher.add("FOUNDED", [pattern])
 > ]
 > matcher.add("Founder", [pattern])
 > doc = nlp("Bill Gates founded Microsoft.")
 > matches = matcher(doc)
 > ```
-| Name        | Description                                                                                                                                                                                             |
+| Name        | Description                                                                                                                                                                                                                                                                                                                           |
-| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doclike`   | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                 |
+| `doclike`   | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                                               |
-| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ |
+| **RETURNS** | A list of `(match_id, token_ids)` tuples, describing the matches. The `match_id` is the ID of the match pattern and `token_ids` is a list of token indices matched by the pattern, where the position of each token in the list corresponds to the position of the node specification in the pattern. ~~List[Tuple[int, List[int]]]~~ |
 ## DependencyMatcher.\_\_len\_\_ {#len tag="method"}
-Get the number of rules (edges) added to the dependency matcher. Note that this
+Get the number of rules added to the dependency matcher. Note that this only
-only returns the number of rules (identical with the number of IDs), not the
+returns the number of rules (identical with the number of IDs), not the number
-number of individual patterns.
+of individual patterns.
 > #### Example
 >
 > ```python
 > matcher = DependencyMatcher(nlp.vocab)
 > assert len(matcher) == 0
-> pattern = [
+> pattern = [{"RIGHT_ID": "founded_id",
->     {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}},
+>   "RIGHT_ATTRS": {"ORTH": "founded"}}]
->     {"SPEC": {"NODE_NAME": "START_ENTITY", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}},
+> matcher.add("FOUNDED", [pattern])
 > ]
 > matcher.add("Rule", [pattern])
 > assert len(matcher) == 1
 > ```
@ -126,10 +150,10 @@ Check whether the matcher contains rules for a match ID.
 > #### Example
 >
 > ```python
-> matcher = Matcher(nlp.vocab)
+> matcher = DependencyMatcher(nlp.vocab)
-> assert "Rule" not in matcher
+> assert "FOUNDED" not in matcher
-> matcher.add("Rule", [pattern])
+> matcher.add("FOUNDED", [pattern])
-> assert "Rule" in matcher
+> assert "FOUNDED" in matcher
 > ```
 | Name        | Description                                                    |
@ -152,33 +176,15 @@ will be overwritten.
 >     print('Matched!', matches)
 >
 > matcher = DependencyMatcher(nlp.vocab)
-> matcher.add("TEST_PATTERNS", patterns)
+> matcher.add("FOUNDED", patterns, on_match=on_match)
 > ```
-| Name           | Description                                                                                                                                                |
+| Name           | Description                                                                                                                                                          |
-| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `match_id`     | An ID for the thing you're matching. ~~str~~                                                                                                               |
+| `match_id`     | An ID for the patterns. ~~str~~                                                                                                                                      |
-| `patterns`     | list                                                                                                                                                       | Match pattern. A pattern consists of a list of dicts, where each dict describes a `"PATTERN"` and `"SPEC"`. ~~List[List[Dict[str, dict]]]~~ |
+| `patterns`     | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~          |
-| _keyword-only_ |                                                                                                                                                            |  |
+| _keyword-only_ |                                                                                                                                                                      |  |
-| `on_match`     | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
+| `on_match`     | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ |
 ## DependencyMatcher.remove {#remove tag="method"}
 Remove a rule from the matcher. A `KeyError` is raised if the match ID does not
 exist.
 > #### Example
 >
 > ```python
 > matcher.add("Rule", [pattern]])
 > assert "Rule" in matcher
 > matcher.remove("Rule")
 > assert "Rule" not in matcher
 > ```
 | Name  | Description                       |
 | ----- | --------------------------------- |
 | `key` | The ID of the match rule. ~~str~~ |
 ## DependencyMatcher.get {#get tag="method"}
@ -188,11 +194,29 @@ Retrieve the pattern stored for a key. Returns the rule as an
 > #### Example
 >
 > ```python
-> matcher.add("Rule", [pattern], on_match=on_match)
+> matcher.add("FOUNDED", patterns, on_match=on_match)
-> on_match, patterns = matcher.get("Rule")
+> on_match, patterns = matcher.get("FOUNDED")
 > ```
-| Name        | Description                                                                                   |
+| Name        | Description                                                                                                 |
-| ----------- | --------------------------------------------------------------------------------------------- |
+| ----------- | ----------------------------------------------------------------------------------------------------------- |
-| `key`       | The ID of the match rule. ~~str~~                                                             |
+| `key`       | The ID of the match rule. ~~str~~                                                                           |
-| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[dict]]]~~ |
+| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[Union[Dict, Tuple]]]]~~ |
 ## DependencyMatcher.remove {#remove tag="method"}
 Remove a rule from the dependency matcher. A `KeyError` is raised if the match
 ID does not exist.
 > #### Example
 >
 > ```python
 > matcher.add("FOUNDED", patterns)
 > assert "FOUNDED" in matcher
 > matcher.remove("FOUNDED")
 > assert "FOUNDED" not in matcher
 > ```
 | Name  | Description                       |
 | ----- | --------------------------------- |
 | `key` | The ID of the match rule. ~~str~~ |
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -186,8 +186,9 @@ Remove a previously registered extension.
 ## Doc.char_span {#char_span tag="method" new="2"}
-Create a `Span` object from the slice `doc.text[start:end]`. Returns `None` if
+Create a `Span` object from the slice `doc.text[start_idx:end_idx]`. Returns
-the character indices don't map to a valid span.
+`None` if the character indices don't map to a valid span using the default mode
 `"strict".
 > #### Example
 >
@ -197,14 +198,15 @@ the character indices don't map to a valid span.
 > assert span.text == "New York"
 > ```
-| Name                                 | Description                                                                               |
+| Name                                 | Description                                                                                                                                                                                                                                                                 |
-| ------------------------------------ | ----------------------------------------------------------------------------------------- |
+| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `start`                              | The index of the first character of the span. ~~int~~                                     |
+| `start`                              | The index of the first character of the span. ~~int~~                                                                                                                                                                                                                       |
-| `end`                                | The index of the last character after the span. ~int~~                                    |
+| `end`                                | The index of the last character after the span. ~int~~                                                                                                                                                                                                                      |
-| `label`                              | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~               |
+| `label`                              | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~                                                                                                                                                                                                 |
-| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
+| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~                                                                                                                                                                                   |
-| `vector`                             | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~            |
+| `vector`                             | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~                                                                                                                                                                                              |
-| **RETURNS**                          | The newly constructed object or `None`. ~~Optional[Span]~~                                |
+| `mode`                               | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"inside"` (span of all tokens completely within the character span), `"outside"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
 | **RETURNS**                          | The newly constructed object or `None`. ~~Optional[Span]~~                                                                                                                                                                                                                  |
 ## Doc.similarity {#similarity tag="method" model="vectors"}
--- a/website/docs/images/dep-match-diagram.svg
+++ b/website/docs/images/dep-match-diagram.svg
--- a/website/docs/images/displacy-dep-founded.html
+++ b/website/docs/images/displacy-dep-founded.html
@ -0,0 +1,58 @@
 <svg xmlns="http://www.w3.org/2000/svg" xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="c3124cc3e661444cb9d4175a5b7c09d1-0" class="displacy" width="925" height="399.5" direction="ltr" style="max-width: none; height: 399.5px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
    <tspan class="displacy-word" fill="currentColor" x="50">Smith</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50"></tspan>
 </text>
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
    <tspan class="displacy-word" fill="currentColor" x="225">founded</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225"></tspan>
 </text>
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
    <tspan class="displacy-word" fill="currentColor" x="400">a</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="400"></tspan>
 </text>
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
    <tspan class="displacy-word" fill="currentColor" x="575">healthcare</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="575"></tspan>
 </text>
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
    <tspan class="displacy-word" fill="currentColor" x="750">company</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="750"></tspan>
 </text>
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-0" stroke-width="2px" d="M70,264.5 C70,177.0 215.0,177.0 215.0,264.5" fill="none" stroke="currentColor"></path>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-0" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">nsubj</textPath>
    </text>
    <path class="displacy-arrowhead" d="M70,266.5 L62,254.5 78,254.5" fill="currentColor"></path>
 </g>
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-1" stroke-width="2px" d="M420,264.5 C420,89.5 745.0,89.5 745.0,264.5" fill="none" stroke="currentColor"></path>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-1" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">det</textPath>
    </text>
    <path class="displacy-arrowhead" d="M420,266.5 L412,254.5 428,254.5" fill="currentColor"></path>
 </g>
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-2" stroke-width="2px" d="M595,264.5 C595,177.0 740.0,177.0 740.0,264.5" fill="none" stroke="currentColor"></path>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-2" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">compound</textPath>
    </text>
    <path class="displacy-arrowhead" d="M595,266.5 L587,254.5 603,254.5" fill="currentColor"></path>
 </g>
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-3" stroke-width="2px" d="M245,264.5 C245,2.0 750.0,2.0 750.0,264.5" fill="none" stroke="currentColor"></path>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-c3124cc3e661444cb9d4175a5b7c09d1-0-3" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">dobj</textPath>
    </text>
    <path class="displacy-arrowhead" d="M750.0,266.5 L758.0,254.5 742.0,254.5" fill="currentColor"></path>
 </g>
 </svg>
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1021,7 +1021,7 @@ expressions – for example,
 [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):
 ```python
-suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
+suffixes = nlp.Defaults.suffixes + [r'''-+$''',]
 suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 nlp.tokenizer.suffix_search = suffix_regex.search
 ```
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -330,7 +330,7 @@ custom component entirely (more details on this in the section on
 ```python
 nlp.remove_pipe("parser")
 nlp.rename_pipe("ner", "entityrecognizer")
-nlp.replace_pipe("tagger", my_custom_tagger)
+nlp.replace_pipe("tagger", "my_custom_tagger")
 ```
 The `Language` object exposes different [attributes](/api/language#attributes)
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -4,6 +4,7 @@ teaser: Find phrases and tokens, and match entities
 menu:
  - ['Token Matcher', 'matcher']
  - ['Phrase Matcher', 'phrasematcher']
  - ['Dependency Matcher', 'dependencymatcher']
  - ['Entity Ruler', 'entityruler']
  - ['Models & Rules', 'models-rules']
 ---
@ -939,10 +940,10 @@ object patterns as efficiently as possible and without running any of the other
 pipeline components. If the token attribute you want to match on are set by a
 pipeline component, **make sure that the pipeline component runs** when you
 create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
-objects need to have part-of-speech tags set by the `tagger`. You can either
+objects need to have part-of-speech tags set by the `tagger` or `morphologizer`.
-call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
+You can either call the `nlp` object on your pattern texts instead of
-[`nlp.select_pipes`](/api/language#select_pipes) to disable components
+`nlp.make_doc`, or use [`nlp.select_pipes`](/api/language#select_pipes) to
-selectively.
+disable components selectively.
 </Infobox>
@ -973,10 +974,287 @@ to match phrases with the same sequence of punctuation and non-punctuation
 tokens as the pattern. But this can easily get confusing and doesn't have much
 of an advantage over writing one or two token patterns.
 ## Dependency Matcher {#dependencymatcher new="3" model="parser"}
 The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
 the dependency parse using
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
 operators. It requires a model containing a parser such as the
 [`DependencyParser`](/api/dependencyparser). Instead of defining a list of
 adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match
 tokens in the dependency parse and specify the relations between them.
 > ```python
 > ### Example
 > from spacy.matcher import DependencyMatcher
 >
 > # "[subject] ... initially founded"
 > pattern = [
 >   # anchor token: founded
 >   {
 >     "RIGHT_ID": "founded",
 >     "RIGHT_ATTRS": {"ORTH": "founded"}
 >   },
 >   # founded -> subject
 >   {
 >     "LEFT_ID": "founded",
 >     "REL_OP": ">",
 >     "RIGHT_ID": "subject",
 >     "RIGHT_ATTRS": {"DEP": "nsubj"}
 >   },
 >   # "founded" follows "initially"
 >   {
 >     "LEFT_ID": "founded",
 >     "REL_OP": ";",
 >     "RIGHT_ID": "initially",
 >     "RIGHT_ATTRS": {"ORTH": "initially"}
 >   }
 > ]
 >
 > matcher = DependencyMatcher(nlp.vocab)
 > matcher.add("FOUNDED", [pattern])
 > matches = matcher(doc)
 > ```
 A pattern added to the dependency matcher consists of a **list of
 dictionaries**, with each dictionary describing a **token to match** and its
 **relation to an existing token** in the pattern. Except for the first
 dictionary, which defines an anchor token using only `RIGHT_ID` and
 `RIGHT_ATTRS`, each pattern should have the following keys:
 | Name          | Description                                                                                                                                                            |
 | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~                                                                     |
 | `REL_OP`      | An operator that describes how the two nodes are related. ~~str~~                                                                                                      |
 | `RIGHT_ID`    | A unique name for the right-hand node in the relation. ~~str~~                                                                                                         |
 | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
 Each additional token added to the pattern is linked to an existing token
 `LEFT_ID` by the relation `REL_OP`. The new token is given the name `RIGHT_ID`
 and described by the attributes `RIGHT_ATTRS`.
 <Infobox title="Important note" variant="warning">
 Because the unique token **names** in `LEFT_ID` and `RIGHT_ID` are used to
 identify tokens, the order of the dicts in the patterns is important: a token
 name needs to be defined as `RIGHT_ID` in one dict in the pattern **before** it
 can be used as `LEFT_ID` in another dict.
 </Infobox>
 ### Dependency matcher operators {#dependencymatcher-operators}
 The following operators are supported by the `DependencyMatcher`, most of which
 come directly from
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
 | Symbol    | Description                                                                                                          |
 | --------- | -------------------------------------------------------------------------------------------------------------------- |
 | `A < B`   | `A` is the immediate dependent of `B`.                                                                               |
 | `A > B`   | `A` is the immediate head of `B`.                                                                                    |
 | `A << B`  | `A` is the dependent in a chain to `B` following dep &rarr; head paths.                                              |
 | `A >> B`  | `A` is the head in a chain to `B` following head &rarr; dep paths.                                                   |
 | `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree.                   |
 | `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_.                 |
 | `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
 | `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_.                  |
 | `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`.                 |
 | `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`.                  |
 | `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`.                                |
 | `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`.                                 |
 ### Designing dependency matcher patterns {#dependencymatcher-patterns}
 Let's say we want to find sentences describing who founded what kind of company:
 - _Smith founded a healthcare company in 2005._
 - _Williams initially founded an insurance company in 1987._
 - _Lee, an experienced CEO, has founded two AI startups._
 The dependency parse for "Smith founded a healthcare company" shows types of
 relations and tokens we want to match:
 > #### Visualizing the parse
 >
 > The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects
 > and their dependency parse and part-of-speech tags:
 >
 > ```python
 > import spacy
 > from spacy import displacy
 >
 > nlp = spacy.load("en_core_web_sm")
 > doc = nlp("Smith founded a healthcare company")
 > displacy.serve(doc)
 > ```
 import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html'
 <Iframe title="displaCy visualization of dependencies" html={DisplaCyDepFoundedHtml} height={450} />
 The relations we're interested in are:
 - the founder is the **subject** (`nsubj`) of the token with the text `founded`
 - the company is the **object** (`dobj`) of `founded`
 - the kind of company may be an **adjective** (`amod`, not shown above) or a
  **compound** (`compound`)
 The first step is to pick an **anchor token** for the pattern. Since it's the
 root of the dependency parse, `founded` is a good choice here. It is often
 easier to construct patterns when all dependency relation operators point from
 the head to the children. In this example, we'll only use `>`, which connects a
 head to an immediate dependent as `head > child`.
 The simplest dependency matcher pattern will identify and name a single token in
 the tree:
 ```python
 ### {executable="true"}
 import spacy
 from spacy.matcher import DependencyMatcher
 nlp = spacy.load("en_core_web_sm")
 matcher = DependencyMatcher(nlp.vocab)
 pattern = [
  {
    "RIGHT_ID": "anchor_founded",       # unique name
    "RIGHT_ATTRS": {"ORTH": "founded"}  # token pattern for "founded"
  }
 ]
 matcher.add("FOUNDED", [pattern])
 doc = nlp("Smith founded two companies.")
 matches = matcher(doc)
 print(matches) # [(4851363122962674176, [1])]
 ```
 Now that we have a named anchor token (`anchor_founded`), we can add the founder
 as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`:
 ```python
 ### Step 1 {highlight="8,10"}
 pattern = [
    {
        "RIGHT_ID": "anchor_founded",
        "RIGHT_ATTRS": {"ORTH": "founded"}
    },
    {
        "LEFT_ID": "anchor_founded",
        "REL_OP": ">",
        "RIGHT_ID": "subject",
        "RIGHT_ATTRS": {"DEP": "nsubj"},
    }
    # ...
 ]
 ```
 The direct object (`dobj`) is added in the same way:
 ```python
 ### Step 2 {highlight=""}
 pattern = [
    #...
    {
        "LEFT_ID": "anchor_founded",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object",
        "RIGHT_ATTRS": {"DEP": "dobj"},
    }
    # ...
 ]
 ```
 When the subject and object tokens are added, they are required to have names
 under the key `RIGHT_ID`, which are allowed to be any unique string, e.g.
 `founded_subject`. These names can then be used as `LEFT_ID` to **link new
 tokens into the pattern**. For the final part of our pattern, we'll specify that
 the token `founded_object` should have a modifier with the dependency relation
 `amod` or `compound`:
 ```python
 ### Step 3 {highlight="7"}
 pattern = [
    # ...
    {
        "LEFT_ID": "founded_object",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
    }
 ]
 ```
 You can picture the process of creating a dependency matcher pattern as defining
 an anchor token on the left and building up the pattern by linking tokens
 one-by-one on the right using relation operators. To create a valid pattern,
 each new token needs to be linked to an existing token on its left. As for
 `founded` in this example, a token may be linked to more than one token on its
 right:
 ![Dependency matcher pattern](../images/dep-match-diagram.svg)
 The full pattern comes together as shown in the example below:
 ```python
 ### {executable="true"}
 import spacy
 from spacy.matcher import DependencyMatcher
 nlp = spacy.load("en_core_web_sm")
 matcher = DependencyMatcher(nlp.vocab)
 pattern = [
    {
        "RIGHT_ID": "anchor_founded",
        "RIGHT_ATTRS": {"ORTH": "founded"}
    },
    {
        "LEFT_ID": "anchor_founded",
        "REL_OP": ">",
        "RIGHT_ID": "subject",
        "RIGHT_ATTRS": {"DEP": "nsubj"},
    },
    {
        "LEFT_ID": "anchor_founded",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object",
        "RIGHT_ATTRS": {"DEP": "dobj"},
    },
    {
        "LEFT_ID": "founded_object",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
    }
 ]
 matcher.add("FOUNDED", [pattern])
 doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
 matches = matcher(doc)
 print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
 # Each token_id corresponds to one pattern dict
 match_id, token_ids = matches[0]
 for i in range(len(token_ids)):
    print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
 ```
 <Infobox title="Important note on speed" variant="warning">
 The dependency matcher may be slow when token patterns can potentially match
 many tokens in the sentence or when relation operators allow longer paths in the
 dependency parse, e.g. `<<`, `>>`, `.*` and `;*`.
 To improve the matcher speed, try to make your token patterns and operators as
 specific as possible. For example, use `>` instead of `>>` if possible and use
 token patterns that include dependency labels and other token attributes instead
 of patterns such as `{}` that match any token in the sentence.
 </Infobox>
 ## Rule-based entity recognition {#entityruler new="2.1"}
-The [`EntityRuler`](/api/entityruler) is an exciting new component that lets you
+The [`EntityRuler`](/api/entityruler) is a component that lets you add named
-add named entities based on pattern dictionaries, and makes it easy to combine
+entities based on pattern dictionaries, which makes it easy to combine
 rule-based and statistical named entity recognition for even more powerful
 pipelines.
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -26,6 +26,7 @@ menu:
 - [End-to-end project workflows](#features-projects)
 - [New built-in components](#features-pipeline-components)
 - [New custom component API](#features-components)
 - [Dependency matching](#features-dep-matcher)
 - [Python type hints](#features-types)
 - [New methods & attributes](#new-methods)
 - [New & updated documentation](#new-docs)
@ -201,6 +202,41 @@ aren't set.
 </Infobox>
 ### Dependency matching {#features-dep-matcher}
 <!-- TODO: improve summary -->
 > #### Example
 >
 > ```python
 > from spacy.matcher import DependencyMatcher
 >
 > matcher = DependencyMatcher(nlp.vocab)
 > pattern = [
 >     {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
 >     {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}}
 > ]
 > matcher.add("FOUNDED", [pattern])
 > ```
 The new [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns
 within the dependency parse using
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
 operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
 A pattern added to the dependency matcher consists of a **list of
 dictionaries**, with each dictionary describing a **token to match** and its
 **relation to an existing token** in the pattern.
 <Infobox title="Details & Documentation" emoji="📖" list>
 - **Usage:**
  [Dependency matching](/usage/rule-based-matching#dependencymatcher),
 - **API:** [`DependencyMatcher`](/api/dependencymatcher),
 - **Implementation:**
  [`spacy/matcher/dependencymatcher.pyx`](https://github.com/explosion/spaCy/tree/develop/spacy/matcher/dependencymatcher.pyx)
 </Infobox>
 ### Type hints and type-based data validation {#features-types}
 > #### Example
@ -306,14 +342,16 @@ format for documenting argument and return types.
  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
  [Morphology](/usage/linguistic-features#morphology),
  [Lemmatization](/usage/linguistic-features#lemmatization),
-  [Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions)
+  [Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions),
  [Dependency matching](/usage/rule-based-matching#dependencymatcher)
 - **API Reference: ** [Library architecture](/api),
  [Model architectures](/api/architectures), [Data formats](/api/data-formats)
 - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
  [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
  [`Morphologizer`](/api/morphologizer),
  [`AttributeRuler`](/api/attributeruler),
-  [`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe),
+  [`SentenceRecognizer`](/api/sentencerecognizer),
  [`DependencyMatcher`](/api/dependencymatcher), [`Pipe`](/api/pipe),
  [`Corpus`](/api/corpus)
 </Infobox>
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1,5 +1,30 @@
 {
    "resources": [
        {
            "id": "spacy-sentence-bert",
            "title": "spaCy - sentence-transformers",
            "slogan": "Pipelines for pretrained sentence-transformers (BERT, RoBERTa, XLM-RoBERTa & Co.) directly within spaCy",
            "description": "This library lets you use the embeddings from [sentence-transformers](https://github.com/UKPLab/sentence-transformers) of Docs, Spans and Tokens directly from spaCy. Most models are for the english language but three of them are multilingual.",
            "github": "MartinoMensio/spacy-sentence-bert",
            "pip": "spacy-sentence-bert",
            "code_example": [
                "import spacy_sentence_bert",
                "# load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/",
                "nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')",
                "# get two documents",
                "doc_1 = nlp('Hi there, how are you?')",
                "doc_2 = nlp('Hello there, how are you doing today?')",
                "# use the similarity method that is based on the vectors, on Doc, Span or Token",
                "print(doc_1.similarity(doc_2[0:7]))"
            ],
            "category": ["models", "pipeline"],
            "author": "Martino Mensio",
            "author_links": {
                "twitter": "MartinoMensio",
                "github": "MartinoMensio",
                "website": "https://martinomensio.github.io"
            }
        },
        {
            "id": "spacy-streamlit",
            "title": "spacy-streamlit",
@ -55,13 +80,14 @@
        },
        {
            "id": "spacy-universal-sentence-encoder",
-            "title": "SpaCy - Universal Sentence Encoder",
+            "title": "spaCy - Universal Sentence Encoder",
-            "slogan": "Make use of Google's Universal Sentence Encoder directly within SpaCy",
+            "slogan": "Make use of Google's Universal Sentence Encoder directly within spaCy",
            "description": "This library lets you use Universal Sentence Encoder embeddings of Docs, Spans and Tokens directly from TensorFlow Hub",
-            "github": "MartinoMensio/spacy-universal-sentence-encoder-tfhub",
+            "github": "MartinoMensio/spacy-universal-sentence-encoder",
            "pip": "spacy-universal-sentence-encoder",
            "code_example": [
                "import spacy_universal_sentence_encoder",
-                "load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']",
+                "# load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']",
                "nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')",
                "# get two documents",
                "doc_1 = nlp('Hi there, how are you?')",
@ -1436,7 +1462,7 @@
            "id": "podcast-init",
            "title": "Podcast.__init__ #87: spaCy with Matthew Honnibal",
            "slogan": "December 2017",
-            "description": "As the amount of text available on the internet and in businesses continues to increase, the need for fast and accurate language analysis becomes more prominent. This week Matthew Honnibal, the creator of SpaCy, talks about his experiences researching natural language processing and creating a library to make his findings accessible to industry.",
+            "description": "As the amount of text available on the internet and in businesses continues to increase, the need for fast and accurate language analysis becomes more prominent. This week Matthew Honnibal, the creator of spaCy, talks about his experiences researching natural language processing and creating a library to make his findings accessible to industry.",
            "iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
            "iframe_height": 200,
            "thumb": "https://i.imgur.com/rpo6BuY.png",
@ -1452,7 +1478,7 @@
            "id": "podcast-init2",
            "title": "Podcast.__init__ #256: An Open Source Toolchain For NLP From Explosion AI",
            "slogan": "March 2020",
-            "description": "The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.",
+            "description": "The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of spaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.",
            "iframe": "https://cdn.podlove.org/web-player/share.html?episode=https%3A%2F%2Fwww.pythonpodcast.com%2F%3Fpodlove_player4%3D614",
            "iframe_height": 200,
            "thumb": "https://i.imgur.com/rpo6BuY.png",
@ -1483,7 +1509,7 @@
            "id": "twimlai-podcast",
            "title": "TWiML & AI: Practical NLP with spaCy and Prodigy",
            "slogan": "May 2019",
-            "description": "\"Ines and I caught up to discuss her various projects, including the aforementioned SpaCy, an open-source NLP library built with a focus on industry and production use cases. In our conversation, Ines gives us an overview of the SpaCy Library, a look at some of the use cases that excite her, and the Spacy community and contributors. We also discuss her work with Prodigy, an annotation service tool that uses continuous active learning to train models, and finally, what other exciting projects she is working on.\"",
+            "description": "\"Ines and I caught up to discuss her various projects, including the aforementioned spaCy, an open-source NLP library built with a focus on industry and production use cases. In our conversation, Ines gives us an overview of the spaCy Library, a look at some of the use cases that excite her, and the Spacy community and contributors. We also discuss her work with Prodigy, an annotation service tool that uses continuous active learning to train models, and finally, what other exciting projects she is working on.\"",
            "thumb": "https://i.imgur.com/ng2F5gK.png",
            "url": "https://twimlai.com/twiml-talk-262-practical-natural-language-processing-with-spacy-and-prodigy-w-ines-montani",
            "iframe": "https://html5-player.libsyn.com/embed/episode/id/9691514/height/90/theme/custom/thumbnail/no/preload/no/direction/backward/render-playlist/no/custom-color/3e85b1/",
@ -1515,7 +1541,7 @@
            "id": "practical-ai-podcast",
            "title": "Practical AI: Modern NLP with spaCy",
            "slogan": "December 2019",
-            "description": "\"SpaCy is awesome for NLP! It’s easy to use, has widespread adoption, is open source, and integrates the latest language models. Ines Montani and Matthew Honnibal (core developers of spaCy and co-founders of Explosion) join us to discuss the history of the project, its capabilities, and the latest trends in NLP. We also dig into the practicalities of taking NLP workflows to production. You don’t want to miss this episode!\"",
+            "description": "\"spaCy is awesome for NLP! It’s easy to use, has widespread adoption, is open source, and integrates the latest language models. Ines Montani and Matthew Honnibal (core developers of spaCy and co-founders of Explosion) join us to discuss the history of the project, its capabilities, and the latest trends in NLP. We also dig into the practicalities of taking NLP workflows to production. You don’t want to miss this episode!\"",
            "thumb": "https://i.imgur.com/jn8Bcdw.png",
            "url": "https://changelog.com/practicalai/68",
            "author": "Daniel Whitenack & Chris Benson",
@ -1770,26 +1796,33 @@
        {
            "id": "spacy-conll",
            "title": "spacy_conll",
-            "slogan": "Parse text with spaCy and gets its output in CoNLL-U format",
+            "slogan": "Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe",
-            "description": "This module allows you to parse a text to CoNLL-U format. It contains a pipeline component for spaCy that adds CoNLL-U properties to a Doc and its sentences. It can also be used as a command-line tool.",
+            "description": "This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a spaCy, spacy-stanfordnlp, spacy-stanza, or spacy-udpipe pipeline. It also provides an easy-to-use function to quickly initialize a parser. CoNLL-related properties are added to Doc elements, sentence Spans, and Tokens.",
            "code_example": [
-                "import spacy",
+                "from spacy_conll import init_parser",
                "from spacy_conll import ConllFormatter",
                "",
-                "nlp = spacy.load('en')",
+                "",
-                "conllformatter = ConllFormatter(nlp)",
+                "# Initialise English parser, already including the ConllFormatter as a pipeline component.",
-                "nlp.add_pipe(conllformatter, after='parser')",
+                "# Indicate that we want to get the CoNLL headers in the string output.",
-                "doc = nlp('I like cookies. Do you?')",
+                "# `use_gpu` and `verbose` are specific to stanza (and stanfordnlp). These keywords arguments",
-                "conll = doc._.conll",
+                "# are passed onto their Pipeline() initialisation",
-                "print(doc._.conll_str_headers)",
+                "nlp = init_parser(\"stanza\",",
-                "print(doc._.conll_str)"
+                "                  \"en\",",
                "                  parser_opts={\"use_gpu\": True, \"verbose\": False},",
                "                  include_headers=True)",
                "# Parse a given string",
                "doc = nlp(\"A cookie is a baked or cooked food that is typically small, flat and sweet. It usually contains flour, sugar and some type of oil or fat.\")",
                "",
                "# Get the CoNLL representation of the whole document, including headers",
                "conll = doc._.conll_str",
                "print(conll)"
            ],
            "code_language": "python",
            "author": "Bram Vanroy",
            "author_links": {
                "github": "BramVanroy",
                "twitter": "BramVanroy",
-                "website": "https://bramvanroy.be"
+                "website": "http://bramvanroy.be"
            },
            "github": "BramVanroy/spacy_conll",
            "category": ["standalone", "pipeline"],
@ -1935,6 +1968,28 @@
            "category": ["pipeline"],
            "tags": ["inflection", "lemmatizer"]
        },
        {
            "id": "amrlib",
            "slogan": "A python library that makes AMR parsing, generation and visualization simple.",
            "description": "amrlib is a python module and spaCy add-in for Abstract Meaning Representation (AMR).  The system can parse sentences to AMR graphs or generate text from existing graphs.  It includes a GUI for visualization and experimentation.",
            "github": "bjascob/amrlib",
            "pip": "amrlib",
            "code_example": [
                "import spacy",
                "import amrlib",
                "amrlib.setup_spacy_extension()",
                "nlp = spacy.load('en_core_web_sm')",
                "doc = nlp('This is a test of the spaCy extension. The test has multiple sentences.')",
                "graphs = doc._.to_amr()",
                "for graph in graphs:",
                "    print(graph)"
            ],
            "author": "Brad Jascob",
            "author_links": {
                "github": "bjascob"
            },
            "category": ["pipeline"]
        },
        {
            "id": "blackstone",
            "title": "Blackstone",
@ -2138,7 +2193,7 @@
            "category": ["scientific"],
            "tags": ["sentence segmentation"],
            "code_example": [
-                "from pysbd.util import PySBDFactory",
+                "from pysbd.utils import PySBDFactory",
                "",
                "nlp = spacy.blank('en')",
                "nlp.add_pipe(PySBDFactory(nlp))",
--- a/website/src/components/dropdown.js
+++ b/website/src/components/dropdown.js
@ -6,7 +6,14 @@ import { navigate } from 'gatsby'
 import classes from '../styles/dropdown.module.sass'
 export default function Dropdown({ defaultValue, className, onChange, children }) {
-    const defaultOnChange = ({ target }) => navigate(target.value)
+    const defaultOnChange = ({ target }) => {
        const isExternal = /((http(s?)):\/\/|mailto:)/gi.test(target.value)
        if (isExternal) {
            window.location.href = target.value
        } else {
            navigate(target.value)
        }
    }
    return (
        <select
            defaultValue={defaultValue}
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@ -28,7 +28,7 @@ const CODE_EXAMPLE = `# pip install spacy
 import spacy
-# Load English tokenizer, tagger, parser, NER and word vectors
+# Load English tokenizer, tagger, parser and NER
 nlp = spacy.load("en_core_web_sm")
 # Process whole documents