diff --git a/.github/contributors/bittlingmayer.md b/.github/contributors/bittlingmayer.md new file mode 100644 index 000000000..69ec98a00 --- /dev/null +++ b/.github/contributors/bittlingmayer.md @@ -0,0 +1,107 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Adam Bittlingmayer | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 12 Aug 2020 | +| GitHub username | bittlingmayer | +| Website (optional) | | + diff --git a/.github/contributors/graue70.md b/.github/contributors/graue70.md new file mode 100644 index 000000000..7f9aa037b --- /dev/null +++ b/.github/contributors/graue70.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Thomas | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-08-11 | +| GitHub username | graue70 | +| Website (optional) | | diff --git a/.github/contributors/holubvl3.md b/.github/contributors/holubvl3.md new file mode 100644 index 000000000..f2047b103 --- /dev/null +++ b/.github/contributors/holubvl3.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Vladimir Holubec | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 30.07.2020 | +| GitHub username | holubvl3 | +| Website (optional) | | diff --git a/.github/contributors/idoshr.md b/.github/contributors/idoshr.md new file mode 100644 index 000000000..26e901530 --- /dev/null +++ b/.github/contributors/idoshr.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Ido Shraga | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 20-09-2020 | +| GitHub username | idoshr | +| Website (optional) | | diff --git a/.github/contributors/jgutix.md b/.github/contributors/jgutix.md new file mode 100644 index 000000000..4bda9486b --- /dev/null +++ b/.github/contributors/jgutix.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Juan Gutiérrez | +| Company name (if applicable) | Ojtli | +| Title or role (if applicable) | | +| Date | 2020-08-28 | +| GitHub username | jgutix | +| Website (optional) | ojtli.app | diff --git a/.github/contributors/leyendecker.md b/.github/contributors/leyendecker.md new file mode 100644 index 000000000..74e6cdd80 --- /dev/null +++ b/.github/contributors/leyendecker.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ---------------------------- | +| Name | Gustavo Zadrozny Leyendecker | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | July 29, 2020 | +| GitHub username | leyendecker | +| Website (optional) | | diff --git a/.github/contributors/lizhe2004.md b/.github/contributors/lizhe2004.md new file mode 100644 index 000000000..6011506d6 --- /dev/null +++ b/.github/contributors/lizhe2004.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Zhe li | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-07-24 | +| GitHub username | lizhe2004 | +| Website (optional) | http://www.huahuaxia.net| diff --git a/.github/contributors/snsten.md b/.github/contributors/snsten.md new file mode 100644 index 000000000..0d7c28835 --- /dev/null +++ b/.github/contributors/snsten.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Shashank Shekhar | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-08-23 | +| GitHub username | snsten | +| Website (optional) | | diff --git a/.github/contributors/solarmist.md b/.github/contributors/solarmist.md new file mode 100644 index 000000000..6bfb21696 --- /dev/null +++ b/.github/contributors/solarmist.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------- | +| Name | Joshua Olson | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-07-22 | +| GitHub username | solarmist | +| Website (optional) | http://blog.solarmist.net | diff --git a/.github/contributors/tilusnet.md b/.github/contributors/tilusnet.md new file mode 100644 index 000000000..1618bac2e --- /dev/null +++ b/.github/contributors/tilusnet.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Attila Szász | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 12 Aug 2020 | +| GitHub username | tilusnet | +| Website (optional) | | diff --git a/licenses/3rd_party_licenses.txt b/licenses/3rd_party_licenses.txt new file mode 100644 index 000000000..0aeef5507 --- /dev/null +++ b/licenses/3rd_party_licenses.txt @@ -0,0 +1,38 @@ +Third Party Licenses for spaCy +============================== + +NumPy +----- + +* Files: setup.py + +Copyright (c) 2005-2020, NumPy Developers. +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + + * Neither the name of the NumPy Developers nor the names of any + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/spacy/about.py b/spacy/about.py index 3fe720dbc..7d0e85a17 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy-nightly" -__version__ = "3.0.0a13" +__version__ = "3.0.0a14" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py index 655e2f459..edcd410bd 100644 --- a/spacy/cli/project/pull.py +++ b/spacy/cli/project/pull.py @@ -40,5 +40,6 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False): url = storage.pull(output_path, command_hash=cmd_hash) yield url, output_path - if cmd.get("outputs") and all(loc.exists() for loc in cmd["outputs"]): + out_locs = [project_dir / out for out in cmd.get("outputs", [])] + if all(loc.exists() for loc in out_locs): update_lockfile(project_dir, cmd) diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py index fcee2231a..26495412d 100644 --- a/spacy/cli/project/push.py +++ b/spacy/cli/project/push.py @@ -45,10 +45,19 @@ def project_push(project_dir: Path, remote: str): ) for output_path in cmd.get("outputs", []): output_loc = project_dir / output_path - if output_loc.exists(): + if output_loc.exists() and _is_not_empty_dir(output_loc): url = storage.push( output_path, command_hash=cmd_hash, content_hash=get_content_hash(output_loc), ) yield output_path, url + + +def _is_not_empty_dir(loc: Path): + if not loc.is_dir(): + return True + elif any(_is_not_empty_dir(child) for child in loc.iterdir()): + return True + else: + return False diff --git a/spacy/cli/templates/quickstart_training.jinja b/spacy/cli/templates/quickstart_training.jinja index 43c852d13..199aae217 100644 --- a/spacy/cli/templates/quickstart_training.jinja +++ b/spacy/cli/templates/quickstart_training.jinja @@ -186,11 +186,14 @@ accumulate_gradient = {{ transformer["size_factor"] }} [training.optimizer] @optimizers = "Adam.v1" + +{% if use_transformer -%} [training.optimizer.learn_rate] @schedules = "warmup_linear.v1" warmup_steps = 250 total_steps = 20000 initial_rate = 5e-5 +{% endif %} [training.train_corpus] @readers = "spacy.Corpus.v1" diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 984971812..ba56beca3 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -329,7 +329,11 @@ class EntityRenderer: else: markup += entity offset = end - markup += escape_html(text[offset:]) + fragments = text[offset:].split("\n") + for i, fragment in enumerate(fragments): + markup += escape_html(fragment) + if len(fragments) > 1 and i != len(fragments) - 1: + markup += "
" markup = TPL_ENTS.format(content=markup, dir=self.direction) if title: markup = TPL_TITLE.format(title=title) + markup diff --git a/spacy/errors.py b/spacy/errors.py index f3058d2b4..bad3e83e4 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -76,6 +76,10 @@ class Warnings: "If this is surprising, make sure you have the spacy-lookups-data " "package installed. The languages with lexeme normalization tables " "are currently: {langs}") + W034 = ("Please install the package spacy-lookups-data in order to include " + "the default lexeme normalization table for the language '{lang}'.") + W035 = ('Discarding subpattern "{pattern}" due to an unrecognized ' + "attribute or operator.") # TODO: fix numbering after merging develop into master W090 = ("Could not locate any binary .spacy files in path '{path}'.") @@ -284,12 +288,12 @@ class Errors: "Span objects, or dicts if set to manual=True.") E097 = ("Invalid pattern: expected token pattern (list of dicts) or " "phrase pattern (string) but got:\n{pattern}") - E098 = ("Invalid pattern specified: expected both SPEC and PATTERN.") - E099 = ("First node of pattern should be a root node. The root should " - "only contain NODE_NAME.") - E100 = ("Nodes apart from the root should contain NODE_NAME, NBOR_NAME and " - "NBOR_RELOP.") - E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have " + E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.") + E099 = ("Invalid pattern: the first node of pattern should be an anchor " + "node. The node should only contain RIGHT_ID and RIGHT_ATTRS.") + E100 = ("Nodes other than the anchor node should all contain LEFT_ID, " + "REL_OP and RIGHT_ID.") + E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have " "have been declared in previous edges.") E102 = ("Can't merge non-disjoint spans. '{token}' is already part of " "tokens to merge. If you want to find the longest non-overlapping " @@ -474,6 +478,9 @@ class Errors: E198 = ("Unable to return {n} most similar vectors for the current vectors " "table, which contains {n_rows} vectors.") E199 = ("Unable to merge 0-length span at doc[{start}:{end}].") + E200 = ("Specifying a base model with a pretrained component '{component}' " + "can not be combined with adding a pretrained Tok2Vec layer.") + E201 = ("Span index out of range.") # TODO: fix numbering after merging develop into master E925 = ("Invalid color values for displaCy visualizer: expected dictionary " @@ -654,6 +661,9 @@ class Errors: "'{chunk}'. Tokenizer exceptions are only allowed to specify " "`ORTH` and `NORM`.") E1006 = ("Unable to initialize {name} model with 0 labels.") + E1007 = ("Unsupported DependencyMatcher operator '{op}'.") + E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check " + "that you are providing a list of patterns as `List[List[dict]]`.") @add_codes diff --git a/spacy/lang/cs/__init__.py b/spacy/lang/cs/__init__.py index a4b546b13..0c35e2288 100644 --- a/spacy/lang/cs/__init__.py +++ b/spacy/lang/cs/__init__.py @@ -1,9 +1,11 @@ from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS from ...language import Language class CzechDefaults(Language.Defaults): stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Czech(Language): diff --git a/spacy/lang/cs/examples.py b/spacy/lang/cs/examples.py new file mode 100644 index 000000000..a30b5ac14 --- /dev/null +++ b/spacy/lang/cs/examples.py @@ -0,0 +1,38 @@ +""" +Example sentences to test spaCy and its language models. +>>> from spacy.lang.cs.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "Máma mele maso.", + "Příliš žluťoučký kůň úpěl ďábelské ódy.", + "ArcGIS je geografický informační systém určený pro práci s prostorovými daty.", + "Může data vytvářet a spravovat, ale především je dokáže analyzovat, najít v nich nové vztahy a vše přehledně vizualizovat.", + "Dnes je krásné počasí.", + "Nestihl autobus, protože pozdě vstal z postele.", + "Než budeš jíst, jdi si umýt ruce.", + "Dnes je neděle.", + "Škola začíná v 8:00.", + "Poslední autobus jede v jedenáct hodin večer.", + "V roce 2020 se téměř zastavila světová ekonomika.", + "Praha je hlavní město České republiky.", + "Kdy půjdeš ven?", + "Kam pojedete na dovolenou?", + "Kolik stojí iPhone 12?", + "Průměrná mzda je 30000 Kč.", + "1. ledna 1993 byla založena Česká republika.", + "Co se stalo 21.8.1968?", + "Moje telefonní číslo je 712 345 678.", + "Můj pes má blechy.", + "Když bude přes noc více než 20°, tak nás čeká tropická noc.", + "Kolik bylo letos tropických nocí?", + "Jak to mám udělat?", + "Bydlíme ve čtvrtém patře.", + "Vysílají 30. sezonu seriálu Simpsonovi.", + "Adresa ČVUT je Thákurova 7, 166 29, Praha 6.", + "Jaké PSČ má Praha 1?", + "PSČ Prahy 1 je 110 00.", + "Za 20 minut jede vlak.", +] diff --git a/spacy/lang/cs/lex_attrs.py b/spacy/lang/cs/lex_attrs.py new file mode 100644 index 000000000..530d1d5eb --- /dev/null +++ b/spacy/lang/cs/lex_attrs.py @@ -0,0 +1,61 @@ +from ...attrs import LIKE_NUM + +_num_words = [ + "nula", + "jedna", + "dva", + "tři", + "čtyři", + "pět", + "šest", + "sedm", + "osm", + "devět", + "deset", + "jedenáct", + "dvanáct", + "třináct", + "čtrnáct", + "patnáct", + "šestnáct", + "sedmnáct", + "osmnáct", + "devatenáct", + "dvacet", + "třicet", + "čtyřicet", + "padesát", + "šedesát", + "sedmdesát", + "osmdesát", + "devadesát", + "sto", + "tisíc", + "milion", + "miliarda", + "bilion", + "biliarda", + "trilion", + "triliarda", + "kvadrilion", + "kvadriliarda", + "kvintilion", +] + + +def like_num(text): + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text.lower() in _num_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/cs/stop_words.py b/spacy/lang/cs/stop_words.py index 70aab030b..f61f424f6 100644 --- a/spacy/lang/cs/stop_words.py +++ b/spacy/lang/cs/stop_words.py @@ -1,14 +1,23 @@ # Source: https://github.com/Alir3z4/stop-words +# Source: https://github.com/stopwords-iso/stopwords-cs/blob/master/stopwords-cs.txt STOP_WORDS = set( """ -ačkoli +a +aby ahoj +ačkoli ale +alespoň anebo +ani +aniž ano +atd. +atp. asi aspoň +až během bez beze @@ -21,12 +30,14 @@ budeš budete budou budu +by byl byla byli bylo byly bys +být čau chce chceme @@ -35,14 +46,21 @@ chcete chci chtějí chtít -chut' +chuť chuti co +což +cz +či +článek +článku +články čtrnáct čtyři dál dále daleko +další děkovat děkujeme děkuji @@ -50,6 +68,7 @@ den deset devatenáct devět +dnes do dobrý docela @@ -57,9 +76,15 @@ dva dvacet dvanáct dvě +email +ho hodně +i já jak +jakmile +jako +jakož jde je jeden @@ -69,25 +94,39 @@ jedno jednou jedou jeho +jehož +jej její jejich +jejichž +jehož +jelikož jemu jen jenom +jenž +jež ještě jestli jestliže +ještě +ji jí jich jím +jim jimi jinak -jsem +jiné +již jsi jsme +jsem jsou jste +k kam +každý kde kdo kdy @@ -96,10 +135,13 @@ ke kolik kromě která +kterak +kterou které kteří který kvůli +ku má mají málo @@ -110,8 +152,10 @@ máte mé mě mezi +mi mí mít +mne mně mnou moc @@ -134,6 +178,7 @@ nás náš naše naši +načež ne ně nebo @@ -141,6 +186,7 @@ nebyl nebyla nebyli nebyly +nechť něco nedělá nedělají @@ -150,6 +196,7 @@ neděláš neděláte nějak nejsi +nejsou někde někdo nemají @@ -157,15 +204,22 @@ nemáme nemáte neměl němu +němuž není nestačí +ně nevadí +nové +nový +noví než nic nich +ní ním nimi nula +o od ode on @@ -179,22 +233,37 @@ pak patnáct pět po +pod +pokud pořád +pouze potom pozdě +pravé před +přede přes -přese +přece pro proč prosím prostě +proto proti +první +právě protože +při +přičemž rovně +s se sedm sedmnáct +si +sice +skoro +sic šest šestnáct skoro @@ -203,41 +272,69 @@ smí snad spolu sta +svůj +své +svá +svých +svým +svými +svůj sté sto +strana ta tady tak takhle taky +také +takže tam -tamhle -tamhleto +támhle +támhleto tamto tě tebe tebou -ted' +teď tedy ten +tento +této ti +tím +tímto tisíc tisíce to tobě tohle +tohoto +tom +tomto +tomu +tomuto toto třeba tři třináct trošku +trochu +tu +tuto tvá tvé tvoje tvůj ty +tyto +těm +těma +těmi +u určitě už +v vám vámi vás @@ -247,13 +344,19 @@ vaši ve večer vedle +více vlastně +však +všechen všechno všichni vůbec vy vždy +z +zda za +zde zač zatímco ze diff --git a/spacy/lang/cs/test_text.py b/spacy/lang/cs/test_text.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/lang/en/lex_attrs.py b/spacy/lang/en/lex_attrs.py index 975e6b392..fcc7c6bf2 100644 --- a/spacy/lang/en/lex_attrs.py +++ b/spacy/lang/en/lex_attrs.py @@ -8,6 +8,14 @@ _num_words = [ "fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand", "million", "billion", "trillion", "quadrillion", "gajillion", "bazillion" ] +_ordinal_words = [ + "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", + "ninth", "tenth", "eleventh", "twelfth", "thirteenth", "fourteenth", + "fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth", + "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", + "eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth", + "trillionth", "quadrillionth", "gajillionth", "bazillionth" +] # fmt: on @@ -21,8 +29,15 @@ def like_num(text: str) -> bool: num, denom = text.split("/") if num.isdigit() and denom.isdigit(): return True - if text.lower() in _num_words: + text_lower = text.lower() + if text_lower in _num_words: return True + # Check ordinal number + if text_lower in _ordinal_words: + return True + if text_lower.endswith("th"): + if text_lower[:-2].isdigit(): + return True return False diff --git a/spacy/lang/es/syntax_iterators.py b/spacy/lang/es/syntax_iterators.py index c33412693..427f1f203 100644 --- a/spacy/lang/es/syntax_iterators.py +++ b/spacy/lang/es/syntax_iterators.py @@ -19,8 +19,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: np_left_deps = [doc.vocab.strings.add(label) for label in left_labels] np_right_deps = [doc.vocab.strings.add(label) for label in right_labels] stop_deps = [doc.vocab.strings.add(label) for label in stop_labels] - token = doc[0] - while token and token.i < len(doclike): + for token in doclike: if token.pos in [PROPN, NOUN, PRON]: left, right = noun_bounds( doc, token, np_left_deps, np_right_deps, stop_deps diff --git a/spacy/lang/he/__init__.py b/spacy/lang/he/__init__.py index 70bd9cf45..e0adc3293 100644 --- a/spacy/lang/he/__init__.py +++ b/spacy/lang/he/__init__.py @@ -1,9 +1,11 @@ from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS from ...language import Language class HebrewDefaults(Language.Defaults): stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} diff --git a/spacy/lang/he/lex_attrs.py b/spacy/lang/he/lex_attrs.py new file mode 100644 index 000000000..2953e7592 --- /dev/null +++ b/spacy/lang/he/lex_attrs.py @@ -0,0 +1,95 @@ +from ...attrs import LIKE_NUM + +_num_words = [ + "אפס", + "אחד", + "אחת", + "שתיים", + "שתים", + "שניים", + "שנים", + "שלוש", + "שלושה", + "ארבע", + "ארבעה", + "חמש", + "חמישה", + "שש", + "שישה", + "שבע", + "שבעה", + "שמונה", + "תשע", + "תשעה", + "עשר", + "עשרה", + "אחד עשר", + "אחת עשרה", + "שנים עשר", + "שתים עשרה", + "שלושה עשר", + "שלוש עשרה", + "ארבעה עשר", + "ארבע עשרה", + "חמישה עשר", + "חמש עשרה", + "ששה עשר", + "שש עשרה", + "שבעה עשר", + "שבע עשרה", + "שמונה עשר", + "שמונה עשרה", + "תשעה עשר", + "תשע עשרה", + "עשרים", + "שלושים", + "ארבעים", + "חמישים", + "שישים", + "שבעים", + "שמונים", + "תשעים", + "מאה", + "אלף", + "מליון", + "מליארד", + "טריליון", +] + + +_ordinal_words = [ + "ראשון", + "שני", + "שלישי", + "רביעי", + "חמישי", + "שישי", + "שביעי", + "שמיני", + "תשיעי", + "עשירי", +] + + +def like_num(text): + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + + if text in _num_words: + return True + + # CHeck ordinal number + if text in _ordinal_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/he/stop_words.py b/spacy/lang/he/stop_words.py index 2745460a7..23bb5176d 100644 --- a/spacy/lang/he/stop_words.py +++ b/spacy/lang/he/stop_words.py @@ -39,7 +39,6 @@ STOP_WORDS = set( בין עם עד -נגר על אל מול @@ -58,7 +57,7 @@ STOP_WORDS = set( עליך עלינו עליכם -לעיכן +עליכן עליהם עליהן כל @@ -67,8 +66,8 @@ STOP_WORDS = set( כך ככה כזה +כזאת זה -זות אותי אותה אותם @@ -91,7 +90,7 @@ STOP_WORDS = set( איתכן יהיה תהיה -היתי +הייתי היתה היה להיות @@ -101,8 +100,6 @@ STOP_WORDS = set( עצמם עצמן עצמנו -עצמהם -עצמהן מי מה איפה @@ -153,6 +150,7 @@ STOP_WORDS = set( לאו אי כלל +בעד נגד אם עם @@ -196,7 +194,6 @@ STOP_WORDS = set( אשר ואילו למרות -אס כמו כפי אז @@ -204,8 +201,8 @@ STOP_WORDS = set( כן לכן לפיכך -מאד עז +מאוד מעט מעטים במידה diff --git a/spacy/lang/hi/examples.py b/spacy/lang/hi/examples.py index ecb0b328c..1443b4908 100644 --- a/spacy/lang/hi/examples.py +++ b/spacy/lang/hi/examples.py @@ -15,4 +15,6 @@ sentences = [ "फ्रांस के राष्ट्रपति कौन हैं?", "संयुक्त राज्यों की राजधानी क्या है?", "बराक ओबामा का जन्म कब हुआ था?", + "जवाहरलाल नेहरू भारत के पहले प्रधानमंत्री हैं।", + "राजेंद्र प्रसाद, भारत के पहले राष्ट्रपति, दो कार्यकाल के लिए कार्यालय रखने वाले एकमात्र व्यक्ति हैं।", ] diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py index 051415455..117514c09 100644 --- a/spacy/lang/ja/__init__.py +++ b/spacy/lang/ja/__init__.py @@ -254,7 +254,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): return text_dtokens, text_spaces # align words and dtokens by referring text, and insert gap tokens for the space char spans - for word, dtoken in zip(words, dtokens): + for i, (word, dtoken) in enumerate(zip(words, dtokens)): # skip all space tokens if word.isspace(): continue @@ -275,7 +275,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): text_spaces.append(False) text_pos += len(word) # poll a space char after the word - if text_pos < len(text) and text[text_pos] == " ": + if i + 1 < len(dtokens) and dtokens[i + 1].surface == " ": text_spaces[-1] = True text_pos += 1 diff --git a/spacy/lang/lex_attrs.py b/spacy/lang/lex_attrs.py index 088a05ef4..12016c273 100644 --- a/spacy/lang/lex_attrs.py +++ b/spacy/lang/lex_attrs.py @@ -8,7 +8,7 @@ from .. import attrs _like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match _tlds = set( "com|org|edu|gov|net|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|" - "name|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|" + "name|pro|tel|travel|xyz|icu|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|" "ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|" "cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|" "ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|" diff --git a/spacy/lang/ne/stop_words.py b/spacy/lang/ne/stop_words.py index f008697d0..8470297b9 100644 --- a/spacy/lang/ne/stop_words.py +++ b/spacy/lang/ne/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt STOP_WORDS = set( diff --git a/spacy/lang/sa/__init__.py b/spacy/lang/sa/__init__.py new file mode 100644 index 000000000..345137817 --- /dev/null +++ b/spacy/lang/sa/__init__.py @@ -0,0 +1,16 @@ +from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS +from ...language import Language + + +class SanskritDefaults(Language.Defaults): + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS + + +class Sanskrit(Language): + lang = "sa" + Defaults = SanskritDefaults + + +__all__ = ["Sanskrit"] diff --git a/spacy/lang/sa/examples.py b/spacy/lang/sa/examples.py new file mode 100644 index 000000000..60243c04c --- /dev/null +++ b/spacy/lang/sa/examples.py @@ -0,0 +1,15 @@ +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.sa.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "अभ्यावहति कल्याणं विविधं वाक् सुभाषिता ।", + "मनसि व्याकुले चक्षुः पश्यन्नपि न पश्यति ।", + "यस्य बुद्धिर्बलं तस्य निर्बुद्धेस्तु कुतो बलम्?", + "परो अपि हितवान् बन्धुः बन्धुः अपि अहितः परः ।", + "अहितः देहजः व्याधिः हितम् आरण्यं औषधम् ॥", +] diff --git a/spacy/lang/sa/lex_attrs.py b/spacy/lang/sa/lex_attrs.py new file mode 100644 index 000000000..f2b51650b --- /dev/null +++ b/spacy/lang/sa/lex_attrs.py @@ -0,0 +1,127 @@ +from ...attrs import LIKE_NUM + +# reference 1: https://en.wikibooks.org/wiki/Sanskrit/Numbers + +_num_words = [ + "एकः", + "द्वौ", + "त्रयः", + "चत्वारः", + "पञ्च", + "षट्", + "सप्त", + "अष्ट", + "नव", + "दश", + "एकादश", + "द्वादश", + "त्रयोदश", + "चतुर्दश", + "पञ्चदश", + "षोडश", + "सप्तदश", + "अष्टादश", + "एकान्नविंशति", + "विंशति", + "एकाविंशति", + "द्वाविंशति", + "त्रयोविंशति", + "चतुर्विंशति", + "पञ्चविंशति", + "षड्विंशति", + "सप्तविंशति", + "अष्टाविंशति", + "एकान्नत्रिंशत्", + "त्रिंशत्", + "एकत्रिंशत्", + "द्वात्रिंशत्", + "त्रयत्रिंशत्", + "चतुस्त्रिंशत्", + "पञ्चत्रिंशत्", + "षट्त्रिंशत्", + "सप्तत्रिंशत्", + "अष्टात्रिंशत्", + "एकोनचत्वारिंशत्", + "चत्वारिंशत्", + "एकचत्वारिंशत्", + "द्वाचत्वारिंशत्", + "त्रयश्चत्वारिंशत्", + "चतुश्चत्वारिंशत्", + "पञ्चचत्वारिंशत्", + "षट्चत्वारिंशत्", + "सप्तचत्वारिंशत्", + "अष्टाचत्वारिंशत्", + "एकोनपञ्चाशत्", + "पञ्चाशत्", + "एकपञ्चाशत्", + "द्विपञ्चाशत्", + "त्रिपञ्चाशत्", + "चतुःपञ्चाशत्", + "पञ्चपञ्चाशत्", + "षट्पञ्चाशत्", + "सप्तपञ्चाशत्", + "अष्टपञ्चाशत्", + "एकोनषष्ठिः", + "षष्ठिः", + "एकषष्ठिः", + "द्विषष्ठिः", + "त्रिषष्ठिः", + "चतुःषष्ठिः", + "पञ्चषष्ठिः", + "षट्षष्ठिः", + "सप्तषष्ठिः", + "अष्टषष्ठिः", + "एकोनसप्ततिः", + "सप्ततिः", + "एकसप्ततिः", + "द्विसप्ततिः", + "त्रिसप्ततिः", + "चतुःसप्ततिः", + "पञ्चसप्ततिः", + "षट्सप्ततिः", + "सप्तसप्ततिः", + "अष्टसप्ततिः", + "एकोनाशीतिः", + "अशीतिः", + "एकाशीतिः", + "द्वशीतिः", + "त्र्यशीतिः", + "चतुरशीतिः", + "पञ्चाशीतिः", + "षडशीतिः", + "सप्ताशीतिः", + "अष्टाशीतिः", + "एकोननवतिः", + "नवतिः", + "एकनवतिः", + "द्विनवतिः", + "त्रिनवतिः", + "चतुर्नवतिः", + "पञ्चनवतिः", + "षण्णवतिः", + "सप्तनवतिः", + "अष्टनवतिः", + "एकोनशतम्", + "शतम्", +] + + +def like_num(text): + """ + Check if text resembles a number + """ + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text in _num_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/sa/stop_words.py b/spacy/lang/sa/stop_words.py new file mode 100644 index 000000000..30302a14d --- /dev/null +++ b/spacy/lang/sa/stop_words.py @@ -0,0 +1,515 @@ +# Source: https://gist.github.com/Akhilesh28/fe8b8e180f64b72e64751bc31cb6d323 + +STOP_WORDS = set( + """ +अहम् +आवाम् +वयम् +माम् मा +आवाम् +अस्मान् नः +मया +आवाभ्याम् +अस्माभिस् +मह्यम् मे +आवाभ्याम् नौ +अस्मभ्यम् नः +मत् +आवाभ्याम् +अस्मत् +मम मे +आवयोः +अस्माकम् नः +मयि +आवयोः +अस्मासु +त्वम् +युवाम् +यूयम् +त्वाम् त्वा +युवाम् वाम् +युष्मान् वः +त्वया +युवाभ्याम् +युष्माभिः +तुभ्यम् ते +युवाभ्याम् वाम् +युष्मभ्यम् वः +त्वत् +युवाभ्याम् +युष्मत् +तव ते +युवयोः वाम् +युष्माकम् वः +त्वयि +युवयोः +युष्मासु +सः +तौ +ते +तम् +तौ +तान् +तेन +ताभ्याम् +तैः +तस्मै +ताभ्याम् +तेभ्यः +तस्मात् +ताभ्याम् +तेभ्यः +तस्य +तयोः +तेषाम् +तस्मिन् +तयोः +तेषु +सा +ते +ताः +ताम् +ते +ताः +तया +ताभ्याम् +ताभिः +तस्यै +ताभ्याम् +ताभ्यः +तस्याः +ताभ्याम् +ताभ्यः +तस्य +तयोः +तासाम् +तस्याम् +तयोः +तासु +तत् +ते +तानि +तत् +ते +तानि +तया +ताभ्याम् +ताभिः +तस्यै +ताभ्याम् +ताभ्यः +तस्याः +ताभ्याम् +ताभ्यः +तस्य +तयोः +तासाम् +तस्याम् +तयोः +तासु +अयम् +इमौ +इमे +इमम् +इमौ +इमान् +अनेन +आभ्याम् +एभिः +अस्मै +आभ्याम् +एभ्यः +अस्मात् +आभ्याम् +एभ्यः +अस्य +अनयोः +एषाम् +अस्मिन् +अनयोः +एषु +इयम् +इमे +इमाः +इमाम् +इमे +इमाः +अनया +आभ्याम् +आभिः +अस्यै +आभ्याम् +आभ्यः +अस्याः +आभ्याम् +आभ्यः +अस्याः +अनयोः +आसाम् +अस्याम् +अनयोः +आसु +इदम् +इमे +इमानि +इदम् +इमे +इमानि +अनेन +आभ्याम् +एभिः +अस्मै +आभ्याम् +एभ्यः +अस्मात् +आभ्याम् +एभ्यः +अस्य +अनयोः +एषाम् +अस्मिन् +अनयोः +एषु +एषः +एतौ +एते +एतम् एनम् +एतौ एनौ +एतान् एनान् +एतेन +एताभ्याम् +एतैः +एतस्मै +एताभ्याम् +एतेभ्यः +एतस्मात् +एताभ्याम् +एतेभ्यः +एतस्य +एतस्मिन् +एतेषाम् +एतस्मिन् +एतस्मिन् +एतेषु +एषा +एते +एताः +एताम् एनाम् +एते एने +एताः एनाः +एतया एनया +एताभ्याम् +एताभिः +एतस्यै +एताभ्याम् +एताभ्यः +एतस्याः +एताभ्याम् +एताभ्यः +एतस्याः +एतयोः एनयोः +एतासाम् +एतस्याम् +एतयोः एनयोः +एतासु +एतत् एतद् +एते +एतानि +एतत् एतद् एनत् एनद् +एते एने +एतानि एनानि +एतेन एनेन +एताभ्याम् +एतैः +एतस्मै +एताभ्याम् +एतेभ्यः +एतस्मात् +एताभ्याम् +एतेभ्यः +एतस्य +एतयोः एनयोः +एतेषाम् +एतस्मिन् +एतयोः एनयोः +एतेषु +असौ +अमू +अमी +अमूम् +अमू +अमून् +अमुना +अमूभ्याम् +अमीभिः +अमुष्मै +अमूभ्याम् +अमीभ्यः +अमुष्मात् +अमूभ्याम् +अमीभ्यः +अमुष्य +अमुयोः +अमीषाम् +अमुष्मिन् +अमुयोः +अमीषु +असौ +अमू +अमूः +अमूम् +अमू +अमूः +अमुया +अमूभ्याम् +अमूभिः +अमुष्यै +अमूभ्याम् +अमूभ्यः +अमुष्याः +अमूभ्याम् +अमूभ्यः +अमुष्याः +अमुयोः +अमूषाम् +अमुष्याम् +अमुयोः +अमूषु +अमु +अमुनी +अमूनि +अमु +अमुनी +अमूनि +अमुना +अमूभ्याम् +अमीभिः +अमुष्मै +अमूभ्याम् +अमीभ्यः +अमुष्मात् +अमूभ्याम् +अमीभ्यः +अमुष्य +अमुयोः +अमीषाम् +अमुष्मिन् +अमुयोः +अमीषु +कः +कौ +के +कम् +कौ +कान् +केन +काभ्याम् +कैः +कस्मै +काभ्याम् +केभ्य +कस्मात् +काभ्याम् +केभ्य +कस्य +कयोः +केषाम् +कस्मिन् +कयोः +केषु +का +के +काः +काम् +के +काः +कया +काभ्याम् +काभिः +कस्यै +काभ्याम् +काभ्यः +कस्याः +काभ्याम् +काभ्यः +कस्याः +कयोः +कासाम् +कस्याम् +कयोः +कासु +किम् +के +कानि +किम् +के +कानि +केन +काभ्याम् +कैः +कस्मै +काभ्याम् +केभ्य +कस्मात् +काभ्याम् +केभ्य +कस्य +कयोः +केषाम् +कस्मिन् +कयोः +केषु +भवान् +भवन्तौ +भवन्तः +भवन्तम् +भवन्तौ +भवतः +भवता +भवद्भ्याम् +भवद्भिः +भवते +भवद्भ्याम् +भवद्भ्यः +भवतः +भवद्भ्याम् +भवद्भ्यः +भवतः +भवतोः +भवताम् +भवति +भवतोः +भवत्सु +भवती +भवत्यौ +भवत्यः +भवतीम् +भवत्यौ +भवतीः +भवत्या +भवतीभ्याम् +भवतीभिः +भवत्यै +भवतीभ्याम् +भवतीभिः +भवत्याः +भवतीभ्याम् +भवतीभिः +भवत्याः +भवत्योः +भवतीनाम् +भवत्याम् +भवत्योः +भवतीषु +भवत् +भवती +भवन्ति +भवत् +भवती +भवन्ति +भवता +भवद्भ्याम् +भवद्भिः +भवते +भवद्भ्याम् +भवद्भ्यः +भवतः +भवद्भ्याम् +भवद्भ्यः +भवतः +भवतोः +भवताम् +भवति +भवतोः +भवत्सु +अये +अरे +अरेरे +अविधा +असाधुना +अस्तोभ +अहह +अहावस् +आम् +आर्यहलम् +आह +आहो +इस् +उम् +उवे +काम् +कुम् +चमत् +टसत् +दृन् +धिक् +पाट् +फत् +फाट् +फुडुत् +बत +बाल् +वट् +व्यवस्तोभति व्यवस्तुभ् +षाट् +स्तोभ +हुम्मा +हूम् +अति +अधि +अनु +अप +अपि +अभि +अव +आ +उद् +उप +नि +निर् +परा +परि +प्र +प्रति +वि +सम् +अथवा उत +अन्यथा +इव +च +चेत् यदि +तु परन्तु +यतः करणेन हि यतस् यदर्थम् यदर्थे यर्हि यथा यत्कारणम् येन ही हिन +यथा यतस् +यद्यपि +यात् अवधेस् यावति +येन प्रकारेण +स्थाने +अह +एव +एवम् +कच्चित् +कु +कुवित् +कूपत् +च +चण् +चेत् +तत्र +नकिम् +नह +नुनम् +नेत् +भूयस् +मकिम् +मकिर् +यत्र +युगपत् +वा +शश्वत् +सूपत् +ह +हन्त +हि +""".split() +) diff --git a/spacy/lang/tokenizer_exceptions.py b/spacy/lang/tokenizer_exceptions.py index 2532ae104..960302513 100644 --- a/spacy/lang/tokenizer_exceptions.py +++ b/spacy/lang/tokenizer_exceptions.py @@ -34,13 +34,13 @@ URL_PATTERN = ( r"|" # host & domain names # mods: match is case-sensitive, so include [A-Z] - "(?:" # noqa: E131 - "(?:" - "[A-Za-z0-9\u00a1-\uffff]" - "[A-Za-z0-9\u00a1-\uffff_-]{0,62}" - ")?" - "[A-Za-z0-9\u00a1-\uffff]\." - ")+" + r"(?:" # noqa: E131 + r"(?:" + r"[A-Za-z0-9\u00a1-\uffff]" + r"[A-Za-z0-9\u00a1-\uffff_-]{0,62}" + r")?" + r"[A-Za-z0-9\u00a1-\uffff]\." + r")+" # TLD identifier # mods: use ALPHA_LOWER instead of a wider range so that this doesn't match # strings like "lower.Upper", which can be split on "." by infixes in some @@ -128,6 +128,8 @@ emoticons = set( :-] [: [-: +[= +=] :o) (o: :} @@ -159,6 +161,8 @@ emoticons = set( =| :| :-| +]= +=[ :1 :P :-P diff --git a/spacy/language.py b/spacy/language.py index 3b307e3f4..cd84e30a4 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1,9 +1,8 @@ from typing import Optional, Any, Dict, Callable, Iterable, Union, List, Pattern -from typing import Tuple, Iterator, Optional +from typing import Tuple, Iterator from dataclasses import dataclass import random import itertools -import weakref import functools from contextlib import contextmanager from copy import deepcopy @@ -1378,8 +1377,6 @@ class Language: docs = (self.make_doc(text) for text in texts) for pipe in pipes: docs = pipe(docs) - - nr_seen = 0 for doc in docs: yield doc diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index e0a54e6f1..067b2167c 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -1,16 +1,16 @@ # cython: infer_types=True, profile=True -from cymem.cymem cimport Pool -from preshed.maps cimport PreshMap -from libcpp cimport bool +from typing import List import numpy +from cymem.cymem cimport Pool + from .matcher cimport Matcher from ..vocab cimport Vocab from ..tokens.doc cimport Doc -from .matcher import unpickle_matcher from ..errors import Errors +from ..tokens import Span DELIMITER = "||" @@ -22,36 +22,52 @@ cdef class DependencyMatcher: """Match dependency parse tree based on pattern rules.""" cdef Pool mem cdef readonly Vocab vocab - cdef readonly Matcher token_matcher + cdef readonly Matcher matcher cdef public object _patterns + cdef public object _raw_patterns cdef public object _keys_to_token cdef public object _root - cdef public object _entities cdef public object _callbacks cdef public object _nodes cdef public object _tree + cdef public object _ops - def __init__(self, vocab): + def __init__(self, vocab, *, validate=False): """Create the DependencyMatcher. vocab (Vocab): The vocabulary object, which must be shared with the documents the matcher will operate on. + validate (bool): Whether patterns should be validated, passed to + Matcher as `validate` """ size = 20 - # TODO: make matcher work with validation - self.token_matcher = Matcher(vocab, validate=False) + self.matcher = Matcher(vocab, validate=validate) self._keys_to_token = {} self._patterns = {} + self._raw_patterns = {} self._root = {} self._nodes = {} self._tree = {} - self._entities = {} self._callbacks = {} self.vocab = vocab self.mem = Pool() + self._ops = { + "<": self.dep, + ">": self.gov, + "<<": self.dep_chain, + ">>": self.gov_chain, + ".": self.imm_precede, + ".*": self.precede, + ";": self.imm_follow, + ";*": self.follow, + "$+": self.imm_right_sib, + "$-": self.imm_left_sib, + "$++": self.right_sib, + "$--": self.left_sib, + } def __reduce__(self): - data = (self.vocab, self._patterns,self._tree, self._callbacks) + data = (self.vocab, self._raw_patterns, self._callbacks) return (unpickle_matcher, data, None, None) def __len__(self): @@ -74,54 +90,61 @@ cdef class DependencyMatcher: idx = 0 visited_nodes = {} for relation in pattern: - if "PATTERN" not in relation or "SPEC" not in relation: + if not isinstance(relation, dict): + raise ValueError(Errors.E1008) + if "RIGHT_ATTRS" not in relation and "RIGHT_ID" not in relation: raise ValueError(Errors.E098.format(key=key)) if idx == 0: if not( - "NODE_NAME" in relation["SPEC"] - and "NBOR_RELOP" not in relation["SPEC"] - and "NBOR_NAME" not in relation["SPEC"] + "RIGHT_ID" in relation + and "REL_OP" not in relation + and "LEFT_ID" not in relation ): raise ValueError(Errors.E099.format(key=key)) - visited_nodes[relation["SPEC"]["NODE_NAME"]] = True + visited_nodes[relation["RIGHT_ID"]] = True else: if not( - "NODE_NAME" in relation["SPEC"] - and "NBOR_RELOP" in relation["SPEC"] - and "NBOR_NAME" in relation["SPEC"] + "RIGHT_ID" in relation + and "RIGHT_ATTRS" in relation + and "REL_OP" in relation + and "LEFT_ID" in relation ): raise ValueError(Errors.E100.format(key=key)) if ( - relation["SPEC"]["NODE_NAME"] in visited_nodes - or relation["SPEC"]["NBOR_NAME"] not in visited_nodes + relation["RIGHT_ID"] in visited_nodes + or relation["LEFT_ID"] not in visited_nodes ): raise ValueError(Errors.E101.format(key=key)) - visited_nodes[relation["SPEC"]["NODE_NAME"]] = True - visited_nodes[relation["SPEC"]["NBOR_NAME"]] = True + if relation["REL_OP"] not in self._ops: + raise ValueError(Errors.E1007.format(op=relation["REL_OP"])) + visited_nodes[relation["RIGHT_ID"]] = True + visited_nodes[relation["LEFT_ID"]] = True idx = idx + 1 - def add(self, key, patterns, *_patterns, on_match=None): + def add(self, key, patterns, *, on_match=None): """Add a new matcher rule to the matcher. key (str): The match ID. patterns (list): The patterns to add for the given key. on_match (callable): Optional callback executed on match. """ - if patterns is None or hasattr(patterns, "__call__"): # old API - on_match = patterns - patterns = _patterns + if on_match is not None and not hasattr(on_match, "__call__"): + raise ValueError(Errors.E171.format(arg_type=type(on_match))) + if patterns is None or not isinstance(patterns, List): # old API + raise ValueError(Errors.E948.format(arg_type=type(patterns))) for pattern in patterns: if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) - self.validate_input(pattern,key) + self.validate_input(pattern, key) key = self._normalize_key(key) + self._raw_patterns.setdefault(key, []) + self._raw_patterns[key].extend(patterns) _patterns = [] for pattern in patterns: token_patterns = [] for i in range(len(pattern)): - token_pattern = [pattern[i]["PATTERN"]] + token_pattern = [pattern[i]["RIGHT_ATTRS"]] token_patterns.append(token_pattern) - # self.patterns.append(token_patterns) _patterns.append(token_patterns) self._patterns.setdefault(key, []) self._callbacks[key] = on_match @@ -135,7 +158,7 @@ cdef class DependencyMatcher: # TODO: Better ways to hash edges in pattern? for j in range(len(_patterns[i])): k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j)) - self.token_matcher.add(k, [_patterns[i][j]]) + self.matcher.add(k, [_patterns[i][j]]) _keys_to_token[k] = j _keys_to_token_list.append(_keys_to_token) self._keys_to_token.setdefault(key, []) @@ -144,14 +167,14 @@ cdef class DependencyMatcher: for pattern in patterns: nodes = {} for i in range(len(pattern)): - nodes[pattern[i]["SPEC"]["NODE_NAME"]] = i + nodes[pattern[i]["RIGHT_ID"]] = i _nodes_list.append(nodes) self._nodes.setdefault(key, []) self._nodes[key].extend(_nodes_list) # Create an object tree to traverse later on. This data structure # enables easy tree pattern match. Doc-Token based tree cannot be # reused since it is memory-heavy and tightly coupled with the Doc. - self.retrieve_tree(patterns, _nodes_list,key) + self.retrieve_tree(patterns, _nodes_list, key) def retrieve_tree(self, patterns, _nodes_list, key): _heads_list = [] @@ -161,13 +184,13 @@ cdef class DependencyMatcher: root = -1 for j in range(len(patterns[i])): token_pattern = patterns[i][j] - if ("NBOR_RELOP" not in token_pattern["SPEC"]): + if ("REL_OP" not in token_pattern): heads[j] = ('root', j) root = j else: heads[j] = ( - token_pattern["SPEC"]["NBOR_RELOP"], - _nodes_list[i][token_pattern["SPEC"]["NBOR_NAME"]] + token_pattern["REL_OP"], + _nodes_list[i][token_pattern["LEFT_ID"]] ) _heads_list.append(heads) _root_list.append(root) @@ -202,11 +225,21 @@ cdef class DependencyMatcher: RETURNS (tuple): The rule, as an (on_match, patterns) tuple. """ key = self._normalize_key(key) - if key not in self._patterns: + if key not in self._raw_patterns: return default - return (self._callbacks[key], self._patterns[key]) + return (self._callbacks[key], self._raw_patterns[key]) - def __call__(self, Doc doc): + def remove(self, key): + key = self._normalize_key(key) + if not key in self._patterns: + raise ValueError(Errors.E175.format(key=key)) + self._patterns.pop(key) + self._raw_patterns.pop(key) + self._nodes.pop(key) + self._tree.pop(key) + self._root.pop(key) + + def __call__(self, object doclike): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. @@ -214,8 +247,14 @@ cdef class DependencyMatcher: describing the matches. A match tuple describes a span `doc[start:end]`. The `label_id` and `key` are both integers. """ + if isinstance(doclike, Doc): + doc = doclike + elif isinstance(doclike, Span): + doc = doclike.as_doc() + else: + raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) matched_key_trees = [] - matches = self.token_matcher(doc) + matches = self.matcher(doc) for key in list(self._patterns.keys()): _patterns_list = self._patterns[key] _keys_to_token_list = self._keys_to_token[key] @@ -244,26 +283,26 @@ cdef class DependencyMatcher: length = len(_nodes) matched_trees = [] - self.recurse(_tree,id_to_position,_node_operator_map,0,[],matched_trees) - matched_key_trees.append((key,matched_trees)) - - for i, (ent_id, nodes) in enumerate(matched_key_trees): - on_match = self._callbacks.get(ent_id) + self.recurse(_tree, id_to_position, _node_operator_map, 0, [], matched_trees) + for matched_tree in matched_trees: + matched_key_trees.append((key, matched_tree)) + for i, (match_id, nodes) in enumerate(matched_key_trees): + on_match = self._callbacks.get(match_id) if on_match is not None: on_match(self, doc, i, matched_key_trees) return matched_key_trees - def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visited_nodes,matched_trees): - cdef bool isValid; - if(patternLength == len(id_to_position.keys())): + def recurse(self, tree, id_to_position, _node_operator_map, int patternLength, visited_nodes, matched_trees): + cdef bint isValid; + if patternLength == len(id_to_position.keys()): isValid = True for node in range(patternLength): - if(node in tree): + if node in tree: for idx, (relop,nbor) in enumerate(tree[node]): computed_nbors = numpy.asarray(_node_operator_map[visited_nodes[node]][relop]) isNbor = False for computed_nbor in computed_nbors: - if(computed_nbor.i == visited_nodes[nbor]): + if computed_nbor.i == visited_nodes[nbor]: isNbor = True isValid = isValid & isNbor if(isValid): @@ -271,14 +310,14 @@ cdef class DependencyMatcher: return allPatternNodes = numpy.asarray(id_to_position[patternLength]) for patternNode in allPatternNodes: - self.recurse(tree,id_to_position,_node_operator_map,patternLength+1,visited_nodes+[patternNode],matched_trees) + self.recurse(tree, id_to_position, _node_operator_map, patternLength+1, visited_nodes+[patternNode], matched_trees) # Given a node and an edge operator, to return the list of nodes # from the doc that belong to node+operator. This is used to store # all the results beforehand to prevent unnecessary computation while # pattern matching # _node_operator_map[node][operator] = [...] - def get_node_operator_map(self,doc,tree,id_to_position,nodes,root): + def get_node_operator_map(self, doc, tree, id_to_position, nodes, root): _node_operator_map = {} all_node_indices = nodes.values() all_operators = [] @@ -295,24 +334,14 @@ cdef class DependencyMatcher: _node_operator_map[node] = {} for operator in all_operators: _node_operator_map[node][operator] = [] - # Used to invoke methods for each operator - switcher = { - "<": self.dep, - ">": self.gov, - "<<": self.dep_chain, - ">>": self.gov_chain, - ".": self.imm_precede, - "$+": self.imm_right_sib, - "$-": self.imm_left_sib, - "$++": self.right_sib, - "$--": self.left_sib - } for operator in all_operators: for node in all_nodes: - _node_operator_map[node][operator] = switcher.get(operator)(doc,node) + _node_operator_map[node][operator] = self._ops.get(operator)(doc, node) return _node_operator_map def dep(self, doc, node): + if doc[node].head == doc[node]: + return [] return [doc[node].head] def gov(self,doc,node): @@ -322,36 +351,51 @@ cdef class DependencyMatcher: return list(doc[node].ancestors) def gov_chain(self, doc, node): - return list(doc[node].subtree) + return [t for t in doc[node].subtree if t != doc[node]] def imm_precede(self, doc, node): - if node > 0: + sent = self._get_sent(doc[node]) + if node < len(doc) - 1 and doc[node + 1] in sent: + return [doc[node + 1]] + return [] + + def precede(self, doc, node): + sent = self._get_sent(doc[node]) + return [doc[i] for i in range(node + 1, sent.end)] + + def imm_follow(self, doc, node): + sent = self._get_sent(doc[node]) + if node > 0 and doc[node - 1] in sent: return [doc[node - 1]] return [] + def follow(self, doc, node): + sent = self._get_sent(doc[node]) + return [doc[i] for i in range(sent.start, node)] + def imm_right_sib(self, doc, node): for child in list(doc[node].head.children): - if child.i == node - 1: + if child.i == node + 1: return [doc[child.i]] return [] def imm_left_sib(self, doc, node): for child in list(doc[node].head.children): - if child.i == node + 1: + if child.i == node - 1: return [doc[child.i]] return [] def right_sib(self, doc, node): candidate_children = [] for child in list(doc[node].head.children): - if child.i < node: + if child.i > node: candidate_children.append(doc[child.i]) return candidate_children def left_sib(self, doc, node): candidate_children = [] for child in list(doc[node].head.children): - if child.i > node: + if child.i < node: candidate_children.append(doc[child.i]) return candidate_children @@ -360,3 +404,15 @@ cdef class DependencyMatcher: return self.vocab.strings.add(key) else: return key + + def _get_sent(self, token): + root = (list(token.ancestors) or [token])[-1] + return token.doc[root.left_edge.i:root.right_edge.i + 1] + + +def unpickle_matcher(vocab, patterns, callbacks): + matcher = DependencyMatcher(vocab) + for key, pattern in patterns.items(): + callback = callbacks.get(key, None) + matcher.add(key, pattern, on_match=callback) + return matcher diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index a170c7a6b..079cac788 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -829,9 +829,11 @@ def _get_extra_predicates(spec, extra_predicates): attr = "ORTH" attr = IDS.get(attr.upper()) if isinstance(value, dict): + processed = False + value_with_upper_keys = {k.upper(): v for k, v in value.items()} for type_, cls in predicate_types.items(): - if type_ in value: - predicate = cls(len(extra_predicates), attr, value[type_], type_) + if type_ in value_with_upper_keys: + predicate = cls(len(extra_predicates), attr, value_with_upper_keys[type_], type_) # Don't create a redundant predicates. # This helps with efficiency, as we're caching the results. if predicate.key in seen_predicates: @@ -840,6 +842,9 @@ def _get_extra_predicates(spec, extra_predicates): extra_predicates.append(predicate) output.append(predicate.i) seen_predicates[predicate.key] = predicate.i + processed = True + if not processed: + warnings.warn(Warnings.W035.format(pattern=value)) return output diff --git a/spacy/pipeline/dep_parser.pyx b/spacy/pipeline/dep_parser.pyx index e001920a6..eee4ed535 100644 --- a/spacy/pipeline/dep_parser.pyx +++ b/spacy/pipeline/dep_parser.pyx @@ -156,7 +156,7 @@ cdef class DependencyParser(Parser): results = {} results.update(Scorer.score_spans(examples, "sents", **kwargs)) kwargs.setdefault("getter", dep_getter) - kwargs.setdefault("ignore_label", ("p", "punct")) + kwargs.setdefault("ignore_labels", ("p", "punct")) results.update(Scorer.score_deps(examples, "dep", **kwargs)) del results["sents_per_type"] return results diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index 9a87c8589..4f4ff230e 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -133,7 +133,7 @@ class EntityRuler: matches = set( [(m_id, start, end) for m_id, start, end in matches if start != end] ) - get_sort_key = lambda m: (m[2] - m[1], m[1]) + get_sort_key = lambda m: (m[2] - m[1], -m[1]) matches = sorted(matches, key=get_sort_key, reverse=True) entities = list(doc.ents) new_entities = [] diff --git a/spacy/schemas.py b/spacy/schemas.py index be8db6a99..59af53301 100644 --- a/spacy/schemas.py +++ b/spacy/schemas.py @@ -57,12 +57,13 @@ def validate_token_pattern(obj: list) -> List[str]: class TokenPatternString(BaseModel): - REGEX: Optional[StrictStr] - IN: Optional[List[StrictStr]] - NOT_IN: Optional[List[StrictStr]] + REGEX: Optional[StrictStr] = Field(None, alias="regex") + IN: Optional[List[StrictStr]] = Field(None, alias="in") + NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in") class Config: extra = "forbid" + allow_population_by_field_name = True # allow alias and field name @validator("*", pre=True, each_item=True, allow_reuse=True) def raise_for_none(cls, v): @@ -72,9 +73,9 @@ class TokenPatternString(BaseModel): class TokenPatternNumber(BaseModel): - REGEX: Optional[StrictStr] = None - IN: Optional[List[StrictInt]] = None - NOT_IN: Optional[List[StrictInt]] = None + REGEX: Optional[StrictStr] = Field(None, alias="regex") + IN: Optional[List[StrictInt]] = Field(None, alias="in") + NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in") EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==") NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=") GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=") @@ -84,6 +85,7 @@ class TokenPatternNumber(BaseModel): class Config: extra = "forbid" + allow_population_by_field_name = True # allow alias and field name @validator("*", pre=True, each_item=True, allow_reuse=True) def raise_for_none(cls, v): diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 1c0595672..e17199a08 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -44,6 +44,11 @@ def ca_tokenizer(): return get_lang_class("ca")().tokenizer +@pytest.fixture(scope="session") +def cs_tokenizer(): + return get_lang_class("cs")().tokenizer + + @pytest.fixture(scope="session") def da_tokenizer(): return get_lang_class("da")().tokenizer @@ -204,6 +209,11 @@ def ru_lemmatizer(): return get_lang_class("ru")().add_pipe("lemmatizer") +@pytest.fixture(scope="session") +def sa_tokenizer(): + return get_lang_class("sa")().tokenizer + + @pytest.fixture(scope="session") def sr_tokenizer(): return get_lang_class("sr")().tokenizer diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index 79e8f31c0..1e9623484 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -162,11 +162,36 @@ def test_spans_are_hashable(en_tokenizer): def test_spans_by_character(doc): span1 = doc[1:-2] + + # default and specified alignment mode "strict" span2 = doc.char_span(span1.start_char, span1.end_char, label="GPE") assert span1.start_char == span2.start_char assert span1.end_char == span2.end_char assert span2.label_ == "GPE" + span2 = doc.char_span( + span1.start_char, span1.end_char, label="GPE", alignment_mode="strict" + ) + assert span1.start_char == span2.start_char + assert span1.end_char == span2.end_char + assert span2.label_ == "GPE" + + # alignment mode "contract" + span2 = doc.char_span( + span1.start_char - 3, span1.end_char, label="GPE", alignment_mode="contract" + ) + assert span1.start_char == span2.start_char + assert span1.end_char == span2.end_char + assert span2.label_ == "GPE" + + # alignment mode "expand" + span2 = doc.char_span( + span1.start_char + 1, span1.end_char, label="GPE", alignment_mode="expand" + ) + assert span1.start_char == span2.start_char + assert span1.end_char == span2.end_char + assert span2.label_ == "GPE" + def test_span_to_array(doc): span = doc[1:-2] diff --git a/spacy/tests/lang/cs/__init__.py b/spacy/tests/lang/cs/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/cs/test_text.py b/spacy/tests/lang/cs/test_text.py new file mode 100644 index 000000000..b834111b9 --- /dev/null +++ b/spacy/tests/lang/cs/test_text.py @@ -0,0 +1,23 @@ +import pytest + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10.000", True), + ("1000", True), + ("999,0", True), + ("devatenáct", True), + ("osmdesát", True), + ("kvadrilion", True), + ("Pes", False), + (",", False), + ("1/2", True), + ], +) +def test_lex_attrs_like_number(cs_tokenizer, text, match): + tokens = cs_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match diff --git a/spacy/tests/lang/en/test_text.py b/spacy/tests/lang/en/test_text.py index 4d4d0a643..733e814f7 100644 --- a/spacy/tests/lang/en/test_text.py +++ b/spacy/tests/lang/en/test_text.py @@ -56,6 +56,11 @@ def test_lex_attrs_like_number(en_tokenizer, text, match): assert tokens[0].like_num == match +@pytest.mark.parametrize("word", ["third", "Millionth", "100th", "Hundredth"]) +def test_en_lex_attrs_like_number_for_ordinal(word): + assert like_num(word) + + @pytest.mark.parametrize("word", ["eleven"]) def test_en_lex_attrs_capitals(word): assert like_num(word) diff --git a/spacy/tests/lang/he/test_tokenizer.py b/spacy/tests/lang/he/test_tokenizer.py index 3131014a3..3716f7e3b 100644 --- a/spacy/tests/lang/he/test_tokenizer.py +++ b/spacy/tests/lang/he/test_tokenizer.py @@ -1,4 +1,5 @@ import pytest +from spacy.lang.he.lex_attrs import like_num @pytest.mark.parametrize( @@ -39,3 +40,30 @@ def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens): def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens): tokens = he_tokenizer(text) assert expected_tokens == [token.text for token in tokens] + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10,000", True), + ("10,00", True), + ("999.0", True), + ("אחד", True), + ("שתיים", True), + ("מליון", True), + ("כלב", False), + (",", False), + ("1/2", True), + ], +) +def test_lex_attrs_like_number(he_tokenizer, text, match): + tokens = he_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match + + +@pytest.mark.parametrize("word", ["שלישי", "מליון", "עשירי", "מאה", "עשר", "אחד עשר"]) +def test_he_lex_attrs_like_number_for_ordinal(word): + assert like_num(word) diff --git a/spacy/tests/lang/ne/test_text.py b/spacy/tests/lang/ne/test_text.py index 794f8fbdc..7dd971132 100644 --- a/spacy/tests/lang/ne/test_text.py +++ b/spacy/tests/lang/ne/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sa/__init__.py b/spacy/tests/lang/sa/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/sa/test_text.py b/spacy/tests/lang/sa/test_text.py new file mode 100644 index 000000000..41257a4d8 --- /dev/null +++ b/spacy/tests/lang/sa/test_text.py @@ -0,0 +1,42 @@ +import pytest + + +def test_sa_tokenizer_handles_long_text(sa_tokenizer): + text = """नानाविधानि दिव्यानि नानावर्णाकृतीनि च।।""" + tokens = sa_tokenizer(text) + assert len(tokens) == 6 + + +@pytest.mark.parametrize( + "text,length", + [ + ("श्री भगवानुवाच पश्य मे पार्थ रूपाणि शतशोऽथ सहस्रशः।", 9,), + ("गुणान् सर्वान् स्वभावो मूर्ध्नि वर्तते ।", 6), + ], +) +def test_sa_tokenizer_handles_cnts(sa_tokenizer, text, length): + tokens = sa_tokenizer(text) + assert len(tokens) == length + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10.000", True), + ("1000", True), + ("999,0", True), + ("एकः ", True), + ("दश", True), + ("पञ्चदश", True), + ("चत्वारिंशत् ", True), + ("कूपे", False), + (",", False), + ("1/2", True), + ], +) +def test_lex_attrs_like_number(sa_tokenizer, text, match): + tokens = sa_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match diff --git a/spacy/tests/matcher/test_dependency_matcher.py b/spacy/tests/matcher/test_dependency_matcher.py new file mode 100644 index 000000000..72005cc82 --- /dev/null +++ b/spacy/tests/matcher/test_dependency_matcher.py @@ -0,0 +1,334 @@ +import pytest +import pickle +import re +import copy +from mock import Mock +from spacy.matcher import DependencyMatcher +from ..util import get_doc + + +@pytest.fixture +def doc(en_vocab): + text = "The quick brown fox jumped over the lazy fox" + heads = [3, 2, 1, 1, 0, -1, 2, 1, -3] + deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "pobj", "det", "amod"] + doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps) + return doc + + +@pytest.fixture +def patterns(en_vocab): + def is_brown_yellow(text): + return bool(re.compile(r"brown|yellow").match(text)) + + IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow) + + pattern1 = [ + {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}}, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "q", + "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {IS_BROWN_YELLOW: True}, + }, + ] + + pattern2 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox1", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + { + "LEFT_ID": "jumped", + "REL_OP": ".", + "RIGHT_ID": "over", + "RIGHT_ATTRS": {"ORTH": "over"}, + }, + ] + + pattern3 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">>", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {"ORTH": "brown"}, + }, + ] + + pattern4 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + } + ] + + pattern5 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">>", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + ] + + return [pattern1, pattern2, pattern3, pattern4, pattern5] + + +@pytest.fixture +def dependency_matcher(en_vocab, patterns, doc): + matcher = DependencyMatcher(en_vocab) + mock = Mock() + for i in range(1, len(patterns) + 1): + if i == 1: + matcher.add("pattern1", [patterns[0]], on_match=mock) + else: + matcher.add("pattern" + str(i), [patterns[i - 1]]) + + return matcher + + +def test_dependency_matcher(dependency_matcher, doc, patterns): + assert len(dependency_matcher) == 5 + assert "pattern3" in dependency_matcher + assert dependency_matcher.get("pattern3") == (None, [patterns[2]]) + matches = dependency_matcher(doc) + assert len(matches) == 6 + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + span = doc[0:6] + matches = dependency_matcher(span) + assert len(matches) == 5 + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + + +def test_dependency_matcher_pickle(en_vocab, patterns, doc): + matcher = DependencyMatcher(en_vocab) + for i in range(1, len(patterns) + 1): + matcher.add("pattern" + str(i), [patterns[i - 1]]) + + matches = matcher(doc) + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + b = pickle.dumps(matcher) + matcher_r = pickle.loads(b) + + assert len(matcher) == len(matcher_r) + matches = matcher_r(doc) + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + +def test_dependency_matcher_pattern_validation(en_vocab): + pattern = [ + {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}}, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "q", + "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {"ORTH": "brown"}, + }, + ] + + matcher = DependencyMatcher(en_vocab) + # original pattern is valid + matcher.add("FOUNDED", [pattern]) + # individual pattern not wrapped in a list + with pytest.raises(ValueError): + matcher.add("FOUNDED", pattern) + # no anchor node + with pytest.raises(ValueError): + matcher.add("FOUNDED", [pattern[1:]]) + # required keys missing + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[0]["RIGHT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["RIGHT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["RIGHT_ATTRS"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["LEFT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["REL_OP"] + matcher.add("FOUNDED", [pattern2]) + # invalid operator + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + pattern2[1]["REL_OP"] = "!!!" + matcher.add("FOUNDED", [pattern2]) + # duplicate node name + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + pattern2[1]["RIGHT_ID"] = "fox" + matcher.add("FOUNDED", [pattern2]) + + +def test_dependency_matcher_callback(en_vocab, doc): + pattern = [ + {"RIGHT_ID": "quick", "RIGHT_ATTRS": {"ORTH": "quick"}}, + ] + + matcher = DependencyMatcher(en_vocab) + mock = Mock() + matcher.add("pattern", [pattern], on_match=mock) + matches = matcher(doc) + mock.assert_called_once_with(matcher, doc, 0, matches) + + # check that matches with and without callback are the same (#4590) + matcher2 = DependencyMatcher(en_vocab) + matcher2.add("pattern", [pattern]) + matches2 = matcher2(doc) + assert matches == matches2 + + +@pytest.mark.parametrize( + "op,num_matches", [(".", 8), (".*", 20), (";", 8), (";*", 20),] +) +def test_dependency_matcher_precedence_ops(en_vocab, op, num_matches): + # two sentences to test that all matches are within the same sentence + doc = get_doc( + en_vocab, + words=["a", "b", "c", "d", "e"] * 2, + heads=[0, -1, -2, -3, -4] * 2, + deps=["dep"] * 10, + ) + match_count = 0 + for text in ["a", "b", "c", "d", "e"]: + pattern = [ + {"RIGHT_ID": "1", "RIGHT_ATTRS": {"ORTH": text}}, + {"LEFT_ID": "1", "REL_OP": op, "RIGHT_ID": "2", "RIGHT_ATTRS": {},}, + ] + matcher = DependencyMatcher(en_vocab) + matcher.add("A", [pattern]) + matches = matcher(doc) + match_count += len(matches) + for match in matches: + match_id, token_ids = match + # token_ids[0] op token_ids[1] + if op == ".": + assert token_ids[0] == token_ids[1] - 1 + elif op == ";": + assert token_ids[0] == token_ids[1] + 1 + elif op == ".*": + assert token_ids[0] < token_ids[1] + elif op == ";*": + assert token_ids[0] > token_ids[1] + # all tokens are within the same sentence + assert doc[token_ids[0]].sent == doc[token_ids[1]].sent + assert match_count == num_matches + + +@pytest.mark.parametrize( + "left,right,op,num_matches", + [ + ("fox", "jumped", "<", 1), + ("the", "lazy", "<", 0), + ("jumped", "jumped", "<", 0), + ("fox", "jumped", ">", 0), + ("fox", "lazy", ">", 1), + ("lazy", "lazy", ">", 0), + ("fox", "jumped", "<<", 2), + ("jumped", "fox", "<<", 0), + ("the", "fox", "<<", 2), + ("fox", "jumped", ">>", 0), + ("over", "the", ">>", 1), + ("fox", "the", ">>", 2), + ("fox", "jumped", ".", 1), + ("lazy", "fox", ".", 1), + ("the", "fox", ".", 0), + ("the", "the", ".", 0), + ("fox", "jumped", ";", 0), + ("lazy", "fox", ";", 0), + ("the", "fox", ";", 0), + ("the", "the", ";", 0), + ("quick", "fox", ".*", 2), + ("the", "fox", ".*", 3), + ("the", "the", ".*", 1), + ("fox", "jumped", ";*", 1), + ("quick", "fox", ";*", 0), + ("the", "fox", ";*", 1), + ("the", "the", ";*", 1), + ("quick", "brown", "$+", 1), + ("brown", "quick", "$+", 0), + ("brown", "brown", "$+", 0), + ("quick", "brown", "$-", 0), + ("brown", "quick", "$-", 1), + ("brown", "brown", "$-", 0), + ("the", "brown", "$++", 1), + ("brown", "the", "$++", 0), + ("brown", "brown", "$++", 0), + ("the", "brown", "$--", 0), + ("brown", "the", "$--", 1), + ("brown", "brown", "$--", 0), + ], +) +def test_dependency_matcher_ops(en_vocab, doc, left, right, op, num_matches): + right_id = right + if left == right: + right_id = right + "2" + pattern = [ + {"RIGHT_ID": left, "RIGHT_ATTRS": {"LOWER": left}}, + { + "LEFT_ID": left, + "REL_OP": op, + "RIGHT_ID": right_id, + "RIGHT_ATTRS": {"LOWER": right}, + }, + ] + + matcher = DependencyMatcher(en_vocab) + matcher.add("pattern", [pattern]) + matches = matcher(doc) + assert len(matches) == num_matches diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 8310c4466..e0f335a19 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -1,7 +1,6 @@ import pytest -import re from mock import Mock -from spacy.matcher import Matcher, DependencyMatcher +from spacy.matcher import Matcher from spacy.tokens import Doc, Token, Span from ..doc.test_underscore import clean_underscore # noqa: F401 @@ -292,84 +291,6 @@ def test_matcher_extension_set_membership(en_vocab): assert len(matches) == 0 -@pytest.fixture -def text(): - return "The quick brown fox jumped over the lazy fox" - - -@pytest.fixture -def heads(): - return [3, 2, 1, 1, 0, -1, 2, 1, -3] - - -@pytest.fixture -def deps(): - return ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"] - - -@pytest.fixture -def dependency_matcher(en_vocab): - def is_brown_yellow(text): - return bool(re.compile(r"brown|yellow|over").match(text)) - - IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow) - - pattern1 = [ - {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}}, - { - "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, - "PATTERN": {"ORTH": "quick", "DEP": "amod"}, - }, - { - "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, - "PATTERN": {IS_BROWN_YELLOW: True}, - }, - ] - - pattern2 = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - ] - - pattern3 = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, - "PATTERN": {"ORTH": "brown"}, - }, - ] - - matcher = DependencyMatcher(en_vocab) - matcher.add("pattern1", [pattern1]) - matcher.add("pattern2", [pattern2]) - matcher.add("pattern3", [pattern3]) - - return matcher - - -def test_dependency_matcher_compile(dependency_matcher): - assert len(dependency_matcher) == 3 - - -# def test_dependency_matcher(dependency_matcher, text, heads, deps): -# doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps) -# matches = dependency_matcher(doc) -# assert matches[0][1] == [[3, 1, 2]] -# assert matches[1][1] == [[4, 3, 3]] -# assert matches[2][1] == [[4, 3, 2]] - - def test_matcher_basic_check(en_vocab): matcher = Matcher(en_vocab) # Potential mistake: pass in pattern instead of list of patterns diff --git a/spacy/tests/matcher/test_pattern_validation.py b/spacy/tests/matcher/test_pattern_validation.py index 5dea3dde2..4d21aea81 100644 --- a/spacy/tests/matcher/test_pattern_validation.py +++ b/spacy/tests/matcher/test_pattern_validation.py @@ -59,3 +59,12 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors): matcher.add("TEST", [pattern]) elif n_errors == 0: matcher.add("TEST", [pattern]) + + +def test_pattern_errors(en_vocab): + matcher = Matcher(en_vocab) + # normalize "regex" to upper like "text" + matcher.add("TEST1", [[{"text": {"regex": "regex"}}]]) + # error if subpattern attribute isn't recognized and processed + with pytest.raises(MatchPatternError): + matcher.add("TEST2", [[{"TEXT": {"XX": "xx"}}]]) diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index e4e1631b1..d70d0326e 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -150,3 +150,15 @@ def test_entity_ruler_properties(nlp, patterns): ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) assert sorted(ruler.labels) == sorted(["HELLO", "BYE", "COMPLEX", "TECH_ORG"]) assert sorted(ruler.ent_ids) == ["a1", "a2"] + + +def test_entity_ruler_overlapping_spans(nlp): + ruler = EntityRuler(nlp) + patterns = [ + {"label": "FOOBAR", "pattern": "foo bar"}, + {"label": "BARBAZ", "pattern": "bar baz"}, + ] + ruler.add_patterns(patterns) + doc = ruler(nlp.make_doc("foo bar baz")) + assert len(doc.ents) == 1 + assert doc.ents[0].label_ == "FOOBAR" diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index a1aa7e1e1..540301eac 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -71,6 +71,6 @@ def test_overfitting_IO(): def test_tagger_requires_labels(): nlp = English() - tagger = nlp.add_pipe("tagger") + nlp.add_pipe("tagger") with pytest.raises(ValueError): - optimizer = nlp.begin_training() + nlp.begin_training() diff --git a/spacy/tests/regression/test_issue4501-5000.py b/spacy/tests/regression/test_issue4501-5000.py index 39533f70a..d83a2c718 100644 --- a/spacy/tests/regression/test_issue4501-5000.py +++ b/spacy/tests/regression/test_issue4501-5000.py @@ -38,32 +38,6 @@ def test_gold_misaligned(en_tokenizer, text, words): Example.from_dict(doc, {"words": words}) -def test_issue4590(en_vocab): - """Test that matches param in on_match method are the same as matches run with no on_match method""" - pattern = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - ] - - on_match = Mock() - matcher = DependencyMatcher(en_vocab) - matcher.add("pattern", on_match, pattern) - text = "The quick brown fox jumped over the lazy fox" - heads = [3, 2, 1, 1, 0, -1, 2, 1, -3] - deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"] - doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps) - matches = matcher(doc) - on_match_args = on_match.call_args - assert on_match_args[0][3] == matches - - def test_issue4651_with_phrase_matcher_attr(): """Test that the EntityRuler PhraseMatcher is deserialized correctly using the method from_disk when the EntityRuler argument phrase_matcher_attr is diff --git a/spacy/tests/regression/test_issue5838.py b/spacy/tests/regression/test_issue5838.py new file mode 100644 index 000000000..4e4d98beb --- /dev/null +++ b/spacy/tests/regression/test_issue5838.py @@ -0,0 +1,23 @@ +from spacy.lang.en import English +from spacy.tokens import Span +from spacy import displacy + + +SAMPLE_TEXT = """First line +Second line, with ent +Third line +Fourth line +""" + + +def test_issue5838(): + # Displacy's EntityRenderer break line + # not working after last entity + + nlp = English() + doc = nlp(SAMPLE_TEXT) + doc.ents = [Span(doc, 7, 8, label="test")] + + html = displacy.render(doc, style="ent") + found = html.count("
") + assert found == 4 diff --git a/spacy/tests/regression/test_issue5918.py b/spacy/tests/regression/test_issue5918.py new file mode 100644 index 000000000..66280f012 --- /dev/null +++ b/spacy/tests/regression/test_issue5918.py @@ -0,0 +1,27 @@ +from spacy.lang.en import English +from spacy.pipeline import merge_entities + + +def test_issue5918(): + # Test edge case when merging entities. + nlp = English() + ruler = nlp.add_pipe("entity_ruler") + patterns = [ + {"label": "ORG", "pattern": "Digicon Inc"}, + {"label": "ORG", "pattern": "Rotan Mosle Inc's"}, + {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"}, + ] + ruler.add_patterns(patterns) + + text = """ + Digicon Inc said it has completed the previously-announced disposition + of its computer systems division to an investment group led by + Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate. + """ + doc = nlp(text) + assert len(doc.ents) == 3 + # make it so that the third span's head is within the entity (ent_iob=I) + # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents. + doc[29].head = doc[33] + doc = merge_entities(doc) + assert len(doc.ents) == 3 diff --git a/spacy/tests/test_tok2vec.py b/spacy/tests/test_tok2vec.py index 1068b662d..9f0f4b74a 100644 --- a/spacy/tests/test_tok2vec.py +++ b/spacy/tests/test_tok2vec.py @@ -135,6 +135,7 @@ TRAIN_DATA = [ ("Eat blue ham", {"tags": ["V", "J", "N"]}), ] + def test_tok2vec_listener(): orig_config = Config().from_str(cfg_string) nlp, config = util.load_model_from_config(orig_config, auto_fill=True, validate=True) diff --git a/spacy/tests/tokenizer/test_naughty_strings.py b/spacy/tests/tokenizer/test_naughty_strings.py index e93d5654f..b22dabb9d 100644 --- a/spacy/tests/tokenizer/test_naughty_strings.py +++ b/spacy/tests/tokenizer/test_naughty_strings.py @@ -29,6 +29,7 @@ NAUGHTY_STRINGS = [ r"₀₁₂", r"⁰⁴⁵₀₁₂", r"ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็", + r" ̄ ̄", # Two-Byte Characters r"田中さんにあげて下さい", r"パーティーへ行かないか", diff --git a/spacy/tests/tokenizer/test_whitespace.py b/spacy/tests/tokenizer/test_whitespace.py index c7b9d7c6d..d68bb9e4e 100644 --- a/spacy/tests/tokenizer/test_whitespace.py +++ b/spacy/tests/tokenizer/test_whitespace.py @@ -15,7 +15,7 @@ def test_tokenizer_splits_double_space(tokenizer, text): @pytest.mark.parametrize("text", ["lorem ipsum "]) -def test_tokenizer_handles_double_trainling_ws(tokenizer, text): +def test_tokenizer_handles_double_trailing_ws(tokenizer, text): tokens = tokenizer(text) assert repr(tokens.text_with_ws) == repr(text) diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index c5fac2299..9323bb579 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -169,6 +169,8 @@ def _merge(Doc doc, merges): spans.append(span) # House the new merged token where it starts token = &doc.c[start] + start_ent_iob = doc.c[start].ent_iob + start_ent_type = doc.c[start].ent_type # Initially set attributes to attributes of span root token.tag = doc.c[span.root.i].tag token.pos = doc.c[span.root.i].pos @@ -181,8 +183,8 @@ def _merge(Doc doc, merges): merged_iob = 3 # If start token is I-ENT and previous token is of the same # type, then I-ENT (could check I-ENT from start to span root) - if doc.c[start].ent_iob == 1 and start > 0 \ - and doc.c[start].ent_type == token.ent_type \ + if start_ent_iob == 1 and start > 0 \ + and start_ent_type == token.ent_type \ and doc.c[start - 1].ent_type == token.ent_type: merged_iob = 1 token.ent_iob = merged_iob diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 29bdf85ab..3f8c735fb 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -336,17 +336,25 @@ cdef class Doc: def doc(self): return self - def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None): - """Create a `Span` object from the slice `doc.text[start : end]`. + def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None, alignment_mode="strict"): + """Create a `Span` object from the slice + `doc.text[start_idx : end_idx]`. Returns None if no valid `Span` can be + created. doc (Doc): The parent document. - start (int): The index of the first character of the span. - end (int): The index of the first character after the span. + start_idx (int): The index of the first character of the span. + end_idx (int): The index of the first character after the span. label (uint64 or string): A label to attach to the Span, e.g. for named entities. - kb_id (uint64 or string): An ID from a KB to capture the meaning of a named entity. + kb_id (uint64 or string): An ID from a KB to capture the meaning of a + named entity. vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span. + alignment_mode (str): How character indices are aligned to token + boundaries. Options: "strict" (character indices must be aligned + with token boundaries), "contract" (span of all tokens completely + within the character span), "expand" (span of all tokens at least + partially covered by the character span). Defaults to "strict". RETURNS (Span): The newly constructed object. DOCS: https://nightly.spacy.io/api/doc#char_span @@ -355,12 +363,29 @@ cdef class Doc: label = self.vocab.strings.add(label) if not isinstance(kb_id, int): kb_id = self.vocab.strings.add(kb_id) - cdef int start = token_by_start(self.c, self.length, start_idx) - if start == -1: + if alignment_mode not in ("strict", "contract", "expand"): + alignment_mode = "strict" + cdef int start = token_by_char(self.c, self.length, start_idx) + if start < 0 or (alignment_mode == "strict" and start_idx != self[start].idx): return None - cdef int end = token_by_end(self.c, self.length, end_idx) - if end == -1: + # end_idx is exclusive, so find the token at one char before + cdef int end = token_by_char(self.c, self.length, end_idx - 1) + if end < 0 or (alignment_mode == "strict" and end_idx != self[end].idx + len(self[end])): return None + # Adjust start and end by alignment_mode + if alignment_mode == "contract": + if self[start].idx < start_idx: + start += 1 + if end_idx < self[end].idx + len(self[end]): + end -= 1 + # if no tokens are completely within the span, return None + if end < start: + return None + elif alignment_mode == "expand": + # Don't consider the trailing whitespace to be part of the previous + # token + if start_idx == self[start].idx + len(self[start]): + start += 1 # Currently we have the token index, we want the range-end index end += 1 cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector) @@ -1268,23 +1293,35 @@ cdef class Doc: cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2: - cdef int i - for i in range(length): - if tokens[i].idx == start_char: - return i + cdef int i = token_by_char(tokens, length, start_char) + if i >= 0 and tokens[i].idx == start_char: + return i else: return -1 cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2: - cdef int i - for i in range(length): - if tokens[i].idx + tokens[i].lex.length == end_char: - return i + # end_char is exclusive, so find the token at one char before + cdef int i = token_by_char(tokens, length, end_char - 1) + if i >= 0 and tokens[i].idx + tokens[i].lex.length == end_char: + return i else: return -1 +cdef int token_by_char(const TokenC* tokens, int length, int char_idx) except -2: + cdef int start = 0, mid, end = length - 1 + while start <= end: + mid = (start + end) / 2 + if char_idx < tokens[mid].idx: + end = mid - 1 + elif char_idx >= tokens[mid].idx + tokens[mid].lex.length + tokens[mid].spacy: + start = mid + 1 + else: + return mid + return -1 + + cdef int set_children_from_heads(TokenC* tokens, int length) except -1: cdef TokenC* head cdef TokenC* child diff --git a/website/docs/api/dependencymatcher.md b/website/docs/api/dependencymatcher.md index b0395cc42..c90a715d9 100644 --- a/website/docs/api/dependencymatcher.md +++ b/website/docs/api/dependencymatcher.md @@ -1,65 +1,91 @@ --- title: DependencyMatcher -teaser: Match sequences of tokens, based on the dependency parse +teaser: Match subtrees within a dependency parse tag: class +new: 3 source: spacy/matcher/dependencymatcher.pyx --- The `DependencyMatcher` follows the same API as the [`Matcher`](/api/matcher) and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees -using the -[Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). -It requires a trained [`DependencyParser`](/api/parser) or other component that -sets the `Token.dep` attribute. +using +[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). +It requires a pretrained [`DependencyParser`](/api/parser) or other component +that sets the `Token.dep` and `Token.head` attributes. See the +[usage guide](/usage/rule-based-matching#dependencymatcher) for examples. ## Pattern format {#patterns} -> ```json +> ```python > ### Example +> # pattern: "[subject] ... initially founded" > [ +> # anchor token: founded > { -> "SPEC": {"NODE_NAME": "founded"}, -> "PATTERN": {"ORTH": "founded"} +> "RIGHT_ID": "founded", +> "RIGHT_ATTRS": {"ORTH": "founded"} > }, +> # founded -> subject > { -> "SPEC": { -> "NODE_NAME": "founder", -> "NBOR_RELOP": ">", -> "NBOR_NAME": "founded" -> }, -> "PATTERN": {"DEP": "nsubj"} +> "LEFT_ID": "founded", +> "REL_OP": ">", +> "RIGHT_ID": "subject", +> "RIGHT_ATTRS": {"DEP": "nsubj"} > }, +> # "founded" follows "initially" > { -> "SPEC": { -> "NODE_NAME": "object", -> "NBOR_RELOP": ">", -> "NBOR_NAME": "founded" -> }, -> "PATTERN": {"DEP": "dobj"} +> "LEFT_ID": "founded", +> "REL_OP": ";", +> "RIGHT_ID": "initially", +> "RIGHT_ATTRS": {"ORTH": "initially"} > } > ] > ``` A pattern added to the `DependencyMatcher` consists of a list of dictionaries, -with each dictionary describing a node to match. Each pattern should have the -following top-level keys: +with each dictionary describing a token to match. Except for the first +dictionary, which defines an anchor token using only `RIGHT_ID` and +`RIGHT_ATTRS`, each pattern should have the following keys: -| Name | Description | -| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | -| `PATTERN` | The token attributes to match in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | -| `SPEC` | The relationships of the nodes in the subtree that should be matched. ~~Dict[str, str]~~ | +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | +| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | +| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | +| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | -The `SPEC` includes the following fields: + -| Name | Description | -| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `NODE_NAME` | A unique name for this node to refer to it in other specs. ~~str~~ | -| `NBOR_RELOP` | A [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) operator that describes how the two nodes are related. ~~str~~ | -| `NBOR_NAME` | The unique name of the node that this node is connected to. ~~str~~ | +For examples of how to construct dependency matcher patterns for different types +of relations, see the usage guide on +[dependency matching](/usage/rule-based-matching#dependencymatcher). + + + +### Operators + +The following operators are supported by the `DependencyMatcher`, most of which +come directly from +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): + +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | ## DependencyMatcher.\_\_init\_\_ {#init tag="method"} -Create a rule-based `DependencyMatcher`. +Create a `DependencyMatcher`. > #### Example > @@ -68,13 +94,15 @@ Create a rule-based `DependencyMatcher`. > matcher = DependencyMatcher(nlp.vocab) > ``` -| Name | Description | -| ------- | ----------------------------------------------------------------------------------------------------- | -| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------- | +| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| _keyword-only_ | | +| `validate` | Validate all patterns added to this matcher. ~~bool~~ | ## DependencyMatcher.\_\call\_\_ {#call tag="method"} -Find all token sequences matching the supplied patterns on the `Doc` or `Span`. +Find all tokens matching the supplied patterns on the `Doc` or `Span`. > #### Example > @@ -82,36 +110,32 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > from spacy.matcher import DependencyMatcher > > matcher = DependencyMatcher(nlp.vocab) -> pattern = [ -> {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}}, -> {"SPEC": {"NODE_NAME": "founder", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}}, -> ] -> matcher.add("Founder", [pattern]) +> pattern = [{"RIGHT_ID": "founded_id", +> "RIGHT_ATTRS": {"ORTH": "founded"}}] +> matcher.add("FOUNDED", [pattern]) > doc = nlp("Bill Gates founded Microsoft.") > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| **RETURNS** | A list of `(match_id, token_ids)` tuples, describing the matches. The `match_id` is the ID of the match pattern and `token_ids` is a list of token indices matched by the pattern, where the position of each token in the list corresponds to the position of the node specification in the pattern. ~~List[Tuple[int, List[int]]]~~ | ## DependencyMatcher.\_\_len\_\_ {#len tag="method"} -Get the number of rules (edges) added to the dependency matcher. Note that this -only returns the number of rules (identical with the number of IDs), not the -number of individual patterns. +Get the number of rules added to the dependency matcher. Note that this only +returns the number of rules (identical with the number of IDs), not the number +of individual patterns. > #### Example > > ```python > matcher = DependencyMatcher(nlp.vocab) > assert len(matcher) == 0 -> pattern = [ -> {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}}, -> {"SPEC": {"NODE_NAME": "START_ENTITY", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}}, -> ] -> matcher.add("Rule", [pattern]) +> pattern = [{"RIGHT_ID": "founded_id", +> "RIGHT_ATTRS": {"ORTH": "founded"}}] +> matcher.add("FOUNDED", [pattern]) > assert len(matcher) == 1 > ``` @@ -126,10 +150,10 @@ Check whether the matcher contains rules for a match ID. > #### Example > > ```python -> matcher = Matcher(nlp.vocab) -> assert "Rule" not in matcher -> matcher.add("Rule", [pattern]) -> assert "Rule" in matcher +> matcher = DependencyMatcher(nlp.vocab) +> assert "FOUNDED" not in matcher +> matcher.add("FOUNDED", [pattern]) +> assert "FOUNDED" in matcher > ``` | Name | Description | @@ -152,33 +176,15 @@ will be overwritten. > print('Matched!', matches) > > matcher = DependencyMatcher(nlp.vocab) -> matcher.add("TEST_PATTERNS", patterns) +> matcher.add("FOUNDED", patterns, on_match=on_match) > ``` -| Name | Description | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `match_id` | An ID for the thing you're matching. ~~str~~ | -| `patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a `"PATTERN"` and `"SPEC"`. ~~List[List[Dict[str, dict]]]~~ | -| _keyword-only_ | | | -| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | - -## DependencyMatcher.remove {#remove tag="method"} - -Remove a rule from the matcher. A `KeyError` is raised if the match ID does not -exist. - -> #### Example -> -> ```python -> matcher.add("Rule", [pattern]]) -> assert "Rule" in matcher -> matcher.remove("Rule") -> assert "Rule" not in matcher -> ``` - -| Name | Description | -| ----- | --------------------------------- | -| `key` | The ID of the match rule. ~~str~~ | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `match_id` | An ID for the patterns. ~~str~~ | +| `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ | +| _keyword-only_ | | | +| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ | ## DependencyMatcher.get {#get tag="method"} @@ -188,11 +194,29 @@ Retrieve the pattern stored for a key. Returns the rule as an > #### Example > > ```python -> matcher.add("Rule", [pattern], on_match=on_match) -> on_match, patterns = matcher.get("Rule") +> matcher.add("FOUNDED", patterns, on_match=on_match) +> on_match, patterns = matcher.get("FOUNDED") > ``` -| Name | Description | -| ----------- | --------------------------------------------------------------------------------------------- | -| `key` | The ID of the match rule. ~~str~~ | -| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[dict]]]~~ | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | +| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[Union[Dict, Tuple]]]]~~ | + +## DependencyMatcher.remove {#remove tag="method"} + +Remove a rule from the dependency matcher. A `KeyError` is raised if the match +ID does not exist. + +> #### Example +> +> ```python +> matcher.add("FOUNDED", patterns) +> assert "FOUNDED" in matcher +> matcher.remove("FOUNDED") +> assert "FOUNDED" not in matcher +> ``` + +| Name | Description | +| ----- | --------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index 3c4825f0d..88dc62c2a 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -186,8 +186,9 @@ Remove a previously registered extension. ## Doc.char_span {#char_span tag="method" new="2"} -Create a `Span` object from the slice `doc.text[start:end]`. Returns `None` if -the character indices don't map to a valid span. +Create a `Span` object from the slice `doc.text[start_idx:end_idx]`. Returns +`None` if the character indices don't map to a valid span using the default mode +`"strict". > #### Example > @@ -197,14 +198,15 @@ the character indices don't map to a valid span. > assert span.text == "New York" > ``` -| Name | Description | -| ------------------------------------ | ----------------------------------------------------------------------------------------- | -| `start` | The index of the first character of the span. ~~int~~ | -| `end` | The index of the last character after the span. ~int~~ | -| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | -| `kb_id` 2.2 | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | -| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | -| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | +| Name | Description | +| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `start` | The index of the first character of the span. ~~int~~ | +| `end` | The index of the last character after the span. ~int~~ | +| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | +| `kb_id` 2.2 | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | +| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | +| `mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"inside"` (span of all tokens completely within the character span), `"outside"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ | +| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | ## Doc.similarity {#similarity tag="method" model="vectors"} diff --git a/website/docs/images/dep-match-diagram.svg b/website/docs/images/dep-match-diagram.svg new file mode 100644 index 000000000..676be4137 --- /dev/null +++ b/website/docs/images/dep-match-diagram.svg @@ -0,0 +1,39 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/displacy-dep-founded.html b/website/docs/images/displacy-dep-founded.html new file mode 100644 index 000000000..e22984ee1 --- /dev/null +++ b/website/docs/images/displacy-dep-founded.html @@ -0,0 +1,58 @@ + + + Smith + + + + + founded + + + + + a + + + + + healthcare + + + + + company + + + + + + + nsubj + + + + + + + + det + + + + + + + + compound + + + + + + + + dobj + + + + diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index ff08d547c..b36e9b71f 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -1021,7 +1021,7 @@ expressions – for example, [`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex): ```python -suffixes = nlp.Defaults.suffixes + (r'''-+$''',) +suffixes = nlp.Defaults.suffixes + [r'''-+$''',] suffix_regex = spacy.util.compile_suffix_regex(suffixes) nlp.tokenizer.suffix_search = suffix_regex.search ``` diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 2885d9f50..0da350f27 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -330,7 +330,7 @@ custom component entirely (more details on this in the section on ```python nlp.remove_pipe("parser") nlp.rename_pipe("ner", "entityrecognizer") -nlp.replace_pipe("tagger", my_custom_tagger) +nlp.replace_pipe("tagger", "my_custom_tagger") ``` The `Language` object exposes different [attributes](/api/language#attributes) diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index fb54c9936..01d60ddb8 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -4,6 +4,7 @@ teaser: Find phrases and tokens, and match entities menu: - ['Token Matcher', 'matcher'] - ['Phrase Matcher', 'phrasematcher'] + - ['Dependency Matcher', 'dependencymatcher'] - ['Entity Ruler', 'entityruler'] - ['Models & Rules', 'models-rules'] --- @@ -939,10 +940,10 @@ object patterns as efficiently as possible and without running any of the other pipeline components. If the token attribute you want to match on are set by a pipeline component, **make sure that the pipeline component runs** when you create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc` -objects need to have part-of-speech tags set by the `tagger`. You can either -call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use -[`nlp.select_pipes`](/api/language#select_pipes) to disable components -selectively. +objects need to have part-of-speech tags set by the `tagger` or `morphologizer`. +You can either call the `nlp` object on your pattern texts instead of +`nlp.make_doc`, or use [`nlp.select_pipes`](/api/language#select_pipes) to +disable components selectively. @@ -973,10 +974,287 @@ to match phrases with the same sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get confusing and doesn't have much of an advantage over writing one or two token patterns. +## Dependency Matcher {#dependencymatcher new="3" model="parser"} + +The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within +the dependency parse using +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) +operators. It requires a model containing a parser such as the +[`DependencyParser`](/api/dependencyparser). Instead of defining a list of +adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match +tokens in the dependency parse and specify the relations between them. + +> ```python +> ### Example +> from spacy.matcher import DependencyMatcher +> +> # "[subject] ... initially founded" +> pattern = [ +> # anchor token: founded +> { +> "RIGHT_ID": "founded", +> "RIGHT_ATTRS": {"ORTH": "founded"} +> }, +> # founded -> subject +> { +> "LEFT_ID": "founded", +> "REL_OP": ">", +> "RIGHT_ID": "subject", +> "RIGHT_ATTRS": {"DEP": "nsubj"} +> }, +> # "founded" follows "initially" +> { +> "LEFT_ID": "founded", +> "REL_OP": ";", +> "RIGHT_ID": "initially", +> "RIGHT_ATTRS": {"ORTH": "initially"} +> } +> ] +> +> matcher = DependencyMatcher(nlp.vocab) +> matcher.add("FOUNDED", [pattern]) +> matches = matcher(doc) +> ``` + +A pattern added to the dependency matcher consists of a **list of +dictionaries**, with each dictionary describing a **token to match** and its +**relation to an existing token** in the pattern. Except for the first +dictionary, which defines an anchor token using only `RIGHT_ID` and +`RIGHT_ATTRS`, each pattern should have the following keys: + +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | +| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | +| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | +| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | + +Each additional token added to the pattern is linked to an existing token +`LEFT_ID` by the relation `REL_OP`. The new token is given the name `RIGHT_ID` +and described by the attributes `RIGHT_ATTRS`. + + + +Because the unique token **names** in `LEFT_ID` and `RIGHT_ID` are used to +identify tokens, the order of the dicts in the patterns is important: a token +name needs to be defined as `RIGHT_ID` in one dict in the pattern **before** it +can be used as `LEFT_ID` in another dict. + + + +### Dependency matcher operators {#dependencymatcher-operators} + +The following operators are supported by the `DependencyMatcher`, most of which +come directly from +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): + +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | + +### Designing dependency matcher patterns {#dependencymatcher-patterns} + +Let's say we want to find sentences describing who founded what kind of company: + +- _Smith founded a healthcare company in 2005._ +- _Williams initially founded an insurance company in 1987._ +- _Lee, an experienced CEO, has founded two AI startups._ + +The dependency parse for "Smith founded a healthcare company" shows types of +relations and tokens we want to match: + +> #### Visualizing the parse +> +> The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects +> and their dependency parse and part-of-speech tags: +> +> ```python +> import spacy +> from spacy import displacy +> +> nlp = spacy.load("en_core_web_sm") +> doc = nlp("Smith founded a healthcare company") +> displacy.serve(doc) +> ``` + +import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html' + +