diff --git a/.github/contributors/MiniLau.md b/.github/contributors/MiniLau.md new file mode 100644 index 000000000..14d6fe328 --- /dev/null +++ b/.github/contributors/MiniLau.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Desausoi Laurent | +| Company name (if applicable) | / | +| Title or role (if applicable) | / | +| Date | 22 November 2019 | +| GitHub username | MiniLau | +| Website (optional) | / | diff --git a/.github/contributors/Mlawrence95.md b/.github/contributors/Mlawrence95.md new file mode 100644 index 000000000..505d6c16f --- /dev/null +++ b/.github/contributors/Mlawrence95.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ x ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Mike Lawrence | +| Company name (if applicable) | NA | +| Title or role (if applicable) | NA | +| Date | April 17, 2020 | +| GitHub username | Mlawrence95 | +| Website (optional) | | diff --git a/.github/contributors/YohannesDatasci.md b/.github/contributors/YohannesDatasci.md new file mode 100644 index 000000000..129c45576 --- /dev/null +++ b/.github/contributors/YohannesDatasci.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Yohannes | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-02 | +| GitHub username | YohannesDatasci | +| Website (optional) | | \ No newline at end of file diff --git a/.github/contributors/chopeen.md b/.github/contributors/chopeen.md new file mode 100644 index 000000000..d293c9845 --- /dev/null +++ b/.github/contributors/chopeen.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Marek Grzenkowicz | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020.04.10 | +| GitHub username | chopeen | +| Website (optional) | | diff --git a/.github/contributors/elben10 b/.github/contributors/elben10 new file mode 100644 index 000000000..1eb4656dc --- /dev/null +++ b/.github/contributors/elben10 @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Jakob Jul Elben | +| Company name (if applicable) | N/A | +| Title or role (if applicable) | N/A | +| Date | April 16th, 2020 | +| GitHub username | elben10 | +| Website (optional) | N/A | diff --git a/.github/contributors/ilivans.md b/.github/contributors/ilivans.md new file mode 100644 index 000000000..d471fde48 --- /dev/null +++ b/.github/contributors/ilivans.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Ilia Ivanov | +| Company name (if applicable) | Chattermill | +| Title or role (if applicable) | DL Engineer | +| Date | 2020-05-14 | +| GitHub username | ilivans | +| Website (optional) | | diff --git a/.github/contributors/jacse.md b/.github/contributors/jacse.md new file mode 100644 index 000000000..7face10c3 --- /dev/null +++ b/.github/contributors/jacse.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Jacob Lauritzen | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-03-30 | +| GitHub username | jacse | +| Website (optional) | | diff --git a/.github/contributors/kevinlu1248.md b/.github/contributors/kevinlu1248.md new file mode 100644 index 000000000..fc974ec95 --- /dev/null +++ b/.github/contributors/kevinlu1248.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Kevin Lu| +| Company name (if applicable) | | +| Title or role (if applicable) | Student| +| Date | | +| GitHub username | kevinlu1248| +| Website (optional) | | diff --git a/.github/contributors/koaning.md b/.github/contributors/koaning.md new file mode 100644 index 000000000..ddb28cab0 --- /dev/null +++ b/.github/contributors/koaning.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Vincent D. Warmerdam | +| Company name (if applicable) | | +| Title or role (if applicable) | Data Person | +| Date | 2020-03-01 | +| GitHub username | koaning | +| Website (optional) | https://koaning.io | diff --git a/.github/contributors/laszabine.md b/.github/contributors/laszabine.md new file mode 100644 index 000000000..c1a4a3a6b --- /dev/null +++ b/.github/contributors/laszabine.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Sabine Laszakovits | +| Company name (if applicable) | Austrian Academy of Sciences | +| Title or role (if applicable) | Data analyst | +| Date | 2020-04-16 | +| GitHub username | laszabine | +| Website (optional) | https://sabine.laszakovits.net | diff --git a/.github/contributors/leicmi.md b/.github/contributors/leicmi.md new file mode 100644 index 000000000..6a65a48f2 --- /dev/null +++ b/.github/contributors/leicmi.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Michael Leichtfried | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 30.03.2020 | +| GitHub username | leicmi | +| Website (optional) | | diff --git a/.github/contributors/louisguitton.md b/.github/contributors/louisguitton.md new file mode 100644 index 000000000..8c5f30df6 --- /dev/null +++ b/.github/contributors/louisguitton.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Louis Guitton | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-25 | +| GitHub username | louisguitton | +| Website (optional) | https://guitton.co/ | diff --git a/.github/contributors/michael-k.md b/.github/contributors/michael-k.md new file mode 100644 index 000000000..4ecc5be85 --- /dev/null +++ b/.github/contributors/michael-k.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Michael Käufl | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-23 | +| GitHub username | michael-k | +| Website (optional) | | diff --git a/.github/contributors/nikhilsaldanha.md b/.github/contributors/nikhilsaldanha.md new file mode 100644 index 000000000..f8d37d709 --- /dev/null +++ b/.github/contributors/nikhilsaldanha.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Nikhil Saldanha | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-03-17 | +| GitHub username | nikhilsaldanha | +| Website (optional) | | diff --git a/.github/contributors/osori.md b/.github/contributors/osori.md new file mode 100644 index 000000000..93b5c7dd4 --- /dev/null +++ b/.github/contributors/osori.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Ilkyu Ju | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-05-17 | +| GitHub username | osori | +| Website (optional) | | diff --git a/.github/contributors/paoloq.md b/.github/contributors/paoloq.md new file mode 100644 index 000000000..0fac70c9a --- /dev/null +++ b/.github/contributors/paoloq.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Paolo Arduin | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 9 April 2020 | +| GitHub username | paoloq | +| Website (optional) | | diff --git a/.github/contributors/punitvara.md b/.github/contributors/punitvara.md new file mode 100644 index 000000000..dde810453 --- /dev/null +++ b/.github/contributors/punitvara.md @@ -0,0 +1,107 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Punit Vara | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-26 | +| GitHub username | punitvara | +| Website (optional) | https://punitvara.com | + diff --git a/.github/contributors/sabiqueqb.md b/.github/contributors/sabiqueqb.md new file mode 100644 index 000000000..da0f2f2a2 --- /dev/null +++ b/.github/contributors/sabiqueqb.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Sabique Ahammed Lava | +| Company name (if applicable) | QBurst | +| Title or role (if applicable) | Senior Engineer | +| Date | 24 Apr 2020 | +| GitHub username | sabiqueqb | +| Website (optional) | | diff --git a/.github/contributors/sebastienharinck.md b/.github/contributors/sebastienharinck.md new file mode 100644 index 000000000..e0fddeba5 --- /dev/null +++ b/.github/contributors/sebastienharinck.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------------------------------- | +| Name | Sébastien Harinck | +| Company name (if applicable) | Odaxiom | +| Title or role (if applicable) | ML Engineer | +| Date | 2020-04-15 | +| GitHub username | sebastienharinck | +| Website (optional) | [https://odaxiom.com](https://odaxiom.com) | \ No newline at end of file diff --git a/.github/contributors/thomasthiebaud.md b/.github/contributors/thomasthiebaud.md new file mode 100644 index 000000000..bdbf0ec50 --- /dev/null +++ b/.github/contributors/thomasthiebaud.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, + object code, patch, tool, sample, graphic, specification, manual, + documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and + registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment + to any third party, you hereby grant to us a perpetual, irrevocable, + non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your + contribution. The rights that you grant to us under these terms are effective + on the date you first submitted a contribution to us, even if your submission + took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + - Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + - to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + - each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable + U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT + mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +| ----------------------------- | --------------- | +| Name | Thomas Thiebaud | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-07 | +| GitHub username | thomasthiebaud | +| Website (optional) | | diff --git a/.github/contributors/thoppe.md b/.github/contributors/thoppe.md new file mode 100644 index 000000000..9271a2601 --- /dev/null +++ b/.github/contributors/thoppe.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Travis Hoppe | +| Company name (if applicable) | | +| Title or role (if applicable) | Data Scientist | +| Date | 07 May 2020 | +| GitHub username | thoppe | +| Website (optional) | http://thoppe.github.io/ | diff --git a/.github/contributors/tommilligan.md b/.github/contributors/tommilligan.md new file mode 100644 index 000000000..475df5afa --- /dev/null +++ b/.github/contributors/tommilligan.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, + object code, patch, tool, sample, graphic, specification, manual, + documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and + registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment + to any third party, you hereby grant to us a perpetual, irrevocable, + non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your + contribution. The rights that you grant to us under these terms are effective + on the date you first submitted a contribution to us, even if your submission + took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + - Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + - to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + - each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable + U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT + mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +| ----------------------------- | ------------ | +| Name | Tom Milligan | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-03-24 | +| GitHub username | tommilligan | +| Website (optional) | | diff --git a/.github/contributors/umarbutler.md b/.github/contributors/umarbutler.md new file mode 100644 index 000000000..8df825152 --- /dev/null +++ b/.github/contributors/umarbutler.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Umar Butler | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-04-09 | +| GitHub username | umarbutler | +| Website (optional) | https://umarbutler.com | diff --git a/.github/contributors/vishnupriyavr.md b/.github/contributors/vishnupriyavr.md new file mode 100644 index 000000000..73657a772 --- /dev/null +++ b/.github/contributors/vishnupriyavr.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Vishnu Priya VR | +| Company name (if applicable) | Uniphore | +| Title or role (if applicable) | NLP/AI Engineer | +| Date | 2020-05-03 | +| GitHub username | vishnupriyavr | +| Website (optional) | | diff --git a/.github/contributors/vondersam.md b/.github/contributors/vondersam.md new file mode 100644 index 000000000..8add70330 --- /dev/null +++ b/.github/contributors/vondersam.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------| +| Name | Samuel Rodríguez Medina | +| Company name (if applicable) | | +| Title or role (if applicable) | Computational linguist | +| Date | 28 April 2020 | +| GitHub username | vondersam | +| Website (optional) | | diff --git a/examples/training/pretrain_kb.py b/examples/training/create_kb.py similarity index 75% rename from examples/training/pretrain_kb.py rename to examples/training/create_kb.py index 54c68f653..cbdb5c05b 100644 --- a/examples/training/pretrain_kb.py +++ b/examples/training/create_kb.py @@ -1,15 +1,15 @@ #!/usr/bin/env python # coding: utf8 -"""Example of defining and (pre)training spaCy's knowledge base, +"""Example of defining a knowledge base in spaCy, which is needed to implement entity linking functionality. For more details, see the documentation: * Knowledge base: https://spacy.io/api/kb * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking -Compatible with: spaCy v2.2.3 -Last tested with: v2.2.3 +Compatible with: spaCy v2.2.4 +Last tested with: v2.2.4 """ from __future__ import unicode_literals, print_function @@ -20,24 +20,18 @@ from spacy.vocab import Vocab import spacy from spacy.kb import KnowledgeBase -from bin.wiki_entity_linking.train_descriptions import EntityEncoder - # Q2146908 (Russ Cochran): American golfer # Q7381115 (Russ Cochran): publisher ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)} -INPUT_DIM = 300 # dimension of pretrained input vectors -DESC_WIDTH = 64 # dimension of output entity vectors - @plac.annotations( model=("Model name, should have pretrained word embeddings", "positional", None, str), output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), ) -def main(model=None, output_dir=None, n_iter=50): - """Load the model, create the KB and pretrain the entity encodings. +def main(model=None, output_dir=None): + """Load the model and create the KB with pre-defined entity encodings. If an output_dir is provided, the KB will be stored there in a file 'kb'. The updated vocab will also be written to a directory in the output_dir.""" @@ -51,33 +45,23 @@ def main(model=None, output_dir=None, n_iter=50): " cf. https://spacy.io/usage/models#languages." ) - kb = KnowledgeBase(vocab=nlp.vocab) + # You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality. + # For simplicity, we'll just use the original vector dimension here instead. + vectors_dim = nlp.vocab.vectors.shape[1] + kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim) # set up the data entity_ids = [] - descriptions = [] + descr_embeddings = [] freqs = [] for key, value in ENTITIES.items(): desc, freq = value entity_ids.append(key) - descriptions.append(desc) + descr_embeddings.append(nlp(desc).vector) freqs.append(freq) - # training entity description encodings - # this part can easily be replaced with a custom entity encoder - encoder = EntityEncoder( - nlp=nlp, - input_dim=INPUT_DIM, - desc_width=DESC_WIDTH, - epochs=n_iter, - ) - encoder.train(description_list=descriptions, to_print=True) - - # get the pretrained entity vectors - embeddings = encoder.apply_encoder(descriptions) - # set the entities, can also be done by calling `kb.add_entity` for each entity - kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=embeddings) + kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=descr_embeddings) # adding aliases, the entities need to be defined in the KB beforehand kb.add_alias( @@ -113,8 +97,8 @@ def main(model=None, output_dir=None, n_iter=50): vocab2 = Vocab().from_disk(vocab_path) kb2 = KnowledgeBase(vocab=vocab2) kb2.load_bulk(kb_path) - _print_kb(kb2) print() + _print_kb(kb2) def _print_kb(kb): @@ -126,6 +110,5 @@ if __name__ == "__main__": plac.call(main) # Expected output: - # 2 kb entities: ['Q2146908', 'Q7381115'] # 1 kb aliases: ['Russ Cochran'] diff --git a/examples/training/rehearsal.py b/examples/training/rehearsal.py index 24fc67ebb..98a96643b 100644 --- a/examples/training/rehearsal.py +++ b/examples/training/rehearsal.py @@ -1,6 +1,7 @@ """Prevent catastrophic forgetting with rehearsal updates.""" import plac import random +import warnings import srsly import spacy from spacy.gold import GoldParse @@ -63,7 +64,10 @@ def main(model_name, unlabelled_loc): optimizer.b2 = 0.0 sizes = compounding(1.0, 4.0, 1.001) - with nlp.select_pipes(enable="ner"): + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): + # show warnings for misaligned entity spans once + warnings.filterwarnings("once", category=UserWarning, module="spacy") + for itn in range(n_iter): random.shuffle(TRAIN_DATA) random.shuffle(raw_docs) diff --git a/examples/training/train_entity_linker.py b/examples/training/train_entity_linker.py index 2da1db26d..b82ff5bb4 100644 --- a/examples/training/train_entity_linker.py +++ b/examples/training/train_entity_linker.py @@ -1,15 +1,15 @@ #!/usr/bin/env python # coding: utf8 -"""Example of training spaCy's entity linker, starting off with an -existing model and a pre-defined knowledge base. +"""Example of training spaCy's entity linker, starting off with a predefined +knowledge base and corresponding vocab, and a blank English model. For more details, see the documentation: * Training: https://spacy.io/usage/training * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking -Compatible with: spaCy v2.2.3 -Last tested with: v2.2.3 +Compatible with: spaCy v2.2.4 +Last tested with: v2.2.4 """ from __future__ import unicode_literals, print_function @@ -17,13 +17,10 @@ import plac import random from pathlib import Path -import srsly from spacy.vocab import Vocab - import spacy from spacy.kb import KnowledgeBase from spacy.pipeline import EntityRuler -from spacy.tokens import Span from spacy.util import minibatch, compounding @@ -66,18 +63,20 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50): """Create a blank model with the specified vocab, set up the pipeline and train the entity linker. The `vocab` should be the one used during creation of the KB.""" vocab = Vocab().from_disk(vocab_path) - # create blank Language class with correct vocab + # create blank English model with correct vocab nlp = spacy.blank("en", vocab=vocab) nlp.vocab.vectors.name = "nel_vectors" print("Created blank 'en' model with vocab from '%s'" % vocab_path) # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy. - nlp.add_pipe(nlp.create_pipe('sentencizer')) + nlp.add_pipe(nlp.create_pipe("sentencizer")) # Add a custom component to recognize "Russ Cochran" as an entity for the example training data. # Note that in a realistic application, an actual NER algorithm should be used instead. ruler = EntityRuler(nlp) - patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}] + patterns = [ + {"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]} + ] ruler.add_patterns(patterns) nlp.add_pipe(ruler) diff --git a/examples/training/train_ner.py b/examples/training/train_ner.py index f0f3affe7..f439fda23 100644 --- a/examples/training/train_ner.py +++ b/examples/training/train_ner.py @@ -8,12 +8,13 @@ For more details, see the documentation: * NER: https://spacy.io/usage/linguistic-features#named-entities Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 +Last tested with: v2.2.4 """ from __future__ import unicode_literals, print_function import plac import random +import warnings from pathlib import Path import spacy from spacy.util import minibatch, compounding @@ -55,12 +56,17 @@ def main(model=None, output_dir=None, n_iter=100): print("Add label", ent[2]) ner.add_label(ent[2]) - with nlp.select_pipes(enable="ner"): # only train NER + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): + # show warnings for misaligned entity spans once + warnings.filterwarnings("once", category=UserWarning, module="spacy") + # reset and initialize the weights randomly – but only if we're # training a new model if model is None: nlp.begin_training() - print("Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names()))) + print( + "Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names())) + ) for itn in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} diff --git a/examples/training/train_new_entity_type.py b/examples/training/train_new_entity_type.py index 445c3fc27..5124d0a2c 100644 --- a/examples/training/train_new_entity_type.py +++ b/examples/training/train_new_entity_type.py @@ -24,12 +24,13 @@ For more details, see the documentation: * NER: https://spacy.io/usage/linguistic-features#named-entities Compatible with: spaCy v2.1.0+ -Last tested with: v2.1.0 +Last tested with: v2.2.4 """ from __future__ import unicode_literals, print_function import plac import random +import warnings from pathlib import Path import spacy from spacy.util import minibatch, compounding @@ -94,8 +95,10 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30): else: optimizer = nlp.resume_training() move_names = list(ner.move_names) + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): + # show warnings for misaligned entity spans once + warnings.filterwarnings("once", category=UserWarning, module="spacy") - with nlp.select_pipes(enable="ner"): # only train NER sizes = compounding(1.0, 4.0, 1.001) # batch up the examples using spaCy's minibatch for itn in range(n_iter): diff --git a/netlify.toml b/netlify.toml index 45bd2c3b6..be809f1d4 100644 --- a/netlify.toml +++ b/netlify.toml @@ -7,42 +7,42 @@ redirects = [ {from = "https://alpha.spacy.io/*", to = "https://spacy.io", force = true}, {from = "http://alpha.spacy.io/*", to = "https://spacy.io", force = true}, # Old demos - {from = "/demos/*", to = "https://explosion.ai/demos/:splat"}, + {from = "/demos/*", to = "https://explosion.ai/demos/:splat", force = true}, # Old blog - {from = "/blog/*", to = "https://explosion.ai/blog/:splat"}, - {from = "/feed", to = "https://explosion.ai/feed"}, - {from = "/feed.xml", to = "https://explosion.ai/feed"}, + {from = "/blog/*", to = "https://explosion.ai/blog/:splat", force = true}, + {from = "/feed", to = "https://explosion.ai/feed", force = true}, + {from = "/feed.xml", to = "https://explosion.ai/feed", force = true}, # Old documentation pages (1.x) - {from = "/docs/usage/processing-text", to = "/usage/linguistic-features"}, - {from = "/docs/usage/deep-learning", to = "/usage/training"}, - {from = "/docs/usage/pos-tagging", to = "/usage/linguistic-features#pos-tagging"}, - {from = "/docs/usage/dependency-parse", to = "/usage/linguistic-features#dependency-parse"}, - {from = "/docs/usage/entity-recognition", to = "/usage/linguistic-features#named-entities"}, - {from = "/docs/usage/word-vectors-similarities", to = "/usage/vectors-similarity"}, - {from = "/docs/usage/customizing-tokenizer", to = "/usage/linguistic-features#tokenization"}, - {from = "/docs/usage/language-processing-pipeline", to = "/usage/processing-pipelines"}, - {from = "/docs/usage/customizing-pipeline", to = "/usage/processing-pipelines"}, - {from = "/docs/usage/training-ner", to = "/usage/training#ner"}, - {from = "/docs/usage/tutorials", to = "/usage/examples"}, - {from = "/docs/usage/data-model", to = "/api"}, - {from = "/docs/usage/cli", to = "/api/cli"}, - {from = "/docs/usage/lightning-tour", to = "/usage/spacy-101#lightning-tour"}, - {from = "/docs/api/language-models", to = "/usage/models#languages"}, - {from = "/docs/api/spacy", to = "/docs/api/top-level"}, - {from = "/docs/api/displacy", to = "/api/top-level#displacy"}, - {from = "/docs/api/util", to = "/api/top-level#util"}, - {from = "/docs/api/features", to = "/models/#architecture"}, - {from = "/docs/api/philosophy", to = "/usage/spacy-101"}, - {from = "/docs/usage/showcase", to = "/universe"}, - {from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom"}, - {from = "/tutorials", to = "/usage/examples"}, + {from = "/docs/usage/processing-text", to = "/usage/linguistic-features", force = true}, + {from = "/docs/usage/deep-learning", to = "/usage/training", force = true}, + {from = "/docs/usage/pos-tagging", to = "/usage/linguistic-features#pos-tagging", force = true}, + {from = "/docs/usage/dependency-parse", to = "/usage/linguistic-features#dependency-parse", force = true}, + {from = "/docs/usage/entity-recognition", to = "/usage/linguistic-features#named-entities", force = true}, + {from = "/docs/usage/word-vectors-similarities", to = "/usage/vectors-similarity", force = true}, + {from = "/docs/usage/customizing-tokenizer", to = "/usage/linguistic-features#tokenization", force = true}, + {from = "/docs/usage/language-processing-pipeline", to = "/usage/processing-pipelines", force = true}, + {from = "/docs/usage/customizing-pipeline", to = "/usage/processing-pipelines", force = true}, + {from = "/docs/usage/training-ner", to = "/usage/training#ner", force = true}, + {from = "/docs/usage/tutorials", to = "/usage/examples", force = true}, + {from = "/docs/usage/data-model", to = "/api", force = true}, + {from = "/docs/usage/cli", to = "/api/cli", force = true}, + {from = "/docs/usage/lightning-tour", to = "/usage/spacy-101#lightning-tour", force = true}, + {from = "/docs/api/language-models", to = "/usage/models#languages", force = true}, + {from = "/docs/api/spacy", to = "/docs/api/top-level", force = true}, + {from = "/docs/api/displacy", to = "/api/top-level#displacy", force = true}, + {from = "/docs/api/util", to = "/api/top-level#util", force = true}, + {from = "/docs/api/features", to = "/models/#architecture", force = true}, + {from = "/docs/api/philosophy", to = "/usage/spacy-101", force = true}, + {from = "/docs/usage/showcase", to = "/universe", force = true}, + {from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true}, + {from = "/tutorials", to = "/usage/examples", force = true}, # Rewrite all other docs pages to / {from = "/docs/*", to = "/:splat"}, # Updated documentation pages - {from = "/usage/resources", to = "/universe"}, - {from = "/usage/lightning-tour", to = "/usage/spacy-101#lightning-tour"}, - {from = "/usage/linguistic-features#rule-based-matching", to = "/usage/rule-based-matching"}, - {from = "/models/comparison", to = "/models"}, + {from = "/usage/resources", to = "/universe", force = true}, + {from = "/usage/lightning-tour", to = "/usage/spacy-101#lightning-tour", force = true}, + {from = "/usage/linguistic-features#rule-based-matching", to = "/usage/rule-based-matching", force = true}, + {from = "/models/comparison", to = "/models", force = true}, {from = "/api/#section-cython", to = "/api/cython", force = true}, {from = "/api/#cython", to = "/api/cython", force = true}, {from = "/api/sentencesegmenter", to="/api/sentencizer"}, diff --git a/setup.cfg b/setup.cfg index ae09d071c..c19b8d857 100644 --- a/setup.cfg +++ b/setup.cfg @@ -61,19 +61,23 @@ install_requires = [options.extras_require] lookups = - spacy_lookups_data>=0.0.5,<0.2.0 + spacy_lookups_data>=0.3.1,<0.4.0 cuda = - cupy>=5.0.0b4 + cupy>=5.0.0b4,<9.0.0 cuda80 = - cupy-cuda80>=5.0.0b4 + cupy-cuda80>=5.0.0b4,<9.0.0 cuda90 = - cupy-cuda90>=5.0.0b4 + cupy-cuda90>=5.0.0b4,<9.0.0 cuda91 = - cupy-cuda91>=5.0.0b4 + cupy-cuda91>=5.0.0b4,<9.0.0 cuda92 = - cupy-cuda92>=5.0.0b4 + cupy-cuda92>=5.0.0b4,<9.0.0 cuda100 = - cupy-cuda100>=5.0.0b4 + cupy-cuda100>=5.0.0b4,<9.0.0 +cuda101 = + cupy-cuda101>=5.0.0b4,<9.0.0 +cuda102 = + cupy-cuda102>=5.0.0b4,<9.0.0 # Language tokenizers with external dependencies ja = fugashi>=0.1.3 diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 20c42f066..33d5372de 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -15,7 +15,7 @@ cdef enum attr_id_t: LIKE_NUM LIKE_EMAIL IS_STOP - IS_OOV + IS_OOV_DEPRECATED IS_BRACKET IS_QUOTE IS_LEFT_PUNCT @@ -95,3 +95,4 @@ cdef enum attr_id_t: ENT_ID = symbols.ENT_ID IDX + SENT_END \ No newline at end of file diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index 338b5db1b..b15db7599 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -13,7 +13,7 @@ IDS = { "LIKE_NUM": LIKE_NUM, "LIKE_EMAIL": LIKE_EMAIL, "IS_STOP": IS_STOP, - "IS_OOV": IS_OOV, + "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED, "IS_BRACKET": IS_BRACKET, "IS_QUOTE": IS_QUOTE, "IS_LEFT_PUNCT": IS_LEFT_PUNCT, @@ -85,6 +85,7 @@ IDS = { "ENT_KB_ID": ENT_KB_ID, "HEAD": HEAD, "SENT_START": SENT_START, + "SENT_END": SENT_END, "SPACY": SPACY, "PROB": PROB, "LANG": LANG, diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 1705bf446..21f49956d 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -89,11 +89,11 @@ def debug_data( msg.good("Corpus is loadable") # Create all gold data here to avoid iterating over the train_dataset constantly - gold_train_data = _compile_gold(train_dataset, pipeline) + gold_train_data = _compile_gold(train_dataset, pipeline, nlp) gold_train_unpreprocessed_data = _compile_gold( train_dataset_unpreprocessed, pipeline ) - gold_dev_data = _compile_gold(dev_dataset, pipeline) + gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp) train_texts = gold_train_data["texts"] dev_texts = gold_dev_data["texts"] @@ -151,6 +151,21 @@ def debug_data( f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} " f"unique keys, {nlp.vocab.vectors_length} dimensions)" ) + n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values()) + msg.warn( + "{} words in training data without vectors ({:0.2f}%)".format( + n_missing_vectors, n_missing_vectors / gold_train_data["n_words"], + ), + ) + msg.text( + "10 most common words without vectors: {}".format( + _format_labels( + gold_train_data["words_missing_vectors"].most_common(10), + counts=True, + ) + ), + show=verbose, + ) else: msg.info("No word vectors present in the model") @@ -450,7 +465,7 @@ def _load_file(file_path, msg): ) -def _compile_gold(examples, pipeline): +def _compile_gold(examples, pipeline, nlp): data = { "ner": Counter(), "cats": Counter(), @@ -462,6 +477,7 @@ def _compile_gold(examples, pipeline): "punct_ents": 0, "n_words": 0, "n_misaligned_words": 0, + "words_missing_vectors": Counter(), "n_sents": 0, "n_nonproj": 0, "n_cycles": 0, @@ -476,6 +492,10 @@ def _compile_gold(examples, pipeline): data["n_words"] += len(valid_words) data["n_misaligned_words"] += len(gold.words) - len(valid_words) data["texts"].add(doc.text) + if len(nlp.vocab.vectors): + for word in valid_words: + if nlp.vocab.strings[word] not in nlp.vocab.vectors: + data["words_missing_vectors"].update([word]) if "ner" in pipeline: for i, label in enumerate(gold.ner): if label is None: diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 94813e732..735e304f9 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -32,7 +32,10 @@ def evaluate( if displacy_path and not displacy_path.exists(): msg.fail("Visualization output directory not found", displacy_path, exits=1) corpus = GoldCorpus(data_path, data_path) - nlp = util.load_model(model) + if model.startswith("blank:"): + nlp = util.get_lang_class(model.replace("blank:", ""))() + else: + nlp = util.load_model(model) dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc)) begin = timer() scorer = nlp.evaluate(dev_dataset, verbose=False) diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py index 4b4949179..700fa43de 100644 --- a/spacy/cli/init_model.py +++ b/spacy/cli/init_model.py @@ -8,12 +8,13 @@ import tarfile import gzip import zipfile import srsly -from wasabi import msg import warnings +from wasabi import msg from ..vectors import Vectors from ..errors import Errors, Warnings -from ..util import ensure_path, get_lang_class +from ..util import ensure_path, get_lang_class, load_model, OOV_RANK +from ..lookups import Lookups try: import ftfy @@ -33,8 +34,11 @@ def init_model( jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None, vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None, prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1, + truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0, vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None, model_name: ("Optional name for the model meta", "option", "mn", str) = None, + omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False, + base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None # fmt: on ): """ @@ -67,10 +71,19 @@ def init_model( lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc) with msg.loading("Creating model..."): - nlp = create_model(lang, lex_attrs, name=model_name) + nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model) + + # Create empty extra lexeme tables so the data from spacy-lookups-data + # isn't loaded if these features are accessed + if omit_extra_lookups: + nlp.vocab.lookups_extra = Lookups() + nlp.vocab.lookups_extra.add_table("lexeme_cluster") + nlp.vocab.lookups_extra.add_table("lexeme_prob") + nlp.vocab.lookups_extra.add_table("lexeme_settings") + msg.good("Successfully created model") if vectors_loc is not None: - add_vectors(nlp, vectors_loc, prune_vectors, vectors_name) + add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name) vec_added = len(nlp.vocab.vectors) lex_added = len(nlp.vocab) msg.good( @@ -126,20 +139,23 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc): return lex_attrs -def create_model(lang, lex_attrs, name=None): - lang_class = get_lang_class(lang) - nlp = lang_class() +def create_model(lang, lex_attrs, name=None, base_model=None): + if base_model: + nlp = load_model(base_model) + # keep the tokenizer but remove any existing pipeline components due to + # potentially conflicting vectors + for pipe in nlp.pipe_names: + nlp.remove_pipe(pipe) + else: + lang_class = get_lang_class(lang) + nlp = lang_class() for lexeme in nlp.vocab: - lexeme.rank = 0 - lex_added = 0 + lexeme.rank = OOV_RANK for attrs in lex_attrs: if "settings" in attrs: continue lexeme = nlp.vocab[attrs["orth"]] lexeme.set_attrs(**attrs) - lexeme.is_oov = False - lex_added += 1 - lex_added += 1 if len(nlp.vocab): oov_prob = min(lex.prob for lex in nlp.vocab) - 1 else: @@ -150,12 +166,12 @@ def create_model(lang, lex_attrs, name=None): return nlp -def add_vectors(nlp, vectors_loc, prune_vectors, name=None): +def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None): vectors_loc = ensure_path(vectors_loc) if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) for lex in nlp.vocab: - if lex.rank: + if lex.rank and lex.rank != OOV_RANK: nlp.vocab.vectors.add(lex.orth, row=lex.rank) else: if vectors_loc: @@ -167,8 +183,7 @@ def add_vectors(nlp, vectors_loc, prune_vectors, name=None): if vector_keys is not None: for word in vector_keys: if word not in nlp.vocab: - lexeme = nlp.vocab[word] - lexeme.is_oov = False + nlp.vocab[word] if vectors_data is not None: nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) if name is None: @@ -180,9 +195,11 @@ def add_vectors(nlp, vectors_loc, prune_vectors, name=None): nlp.vocab.prune_vectors(prune_vectors) -def read_vectors(vectors_loc): +def read_vectors(vectors_loc, truncate_vectors=0): f = open_file(vectors_loc) shape = tuple(int(size) for size in next(f).split()) + if truncate_vectors >= 1: + shape = (truncate_vectors, shape[1]) vectors_data = numpy.zeros(shape=shape, dtype="f") vectors_keys = [] for i, line in enumerate(tqdm(f)): @@ -193,6 +210,8 @@ def read_vectors(vectors_loc): msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1) vectors_data[i] = numpy.asarray(pieces, dtype="f") vectors_keys.append(word) + if i == truncate_vectors - 1: + break return vectors_data, vectors_keys diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 590ce4f13..cbe977cad 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -11,8 +11,8 @@ import random from ..util import create_default_optimizer from ..util import use_gpu as set_gpu -from ..attrs import PROB, IS_OOV, CLUSTER, LANG from ..gold import GoldCorpus +from ..lookups import Lookups from .. import util from .. import about @@ -46,6 +46,7 @@ def train( textcat_arch: ("Textcat model architecture", "option", "ta", str) = "bow", textcat_positive_label: ("Textcat positive label for binary classes with two labels", "option", "tpl", str) = None, tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None, + omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False, verbose: ("Display more information for debug", "flag", "VV", bool) = False, debug: ("Run data diagnostics before training", "flag", "D", bool) = False, # fmt: on @@ -111,7 +112,7 @@ def train( eval_beam_widths.sort() has_beam_widths = eval_beam_widths != [1] - default_dir = Path(__file__).parent.parent / "ml" / "models" / "defaults" + default_dir = Path(__file__).parent.parent / "pipeline" / "defaults" # Set up the base model and pipeline. If a base model is specified, load # the model and make sure the pipeline matches the pipeline setting. If @@ -252,6 +253,18 @@ def train( # Update tag map with provided mapping nlp.vocab.morphology.tag_map.update(tag_map) + # Create empty extra lexeme tables so the data from spacy-lookups-data + # isn't loaded if these features are accessed + if omit_extra_lookups: + nlp.vocab.lookups_extra = Lookups() + nlp.vocab.lookups_extra.add_table("lexeme_cluster") + nlp.vocab.lookups_extra.add_table("lexeme_prob") + nlp.vocab.lookups_extra.add_table("lexeme_settings") + + if vectors: + msg.text("Loading vector from model '{}'".format(vectors)) + _load_vectors(nlp, vectors) + # Multitask objectives multitask_options = [("parser", parser_multitasks), ("ner", entity_multitasks)] for pipe_name, multitasks in multitask_options: @@ -355,7 +368,7 @@ def train( if len(textcat_labels) == 2: msg.warn( "If the textcat component is a binary classifier with " - "exclusive classes, provide '--textcat_positive_label' for " + "exclusive classes, provide '--textcat-positive-label' for " "an evaluation on the positive class." ) msg.text( @@ -445,22 +458,25 @@ def train( cpu_wps = nwords / (end_time - start_time) else: gpu_wps = nwords / (end_time - start_time) - with use_ops("numpy"): - nlp_loaded = util.load_model_from_path(epoch_model_path) - for name, component in nlp_loaded.pipeline: - if hasattr(component, "cfg"): - component.cfg["beam_width"] = beam_width - dev_dataset = list( - corpus.dev_dataset( - nlp_loaded, - gold_preproc=gold_preproc, - ignore_misaligned=True, + # Evaluate on CPU in the first iteration only (for + # timing) when GPU is enabled + if i == 0: + with use_ops("numpy"): + nlp_loaded = util.load_model_from_path(epoch_model_path) + for name, component in nlp_loaded.pipeline: + if hasattr(component, "cfg"): + component.cfg["beam_width"] = beam_width + dev_dataset = list( + corpus.dev_dataset( + nlp_loaded, + gold_preproc=gold_preproc, + ignore_misaligned=True, + ) ) - ) - start_time = timer() - scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose) - end_time = timer() - cpu_wps = nwords / (end_time - start_time) + start_time = timer() + scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose) + end_time = timer() + cpu_wps = nwords / (end_time - start_time) acc_loc = output_path / f"model{i}" / "accuracy.json" srsly.write_json(acc_loc, scorer.scores) @@ -536,7 +552,7 @@ def train( ) break except Exception as e: - msg.warn(f"Aborting and saving final best model. Encountered exception: {e}") + msg.warn(f"Aborting and saving final best model. Encountered exception: {e}", exits=1) finally: best_pipes = nlp.pipe_names if disabled_pipes: @@ -614,17 +630,7 @@ def _create_progress_bar(total): def _load_vectors(nlp, vectors): - loaded_model = util.load_model(vectors, vocab=nlp.vocab) - for lex in nlp.vocab: - values = {} - for attr, func in nlp.vocab.lex_attr_getters.items(): - # These attrs are expected to be set by data. Others should - # be set by calling the language functions. - if attr not in (CLUSTER, PROB, IS_OOV, LANG): - values[lex.vocab.strings[attr]] = func(lex.orth_) - lex.set_attrs(**values) - lex.is_oov = False - return loaded_model + util.load_model(vectors, vocab=nlp.vocab) def _load_pretrained_tok2vec(nlp, loc): diff --git a/spacy/errors.py b/spacy/errors.py index da2cfdf04..6184c078c 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -1,10 +1,13 @@ def add_codes(err_cls): """Add error codes to string messages via class attribute names.""" - class ErrorsWithCodes(object): + class ErrorsWithCodes(err_cls): def __getattribute__(self, code): - msg = getattr(err_cls, code) - return f"[{code}] {msg}" + msg = super().__getattribute__(code) + if code.startswith("__"): # python system attributes like __class__ + return msg + else: + return "[{code}] {msg}".format(code=code, msg=msg) return ErrorsWithCodes() @@ -88,6 +91,8 @@ class Warnings(object): "or the language you're using doesn't have lemmatization data, " "you can ignore this warning. If this is surprising, make sure you " "have the spacy-lookups-data package installed.") + W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. " + "'n_process' will be set to 1.") W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " "the Knowledge Base.") W025 = ("'{name}' requires '{attr}' to be assigned, but none of the " @@ -99,9 +104,13 @@ class Warnings(object): W028 = ("Doc.from_array was called with a vector of type '{type}', " "but is expecting one of type 'uint64' instead. This may result " "in problems with the vocab further on in the pipeline.") - W029 = ("Skipping unsupported morphological feature(s): {feature}. " - "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or " - "string \"Field1=Value1,Value2|Field2=Value3\".") + W029 = ("Unable to align tokens with entities from character offsets. " + "Discarding entity annotation for the text: {text}.") + W030 = ("Some entities could not be aligned in the text \"{text}\" with " + "entities \"{entities}\". Use " + "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`" + " to check the alignment. Misaligned entities ('-') will be " + "ignored during training.") # TODO: fix numbering after merging develop into master W095 = ("Model '{model}' ({model_version}) requires spaCy {version} and is " @@ -118,6 +127,9 @@ class Warnings(object): "so a default configuration was used.") W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', " "but got '{type}' instead, so ignoring it.") + W100 = ("Skipping unsupported morphological feature(s): {feature}. " + "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or " + "string \"Field1=Value1,Value2|Field2=Value3\".") @add_codes @@ -551,6 +563,17 @@ class Errors(object): "array.") E191 = ("Invalid head: the head token must be from the same doc as the " "token itself.") + E192 = ("Unable to resize vectors in place with cupy.") + E193 = ("Unable to resize vectors in place if the resized vector dimension " + "({new_dim}) is not the same as the current vector dimension " + "({curr_dim}).") + E194 = ("Unable to aligned mismatched text '{text}' and words '{words}'.") + E195 = ("Matcher can be called on {good} only, got {got}.") + E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can " + "only be fixed with token.is_sent_start.") + E197 = ("Row out of bounds, unable to add row {row} for key {key}.") + E198 = ("Unable to return {n} most similar vectors for the current vectors " + "table, which contains {n_rows} vectors.") # TODO: fix numbering after merging develop into master diff --git a/spacy/gold.pyx b/spacy/gold.pyx index ecbd13354..1e58f0635 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -47,13 +47,27 @@ def tags_to_entities(tags): return entities +def merge_sents(sents): + m_deps = [[], [], [], [], [], []] + m_cats = {} + m_brackets = [] + i = 0 + for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents: + m_deps[0].extend(id_ + i for id_ in ids) + m_deps[1].extend(words) + m_deps[2].extend(tags) + m_deps[3].extend(head + i for head in heads) + m_deps[4].extend(labels) + m_deps[5].extend(ner) + m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) + for b in brackets) + m_cats.update(cats) + i += len(ids) + return [(m_deps, (m_cats, m_brackets))] + + def _normalize_for_alignment(tokens): - tokens = [w.replace(" ", "").lower() for w in tokens] - output = [] - for token in tokens: - token = token.replace(" ", "").lower() - output.append(token) - return output + return [w.replace(" ", "").lower() for w in tokens] def align(tokens_a, tokens_b): @@ -348,6 +362,7 @@ def make_orth_variants(nlp, example, orth_variant_level=0.0): if not example.token_annotation: return example raw = example.text + lower = False if random.random() >= 0.5: lower = True if raw is not None: @@ -415,8 +430,11 @@ def make_orth_variants(nlp, example, orth_variant_level=0.0): raw_idx += 1 for word in variant_example.token_annotation.words: match_found = False + # skip whitespace words + if word.isspace(): + match_found = True # add identical word - if word not in variants and raw[raw_idx:].startswith(word): + elif word not in variants and raw[raw_idx:].startswith(word): variant_raw += word raw_idx += len(word) match_found = True @@ -1031,8 +1049,17 @@ cdef class GoldParse: self.cats = {} if cats is None else dict(cats) self.links = {} if links is None else dict(links) + # temporary doc for aligning entity annotation + entdoc = None + # avoid allocating memory if the doc does not contain any tokens if self.length == 0: + self.words = [] + self.tags = [] + self.heads = [] + self.labels = [] + self.ner = [] + self.morphs = [] # set a minimal orig so that the scorer can score an empty doc self.orig = TokenAnnotation(ids=[]) else: @@ -1062,7 +1089,25 @@ cdef class GoldParse: entities = [(ent if ent is not None else "-") for ent in entities] if not isinstance(entities[0], str): # Assume we have entities specified by character offset. - entities = biluo_tags_from_offsets(doc, entities) + # Create a temporary Doc corresponding to provided words + # (to preserve gold tokenization) and text (to preserve + # character offsets). + entdoc_words, entdoc_spaces = util.get_words_and_spaces(words, doc.text) + entdoc = Doc(doc.vocab, words=entdoc_words, spaces=entdoc_spaces) + entdoc_entities = biluo_tags_from_offsets(entdoc, entities) + # There may be some additional whitespace tokens in the + # temporary doc, so check that the annotations align with + # the provided words while building a list of BILUO labels. + entities = [] + words_offset = 0 + for i in range(len(entdoc_words)): + if words[i + words_offset] == entdoc_words[i]: + entities.append(entdoc_entities[i]) + else: + words_offset -= 1 + if len(entities) != len(words): + warnings.warn(Warnings.W029.format(text=doc.text)) + entities = ["-" for _ in words] # These are filled by the tagger/parser/entity recogniser self.c.tags = self.mem.alloc(len(doc), sizeof(int)) @@ -1092,7 +1137,8 @@ cdef class GoldParse: # If we under-segment, we'll have one predicted word that covers a # sequence of gold words. # If we "mis-segment", we'll have a sequence of predicted words covering - # a sequence of gold words. That's many-to-many -- we don't do that. + # a sequence of gold words. That's many-to-many -- we don't do that + # except for NER spans where the start and end can be aligned. cost, i2j, j2i, i2j_multi, j2i_multi = align([t.orth_ for t in doc], words) self.cand_to_gold = [(j if j >= 0 else None) for j in i2j] @@ -1123,7 +1169,6 @@ cdef class GoldParse: self.lemmas[i] = lemmas[i2j_multi[i]] self.sent_starts[i] = sent_starts[i2j_multi[i]] is_last = i2j_multi[i] != i2j_multi.get(i+1) - is_first = i2j_multi[i] != i2j_multi.get(i-1) # Set next word in multi-token span as head, until last if not is_last: self.heads[i] = i+1 @@ -1133,30 +1178,10 @@ cdef class GoldParse: if head_i: self.heads[i] = self.gold_to_cand[head_i] self.labels[i] = deps[i2j_multi[i]] - # Now set NER...This is annoying because if we've split - # got an entity word split into two, we need to adjust the - # BILUO tags. We can't have BB or LL etc. - # Case 1: O -- easy. ner_tag = entities[i2j_multi[i]] - if ner_tag == "O": - self.ner[i] = "O" - # Case 2: U. This has to become a B I* L sequence. - elif ner_tag.startswith("U-"): - if is_first: - self.ner[i] = ner_tag.replace("U-", "B-", 1) - elif is_last: - self.ner[i] = ner_tag.replace("U-", "L-", 1) - else: - self.ner[i] = ner_tag.replace("U-", "I-", 1) - # Case 3: L. If not last, change to I. - elif ner_tag.startswith("L-"): - if is_last: - self.ner[i] = ner_tag - else: - self.ner[i] = ner_tag.replace("L-", "I-", 1) - # Case 4: I. Stays correct - elif ner_tag.startswith("I-"): - self.ner[i] = ner_tag + # Assign O/- for many-to-one O/- NER tags + if ner_tag in ("O", "-"): + self.ner[i] = ner_tag else: self.words[i] = words[gold_i] self.tags[i] = tags[gold_i] @@ -1170,6 +1195,39 @@ cdef class GoldParse: self.heads[i] = self.gold_to_cand[heads[gold_i]] self.labels[i] = deps[gold_i] self.ner[i] = entities[gold_i] + # Assign O/- for one-to-many O/- NER tags + for j, cand_j in enumerate(self.gold_to_cand): + if cand_j is None: + if j in j2i_multi: + i = j2i_multi[j] + ner_tag = entities[j] + if ner_tag in ("O", "-"): + self.ner[i] = ner_tag + + # If there is entity annotation and some tokens remain unaligned, + # align all entities at the character level to account for all + # possible token misalignments within the entity spans + if any([e not in ("O", "-") for e in entities]) and None in self.ner: + # If the temporary entdoc wasn't created above, initialize it + if not entdoc: + entdoc_words, entdoc_spaces = util.get_words_and_spaces(words, doc.text) + entdoc = Doc(doc.vocab, words=entdoc_words, spaces=entdoc_spaces) + # Get offsets based on gold words and BILUO entities + entdoc_offsets = offsets_from_biluo_tags(entdoc, entities) + aligned_offsets = [] + aligned_spans = [] + # Filter offsets to identify those that align with doc tokens + for offset in entdoc_offsets: + span = doc.char_span(offset[0], offset[1]) + if span and not span.text.isspace(): + aligned_offsets.append(offset) + aligned_spans.append(span) + # Convert back to BILUO for doc tokens and assign NER for all + # aligned spans + biluo_tags = biluo_tags_from_offsets(doc, aligned_offsets, missing=None) + for span in aligned_spans: + for i in range(span.start, span.end): + self.ner[i] = biluo_tags[i] # Prevent whitespace that isn't within entities from being tagged as # an entity. @@ -1303,6 +1361,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"): break else: biluo[token.i] = missing + if "-" in biluo: + ent_str = str(entities) + warnings.warn(Warnings.W030.format( + text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text, + entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str + )) return biluo diff --git a/spacy/lang/da/__init__.py b/spacy/lang/da/__init__.py index 6d1e33986..e0f0061ec 100644 --- a/spacy/lang/da/__init__.py +++ b/spacy/lang/da/__init__.py @@ -1,24 +1,19 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .morph_rules import MORPH_RULES from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class DanishDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "da" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) morph_rules = MORPH_RULES infixes = TOKENIZER_INFIXES diff --git a/spacy/lang/da/examples.py b/spacy/lang/da/examples.py index 80b2b925b..efa1a7c0e 100644 --- a/spacy/lang/da/examples.py +++ b/spacy/lang/da/examples.py @@ -5,10 +5,13 @@ Example sentences to test spaCy and its language models. >>> docs = nlp.pipe(sentences) """ - sentences = [ - "Apple overvejer at købe et britisk startup for 1 milliard dollar", - "Selvkørende biler flytter forsikringsansvaret over på producenterne", - "San Francisco overvejer at forbyde udbringningsrobotter på fortov", - "London er en stor by i Storbritannien", + "Apple overvejer at købe et britisk startup for 1 milliard dollar.", + "Selvkørende biler flytter forsikringsansvaret over på producenterne.", + "San Francisco overvejer at forbyde udbringningsrobotter på fortovet.", + "London er en storby i Storbritannien.", + "Hvor er du?", + "Hvem er Frankrings president?", + "Hvad er hovedstaden i USA?", + "Hvornår blev Barack Obama født?", ] diff --git a/spacy/lang/da/norm_exceptions.py b/spacy/lang/da/norm_exceptions.py deleted file mode 100644 index c689500f4..000000000 --- a/spacy/lang/da/norm_exceptions.py +++ /dev/null @@ -1,524 +0,0 @@ -""" -Special-case rules for normalizing tokens to improve the model's predictions. -For example 'mysterium' vs 'mysterie' and similar. -""" - -# Sources: -# 1: https://dsn.dk/retskrivning/om-retskrivningsordbogen/mere-om-retskrivningsordbogen-2012/endrede-stave-og-ordformer/ -# 2: http://www.tjerry-korrektur.dk/ord-med-flere-stavemaader/ - -_exc = { - # Alternative spelling - "a-kraft-værk": "a-kraftværk", # 1 - "ålborg": "aalborg", # 2 - "århus": "aarhus", - "accessoirer": "accessoires", # 1 - "affektert": "affekteret", # 1 - "afrikander": "afrikaaner", # 1 - "aftabuere": "aftabuisere", # 1 - "aftabuering": "aftabuisering", # 1 - "akvarium": "akvarie", # 1 - "alenefader": "alenefar", # 1 - "alenemoder": "alenemor", # 1 - "alkoholambulatorium": "alkoholambulatorie", # 1 - "ambulatorium": "ambulatorie", # 1 - "ananassene": "ananasserne", # 2 - "anførelsestegn": "anførselstegn", # 1 - "anseelig": "anselig", # 2 - "antioxydant": "antioxidant", # 1 - "artrig": "artsrig", # 1 - "auditorium": "auditorie", # 1 - "avocado": "avokado", # 2 - "bagerst": "bagest", # 2 - "bagstræv": "bagstræb", # 1 - "bagstræver": "bagstræber", # 1 - "bagstræverisk": "bagstræberisk", # 1 - "balde": "balle", # 2 - "barselorlov": "barselsorlov", # 1 - "barselvikar": "barselsvikar", # 1 - "baskien": "baskerlandet", # 1 - "bayrisk": "bayersk", # 1 - "bedstefader": "bedstefar", # 1 - "bedstemoder": "bedstemor", # 1 - "behefte": "behæfte", # 1 - "beheftelse": "behæftelse", # 1 - "bidragydende": "bidragsydende", # 1 - "bidragyder": "bidragsyder", # 1 - "billiondel": "billiontedel", # 1 - "blaseret": "blasert", # 1 - "bleskifte": "bleskift", # 1 - "blodbroder": "blodsbroder", # 2 - "blyantspidser": "blyantsspidser", # 2 - "boligministerium": "boligministerie", # 1 - "borhul": "borehul", # 1 - "broder": "bror", # 2 - "buldog": "bulldog", # 2 - "bådhus": "bådehus", # 1 - "børnepleje": "barnepleje", # 1 - "børneseng": "barneseng", # 1 - "børnestol": "barnestol", # 1 - "cairo": "kairo", # 1 - "cambodia": "cambodja", # 1 - "cambodianer": "cambodjaner", # 1 - "cambodiansk": "cambodjansk", # 1 - "camouflage": "kamuflage", # 2 - "campylobacter": "kampylobakter", # 1 - "centeret": "centret", # 2 - "chefskahyt": "chefkahyt", # 1 - "chefspost": "chefpost", # 1 - "chefssekretær": "chefsekretær", # 1 - "chefsstol": "chefstol", # 1 - "cirkulærskrivelse": "cirkulæreskrivelse", # 1 - "cognacsglas": "cognacglas", # 1 - "columnist": "kolumnist", # 1 - "cricket": "kricket", # 2 - "dagplejemoder": "dagplejemor", # 1 - "damaskesdug": "damaskdug", # 1 - "damp-barn": "dampbarn", # 1 - "delfinarium": "delfinarie", # 1 - "dentallaboratorium": "dentallaboratorie", # 1 - "diaramme": "diasramme", # 1 - "diaré": "diarré", # 1 - "dioxyd": "dioxid", # 1 - "dommedagsprædiken": "dommedagspræken", # 1 - "donut": "doughnut", # 2 - "driftmæssig": "driftsmæssig", # 1 - "driftsikker": "driftssikker", # 1 - "driftsikring": "driftssikring", # 1 - "drikkejogurt": "drikkeyoghurt", # 1 - "drivein": "drive-in", # 1 - "driveinbiograf": "drive-in-biograf", # 1 - "drøvel": "drøbel", # 1 - "dødskriterium": "dødskriterie", # 1 - "e-mail-adresse": "e-mailadresse", # 1 - "e-post-adresse": "e-postadresse", # 1 - "egypten": "ægypten", # 2 - "ekskommunicere": "ekskommunikere", # 1 - "eksperimentarium": "eksperimentarie", # 1 - "elsass": "Alsace", # 1 - "elsasser": "alsacer", # 1 - "elsassisk": "alsacisk", # 1 - "elvetal": "ellevetal", # 1 - "elvetiden": "ellevetiden", # 1 - "elveårig": "elleveårig", # 1 - "elveårs": "elleveårs", # 1 - "elveårsbarn": "elleveårsbarn", # 1 - "elvte": "ellevte", # 1 - "elvtedel": "ellevtedel", # 1 - "energiministerium": "energiministerie", # 1 - "erhvervsministerium": "erhvervsministerie", # 1 - "espaliere": "spaliere", # 2 - "evangelium": "evangelie", # 1 - "fagministerium": "fagministerie", # 1 - "fakse": "faxe", # 1 - "fangstkvota": "fangstkvote", # 1 - "fader": "far", # 2 - "farbroder": "farbror", # 1 - "farfader": "farfar", # 1 - "farmoder": "farmor", # 1 - "federal": "føderal", # 1 - "federalisering": "føderalisering", # 1 - "federalisme": "føderalisme", # 1 - "federalist": "føderalist", # 1 - "federalistisk": "føderalistisk", # 1 - "federation": "føderation", # 1 - "federativ": "føderativ", # 1 - "fejlbeheftet": "fejlbehæftet", # 1 - "femetagers": "femetages", # 2 - "femhundredekroneseddel": "femhundredkroneseddel", # 2 - "filmpremiere": "filmpræmiere", # 2 - "finansimperium": "finansimperie", # 1 - "finansministerium": "finansministerie", # 1 - "firehjulstræk": "firhjulstræk", # 2 - "fjernstudium": "fjernstudie", # 1 - "formalier": "formalia", # 1 - "formandsskift": "formandsskifte", # 1 - "fornemst": "fornemmest", # 2 - "fornuftparti": "fornuftsparti", # 1 - "fornuftstridig": "fornuftsstridig", # 1 - "fornuftvæsen": "fornuftsvæsen", # 1 - "fornuftægteskab": "fornuftsægteskab", # 1 - "forretningsministerium": "forretningsministerie", # 1 - "forskningsministerium": "forskningsministerie", # 1 - "forstudium": "forstudie", # 1 - "forsvarsministerium": "forsvarsministerie", # 1 - "frilægge": "fritlægge", # 1 - "frilæggelse": "fritlæggelse", # 1 - "frilægning": "fritlægning", # 1 - "fristille": "fritstille", # 1 - "fristilling": "fritstilling", # 1 - "fuldttegnet": "fuldtegnet", # 1 - "fødestedskriterium": "fødestedskriterie", # 1 - "fødevareministerium": "fødevareministerie", # 1 - "følesløs": "følelsesløs", # 1 - "følgeligt": "følgelig", # 1 - "førne": "førn", # 1 - "gearskift": "gearskifte", # 2 - "gladeligt": "gladelig", # 1 - "glosehefte": "glosehæfte", # 1 - "glædeløs": "glædesløs", # 1 - "gonoré": "gonorré", # 1 - "grangiveligt": "grangivelig", # 1 - "grundliggende": "grundlæggende", # 2 - "grønsag": "grøntsag", # 2 - "gudbenådet": "gudsbenådet", # 1 - "gudfader": "gudfar", # 1 - "gudmoder": "gudmor", # 1 - "gulvmop": "gulvmoppe", # 1 - "gymnasium": "gymnasie", # 1 - "hackning": "hacking", # 1 - "halvbroder": "halvbror", # 1 - "halvelvetiden": "halvellevetiden", # 1 - "handelsgymnasium": "handelsgymnasie", # 1 - "hefte": "hæfte", # 1 - "hefteklamme": "hæfteklamme", # 1 - "heftelse": "hæftelse", # 1 - "heftemaskine": "hæftemaskine", # 1 - "heftepistol": "hæftepistol", # 1 - "hefteplaster": "hæfteplaster", # 1 - "heftestraf": "hæftestraf", # 1 - "heftning": "hæftning", # 1 - "helbroder": "helbror", # 1 - "hjemmeklasse": "hjemklasse", # 1 - "hjulspin": "hjulspind", # 1 - "huggevåben": "hugvåben", # 1 - "hulmurisolering": "hulmursisolering", # 1 - "hurtiggående": "hurtigtgående", # 2 - "hurtigttørrende": "hurtigtørrende", # 2 - "husmoder": "husmor", # 1 - "hydroxyd": "hydroxid", # 1 - "håndmikser": "håndmixer", # 1 - "højtaler": "højttaler", # 2 - "hønemoder": "hønemor", # 1 - "ide": "idé", # 2 - "imperium": "imperie", # 1 - "imponerthed": "imponerethed", # 1 - "inbox": "indboks", # 2 - "indenrigsministerium": "indenrigsministerie", # 1 - "indhefte": "indhæfte", # 1 - "indheftning": "indhæftning", # 1 - "indicium": "indicie", # 1 - "indkassere": "inkassere", # 2 - "iota": "jota", # 1 - "jobskift": "jobskifte", # 1 - "jogurt": "yoghurt", # 1 - "jukeboks": "jukebox", # 1 - "justitsministerium": "justitsministerie", # 1 - "kalorifere": "kalorifer", # 1 - "kandidatstipendium": "kandidatstipendie", # 1 - "kannevas": "kanvas", # 1 - "kaperssauce": "kaperssovs", # 1 - "kigge": "kikke", # 2 - "kirkeministerium": "kirkeministerie", # 1 - "klapmydse": "klapmyds", # 1 - "klimakterium": "klimakterie", # 1 - "klogeligt": "klogelig", # 1 - "knivblad": "knivsblad", # 1 - "kollegaer": "kolleger", # 2 - "kollegium": "kollegie", # 1 - "kollegiehefte": "kollegiehæfte", # 1 - "kollokviumx": "kollokvium", # 1 - "kommissorium": "kommissorie", # 1 - "kompendium": "kompendie", # 1 - "komplicerthed": "komplicerethed", # 1 - "konfederation": "konføderation", # 1 - "konfedereret": "konfødereret", # 1 - "konferensstudium": "konferensstudie", # 1 - "konservatorium": "konservatorie", # 1 - "konsulere": "konsultere", # 1 - "kradsbørstig": "krasbørstig", # 2 - "kravsspecifikation": "kravspecifikation", # 1 - "krematorium": "krematorie", # 1 - "krep": "crepe", # 1 - "krepnylon": "crepenylon", # 1 - "kreppapir": "crepepapir", # 1 - "kricket": "cricket", # 2 - "kriterium": "kriterie", # 1 - "kroat": "kroater", # 2 - "kroki": "croquis", # 1 - "kronprinsepar": "kronprinspar", # 2 - "kropdoven": "kropsdoven", # 1 - "kroplus": "kropslus", # 1 - "krøllefedt": "krølfedt", # 1 - "kulturministerium": "kulturministerie", # 1 - "kuponhefte": "kuponhæfte", # 1 - "kvota": "kvote", # 1 - "kvotaordning": "kvoteordning", # 1 - "laboratorium": "laboratorie", # 1 - "laksfarve": "laksefarve", # 1 - "laksfarvet": "laksefarvet", # 1 - "laksrød": "lakserød", # 1 - "laksyngel": "lakseyngel", # 1 - "laksørred": "lakseørred", # 1 - "landbrugsministerium": "landbrugsministerie", # 1 - "landskampstemning": "landskampsstemning", # 1 - "langust": "languster", # 1 - "lappegrejer": "lappegrej", # 1 - "lavløn": "lavtløn", # 1 - "lillebroder": "lillebror", # 1 - "linear": "lineær", # 1 - "loftlampe": "loftslampe", # 2 - "log-in": "login", # 1 - "login": "log-in", # 2 - "lovmedholdig": "lovmedholdelig", # 1 - "ludder": "luder", # 2 - "lysholder": "lyseholder", # 1 - "lægeskifte": "lægeskift", # 1 - "lærvillig": "lærevillig", # 1 - "løgsauce": "løgsovs", # 1 - "madmoder": "madmor", # 1 - "majonæse": "mayonnaise", # 1 - "mareridtagtig": "mareridtsagtig", # 1 - "margen": "margin", # 2 - "martyrium": "martyrie", # 1 - "mellemstatlig": "mellemstatslig", # 1 - "menneskene": "menneskerne", # 2 - "metropolis": "metropol", # 1 - "miks": "mix", # 1 - "mikse": "mixe", # 1 - "miksepult": "mixerpult", # 1 - "mikser": "mixer", # 1 - "mikserpult": "mixerpult", # 1 - "mikslån": "mixlån", # 1 - "miksning": "mixning", # 1 - "miljøministerium": "miljøministerie", # 1 - "milliarddel": "milliardtedel", # 1 - "milliondel": "milliontedel", # 1 - "ministerium": "ministerie", # 1 - "mop": "moppe", # 1 - "moder": "mor", # 2 - "moratorium": "moratorie", # 1 - "morbroder": "morbror", # 1 - "morfader": "morfar", # 1 - "mormoder": "mormor", # 1 - "musikkonservatorium": "musikkonservatorie", # 1 - "muslingskal": "muslingeskal", # 1 - "mysterium": "mysterie", # 1 - "naturalieydelse": "naturalydelse", # 1 - "naturalieøkonomi": "naturaløkonomi", # 1 - "navnebroder": "navnebror", # 1 - "nerium": "nerie", # 1 - "nådeløs": "nådesløs", # 1 - "nærforestående": "nærtforestående", # 1 - "nærstående": "nærtstående", # 1 - "observatorium": "observatorie", # 1 - "oldefader": "oldefar", # 1 - "oldemoder": "oldemor", # 1 - "opgraduere": "opgradere", # 1 - "opgraduering": "opgradering", # 1 - "oratorium": "oratorie", # 1 - "overbookning": "overbooking", # 1 - "overpræsidium": "overpræsidie", # 1 - "overstatlig": "overstatslig", # 1 - "oxyd": "oxid", # 1 - "oxydere": "oxidere", # 1 - "oxydering": "oxidering", # 1 - "pakkenellike": "pakkenelliker", # 1 - "papirtynd": "papirstynd", # 1 - "pastoralseminarium": "pastoralseminarie", # 1 - "peanutsene": "peanuttene", # 2 - "penalhus": "pennalhus", # 2 - "pensakrav": "pensumkrav", # 1 - "pepperoni": "peperoni", # 1 - "peruaner": "peruvianer", # 1 - "petrole": "petrol", # 1 - "piltast": "piletast", # 1 - "piltaste": "piletast", # 1 - "planetarium": "planetarie", # 1 - "plasteret": "plastret", # 2 - "plastic": "plastik", # 2 - "play-off-kamp": "playoffkamp", # 1 - "plejefader": "plejefar", # 1 - "plejemoder": "plejemor", # 1 - "podium": "podie", # 2 - "praha": "prag", # 2 - "preciøs": "pretiøs", # 2 - "privilegium": "privilegie", # 1 - "progredere": "progrediere", # 1 - "præsidium": "præsidie", # 1 - "psykodelisk": "psykedelisk", # 1 - "pudsegrejer": "pudsegrej", # 1 - "referensgruppe": "referencegruppe", # 1 - "referensramme": "referenceramme", # 1 - "refugium": "refugie", # 1 - "registeret": "registret", # 2 - "remedium": "remedie", # 1 - "remiks": "remix", # 1 - "reservert": "reserveret", # 1 - "ressortministerium": "ressortministerie", # 1 - "ressource": "resurse", # 2 - "resætte": "resette", # 1 - "rettelig": "retteligt", # 1 - "rettetaste": "rettetast", # 1 - "returtaste": "returtast", # 1 - "risici": "risikoer", # 2 - "roll-on": "rollon", # 1 - "rollehefte": "rollehæfte", # 1 - "rostbøf": "roastbeef", # 1 - "rygsæksturist": "rygsækturist", # 1 - "rødstjært": "rødstjert", # 1 - "saddel": "sadel", # 2 - "samaritan": "samaritaner", # 2 - "sanatorium": "sanatorie", # 1 - "sauce": "sovs", # 1 - "scanning": "skanning", # 2 - "sceneskifte": "sceneskift", # 1 - "scilla": "skilla", # 1 - "sejflydende": "sejtflydende", # 1 - "selvstudium": "selvstudie", # 1 - "seminarium": "seminarie", # 1 - "sennepssauce": "sennepssovs ", # 1 - "servitutbeheftet": "servitutbehæftet", # 1 - "sit-in": "sitin", # 1 - "skatteministerium": "skatteministerie", # 1 - "skifer": "skiffer", # 2 - "skyldsfølelse": "skyldfølelse", # 1 - "skysauce": "skysovs", # 1 - "sladdertaske": "sladretaske", # 2 - "sladdervorn": "sladrevorn", # 2 - "slagsbroder": "slagsbror", # 1 - "slettetaste": "slettetast", # 1 - "smørsauce": "smørsovs", # 1 - "snitsel": "schnitzel", # 1 - "snobbeeffekt": "snobeffekt", # 2 - "socialministerium": "socialministerie", # 1 - "solarium": "solarie", # 1 - "soldebroder": "soldebror", # 1 - "spagetti": "spaghetti", # 1 - "spagettistrop": "spaghettistrop", # 1 - "spagettiwestern": "spaghettiwestern", # 1 - "spin-off": "spinoff", # 1 - "spinnefiskeri": "spindefiskeri", # 1 - "spolorm": "spoleorm", # 1 - "sproglaboratorium": "sproglaboratorie", # 1 - "spækbræt": "spækkebræt", # 2 - "stand-in": "standin", # 1 - "stand-up-comedy": "standupcomedy", # 1 - "stand-up-komiker": "standupkomiker", # 1 - "statsministerium": "statsministerie", # 1 - "stedbroder": "stedbror", # 1 - "stedfader": "stedfar", # 1 - "stedmoder": "stedmor", # 1 - "stilehefte": "stilehæfte", # 1 - "stipendium": "stipendie", # 1 - "stjært": "stjert", # 1 - "stjærthage": "stjerthage", # 1 - "storebroder": "storebror", # 1 - "stortå": "storetå", # 1 - "strabads": "strabadser", # 1 - "strømlinjet": "strømlinet", # 1 - "studium": "studie", # 1 - "stænkelap": "stænklap", # 1 - "sundhedsministerium": "sundhedsministerie", # 1 - "suppositorium": "suppositorie", # 1 - "svejts": "schweiz", # 1 - "svejtser": "schweizer", # 1 - "svejtserfranc": "schweizerfranc", # 1 - "svejtserost": "schweizerost", # 1 - "svejtsisk": "schweizisk", # 1 - "svigerfader": "svigerfar", # 1 - "svigermoder": "svigermor", # 1 - "svirebroder": "svirebror", # 1 - "symposium": "symposie", # 1 - "sælarium": "sælarie", # 1 - "søreme": "sørme", # 2 - "søterritorium": "søterritorie", # 1 - "t-bone-steak": "t-bonesteak", # 1 - "tabgivende": "tabsgivende", # 1 - "tabuere": "tabuisere", # 1 - "tabuering": "tabuisering", # 1 - "tackle": "takle", # 2 - "tackling": "takling", # 2 - "taifun": "tyfon", # 1 - "take-off": "takeoff", # 1 - "taknemlig": "taknemmelig", # 2 - "talehørelærer": "tale-høre-lærer", # 1 - "talehøreundervisning": "tale-høre-undervisning", # 1 - "tandstik": "tandstikker", # 1 - "tao": "dao", # 1 - "taoisme": "daoisme", # 1 - "taoist": "daoist", # 1 - "taoistisk": "daoistisk", # 1 - "taverne": "taverna", # 1 - "teateret": "teatret", # 2 - "tekno": "techno", # 1 - "temposkifte": "temposkift", # 1 - "terrarium": "terrarie", # 1 - "territorium": "territorie", # 1 - "tesis": "tese", # 1 - "tidsstudium": "tidsstudie", # 1 - "tipoldefader": "tipoldefar", # 1 - "tipoldemoder": "tipoldemor", # 1 - "tomatsauce": "tomatsovs", # 1 - "tonart": "toneart", # 1 - "trafikministerium": "trafikministerie", # 1 - "tredve": "tredive", # 1 - "tredver": "trediver", # 1 - "tredveårig": "trediveårig", # 1 - "tredveårs": "trediveårs", # 1 - "tredveårsfødselsdag": "trediveårsfødselsdag", # 1 - "tredvte": "tredivte", # 1 - "tredvtedel": "tredivtedel", # 1 - "troldunge": "troldeunge", # 1 - "trommestikke": "trommestik", # 1 - "trubadur": "troubadour", # 2 - "trøstepræmie": "trøstpræmie", # 2 - "tummerum": "trummerum", # 1 - "tumultuarisk": "tumultarisk", # 1 - "tunghørighed": "tunghørhed", # 1 - "tus": "tusch", # 2 - "tusind": "tusinde", # 2 - "tvillingbroder": "tvillingebror", # 1 - "tvillingbror": "tvillingebror", # 1 - "tvillingebroder": "tvillingebror", # 1 - "ubeheftet": "ubehæftet", # 1 - "udenrigsministerium": "udenrigsministerie", # 1 - "udhulning": "udhuling", # 1 - "udslaggivende": "udslagsgivende", # 1 - "udspekulert": "udspekuleret", # 1 - "udviklingsministerium": "udviklingsministerie", # 1 - "uforpligtigende": "uforpligtende", # 1 - "uheldvarslende": "uheldsvarslende", # 1 - "uimponerthed": "uimponerethed", # 1 - "undervisningsministerium": "undervisningsministerie", # 1 - "unægtelig": "unægteligt", # 1 - "urinale": "urinal", # 1 - "uvederheftig": "uvederhæftig", # 1 - "vabel": "vable", # 2 - "vadi": "wadi", # 1 - "vaklevorn": "vakkelvorn", # 1 - "vanadin": "vanadium", # 1 - "vaselin": "vaseline", # 1 - "vederheftig": "vederhæftig", # 1 - "vedhefte": "vedhæfte", # 1 - "velar": "velær", # 1 - "videndeling": "vidensdeling", # 2 - "vinkelanførelsestegn": "vinkelanførselstegn", # 1 - "vipstjært": "vipstjert", # 1 - "vismut": "bismut", # 1 - "visvas": "vissevasse", # 1 - "voksværk": "vokseværk", # 1 - "værtdyr": "værtsdyr", # 1 - "værtplante": "værtsplante", # 1 - "wienersnitsel": "wienerschnitzel", # 1 - "yderliggående": "yderligtgående", # 2 - "zombi": "zombie", # 1 - "ægbakke": "æggebakke", # 1 - "ægformet": "æggeformet", # 1 - "ægleder": "æggeleder", # 1 - "ækvilibrist": "ekvilibrist", # 2 - "æselsøre": "æseløre", # 1 - "øjehule": "øjenhule", # 1 - "øjelåg": "øjenlåg", # 1 - "øjeåbner": "øjenåbner", # 1 - "økonomiministerium": "økonomiministerie", # 1 - "ørenring": "ørering", # 2 - "øvehefte": "øvehæfte", # 1 -} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/da/tokenizer_exceptions.py b/spacy/lang/da/tokenizer_exceptions.py index dc1b5275b..36d03bde3 100644 --- a/spacy/lang/da/tokenizer_exceptions.py +++ b/spacy/lang/da/tokenizer_exceptions.py @@ -2,7 +2,7 @@ Tokenizer Exceptions. Source: https://forkortelse.dk/ and various others. """ -from ...symbols import ORTH, LEMMA, NORM, TAG, PUNCT +from ...symbols import ORTH, LEMMA, NORM _exc = {} @@ -48,7 +48,7 @@ for exc_data in [ {ORTH: "Ons.", LEMMA: "onsdag"}, {ORTH: "Fre.", LEMMA: "fredag"}, {ORTH: "Lør.", LEMMA: "lørdag"}, - {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller", TAG: "CC"}, + {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -573,7 +573,7 @@ for h in range(1, 31 + 1): for period in ["."]: _exc[f"{h}{period}"] = [{ORTH: f"{h}."}] -_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]} +_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]} _exc.update(_custom_base_exc) TOKENIZER_EXCEPTIONS = _exc diff --git a/spacy/lang/de/__init__.py b/spacy/lang/de/__init__.py index b72d640b2..25785d125 100644 --- a/spacy/lang/de/__init__.py +++ b/spacy/lang/de/__init__.py @@ -1,5 +1,4 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_INFIXES from .tag_map import TAG_MAP @@ -7,18 +6,14 @@ from .stop_words import STOP_WORDS from .syntax_iterators import SYNTAX_ITERATORS from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class GermanDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: "de" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES diff --git a/spacy/lang/de/norm_exceptions.py b/spacy/lang/de/norm_exceptions.py deleted file mode 100644 index 6ad5b62a7..000000000 --- a/spacy/lang/de/norm_exceptions.py +++ /dev/null @@ -1,13 +0,0 @@ -# Here we only want to include the absolute most common words. Otherwise, -# this list would get impossibly long for German – especially considering the -# old vs. new spelling rules, and all possible cases. - - -_exc = {"daß": "dass"} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/de/syntax_iterators.py b/spacy/lang/de/syntax_iterators.py index 410d2f0b4..e322e1add 100644 --- a/spacy/lang/de/syntax_iterators.py +++ b/spacy/lang/de/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -24,13 +25,17 @@ def noun_chunks(obj): "og", "app", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_label = doc.vocab.strings.add("NP") np_deps = set(doc.vocab.strings.add(label) for label in labels) close_app = doc.vocab.strings.add("nk") rbracket = 0 - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if i < rbracket: continue if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps: diff --git a/spacy/lang/el/__init__.py b/spacy/lang/el/__init__.py index 95920a68f..5269199b3 100644 --- a/spacy/lang/el/__init__.py +++ b/spacy/lang/el/__init__.py @@ -6,21 +6,16 @@ from .lemmatizer import GreekLemmatizer from .syntax_iterators import SYNTAX_ITERATORS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from ..tokenizer_exceptions import BASE_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language from ...lookups import Lookups -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class GreekDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "el" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS tag_map = TAG_MAP diff --git a/spacy/lang/el/norm_exceptions.py b/spacy/lang/el/norm_exceptions.py deleted file mode 100644 index aa774c19b..000000000 --- a/spacy/lang/el/norm_exceptions.py +++ /dev/null @@ -1,2638 +0,0 @@ -# These exceptions are used to add NORM values based on a token's ORTH value. -# Norms are only set if no alternative is provided in the tokenizer exceptions. - -_exc = { - "αγιορίτης": "αγιορείτης", - "αγόρι": "αγώρι", - "έωλος": "αίολος", - "αλλοίθωρος": "αλλήθωρος", - "αλλοιώς": "αλλιώς", - "αλλοιώτικος": "αλλκότικος", - "αναµιγνύω": "αναµειγνύω", - "ανάµιξη": "ανάµειξη", - "ανανδρεία": "ανανδρία", - "αναφιλυτό": "αναφιλητό", - "ανελλειπώς": "ανελλιπώς", - "ανεξιθρησκεία": "ανεξιθρησκία", - "αντικρυνός": "αντικρινός", - "απάγκιο": "απάγκεω", - "αρµατωλός": "αρµατολός", - "αρρώστεια": "αρρώστια", - "ατόφιος": "ατόφυος", - "αφίνω": "αφήνω", - "χιβάδα": "χηβάδα", - "αχρηστεία": "αχρηστία", - "βαρυγκωµώ": "βαρυγγωµώ", - "βεβαρυµένος": "βεβαρηµένος", - "βερύκοκκο": "βερίκοκο", - "βλήτο": "βλίτο", - "βογκώ": "βογγώ", - "βραδυά": "βραδιά", - "βραδυάζει": "βραδίάζει", - "Βρεταννία": "Βρετανία", - "Βρεττανία": "Βρετανία", - "βολοδέρνω": "βωλοδέρνω", - "γέλοιο": "γέλιο", - "γκάµα": "γκάµµα", - "γλύφω": "γλείφω", - "γλήνα": "γλίνα", - "διαφήµηση": "διαφήµιση", - "δικλείδα": "δικλίδα", - "διοξείδιο": "διοξίδιο", - "διορία": "διωρία", - "δυόροφος": "διώροφος", - "δυόµισυ": "δυόµισι", - "διόσµος": "δυόσμος", - "δυσφήμιση": "δυσφήµηση", - "δοσίλογος": "δωσίλογος", - "εγχείριση": "εγχείρηση", - "ειδωλολατρεία": "ειδωλολατρία", - "εληά": "ελιά", - "ελιξίριο": "ελιξήριο", - "έλκυθρο": "έλκηθρο", - "ελλειπής": "ελλίπής", - "ενάµισυς": "ενάµισης", - "ενάµισυ": "ενάµισι", - "ενανθρώπιση": "ενανθρώπηση", - "έξη": "έξι", - "επί τούτο": "επί τούτω", - "εταιρία": "εταιρεία", - "εφορεία": "εφορία", - "ζηλειάρης": "ζηλιάρης", - "Θεοφάνεια": "Θεοφάνια", - "καυγάς": "καβγάς", - "καθίκι": "καθοίκι", - "καινούριος": "καινούργιος", - "κακάβι": "κακκάβι", - "κακαβιά": "κακκαβιά", - "καµµία": "καµία", - "κανέλα": "Καννέλα", - "κανονιοφόρος": "κανονιοφόρος", - "καντίλι": "καντήλι", - "κατεβοδώνω": "κατευοδώνω", - "κοίτοµαι": "κείτοµαι", - "κελαϊδώ": "κελαηδώ", - "κυάλια": "κιάλια", - "κλύδωνας": "κλήδονας", - "κλωτσώ": "κλοτσώ", - "κολλιτσίδα": "κολλητσίδα", - "κουκί": "κουκκί", - "κουλός": "κουλλός", - "κρεββάτι": "κρεβάτι", - "κροκόδειλος": "κροκόδιλος", - "κοβιός": "κωβιός", - "λάκισα": "λάκησα", - "λιµέρι": "ληµέρι", - "λώξυγγας": "λόξυγγας", - "µαγγούρα": "µαγκούρα", - "µαζή": "μαζί", - "µακρυά": "µακριά", - "µαµή": "µαµµή", - "µαµόθρεφτος": "µαµµόθρεφτος", - "µίγµα": "µείγµα", - "µίξη": "µείξη", - "µετώπη": "µετόπη", - "µυρολόι": "µοιρολόι", - "µοτοσικλέτα": "µοτοσυκλέτα", - "µπαλωµατής": "µπαλλωµατής", - "µιζίθρα": "µυζήθρα", - "νεοτερίζω": "νεωτερίζω", - "νεοτερισµός": "νεωτερισμός", - "νεοτεριστής": "νεωτεριστής", - "νινί": "νηνί", - "νοιώθω": "νιώθω", - "νονός": "νοννός", - "ξενιτιά": "ξενιτειά", - "ξαίρω": "ξέρω", - "ξίγκι": "ξίγγι", - "ξείδι": "ξίδι", - "ξώβεργα": "ξόβεργα", - "ξιπάζω": "ξυπάζω", - "ξιπασµένος": "ξυπασµένος", - "ξυπόλητος": "ξυπόλυτος", - "ξωκλήσι": "ξωκκλήσι", - "οξυά": "οξιά", - "ορθοπεδικός": "ορθοπαιδικός", - "ωχ": "οχ", - "παπάς": "παππάς", - "παραγιός": "παραγυιός", - "περηφάνεια": "περηφάνια", - "πιλάλα": "πηλάλα", - "πίννα": "πίνα", - "πηρούνι": "πιρούνι", - "πιτσιλώ": "πιτσυλώ", - "πιτσιλίζω": "πιτσυλίζω", - "πλατυάζω": "πλατειάζω", - "πληµµυρίδα": "πληµυρίδα", - "πληγούρι": "πλιγούρι", - "πωπώ": "ποπό", - "πουγγί": "πουγκί", - "πρίγκηπας": "πρίγκιπας", - "προάστειο": "προάστιο", - "προεδρεία": "προεδρία", - "πρίµα": "πράµα", - "πρωτήτερα": "πρωτύτερα", - "προτύτερα": "πρωτύτερα", - "πόρωση": "πώρωση", - "ρεβύθι": "ρεβίθι", - "ρέγγα": "ρέΥκα", - "ρηγώνω": "ριγώνω", - "ρωµανικός": "ροµανικός", - "ρίζι": "ρύζι", - "Ρώσσος": "Ρώσος", - "σακκούλα": "σακούλα", - "συνάφι": "σινάφι", - "σειρίτι": "σιρίτι", - "σιφόνι": "σιφώνι", - "συχαίνοµαι": "σιχαίνοµαι", - "σκιρόδεµα": "σκυρόδεµα", - "σπάγγος": "σπάγκος", - "στυλιάρι": "στειλιάρι", - "στοιβάδα": "στιβάδα", - "στίβα": "στοίβα", - "στριµώνω": "στρυµώνω", - "στριμώχνω": "στρυμώχνω", - "συγχύζω": "συγχίζω", - "σηκώτι": "συκώτι", - "σιναγρίδα": "συναγρίδα", - "συνοδεία": "συνοδία", - "σίφιλη": "σύφιλη", - "τανιέµαι": "τανυέµαι", - "τανίζω": "τανύζω", - "τέσσερις": "τέσσερεις", - "τζιτζιφιά": "τζιτζυφιά", - "τόνος": "τόννος", - "τοπείο": "τοπίο", - "τρέλλα": "τρέλα", - "τσαγγάρης": "τσαγκάρης", - "τσανάκα": "τσαννάκα", - "τσανακογλείφτης": "τσαννακογλείφτης", - "τσιτώνω": "τσητώνω", - "τσιγκλώ": "τσυγκλώ", - "τσίµα": "τσύµα", - "υννί": "υνί", - "υπερηφάνια": "υπερηφάνεια", - "υπόχρεως": "υπόχρεος", - "φάκελλος": "φάκελος", - "φείδι": "φίδι", - "φιλονεικώ": "φιλονικώ", - "φιλονεικία": "φιλονικία", - "φυρί-φυρί": "φιρί-φιρί", - "φτιάνω": "φτειάχνω", - "φτιάχνω": "φτειάχνω", - "φτώχεια": "φτώχια", - "φυσαλίδα": "φυσαλλίδα", - "χάνος": "χάννος", - "χυνόπωρο": "χινόπωρο", - "χεινόπωρο": "χινόπωρο", - "χιµίζω": "χυµίζω", - "χιμίζω": "χυμιζώ", - "γκωλ": "γκολ", - "αιρκοντίσιον": "ερκοντίσιον", - "καρµπυρατέρ": "καρµπφατέρ", - "κυλόττα": "κιλότα", - "κλή ρινγκ": "κλίρινγκ", - "κωλγκέρλ": "κολγκέρλ", - "κοµπιναιζόν": "κοµπινεζόν", - "κοπυράιτ": "κοπιράιτ", - "µυλαίδη": "µιλέδη", - "µποϋκοτάζ": "µποϊκοτάζ", - "πέναλτυ": "πέναλτι", - "πορτραίτο": "πορτρέτο", - "ρεστωράν": "ρεστοράν", - "ροσµπήφ": "ροσµπίφ", - "σαντιγύ": "σαντιγί", - "στριπτήζ": "στριπτίζ", - "ταµπλώ": "ταµπλό", - "τζόκεϋ": "τζόκεϊ", - "φουτµπώλ": "φουτµπόλ", - "τρόλλεϋ": "τρόλεϊ", - "χίππυ": "χίπι", - "φέρρυ-µπωτ": "φεριµπότ", - "χειρούργος": "χειρουργός", - "αβαείο": "αββαείο", - "αβάς": "αββάς", - "αβάσκαµα": "βάσκαµα", - "αβασκανία": "βασκανία", - "αβάφτιστος": "αβάπτιστος", - "αβάφτιστη": "αβάπτιστη", - "αβάφτιστο": "αβάπτιστο", - "αβγίλα": "αβγουλίλα", - "αυτί": "αφτί", - "αβδέλλα": "βδέλλα", - "Αβράµ": "'Αβραάµ", - "αγγινάρα": "αγκινάρα", - "αγγόνα": "εγγονή", - "αγγόνι": "εγγόνι", - "αγγονός": "εγγονός", - "άγειρτος": "άγερτος", - "άγειρτη": "άγερτη", - "άγειρτο": "άγερτο", - "αγέρας": "αέρας", - "αγκλέουρας": "αγλέορας", - "αγκλίτοα": "γκλίτσα", - "Αγκόλα": "Ανγκόλα", - "αγκορά": "ανγκορά", - "αγκοστοίιρα": "ανγκοστούρα", - "άγνεστος": "άγνεθος", - "άγνεστη": "άγνεθη", - "άγνεστο": "άγνεθο", - "αγώρι": "αγόρι", - "αγωρίστικος": "αγορίστικος", - "αγωρίστικη": "αγορίστικη", - "αγωρίστικο": "αγορίστικο", - "αγωροκόριτσο": "αγοροκόριστο", - "αγουρόλαδο": "αγουρέλαιο", - "αγροικώ": "γροικώ", - "αδάµαντας": "αδάµας", - "αδερφή": "αδελφή", - "αδέρφι": "αδέλφι", - "αδερφικός": "αδελφικός", - "αδερφική": "αδελφική", - "αδερφικό": "αδελφικό", - "αδερφοποιτός": "αδελφοποιτός", - "αδερφός": "αδελφός", - "αδερφοσύνη": "αδελφοσύνη", - "αέρι": "αγέρι", - "αερόµπικ": "αεροβική", - "αεροστρόβιλος": "αεριοστρόβιλος", - "αητός": "αετός", - "αιµατοποσία": "αιµοποσία", - "άιντε": "άντε", - "αισθηµατισµός": "συναισθηµατισµός", - "αιτιακός": "αιτιώδης", - "αιτιακή": "αιτιώδης", - "αιτιακό": "αιτιώδες", - "ακατανόµαστος": "ακατονόµαστος", - "ακατανόμαστη": "ακατονόμαστη", - "ακατονόμαστο": "ακατανόμαστο", - "ακέραιος": "ακέριος", - "ακέραια": "ακέρια", - "ακέραιο": "ακέριο", - "άκρον": "άκρο", - "ακτύπητος": "αχτύπητος", - "ακτύπητη": "αχτύπητη", - "ακτύπητο": "αχτύπητο", - "ακυριολεκτώ": "ακυρολεκτώ", - "ακυριολεξία": "ακυρολεξία", - "αλάτι": "άλας", - "αλατένιος": "αλάτινος", - "αλατένια": "αλάτινη", - "αλατένιο": "αλάτινο", - "αλαφραίνω": "ελαφρώνω", - "αλαφριός": "ελαφρύς", - "αλαφριό": "ελαφρύ", - "αλαφρόµυαλος": "ελαφρόµυαλος", - "αλαφρόμυαλη": "ελαφρόμυαλη", - "αλαφρόμυαλο": "ελαφρόμυαλο", - "αλείβω": "αλείφω", - "άλευρο": "αλεύρι", - "αλησµονησιά": "λησµονιά", - "αλκολίκι": "αλκοολίκι", - "αλλέως": "αλλιώς", - "αλληλοεπίδραση": "αλληλεπίδραση", - "αλλήθωρος": "αλλοίθωρος", - "αλλήθωρη": "αλλοίθωρη", - "αλλήθωρο": "αλλοίθωρο", - "αλλοίµονο": "αλίµονο", - "αµνηστεία": "αµνηστία", - "αµπαρόριζα": "αρµπαρόριζα", - "αµπέχωνο": "αµπέχονο", - "αµυγδαλάτος": "αµυγδαλωτός", - "αμυγδαλάτη": "αμυγδαλωτή", - "αμυγδαλάτο": "αμυγδαλωτό", - "αµυγδαλόλαδο": "αµυγδαλέλαιο", - "αµφίλογος": "αµφιλεγόµενος", - "αμφίλογη": "αμφιλεγόμενη", - "αμφίλογο": "αμφιλεγόμενο", - "αναβατός": "ανεβατός", - "αναβατή": "ανεβατή", - "αναβατό": "ανεβατό", - "αναδεχτός": "αναδεκτός", - "αναθρέφω": "ανατρέφω", - "ανακατώνω": "ανακατεύω", - "ανακάτωση": "ανακάτεµα", - "αναλίσκω": "αναλώνω", - "αναμειγνύω": "αναμιγνύω", - "αναμείκτης": "αναμίκτης", - "ανάµεικτος": "ανάµικτος", - "ανάμεικτη": "ανάμικτη", - "ανάμεικτο": "ανάμικτο", - "αναπαµός": "ανάπαυση", - "αναπαρασταίνω": "αναπαριστάνω", - "ανάπρωρος": "ανάπλωρος", - "ανάπρωρη": "ανάπλωρη", - "ανάπρωρο": "ανάπλωρο", - "αναπτυγµένος": "ανεπτυγμένος", - "αναπτυγµένη": "ανεπτυγμένη", - "αναπτυγµένο": "ανεπτυγμένο", - "άναστρος": "ανάστερος", - "αναστυλώνω": "αναστηλώνω", - "αναστύλωση": "αναστήλωση", - "ανεγνωρισµένος": "αναγνωρισµένος", - "αναγνωρισμένη": "αναγνωρισµένη", - "αναγνωρισμένο": "αναγνωρισµένο", - "ανέµυαλος": "άμυαλος", - "ανέμυαλη": "άμυαλη", - "ανέμυαλο": "άμυαλο", - "ανεπάντεχος": "αναπάντεχος", - "ανεπάντεχη": "αναπάντεχη", - "ανεπάντεχο": "αναπάντεχο", - "ανεψιά": "ανιψιά", - "ανεψιός": "ανιψιός", - "ανήρ": "άνδρας", - "ανηφόρι": "ανήφορος", - "ανηψιά": "ανιψιά", - "ανηψιός": "ανιψιός", - "άνθιση": "άνθηση", - "ανταλλάζω": "ανταλλάσσω", - "ανταπεξέρχοµαι": "αντεπεξέρχοµαι", - "αντζούγια": "αντσούγια", - "αντιεισαγγελέας": "αντεισαγγελέας", - "αντικατασταίνω": "αντικαθιστώ", - "αντικρύζω": "αντικρίζω", - "αντιµολία": "αντιµωλία", - "αντιπροσωπεία": "αντιπροσωπία", - "αντισταµινικό": "αντιισταµινικός", - "αντίχτυπος": "αντίκτυπος", - "άντρας": "άνδρας", - "αντρόγυνο": "ανδρόγυνο", - "αντρώνω": "ανδρώνω", - "άξια": "άξιος", - "απακούµπι": "αποκούµπι", - "απαλάµη": "παλάµη", - "Απαλάχια": "Αππαλάχια", - "απάνω": "επάνω", - "απέδρασα": "αποδιδράσκω", - "απλούς": "απλός", - "απλούν": "απλό", - "απόγαιο": "απόγειο", - "αποδείχνω": "αποδεικνύω", - "αποθαµός": "πεθαµός", - "αποθανατίζω": "απαθανατίζω", - "αποκεντροποίηση": "αποκέντρωση", - "απολαυή": "απολαβή", - "αποξεραίνω": "αποξηραίνω", - "απόξυοη": "απόξεση", - "απόξω": "απέξω", - "απόσχω": "απέχω", - "αποτίω": "αποτίνω", - "αποτυχαίνω": "αποτυγχάνω", - "αποχαιρετίζω": "αποχαιρετώ", - "απόχτηµα": "απόκτηµα", - "απόχτηση": "απόκτηση", - "αποχτώ": "αποκτώ", - "Απρίλης": "Απρίλιος", - "αρκαντάσης": "καρντάσης", - "αρµάρι": "ερµάριο", - "άρµη": "άλµη", - "αρµοστεία": "αρµοστία", - "άρµπουρο": "άλµπουρο", - "αρµύρα": "αλµύρα", - "αρµυρίκι": "αλµυρίκι", - "άρρην": "άρρεν", - "αρσανάς": "ταρσανάς", - "αρτύνω": "αρταίνω", - "αρχινίζω": "αρχίζω", - "αρχινώ": "αρχίζω", - "αρχίτερα": "αρχύτερα", - "ασκηµάδα": "ασχήµια", - "ασκηµαίνω": "ασχηµαίνω", - "ασκήµια": "ασχήµια", - "ασκηµίζω": "ασχηµίζω", - "άσσος": "άσος", - "αστράπτω": "αστράφτω", - "αστράπτω": "αστράφτω", - "αταχτώ": "ατακτώ", - "ατσάλινος": "ατσαλένιος", - "ατσάλινη": "ατσαλένια", - "ατσάλινο": "ατσαλένιο", - "Ατσιγγάνος": "Τσιγγάνος", - "Ατσίγγανος": "Τσιγγάνος", - "αυγαταίνω": "αβγατίζω", - "αυγατίζω": "αβγατίζω", - "αυγό": "αβγό", - "αυγοειδής": "αυγοειδής", - "αυγοειδές": "αβγοειδές", - "αυγοθήκη": "αβγοθήκη", - "αυγοκόβω": "αβγοκόβω", - "αυγοτάραχο": "αβγοτάραχο", - "αύλακας": "αυλάκι", - "αυτί": "αφτί", - "αυτιάζοµαι": "αφτιάζοµαι", - "αφορεσµός": "αφορισµός", - "άφρονας": "άφρων", - "αχείλι": "χείλι", - "άχερο": "άχυρο", - "αχερώνας": "αχυρώνας", - "αχιβάδα": "αχηβάδα", - "αχτίδα": "ακτίνα", - "βαβουίνος": "µπαµπουίνος", - "Βαγγέλης": "Ευάγγελος", - "βαγγέλιο": "ευαγγέλιο", - "Βάγια": "Βάί'α", - "βαζιβουζούκος": "βασιβουζούκος", - "βαθύνω": "βαθαίνω", - "βάιο": "βάγιο", - "βακαλάος": "µπακαλιάρος", - "βαλάντιο": "βαλλάντιο", - "βαλαντώνω": "βαλλαντώνω", - "βάνω": "βάζω", - "βαρειά": "βαριά", - "βαριεστίζω": "βαργεστώ", - "βαριεστώ": "βαργεστώ", - "βαρώ": "βαράω", - "βαρώνος": "βαρόνος", - "βασιλέας": "βασιλιάς", - "βασµούλος": "γασµούλος", - "Βαυαρία": "Βαβαρία", - "Βαυαροκρατία": "Βαβαροκρατία", - "βαφτίζω": "βαπτίζω", - "βάφτιση": "βάπτιση", - "βάφτισµα": "βάπτισµα", - "βαφτιστής": "βαπτιστής", - "βαφτιστικός": "βαπτιστικός", - "βαφτιστική": "βαπτιστική", - "βαφτιστικιά": "βαπτιστική", - "βαφτιστικό": "βαπτιστικό", - "βδοµάδα": "εβδοµάδα", - "βεγόνια": "µπιγκόνια", - "βελανίδι": "βαλανίδι", - "βελανιδιά": "βαλανιδιά", - "βενζίνα": "βενζίνη", - "βεράτιο": "µπεράτι", - "βερόκοκο": "βερίκοκο", - "βιγόνια": "µπιγκόνια", - "βλάφτω": "βλάπτω", - "βλογιά": "ευλογιά", - "βλογάω": "ευλογώ", - "βογγίζω": "βογγώ", - "βόγγος": "βογγητό", - "βογκητό": "βογγητό", - "βοδάµαξα": "βοϊδάµαξα", - "βόλλεϋ": "βόλεϊ", - "βολοκοπώ": "βωλοκοπώ", - "βόλος": "βώλος", - "βουβάλι": "βούβαλος", - "βουή": "βοή", - "βούλα": "βούλλα", - "βούλωµα": "βούλλωµα", - "βουλώνω": "βουλλώνω", - "βουρβόλακας": "βρικόλακας", - "βουρκόλακας": "βρικόλακας", - "βους": "βόδι", - "βραδι": "βράδυ", - "βρυκόλακας": "βρικόλακας", - "βρώµα": "βρόµα", - "βρώµη": "βρόµη", - "βρωµιά": "βροµιά", - "βρωµίζω": "βροµίζω", - "βρώµιο": "βρόµιο", - "βρωµώ": "βροµώ", - "βωξίτης": "βοξίτης", - "γάβρος": "γαύρος", - "γαϊδάρα": "γαϊδούρα", - "γαίµα": "αίµα", - "γαλακτόπιτα": "γαλατόπιτα", - "γάµα": "γάµµα", - "γαµβρός": "γαµπρός", - "γαρίφαλο": "γαρύφαλλο", - "γαρούφαλλο": "γαρύφαλλο", - "γαυγίζω": "γαβγίζω", - "γελάδα": "αγελάδα", - "γελέκο": "γιλέκο", - "γένοµαι": "γίνοµαι", - "γενότυπος": "γονότυπος", - "Γένουα": "Γένοβα", - "γεράζω": "γερνώ", - "γέρακας": "γεράκι", - "γερατειά": "γηρατειά", - "γεροκοµείο": "γηροκοµείο", - "γεροκοµώ": "γηροκοµώ", - "Γεσθηµανή": "Γεθσηµανή", - "γεώδης": "γαιώδης", - "γαιώδες": "γαιώδες", - "γηρασµός": "γήρανση", - "Γιάννενα": "Ιωάννινα", - "Γιάννινα": "Ιωάννινα", - "γιάνω": "γιαίνω", - "γιαουρτλού": "γιογουρτλού", - "Γιαπωνέζος": "Ιαπωνέζος", - "γιγαντεύω": "γιγαντώνω", - "γιεγιές": "γεγές", - "Γιεν": "γεν", - "γιέσµαν": "γέσµαν", - "γιόκας": "γυιόκας", - "γιορτασµός": "εορτασµός", - "γιος": "γυιος", - "Γιούλης": "Ιούλιος", - "Γιούνης": "Ιούνιος", - "γιοφύρι": "γεφύρι", - "Γιώργος": "Γεώργιος", - "γιωτ": "γιοτ", - "γιωτακισµός": "ιωτακισµός", - "γκάγκστερ": "γκάνγκστερ", - "γκαγκστερισµός": "γκανγκστερισµός", - "γκαµήλα": "καµήλα", - "γκεµπελίσκος": "γκαιµπελίσκος", - "γκιουβέτσι": "γιουβέτσι", - "γκιώνης": "γκιόνης", - "γκλοµπ": "κλοµπ", - "γκογκ": "γκονγκ", - "Γκιόνα": "Γκιώνα", - "γκόρφι": "γκόλφι", - "γκρα": "γκρας", - "Γκράβαρα": "Κράβαρα", - "γκυ": "γκι", - "γλαϋξ": "γλαύκα", - "γλιτώνω": "γλυτώνω", - "γλύκισµα": "γλύκυσµα", - "γλυστρώ": "γλιστρώ", - "γλωσσίδα": "γλωττίδα", - "γνέφαλλο": "γνάφαλλο", - "γνοιάζοµαι": "νοιάζοµαι", - "γόµα": "γόµµα", - "γόνα": "γόνατο", - "γονιός": "γονέας", - "γόπα": "γώπα", - "γούµενος": "ηγούµενος", - "γουµένισσα": "ηγουµένη", - "γουώκµαν": "γουόκµαν", - "γραία": "γριά", - "Γράµος": "Γράµµος", - "γρασίδι": "γρασσίδι", - "γρεγολεβάντες": "γραιγολεβάντες", - "γρέγος": "γραίγος", - "γρικώ": "γροικώ", - "Γροιλανδία": "Γροιλανδία", - "γρίνια": "γκρίνια", - "γροθοκοπώ": "γρονθοκοπώ", - "γρούµπος": "γρόµπος", - "γυαλοπωλείο": "υαλοπωλείο", - "γυρνώ": "γυρίζω", - "γόρωθε": "γύροθε", - "γωβιός": "κωβιός", - "δάγκάµα": "δάγκωµα", - "δαγκαµατιά": "δαγκωµατιά", - "δαγκανιά": "δαγκωνιά", - "δαιµονοπληξία": "δαιµονιόπληκτος", - "δαίµων": "δαίµονας", - "δακτυλήθρα": "δαχτυλήθρα", - "δακτυλίδι": "δαχτυλίδι", - "∆αυίδ": "∆αβίδ", - "δαχτυλογραφία": "δακτυλογραφία", - "δαχτυλογράφος": "δακτυλογράφος", - "δεικνύω": "δείχνω", - "δείλι": "δειλινό", - "δείχτης": "δείκτης", - "δελής": "ντελής", - "δενδρογαλή": "δεντρογαλιά", - "δεντρολίβανο": "δενδρολίβανο", - "δεντροστοιχία": "δενδροστοιχία", - "δεντροφυτεία": "δενδροφυτεία", - "δεντροφυτεύω": "δενδροφυτεύω", - "δεντρόφυτος": "δενδρόφυτος", - "δεξής": "δεξιό", - "δερµατώδης": "δερµατοειδής", - "δερματώδες": "δερµατοειδές", - "δέσποτας": "δεσπότης", - "δεφτέρι": "τεφτέρι", - "διαβατάρης": "διαβάτης", - "διάβηκα": "διαβαίνω", - "διαβιβρώσκω": "διαβρώνω", - "διαθρέψω": "διατρέφω", - "διακόνεµα": "διακονιά", - "διάολος": "διάβολος", - "∆ιαµαντής": "Αδαµάντιος", - "διαολιά": "διαβολιά", - "διαολογυναίκα": "διαβολογυναίκα", - "διαολοθήλυκο": "διαβολοθήλυκο", - "διαολόκαιρος": "διαβολόκαιρος", - "διαολοκόριτσο": "διαβολοκόριτσο", - "διαολόπαιδο": "διαβολόπαιδο", - "διάολος": "διάβολος", - "διασκελιά": "δρασκελιά", - "διαχύνω": "διαχέω", - "δίδω": "δίνω", - "δίκηο": "δίκιο", - "δοβλέτι": "ντοβλέτι", - "δοσίλογος": "δωσίλογος", - "δράχνω": "αδράχνω", - "δρέπανο": "δρεπάνι", - "δρόσος": "δροσιά", - "δώνω": "δίνω", - "εγγίζω": "αγγίζω", - "εδώθε": "δώθε", - "εδωνά": "εδωδά", - "εικοσάρι": "εικοσάρικο", - "εικών": "εικόνα", - "εισαγάγω": "εισάγω", - "εισήγαγα": "εισάγω", - "εισήχθην": "εισάγω", - "έκαμα": "έκανα", - "εκατόν": "εκατό", - "εκατοστάρης": "κατοστάρης", - "εκατοστάρι": "κατοστάρι", - "εκατοστάρικο": "κατοστάρικο", - "εκλαίρ": "εκλέρ", - "Ελδοράδο": "Ελντοράντο", - "ελευθεροτεκτονισµός": "τεκτονισµός", - "ελευτεριά": "ελευθερία", - "Ελεφαντοστού Ακτή": "Ακτή Ελεφαντοστού", - "ελληνικάδικο": "ελληνάδικο", - "Ελπίδα": "Ελπίς", - "εµορφιά": "οµορφιά", - "εµορφάδα": "οµορφιά", - "έµπορας": "έµπορος", - "εµώ": "εξεµώ", - "ένδεκα": "έντεκα", - "ενενήκοντα": "ενενήντα", - "ενωρίς": "νωρίς", - "εξανέστην": "εξανίσταµαι", - "εξήκοντα": "εξήντα", - "έξις": "έξη", - "εξωκκλήσι": "ξωκκλήσι", - "εξωµερίτης": "ξωµερίτης", - "επανωφόρι": "πανωφόρι", - "επιµειξία": "επιµιξία", - "επίστοµα": "απίστοµα", - "επτάζυµο": "εφτάζυµο", - "επταήµερος": "εφταηµερος", - "επταθέσιος": "εφταθέσιος", - "επταµελής": "εφταµελης", - "επταµηνία": "εφταµηνία", - "επταµηνίτικος": "εφταµηνίτικος", - "επταπλασιάζω": "εφταπλασιάζω", - "επταπλάσιος": "εφταπλάσιος", - "επτασύλλαβος": "εφτασύλλαβος", - "επτατάξιος": "εφτατάξιος", - "επτάτοµος": "εφτάτοµος", - "επτάφυλλος": "εφτάφυλλος", - "επτάχρονα": "εφτάχρονα", - "επτάχρονος": "εφτάχρονος", - "επταψήφιος": "εφταψήφιος", - "επτάωρος": "εφτάωρος", - "επταώροφος": "εφταώροφος", - "έργον": "έργο", - "ευκή": "ευχή", - "ευρό": "ευρώ", - "ευσπλαχνίζοµαι": "σπλαχνίζοµαι", - "εφεντης": "αφέντης", - "εφηµεριακός": "εφηµέριος", - "εφημεριακή": "εφηµέρια", - "εφημεριακό": "εφηµέριο", - "εφτά": "επτά", - "εφταετία": "επταετία", - "εφτακόσια": "επτακόσια", - "εφτακόσιοι": "επτακόσιοι", - "εφτακοσιοστός": "επτακοσιοστός", - "εχθές": "χθες", - "ζάπι": "ζάφτι", - "ζαχαριάζω": "ζαχαρώνω", - "ζαχαροµύκητας": "σακχαροµύκητας", - "ζεµανφού": "ζαµανφού", - "ζεµανφουτισµός": "ζαµανφουτισµός", - "ζέστα": "ζέστη", - "ζεύλα": "ζεύγλα", - "Ζηλανδία": "Νέα Ζηλανδία", - "ζήλεια": "ζήλια", - "ζιµπούλι": "ζουµπούλι", - "ζο": "ζώο", - "ζουρλαµάρα": "ζούρλα", - "ζωοφόρος": "ζωφόρος", - "ηλεκτροκόλληση": "ηλεκτροσυγκόλληση", - "ηλεκτροοπτική": "ηλεκτροπτική", - "ήλιο": "ήλιον", - "ηµιόροφος": "ηµιώροφος", - "θαλάµι": "θαλάµη", - "θάµα": "θαύµα", - "θαµπώνω": "θαµβώνω", - "θάµπος": "θάµβος", - "θάφτω": "θάβω", - "θεοψία": "θεοπτία", - "θέσει": "θέση", - "θηλειά": "θηλιά", - "Θόδωρος": "Θεόδωρος", - "θρύβω": "θρύπτω", - "θυµούµαι": "θυµάµαι", - "Ιαµάϊκή": "Τζαµάικα", - "ιατρεύω": "γιατρεύω", - "ιατρός": "γιατρός", - "ιατροσόφιο": "γιατροσόφι", - "I.Q.": "αϊ-κιού", - "ινατι": "γινάτι", - "ιονίζω": "ιοντίζω", - "ιονιστής": "ιοντιστής", - "ιονόσφαιρα": "ιοντόσφαιρα", - "Ιούλης": "Ιούλιος", - "ίσασµα": "ίσιωµα", - "ισιάζω": "ισιώνω", - "ίσκιος": "ήσκιος", - "ισκιώνω": "ησκιώνω", - "ίσωµα": "ίσιωµα", - "ισώνω": "ισιώνω", - "ιχθύαση": "ιχθύωση", - "ιώτα": "γιώτα", - "καββαλισµός": "καβαλισµός", - "κάβουρος": "κάβουρας", - "καδής": "κατής", - "καδρίλια": "καντρίλια", - "Καζακστάν": "Καζαχστάν", - "καθέκλα": "καρέκλα", - "κάθησα": "κάθισα", - "[1766]. καθίκι": "καθοίκι", - "καΐλα": "καήλα", - "καϊξής": "καϊκτσής", - "καλδέρα": "καλντέρα", - "καλεντάρι": "καλαντάρι", - "καλήν εσπέρα": "καλησπέρα", - "καλιά": "καλειά", - "καλιακούδα": "καλοιακούδα", - "κάλλια": "κάλλιο", - "καλλιά": "κάλλιο", - "καλόγηρος": "καλόγερος", - "καλόρχεται": "καλοέρχεται", - "καλσόν": "καλτσόν", - "καλυµµαύκι": "καµιλαύκι", - "καλύµπρα": "καλίµπρα", - "καλωσύνη": "καλοσύνη", - "καµαρωτός": "καµαρότος", - "καµηλαύκι": "καµιλαύκι", - "καµτσίκι": "καµουτσίκι", - "καναβάτσο": "κανναβάτσο", - "κανακίζω": "κανακεύω", - "κανάτα": "καννάτα", - "κανατάς": "καννατάς", - "κανάτι": "καννάτι", - "κανελής": "καννελής", - "κανελιά": "καννελή", - "κανελί": "καννελή", - "κανελονι": "καννελόνι", - "κανελλόνι": "καννελόνι", - "κανένας": "κανείς", - "κάνη": "κάννη", - "κανί": "καννί", - "κάνναβης": "κάνναβις", - "καννιβαλισµός": "κανιβαλισµός", - "καννίβαλος": "κανίβαλος", - "κανοκιάλι": "καννοκιάλι", - "κανόνι": "καννόνι", - "κανονιά": "καννονιά", - "κανονίδι": "καννονίδι", - "κανονιέρης": "καννονιέρης", - "κανονιοβολητής": "καννονιοβολητής", - "κανονιοβολισµός": "καννονιοβολισµός", - "κανονιοβολώ": "καννονιοβολώ", - "κανονιοστάσιο": "καννονιοστάσιο", - "κανονιοστοιχία": "καννονιοστοιχία", - "κανονοθυρίδα": "καννονοθυρίδα", - "κάνουλα": "κάννουλα", - "κανών": "κανόνας", - "κάπα": "κάππα", - "κάπαρη": "κάππαρη", - "καπαρντίνα": "καµπαρντίνα", - "καραβόσκοινο": "καραβόσχοινο", - "καρένα": "καρίνα", - "κάρκάδο": "κάκαδο", - "καροτίνη": "καρωτίνη", - "καρότο": "καρώτο", - "καροτόζουµο": "καρωτόζουµο", - "καροτοσαλάτα": "καρωτοσαλάτα", - "καρπούµαι": "καρπώνοµαι", - "καρρώ": "καρό", - "κάρυ": "κάρι", - "καρυοφύλλι": "καριοφίλι", - "καταΐφι": "κανταΐφι", - "κατακάθηµαι": "κατακάθοµαι", - "κατάντια": "κατάντηµα", - "κατασκοπεία": "κατασκοπία", - "καταφτάνω": "καταφθάνω", - "καταχράσθηκα": "καταχράστηκα", - "κατάχτηση": "κατάκτηση", - "καταχτητής": "κατακτητής", - "καταχτώ": "κατακτώ", - "καταχωρώ": "καταχωρίζω", - "κατέβαλα": "καταβάλλω", - "Κατερίνα": "Αικατερίνη", - "κατοστίζω": "εκατοστίζω", - "κάτου": "κάτω", - "κατρουλιό": "κατουρλιό", - "καυναδίζω": "καβγαδίζω", - "καϋµός": "καηµός", - "'κεί": "εκεί", - "κείθε": "εκείθε", - "καψόνι": "καψώνι", - "καψύλλιο": "καψούλι", - "κελάρης": "κελλάρης", - "κελί": "κελλί", - "κεντήτρια": "κεντήστρα", - "κεσέµι": "γκεσέµι", - "κέσιο": "καίσιο", - "κηπάριο": "κήπος", - "κινάρα": "αγκινάρα", - "κιοφτές": "κεφτές", - "κλαίγω": "κλαίω", - "κλαπάτσα": "χλαπάτσα", - "κλασσικίζω": "κλασικίζω", - "κλασσικιστής": "κλασικιστής", - "κλέπτης": "κλέφτης", - "κληθρα": "σκλήθρα", - "κλήρινγκ": "κλίρινγκ", - "κλιπ": "βιντεοκλίπ", - "κλωσά": "κλώσσα", - "κλωτσιά": "κλοτσιά", - "κογκλάβιο": "κονκλάβιο", - "κογκρέσο": "κονγκρέσο", - "κοιµίσης": "κοίµησης", - "κοιµούµαι": "κοιµάµαι", - "κοιτώ": "κοιτάζω", - "κοιτάω": "κοιτάζω", - "κόκαλο": "κόκκαλο", - "κοκίτης": "κοκκύτης", - "κοκκίαση": "κοκκίωση", - "κοκκοφοίνικας": "κοκοφοίνικας", - "κολάζ": "κολλάζ", - "κολαντρίζω": "κουλαντρίζω", - "κολαρίζω": "κολλαρίζω", - "κολεχτίβα": "κολεκτίβα", - "κολεχτιβισµός": "κολεκτιβισµός", - "κολιγιά": "κολληγιά", - "κολίγος": "κολλήγας", - "κολίγας": "κολλήγας", - "κολικόπονος": "κωλικόπονος", - "κολιός": "κολοιός", - "κολιτσίνα": "κολτσίνα", - "κολυµπήθρα": "κολυµβήθρα", - "κολώνα": "κολόνα", - "κολώνια": "κολόνια", - "κοµβόι": "κονβόι", - "κόµις": "κόµης", - "κόµισσα": "κόµης", - "κόµιτας": "κόµης", - "κοµιτεία": "κοµητεία", - "κόµµατα": "κοµµάτι", - "κοµµούνα": "κοµούνα", - "κοµµουναλισµός": "κοµουναλισµός", - "κοµµούνι": "κοµούνι", - "κοµµουνίζω": "κοµουνίζω", - "κοµµουνισµός": "κοµουνισµός", - "κοµµουνιστής": "κοµουνιστής", - "κονδυλοειδής": "κονδυλώδης", - "κονδυλοειδές": "κονδυλώδες", - "κονσέρτο": "κοντσέρτο", - "κόντραµπαντιέρης": "κοντραµπατζής", - "κοντσίνα": "κολτσίνα", - "κονφορµισµός": "κοµφορµισµός", - "κονφορµιστής": "κομφορμιστής", - "κοπελιά": "κοπέλα", - "κοπλιµέντο": "κοµπλιµέντο", - "κόπτω": "κόβω", - "κόπυραιτ": "κοπιράιτ", - "Κοριτσα": "Κορυτσά", - "κοριτσόπουλο": "κορίτσι", - "κορνέτο": "κορνέτα", - "κορνιζώνω": "κορνιζάρω", - "κορόιδεµα": "κοροϊδία", - "κορόνα": "κορώνα", - "κορφή": "κορυφή", - "κοσάρι": "εικοσάρικο", - "κοσάρικο": "εικοσάρικο", - "κοσµετολογία": "κοσµητολογία", - "κοτάω": "κοτώ", - "κουβαρνταλίκι": "χουβαρνταλίκι", - "κουβαρντάς": "χουβαρντάς", - "κουβερνάντα": "γκουβερνάντα", - "κούκος": "κούκκος", - "κουλλουρτζής": "κουλλουράς", - "κουλούρας": "κουλλουράς", - "κουλούρι": "κουλλούρι", - "κουλουριάζω": "κουλλουριάζω", - "κουλουρτζής": "κουλλουράς", - "κουρδιστής": "χορδιστής", - "κουρντιστής": "χορδιστής", - "κουρντίζω": "κουρδίζω", - "κουρντιστήρι": "κουρδιστήρι", - "κουστούµι": "κοστούµι", - "κουτεπιέ": "κουντεπιέ", - "κόφτης": "κόπτης", - "κόχη": "κόγχη", - "κοψοχείλης": "κοψαχείλης", - "κρεµάζω": "κρεµώ", - "κροντήρι": "κρωντήρι", - "κροµµύδι": "κρεµµύδι", - "κροµµυδίλα": "κρεµµυδίλα", - "κρουσταλλιάζω": "κρυσταλλιάζω", - "κτένα": "χτένα", - "κτενάκι": "χτενάκι", - "κτένι": "χτένι", - "κτενίζω": "χτενίζω", - "κτένισµα": "χτένισµα", - "κτίριο": "κτήριο", - "κυλίω": "κυλώ", - "κυττάζω": "κοιτάζω", - "κωλ-γκέρλ": "κολ-γκέρλ", - "κωλοµπαράς": "κολοµπαράς", - "κωσταντινάτο": "κωνσταντινάτο", - "Κώστας": "Κωνσταντίνος", - "κώχη": "κόγχη", - "λάβδα": "λάµβδα", - "λαγούτο": "λαούτο", - "λαγύνι": "λαγήνι", - "λαίδη": "λέδη", - "λαϊκάντζα": "λαϊκούρα", - "λαιµά": "λαιµός", - "λαΐνι": "λαγήνι", - "λαµπράδα": "λαµπρότητα", - "λάρος": "γλάρος", - "λατόµι": "λατοµείο", - "λαύδανο": "λάβδανο", - "λαυράκι": "λαβράκι", - "λαφίνα": "ελαφίνα", - "λαφόπουλο": "ελαφόπουλο", - "λειβάδι": "λιβάδι", - "Λειβαδιά": "Λιβάδια", - "λεϊµόνι": "λεµόνι", - "λεϊµονιά": "λεµονιά", - "Λειψία": "Λιψία", - "λέοντας": "λέων", - "λεπτά": "λεφτά", - "λεπτύνω": "λεπταίνω", - "λευκαστής": "λευκαντής", - "Λευτέρης": "Ελευθέριος", - "λευτερώνω": "ελευθερώνω", - "λέω": "λέγω", - "λιανεµπόριο": "λειανεµπόριο", - "λιανίζω": "λειανίζω", - "λιανοτούφεκο": "λειανοτούφεκο", - "λιανοντούφεκο": "λειανοντούφεκο", - "λιανοπούληµα": "λειανοπούληµα", - "λιανοπωλητής": "λειανοπωλητής", - "λιανοτράγουδο": "λειανοτράγουδο", - "λιγοψυχία": "ολιγοψυχία", - "λιθρίνι": "λυθρίνι", - "λιµένας": "λιµάνι", - "λίµπρα": "λίβρα", - "λιοβολιά": "ηλιοβολία", - "λιόδεντρο": "ελαιόδεντρο", - "λιόλαδο": "ελαιόλαδο", - "λιόσπορος": "ηλιόσπορος", - "λιοτρίβειο": "ελαιοτριβείο", - "λιοτρόπι": "ηλιοτρόπιο", - "λιόφως": "ηλιόφως", - "λιχουδιά": "λειχουδιά", - "λιώνω": "λειώνω", - "λογιωτατίζω": "λογιοτατίζω", - "λογιώτατος": "λογιότατος", - "λόγκος": "λόγγος", - "λόξιγκας": "λόξυγγας", - "λοτόµος": "υλοτόµος", - "Λουµπλιάνα": "Λιουµπλιάνα", - "λούω": "λούζω", - "λύγξ": "λύγκας", - "λυµφατισµός": "λεµφατισµός", - "λυντσάρω": "λιντσάρω", - "λυσσιακό": "λυσσακό", - "λυώνω": "λειώνω", - "Λωξάντρα": "Λοξάντρα", - "λωρένσιο": "λορένσιο", - "λωρίδα": "λουρίδα", - "µαγγάνιο": "µαγκάνιο", - "µαγγιώρος": "µαγκιόρος", - "µαγειριά": "µαγεριά", - "µάγειρος": "µάγειρας", - "µόγερας": "µάγειρας", - "µαγιώ": "µαγιό", - "µαγκανοπήγαδο": "µαγγανοπήγαδο", - "µαγκώνω": "µαγγώνω", - "µαγνόλια": "µανόλια", - "Μαγυάρος": "Μαγιάρος", - "µαζύ": "µαζί", - "µαζώνω": "µαζεύω", - "µαιζονέτα": "µεζονέτα", - "µαιτρ": "µετρ", - "µαιτρέσα": "µετρέσα", - "µακριός": "µακρύς", - "μακριά": "µακρυά", - "μακριό": "µακρύ", - "µαλάσσω": "µαλάζω", - "µαµά": "µαµµά", - "µαµouδι": "µαµούνι", - "µάνα": "µάννα", - "µανδαρινέα": "µανταρινιά", - "µανδήλι": "µαντήλι", - "µάνδρα": "µάντρα", - "µανές": "αµανές", - "Μανόλης": "Εµµανουήλ", - "µαντζούνι": "µατζούνι", - "µαντζουράνα": "µατζουράνα", - "µαντίλα": "µαντήλα", - "µαντίλι": "µαντήλι", - "µαντµαζέλ": "µαµαζέλ", - "µαντρίζω": "µαντρώνω", - "µαντώ": "µαντό", - "Μανώλης": "Εµµανουήλ", - "µάρτυς": "µάρτυρας", - "µασκάλη": "µασχάλη", - "µατοκυλίζω": "αιµατοκυλίζω", - "µατοκύλισµα": "αιµατοκυλίζω", - "µατσέτα": "µασέτα", - "µαυράδα": "µαυρίλα", - "μεγαλόπολη": "µεγαλούπολη", - "µεγαλοσπληνία": "σπληνοµεγαλία", - "µέγγενη": "µέγκενη", - "μείκτης": "µίκτης", - "µελίγγι": "µηλίγγι", - "µεντελισµός": "µενδελισµός", - "µενχίρ": "µενίρ", - "µέρα": "ηµέρα", - "µεράδι": "µοιράδι", - "µερεύω": "ηµερεύω", - "µέρµηγκας": "µυρµήγκι", - "µερµήγκι": "µυρµήγκι", - "µερσίνα": "µυρσίνη", - "µερσίνη": "µυρσίνη", - "µέρωµα": "ηµερώνω", - "µερώνω": "ηµερώνω", - "µέσον": "µέσο", - "µεσοούρανα": "µεσούρανα", - "µεταλίκι": "µεταλλίκι", - "µεταπούληση": "µεταπώληση", - "µεταπουλω": "µεταπωλώ", - "µετοχιάριος": "µετοχάρης", - "µητάτο": "µιτάτο", - "µητριά": "µητρυιά", - "µητριός": "µητρυιός", - "Μιανµάρ": "Μυανµάρ", - "Μίκι Μάους": "Μίκυ Μάους", - "µικρύνω": "µικραίνω", - "µινουέτο": "µενουέτο", - "µιξοπαρθένα": "µειξοπαρθένα", - "µισοφόρι": "µεσοφόρι", - "µίτζα": "µίζα", - "µολογώ": "οµολογώ", - "μολογάω": "οµολογώ", - "µοµία": "µούµια", - "µοµιοποίηση": "µουµιοποίηση", - "µονάρχιδος": "µόνορχις", - "µονιάζω": "µονοιάζω", - "µορφιά": "οµορφιά", - "µορφονιός": "οµορφονιός", - "µοσκάρι": "µοσχάρι", - "µοσκοβολιά": "µοσκοβολιά", - "µοσκοβολώ": "µοσχοβολώ", - "µοσκοκαρυδιά": "µοσχοκαρυδιά", - "µοσκοκάρυδο": "µοσχοκάρυδο", - "µοσκοκάρφι": "µοσχοκάρφι", - "µοσκολίβανο": "µοσχολίβανο", - "µοσκοµπίζελο": "µοσχοµπίζελο", - "µοσκοµυρίζω": "µοσχοµυρίζω", - "µοσκοπουλώ": "µοσχοπουλώ", - "µόσκος": "µόσχος", - "µοσκοσάπουνο": "µοσχοσάπουνο", - "µοσκοστάφυλο": "µοσχοστάφυλο", - "µόσχειος": "µοσχαρήσιος", - "μόσχειο": "µοσχαρήσιο", - "µουλώνω": "µουλαρώνω", - "µουρταδέλα": "µορταδέλα", - "µουσικάντης": "µουζικάντης", - "µουσσώνας": "µουσώνας", - "µουστάκα": "µουστάκι", - "µουστακοφόρος": "µυστακοφόρος", - "µπαγάζια": "µπαγκάζια", - "πάγκα": "µπάνκα", - "µπαγκαδορος": "µπανκαδόρος", - "µπογκέρης": "µπανκέρης", - "µπάγκος": "πάγκος", - "µπαιν-µαρί": "µπεν-µαρί", - "µπαλάντα": "µπαλλάντα", - "µπαλαντέζα": "µπαλλαντέζα", - "µπαλαντέρ": "µπαλλαντέρ", - "µπαλάντζα": "παλάντζα", - "µπαλένα": "µπαλαίνα", - "µπαλέτο": "µπαλλέτο", - "µπάλος": "µπάλλος", - "µπάλσαµο": "βάλσαµο", - "µπαλσάµωµα": "βαλσάµωµα", - "µπαλσαµώνω": "βαλσαµώνω", - "µπάλωµα": "µπάλλωµα", - "µπαλώνω": "µπαλλώνω", - "µπαµπάκι": "βαµβάκι", - "µπαµπακόσπορος": "βαµβακόσπορος", - "Μπάµπης": "Χαραλάµπης", - "µπάµπω": "βάβω", - "µπανέλα": "µπαναίλα", - "µπαρµπρίζ": "παρµπρίζ", - "µπατίστα": "βατίστα", - "µπαχτσές": "µπαξές", - "µπαχτσίσι": "µπαξίσι", - "µπεζεβέγκης": "πεζεβέγκης", - "µπελτές": "πελτές", - "µπεντόνι": "µπιντόνι", - "µπερδουκλώνω": "µπουρδουκλώνω", - "µπερκέτι": "µπερεκέτι", - "µπετόνι": "µπιτόνι", - "µπεχαβιορισµός": "µπιχεβιορισµός", - "µπεχλιβάνης": "πεχλιβάνης", - "µπιγκουτί": "µπικουτί", - "µπιµπίλα": "µπιρµπίλα", - "µπιµπλό": "µπιµπελό", - "µπιρσίµι": "µπρισίµι", - "µπις": "µπιζ", - "µπιστόλα": "πιστόλα", - "µπιστόλι": "πιστόλι", - "µπιστολιά": "πιστολιά", - "µπιτόνι": "µπιντόνι", - "µπογιάρος": "βογιάρος", - "µπονάτσα": "µπουνάτσα", - "µπονατσάρει": "µπουνατσάρει", - "µπουά": "µποά", - "µπουκαµβίλια": "βουκαµβίλια", - "µποϋκοταζ": "µποϊκοτάζ", - "µποϋκοτάρω": "µποϊκοτάρω", - "µπουλβάρ": "βουλεβάρτο", - "µπουρδέλο": "µπορντέλο", - "µπουρµπουάρ": "πουρµπουάρ", - "µπρίζα": "πρίζα", - "µπριτζόλα": "µπριζόλα", - "µπρος": "εµπρός", - "µπύρα": "µπίρα", - "µπυραρία": "µπιραρία", - "µπυροποσία": "µπιροποσία", - "µυγδαλιά": "αµυγδαλιά", - "µύγδαλο": "αµύγδαλο", - "µυλόρδος": "µιλόρδος", - "μυρουδιά": "µυρωδιά", - "µυτζήθρα": "µυζήθρα", - "µύωψ": "µύωπας", - "µώλος": "µόλος", - "νέθω": "γνέθω", - "νι": "νυ", - "νίκελ": "νικέλιο", - "νοµεύς": "νοµέας", - "νοστιµίζω": "νοστιµεύω", - "νουννός": "νοννός", - "νταβάνι": "ταβάνι", - "ντάβανος": "τάβανος", - "νταβανόσκουπα": "ταβανόσκουπα", - "νταβούλι": "νταούλι", - "νταλαβέρι": "νταραβέρι", - "νταµπλάς": "ταµπλάς", - "ντελαπάρω": "ντεραπάρω", - "ντενεκές": "τενεκές", - "ντερβεναγος": "δερβέναγας", - "ντερβένι": "δερβένι", - "ντερβίσης": "δερβίσης", - "ντερβισόπαιδο": "δερβισόπαιδο", - "ντοκυµανταίρ": "ντοκιµαντέρ", - "ντουνρού": "ντογρού", - "ντουζ": "ντους", - "ντουζιέρα": "ντουσιέρα", - "Ντούµα": "∆ούµα", - "ντούπλεξ": "ντούµπλεξ", - "ντουφέκι": "τουφέκι", - "ντουφεκίδι": "τουφεκίδι", - "ντουφεκίζω": "τουφεκίζω", - "ντουφεξής": "τουφεξής", - "νύκτα": "νύχτα", - "νυκτωδία": "νυχτωδία", - "νωµατάρχης": "ενωµοτάρχης", - "ξανεµίζω": "εξανεµίζω", - "ξεγνοιάζω": "ξενοιάζω", - "ξεγνοιασιά": "ξενοιασιά", - "ξελαφρώνω": "ξαλαφρώνω", - "ξεπίτηδες": "επίτηδες", - "ξεπιτούτου": "εξεπιτούτου", - "ξεσκάζω": "ξεσκάω", - "ξεσπάζω": "ξεσπώ", - "ξεσχίζω": "ξεσκίζω", - "ξέσχισµα": "ξεσκίζω", - "ξευτελίζω": "εξευτελίζω", - "ξεφτίζω": "ξεφτύζω", - "ξεφτίλα": "ξευτίλα", - "ξεφτίλας": "ξευτίλας", - "ξεφτιλίζω": "ξευτιλίζω", - "ξεχάνω": "ξεχνώ", - "ξηγώ": "εξηγώ", - "ξηροφαγία": "ξεροφαγία", - "ξηροφαγιά": "ξεροφαγία", - "ξι": "ξει", - "ξιπασιά": "ξυπασιά", - "ξίπασµα": "ξύπασµα", - "ξιπολησιά": "ξυπολυσιά", - "ξιπολιέµαι": "ξυπολιέµαι", - "εξοµολόγηση": "ξομολόγηση", - "ξοµολογητής": "εξοµολογητής", - "ξοµολόγος": "εξοµολόγος", - "ξοµολογώ": "εξοµολογώ", - "ξουράφι": "ξυράφι", - "ξουράφια": "ξυραφιά", - "ξόφληση": "εξόφληση", - "ξύγγι": "ξίγγι", - "ξύγκι": "ξίγγι", - "ξύδι": "ξίδι", - "ξυλοσκίστης": "ξυλοσχίστης", - "ξυλώνω": "ξηλώνω", - "ξυνωρίδα": "συνωρίδα", - "ξώθυρα": "εξώθυρα", - "ξώπορτα": "εξώπορτα", - "ξώφυλλο": "εξώφυλλο", - "οδοντογιατρός": "οδοντίατρος", - "οδοντόπονος": "πονόδοντος", - "οικογενειακά": "οικογενειακώς", - "οικοκυρά": "νοικοκυρά", - "οκτάς": "οκτάδα", - "οκταετής": "οχταετής", - "οκταετές": "οχταετές", - "οκταετία": "οχταετία", - "οµοιάζω": "µοιάζω", - "οµοιώνω": "εξοµοιώνω", - "οµόµετρο": "ωµόµετρο", - "οµορφάδα": "οµορφιά", - "οµπρός": "εµπρός", - "ονείρεµα": "όνειρο", - "οξείδιο": "οξίδιο", - "οξειδοαναγωγή": "οξιδοαναγωγή", - "οξειδώνω": "οξιδώνω", - "οξείδωση": "οξίδωση", - "οξειδωτής": "οξιδωτής", - "οξιζενέ": "οξυζενέ", - "οπίσω": "πίσω", - "οργιά": "οργυιά", - "όρνεο": "όρνιο", - "όρνις": "όρνιθα", - "ορρός": "ορός", - "όσµωση": "ώσµωση", - "οστεΐτιδα": "οστίτιδα", - "οστεογονία": "οστεογένεση", - "οφίτσιο": "οφίκιο", - "οφφίκιο": "οφίκιο", - "οχτάβα": "οκτάβα", - "οχτάδα": "οκτάδα", - "οχταετία": "οκταετία", - "οχτακόσια": "οκτακόσια", - "οχτακόσιοι": "οκτακόσιοι", - "οχτακόσιες": "οκτακόσιες", - "οχτακόσια": "οκτακόσια", - "όχτρητα": "έχθρητα", - "οχτώ": "οκτώ", - "Οχτώβρης": "Οκτώβριος", - "οψιανός": "οψιδιανός", - "παγαίνω": "πηγαίνω", - "παγόνι": "παγώνι", - "παιγνίδι": "παιχνίδι", - "παίδαρος": "παίδαρος", - "παίχτης": "παίκτης", - "παλικαράς": "παλληκαράς", - "παλικάρι": "παλληκάρι", - "παλικαριά": "παλληκαριά", - "παλικαροσύνη": "παλληκαροσύνη", - "παλληκαρίστίκος": "παλληκαρήσιος", - "παλληκαρίστικη": "παλληκαρήσια", - "παλληκαρίστικο": "παλληκαρήσιο", - "παλληκαροσύνη": "παλληκαριά", - "πανταλόνι": "παντελόνι", - "παντατίφ": "πανταντίφ", - "πανταχούσα": "απανταχούσα", - "Πάντοβα": "Πάδοβα", - "παντούφλα": "παντόφλα", - "παντοχή": "απαντοχή", - "πανψυχισµός": "παµψυχισµός", - "πάνω": "επάνω", - "παπαδάκι": "παππαδάκι", - "παπαδαρειό": "παππαδαρειό", - "παπαδιά": "παππαδιά", - "παπαδοκόρη": "παππαδοκόρη", - "παπαδοκρατούµαι": "παππαδοκρατούµαι", - "παπαδολόι": "παππαδολόι", - "παπαδοπαίδι": "παππαδοπαίδι", - "παπαδοπούλα": "παππαδοπούλα", - "Παπαδόπουλο": "παππαδόπουλο", - "παπατζής": "παππατζής", - "παπατρέχας": "παππατρέχας", - "παραγιάς": "παραγυιός", - "παρανυχίδα": "παρωνυχίδα", - "παρεισφρύω": "παρεισφρέω", - "παρεννοώ": "παρανοώ", - "παρ' ολίγο": "παραλίγο", - "πασαβιόλα": "µπασαβιόλα", - "πασάλειµµα": "πασσάλειµµα", - "πασαλείφω": "πασσαλείφω", - "πασκίζω": "πασχίζω", - "παστρουµάς": "παστουρµάς", - "πατερµά": "πατερηµά", - "πατήρ": "πατέρας", - "πατούνα": "πατούσα", - "πατριός": "πατρυιός", - "πάτρονας": "πάτρωνας", - "πάψη": "παύση", - "πεθυµώ": "επιθυµώ", - "πείρος": "πίρος", - "πελέκι": "πέλεκυς", - "πελεκίζω": "πελεκώ", - "πελλόγρα": "πελάγρα", - "πεντήκοντα": "πενήντα", - "πεντόβολα": "πεντόβωλα", - "πεντόδραχµο": "πεντάδραχµο", - "περβολάρης": "περιβολάρης", - "περβόλι": "περιβόλι", - "περδικλώνω": "πεδικλώνω", - "περηφανεύοµαι": "υπερηφανεύοµαι", - "περηφάνια": "υπερηφάνεια", - "περικόβω": "περικόπτω", - "περιπατώ": "περπατώ", - "περιστεριώνας": "περιστερώνας", - "περιτάµω": "περιτέµνω", - "περιφάνεια": "περηφάνια", - "περιφράζω": "περιφράσσω", - "περιχαράζω": "περιχαράσσω", - "περιχέω": "περιχύνω", - "περντάχι": "µπερντάχι", - "πέρπυρο": "υπέρπυρο", - "πέρσι": "πέρυσι", - "πετούγια": "µπετούγια", - "πευκιάς": "πευκώνας", - "πηγεµός": "πηγαιµός", - "πηγούνι": "πιγούνι", - "πήτα": "πίτα", - "πήχυς": "πήχης", - "πι": "πει", - "πιζάµα": "πιτζάµα", - "πιθαµή": "σπιθαµή", - "πιθώνω": "απιθώνω", - "πίκρισµα": "πικρίζω", - "πιλαλώ": "πηλαλώ", - "Πιλάτος": "Πόντιος Πιλάτος", - "πιοτό": "ποτό", - "πιπίζω": "πιππίζω", - "πιρέξ": "πυρέξ", - "πίστοµα": "απίστοµα", - "πιτσιλάδα": "πιτσυλάδα", - "πιτσιλιά": "πιτσυλιά", - "πίττα": "πίτα", - "πίτυρον": "πίτουρο", - "πλάγι": "πλάι", - "πλανάρω": "πλανίζω", - "πλάσσω": "πλάθω", - "πλειονοψηφία": "πλειοψηφία", - "πλείονοψηφώ": "πλειοψηφώ", - "πλεξίδα": "πλεξούδα", - "πλερωµή": "πληρωµή", - "πλερώνω": "πληρώνω", - "πλέυ µπόυ": "πλεϊµπόι", - "πλέχτης": "πλέκτης", - "πληµµύρα": "πληµύρα", - "πνιγµός": "πνίξιµο", - "πνευµονόκοκκος": "πνευµονιόκοκκος", - "ποιµήν": "ποιµένας", - "πόλις": "πόλη", - "πόλιτσµαν": "πόλισµαν", - "πολιτσµάνος": "πόλισµαν", - "πολύµπριζο": "πολύπριζο", - "πολυπάω": "πολυπηγαίνω", - "πολύπους": "πολύποδας", - "Πόρτο Ρίκο": "Πουέρτο Ρίκο", - "ποταπαγόρευση": "ποτοαπαγόρευση", - "πούντρα": "πούδρα", - "πράµα": "πράγµα", - "πρεβάζι": "περβάζι", - "πρέπον": "πρέπων", - "προαγάγω": "προάγω", - "προδίνω": "προδίδω", - "προιξ": "προίκα", - "προποτζής": "προπατζής", - "προσαγάγω": "προσάγω", - "πρόσµιξη": "πρόσµειξη", - "προσφύγω": "προσφεύγω", - "προφθάνω": "προφταίνω", - "προφυλάω": "προφυλάσσω", - "προψές": "προχθές", - "πρύµη": "πρύµνη", - "πταρνίζοµαι": "φταρνίζοµαι", - "πτελέα": "φτελιά", - "πτέρνα": "φτέρνα", - "πτερυγίζω": "φτερουγίζω", - "πτιφούρ": "πετιφούρ", - "πτι-φούρ": "πετιφούρ", - "πτωχαίνω": "φτωχαίνω", - "πτώχεια": "φτώχια", - "πυκνά": "πυκνός", - "πυλωτή": "πιλοτή", - "πύο": "πύον", - "πυρογενής": "πυριγενής", - "πυρογενές": "πυριγενές", - "πυτζάµα": "πιτζάµα", - "ραγκλόν": "ρεγκλάν", - "ραγού": "ραγκού", - "ραΐζω": "ραγίζω", - "ραίντνκεν": "ρέντγκεν", - "ράντζο": "ράντσο", - "ράπτω": "ράβω", - "ρεβανί": "ραβανί", - "ρέγγε": "ρέγκε", - "Ρεγγίνα": "Ρεγκίνα", - "ρεµούλκα": "ρυµούλκα", - "ασκέρι": "ασκέρι", - "ρεοβάση": "ρευµατοβάση", - "ρεπανάκι": "ραπανάκι", - "ρεπάνι": "ραπάνι", - "ρεύω": "ρέβω", - "ρήγα": "ρίγα", - "ρηµοκκλήσι": "ερηµοκκλήσι", - "ριγκ": "ρινγκ", - "ριζότο": "ρυζότο", - "ροβίθι": "ρεβίθι", - "ροβιθιά": "ρεβιθιά", - "ροδακινιά": "ρωδακινιά", - "ροδάκινο": "ρωδάκινο", - "ρόιδι": "ρόδι", - "ροϊδιά": "ροδιά", - "ρόιδο": "ρόδι", - "ροοστάτης": "ρεοστάτης", - "ροφώ": "ρουφώ", - "ρωδιός": "ερωδιός", - "ρωθωνίζω": "ρουθουνίζω", - "ρωµαντισµός": "ροµαντισµός", - "Ρωσσία": "Ρωσία", - "ρωτώ": "ερωτώ", - "σάζω": "σιάζω", - "σαιζλόνγκ": "σεζλόνγκ", - "σαιζόν": "σεζόν", - "σαγολαίφα": "σακολαίβα", - "σάκκα": "σάκα", - "σακκάκι": "σακάκι", - "σακκάς": "σακάς", - "σακκί": "σακί", - "σακκίδιο": "σακίδιο", - "σακκοβελόνα": "σακοβελόνα", - "σακκογκόλιθος": "σακογκόλιθος", - "σακκοειδής": "σακοειδής", - "σακκοειδές": "σακοειδες", - "σακκοράφα": "σακοράφα", - "σάκκος": "σάκος", - "σακκουλα": "σακούλα", - "σακκουλάκι": "σακούλι", - "σακκουλεύοµαι": "σακουλεύοµαι", - "σακκούλι": "σακούλι", - "σακκουλιάζω": "σακουλιάζω", - "σακχαροδιαβήτης": "ζαχαροδιαβήτης", - "σάκχαροκαλάµο": "ζαχαροκάλαµο", - "σακχαροποιία": "ζαχαροποιία", - "σακχαρότευτλον": "ζαχαρότευτλο", - "σαλιαρίστρα": "σαλιάρα", - "σαλπιστής": "σαλπιγκτής", - "σαντακρούτα": "σατακρούτα", - "σαντάλι": "σανδάλι", - "σάνταλο": "σανδάλι", - "σάρρα": "σάρα", - "σαφρίδι": "σαυρίδι", - "σαχάνι": "σαγάνι", - "σβολιάζω": "σβωλιάζω", - "σβώλιασμα": "σβόλιασµα", - "σβόλος": "σβώλος", - "σβύνω": "σβήνω", - "σγουρώνω": "σγουραίνω", - "σενκόντο": "σεκόντο", - "σεγκούνα": "σιγκούνα", - "σεγόντο": "σεκόντο", - "Σειληνός": "Σιληνός", - "σείρακας": "σείρικας", - "σειρήτι": "σιρίτι", - "σεκονταρω": "σιγοντάρω", - "σεγκοντάρω": "σιγοντάρω", - "σελιλόιντ": "σελουλόιντ", - "σέλλα": "σέλα", - "σεξπιριστής": "σαιξπηριστής", - "Σεράγεβο": "Σαράγεβο", - "σεστέτο": "σεξτέτο", - "σετέτο": "σεπτέτο", - "σέχτα": "σέκτα", - "σεχταρισµός": "σεκταρισµός", - "σηµαφόρος": "σηµατοφόρος", - "σήριαλ": "σίριαλ", - "σηψίνη": "σηπτίνη", - "σιγάρο": "τσιγάρο", - "σιγαροθήκη": "τσιγαροθήκη", - "σίγλος": "σίκλος", - "σιγόντο": "σεκόντο", - "Σίδνεϊ": "Σύδνεϋ", - "σίελος": "σίαλος", - "σινθεσάιζερ": "συνθεσάιζερ", - "σιντέφι": "σεντέφι", - "σιορ": "σινιόρ", - "σιρυΐάνι": "σεργιάνι", - "σιρµαγιά": "σερµαγιά", - "σίτα": "σήτα", - "σταρέµπορος": "σιτέµπορος", - "σκανδαλιά": "σκανταλιά", - "σκάνταλο": "σκάνδαλο", - "σκάπτω": "σκάβω", - "σκάρα": "σχάρα", - "σκαρµός": "σκαλµός", - "σκάφτω": "σκάβω", - "σκεβρώνω": "σκευρώνω", - "σκερπάνι": "σκεπάρνι", - "σκίζα": "σχίζα", - "σκίζω": "σχίζω", - "σκίνος": "σχίνος", - "σκίσιµο": "σχίσιµο", - "σκισµάδα": "σχισµάδα", - "σκισµή": "σχισµή", - "σκλήρωση": "σκλήρυνση", - "σκοινάκι": "σχοινάκι", - "σκονί": "σχοινί", - "σκοινί": "σχοινί", - "σκοίνος": "σχοίνος", - "σκολάω": "σχολώ", - "σκολειαρόπαιδο": "σχολειαρόπαιδο", - "σκολειαρούδι": "σχολειαρούδι", - "σκολειό": "σχολείο", - "σκόλη": "σχόλη", - "σκολιαρόπαιδο": "σχολειαρόπαιδο", - "σκολιαρούδι": "σχολειαρούδι", - "σκολιό": "σχολειό", - "σκολνώ": "σχολώ", - "σκολώ": "σχολώ", - "Σκοτία": "Σκωτία", - "σκότισµα": "σκοτισµός", - "Σκοτσέζος": "Σκωτσέζος", - "σκουντούφληµα": "σκουντούφλα", - "σκώληξ": "σκουλήκι", - "σκώτι": "συκώτι", - "σοβαντεπί": "σοβατεπί", - "σοβατίζω": "σοβαντίζω", - "σοροκολεβάντες": "σιροκολεβάντες", - "σορόκος": "σιρόκος", - "σοροπιάζω": "σιροπιάζω", - "σουβατίζω": "σοβαντίζω", - "σουβαντίζω": "σοβαντίζω", - "σουβάς": "σοβάς", - "σουβατεπί": "σοβαντεπί", - "σοβατεπί": "σοβαντεπί", - "σουµιέ": "σοµιέ", - "σούρσιµο": "σύρσιµο", - "σουσπασιόν": "σισπανσιόν", - "σοφεράρω": "σοφάρω", - "σπαής": "σπαχής", - "σπαράσσω": "σπαράζω", - "σπερµατσετο": "σπαρµατσέτο", - "σπερµίνη": "σπερµατίνη", - "σπερµοβλάστη": "σπερµατοβλάστη", - "σπερµογονία": "σπερµατογονία", - "σπερµοδότης": "σπερµατοδότης", - "σπερµοδόχος": "σπερµατοδόχος", - "σπερμοδόχο": "σπερµατοδόχο", - "σπερµοθήκη": "σπερµατοθήκη", - "σπερµοκτόνος": "σπερµατοκτόνος", - "σπερμοκτόνο": "σπερµατοκτόνο", - "σπερµοτοξίνη": "σπερµατοτοξίνη", - "σπερµοφάγος": "σπερµατοφάγος", - "σπερμοφάγο": "σπερµατοφάγο", - "σπερµοφόρος": "σπερµατοφόρος", - "σπερμοφόρο": "σπερµατοφόρο", - "σπινάρω": "σπινιάρω", - "σπιράλ": "σπειράλ", - "σπλάχνο": "σπλάγχνο", - "σπογγίζω": "σφουγγίζω", - "σπω": "σπάζω", - "Στάθης": "Ευστάθιος", - "στάλαµα": "στάλαγµα", - "σταλαµατιά": "σταλαγµατιά", - "σταλαξιά": "σταλαγµατιά", - "σταλίτσα": "σταλιά", - "σταρήθρα": "σιταρήθρα", - "στάρι": "σιτάρι", - "σταρότοπος": "σιταρότοπος", - "σταχολογώ": "σταχυολογώ", - "στειρεύω": "στερεύω", - "στειροποιώ": "στειρώνω", - "Στέλιος": "Στυλιανός", - "Στέλλα": "Στυλιανή", - "στεναχώρια": "στενοχώρια", - "στεναχωρώ": "στενοχωρώ", - "στένω": "στήνω", - "στέριωµα": "στερέωµα", - "στεριώνω": "στερεώνω", - "στέρξιµο": "στέργω", - "στιλ": "στυλ", - "στιλάκι": "στυλάκι", - "στιλιζάρω": "στυλιζάρω", - "στιλίστας": "στυλίστας", - "στιλό": "στυλό", - "στιφάδο": "στυφάδο", - "στορίζω": "ιστορώ", - "στόρισµα": "ιστόρηση", - "στραβοµάρα": "στραβωµάρα", - "στραγγουλίζω": "στραγγαλίζω", - "Στρατής": "Ευστράτιος", - "στρατί": "στράτα", - "στρατοποίηση": "στρατιωτικοποίηση", - "Στράτος": "Ευστράτιος", - "στρένω": "στέργω", - "στριµόκωλα": "στρυµόκωλα", - "στριµωξίδι": "στρυµωξίδι", - "στριµώχνω": "στρυµώχνω", - "στύβω": "στείβω", - "στυπώνω": "στουπώνω", - "σύγνεφο": "σύννεφο", - "συγνώµη": "συγγνώµη", - "συδαυλίζω": "συνδαυλίζω", - "συµπαρασέρνω": "συµπαρασύρω", - "συµπεθεριά": "συµπεθεριό", - "δεκαέξι": "δεκάξι", - "συνήθιο": "συνήθειο", - "συντάµω": "συντέµνω", - "συντριβάνι": "σιντριβάνι", - "συνυφάδα": "συννυφάδα", - "συφορά": "συµφορά", - "συχώρεση": "συγχώρηση", - "συχωρώ": "συγχωρώ", - "συχωροχάρτι": "συγχωροχάρτι", - "σφαλνώ": "σφαλίζω", - "σφεντάµι": "σφένδαµνος", - "σφερδούκλι": "σπερδούκλι", - "σφόνδυλος": "σπόνδυλος", - "σωβινισµός": "σοβινισµός", - "σωβινιστής": "σοβινιστής", - "σώνω": "σώζω", - "σωρείτης": "σωρίτης", - "σωτάρω": "σοτάρω", - "σωτέ": "σοτέ", - "Σωτήρης": "Σωτήριος", - "σωφέρ": "σοφέρ", - "ταβατούρι": "νταβαντούρι", - "ταβερνούλα": "ταβέρνα", - "ταβλάς": "ταµπλάς", - "ταγιαδόρος": "ταλιαδόρος", - "ταγίζω": "ταΐζω", - "τάγισµα": "τάισµα", - "ταγκό": "τανγκό", - "ταή": "ταγή", - "τάλαρο": "τάλιρο", - "τάλληρο": "τάλιρο", - "ταµίευση": "αποταµίευση", - "ταµιεύω": "αποταµιεύω", - "ταµώ": "τέµνω", - "ταξείδι": "ταξίδι", - "ταπεραµέντο": "ταµπεραµέντο", - "ταράσσω": "ταράζω", - "ταχτοποίηση": "τακτοποίηση", - "ταχτοποιώ": "τακτοποιώ", - "τελάλης": "ντελάλης", - "τελολογία": "τελεολογία", - "τεριρέµ": "τερερέµ", - "τερραίν": "τερέν", - "τέσσαρα": "τέσσερα", - "τετράς": "τετράδα", - "τζέντζερης": "τέντζερης", - "τζετζερέδια": "τεντζερέδια", - "τζιριτζάντζουλα": "τζυριτζάτζουλα", - "τζίρος": "τζύρος", - "τζιτζιµπίρα": "τσιτσιµπίρα", - "τηκ": "τικ", - "τηλοµοιοτύπηµα": "τηλεοµοιοτύπηµα", - "τηλοµοιοτυπία": "τηλεοµοιοτυπία", - "τηλοµοιοτυπώ": "τηλεοµοιοτυπώ", - "τιτιβίζω": "τιττυβίζω", - "τµήθηκα": "τέµνω", - "τµήσω": "τέµνω", - "Τόκιο": "Τόκυο", - "τοµάτα": "ντοµάτα", - "τοµατιά": "ντοµατιά", - "τοµατοπολτός": "ντοµατοπολτός", - "τοµατοσαλάτα": "ντοµατοσαλάτα", - "τονθορύζω": "υποτονθορύζω", - "τορβάς": "ντορβάς", - "τορνάρω": "τορνεύω", - "τορπίλα": "τορπίλη", - "τούνδρα": "τούντρα", - "Τουρκάλα": "Τούρκος", - "τράβαλα": "ντράβαλα", - "τραΐ": "τραγί", - "τραινάρισµα": "τρενάρισµα", - "τραινάρω": "τρενάρω", - "τραίνο": "τρένο", - "τρακόσοι": "τριακόσιοι", - "τραπεζάκι": "τραπέζι", - "τρέµουλο": "τρεµούλα", - "τρέψω": "τρέπω", - "τριάµισι": "τρεισήµισι", - "τρικλίζω": "τρεκλίζω", - "τρίκλισµα": "τρέκλισµα", - "τρίπλα": "ντρίπλα", - "τριπλαδόρος": "ντριπλαδόρος", - "τριπλάρω": "ντριπλάρω", - "τρίπους": "τρίποδας", - "τρόπις": "τρόπιδα", - "τρυκ": "τρικ", - "τσαγγαράδικο": "τσαγκαράδικο", - "τσογγάρης": "τσαγκάρης", - "τσαγγάρικο": "τσαγκάρικο", - "τσαγγαροδευτέρα": "τσαγκαροδευτέρα", - "τσάµπα": "τζάµπα", - "τσαµπατζής": "τζαµπατζής", - "τσαντίζω": "τσατίζω", - "τσαντίλα": "τσατίλα", - "τσαντίλας": "τσατίλας", - "τσάντισµα": "τσάτισµα", - "τσίβα": "τζίβα", - "τσίκλα": "τσίχλα", - "τσιµεντώνω": "τσιµεντάρω", - "τσιπούρα": "τσιππούρα", - "τσιρίζω": "τσυρίζω", - "τσιριτσάντζουλα": "τζιριτζάντζουλα", - "τσιρότο": "τσηρώτο", - "τσίτα": "τσήτα", - "τσιτσιρίζω": "τσυτσυρίζω", - "τσιτσίρισµα": "τσυτσυρίζω", - "τσίτωµα": "τσήτωµα", - "τσοµπάνος": "τσοµπάνης", - "τσοπάνης": "τσοµπάνης", - "τσοπανόπουλο": "τσοµπανόπουλο", - "τσοπάνος": "τσοµπάνης", - "τσύνορο": "τσίνορο", - "τυράγνισµα": "τυράννισµα", - "τυραγνω": "τυραννώ", - "τυφεκίζω": "τουφεκίζω", - "τυφεκισµός": "τουφεκισµός", - "υαλόχαρτον": "γυαλόχαρτο", - "υαλόχαρτο": "γυαλόχαρτο", - "υάρδα": "γιάρδα", - "ύβρη": "ύβρις", - "υδατοσκοπια": "υδροσκοπία", - "υδραέριο": "υδαταέριο", - "ύελος": "ύαλος", - "Υόρκη Νέα": "Νέα Υόρκη", - "υποδείχνω": "υποδεικνύω", - "υπόδεσις": "υπόδηση", - "υποκάµισο": "πουκάµισο", - "φαγκρί": "φαγγρί", - "φαγοκύτωση": "φαγοκυττάρωση", - "ψόγουσα": "φαγέδαινα", - "φαγωµός": "φαγωµάρα", - "φάδι": "υφάδι", - "φαινοµεναλισµός": "φαινοµενοκρατία", - "φαινοµενισµός": "φαινοµενοκρατία", - "φαίνω": "υφαίνω", - "φαλακρώνω": "φαλακραίνω", - "φαµίλια": "φαµελιά", - "φαµφάρα": "φανφάρα", - "φαµφαρονισµος": "φανφαρονισµός", - "φαµφαρόνος": "φανφαρόνος", - "φαράκλα": "φαλάκρα", - "φαρµασόνος": "φραµασόνος", - "φαρµπαλάς": "φραµπαλάς", - "φασουλάδα": "φασολάδα", - "φασουλάκια": "φασολάκια", - "φασουλιά": "φασολιά", - "φασούλι": "φασόλι", - "φελόνι": "φαιλόνιο", - "φελώ": "ωφελώ", - "φεουδαλισµός": "φεουδαρχισµός", - "φερµάνι": "φιρµάνι", - "φέτος": "εφέτος", - "φθήνια": "φτήνια", - "Φιλανδία": "Φινλανδία", - "φιλενάδα": "φιλαινάδα", - "φιλιστρίνι": "φινιστρίνι", - "φιλόφρονας": "φιλόφρων", - "φιντάνι": "φυντάνι", - "φιορντ": "φιόρδ", - "φίσκα": "φύσκα", - "φκειάνω": "φτειάχνω", - "φκιάνω": "φτειάχνω", - "φκειασιδι": "φτειασίδι", - "φκειασίδωµα": "φτειασίδωµα", - "φκειασιδώνω": "φτειασιδώνω", - "φκιασιδι": "φτειασίδι", - "φκιασίδωµα": "φτειασίδωµα", - "φκιασιδώνω": "φτειασιδώνω", - "φκυάρι": "φτυάρι", - "Φλάνδρα": "Φλαµανδία", - "φλισκούνι": "φλησκούνι", - "φλοίδα": "φλούδα", - "φλοµιάζω": "φλοµώνω", - "φλορίνι": "φιορίνι", - "φλυτζάνι": "φλιτζάνι", - "φοβούµαι": "φοβάµαι", - "φονεύς": "φονιάς", - "φόντα": "φόντο", - "φουσέκι": "φισέκι", - "φούχτα": "χούφτα", - "φουχτώνω": "χουφτώνω", - "Φραγκφούρτη": "Φρανκφούρτη", - "φράσσω": "φράζω", - "Φρίντα": "Φρειδερίκη", - "Φροσύνη": "Ευφροσύνη", - "Φρόσω": "Ευφροσύνη", - "φροϋδισµος": "φροϊδισµός", - "φρουµάζω": "φριµάζω", - "φρούµασµα": "φρίµασµα", - "φτάνω": "φθάνω", - "φταρνίζοµαι": "φτερνίζοµαι", - "φτειάνω": "φτειάχνω", - "φτηνά": "φθηνά", - "φτηναίνω": "φθηναίνω", - "φτιασίδι": "φτειασίδι", - "φτιασιδώνοµαι": "φτειασιδώνοµαι", - "φτωχοκοµείο": "πτωχοκοµείο", - "φυγάδας": "φυγάς", - "φύγω": "φεύγω", - "φυλάγω": "φυλάσσω", - "φυλλαράκι": "φύλλο", - "φυλλόδεντρο": "φιλόδεντρο", - "φυλώ": "φυλάσσω", - "φυσέκι": "φισέκι", - "φυσεκλίκι": "φισεκλίκι", - "φυσιοθεραπεία": "φυσικοθεραπεία", - "φυστίκι": "φιστίκι", - "φυστικιά": "φιστικιά", - "φύω": "φύοµαι", - "φχαριστώ": "ευχαριστώ", - "φωβισµός": "φοβισµός", - "φωβιστής": "φοβισµός", - "Φώτης": "Φώτιος", - "φωτογραφώ": "φωτογραφίζω", - "φωτοβολή": ", φωτοβολία", - "χάβω": "χάφτω", - "χαΐδεµα": "χαϊδεύω", - "χάιδι": "χάδι", - "χαλνώ": "χαλώ", - "χαλυβώνω": "χαλυβδώνω", - "χάµου": "χάµω", - "χαµψίνι": "χαµσίνι", - "χάνδρα": "χάντρα", - "χαντζής": "χανιτζής", - "χαραµατιά": "χαραγµατιά", - "χάραξ": "χάρακας", - "χάροντας": "χάρος", - "χατζάρα": "χαντζάρα", - "χατζάρι": "χαντζάρι", - "χεγκελιανισµός": "εγελιανισµός", - "χειρόβολο": "χερόβολο", - "χειροµάχηµα": "χεροµαχώ", - "χειροµάχισσα": "χεροµάχος", - "χειροµάχος": "χεροµάχος", - "χειροµαχώ": "χεροµαχώ", - "χέρα": "χέρι", - "χερόµυλος": "χειρόµυλος", - "χεροπόδαρα": "χειροπόδαρα", - "χηνάρι": "χήνα", - "χι": "χει", - "χιµώ": "χυµώ", - "χιών": "χιόνι", - "χλεµπάνια": "πλεµπάγια", - "χλοΐζω": "χλοάζω", - "χλόισµα": "χλόασµα", - "χνώτο": "χνότο", - "χορδίζω": "κουρδίζω", - "χόρδισµα": "κούρδισμα", - "χοχλάζω": "κοχλάζω", - "χοχλακιάζω": "κοχλάζω", - "χοχλακίζω": "κοχλάζω", - "χοχλακώ": "κοχλάζω", - "χρεογραφο": "χρεώγραφο", - "χρεοκοπία": "χρεωκοπία", - "χρεοκοπώ": "χρεωκοπώ", - "χρεολυσία": "χρεωλυσία", - "χρεολύσιο": "χρεωλύσιο", - "χρεόλυτρο": "χρεώλυτρο", - "χρεοπιστώνω": "πιστοχρεώνω", - "χρεοπίστωση": "πιστοχρεώνω", - "χρεοστάσιο": "χρεωστάσιο", - "χρεοφειλέτης": "χρεωφειλέτης", - "Χρήστος": "Χρίστος", - "χρωµατόσωµα": "χρωµόσωµα", - "χρωµογόνος": "χρωµατογόνος", - "χρωµογόνο": "χρωµατογόνο", - "χρωµοφόρος": "χρωµατοφόρος", - "χρωµοφόρο": "χρωµατοφόρο", - "χτες": "χθες", - "χτήµα": "κτήµα", - "χτίζω": "κτίζω", - "χτίσιµο": "κτίσιµο", - "χτίσµα": "κτίσµα", - "χτίστης": "κτίστης", - "χτύπηµα": "κτύπηµα", - "χτύπος": "κτύπος", - "χτυπώ": "κτυπώ", - "χυµίζω": "χυµώ", - "χωλ": "χολ", - "χώνεψη": "χώνευση", - "χωριατοσύνη": "χωριατιά", - "ψένω": "ψήνω", - "ψηλαφώ": "ψηλαφίζω", - "ψηφιδοθέτης": "ψηφοθέτης", - "ψιττακίαση": "ψιττάκωση", - "ψίχαλο": "ψίχουλο", - "ψυχεδελισµός": "ψυχεδέλεια", - "ψυχογιός": "ψυχογυιός", - "ψώριασµα": "ψωριάζω", - "ωγκρατέν": "ογκρατέν", - "ωράριο": "οράριο", - "ώς": "έως", - "ωτασπίδα": "ωτοασπίδα", - "ωτοστόπ": "οτοστόπ", - "ωφελιµοκρατία": "ωφελιµισµός", - "ωχαδερφισµός": "οχαδερφισµός", - "ώχου": "όχου", - "άγυρτος": "άγειρτος", - "άγυρτη": "άγειρτη", - "άγυρτο": "άγειρτο", - "ανηµέρευτος": "ανηµέρωτος", - "ανηµέρευτη": "ανηµέρωτη", - "ανηµέρευτο": "ανηµέρωτο", - "ανοικτός": "ανοιχτός", - "ανοικτή": "ανοιχτή", - "ανοικτό": "ανοιχτό", - "αντιελληνικός": "ανθελληνικός", - "αντιελληνική": "ανθελληνική", - "αντιελληνικό": "ανθελληνικό", - "αντιεπιστηµονικος": "αντεπιστηµονικός", - "αντιεπιστηµονικη": "αντεπιστηµονική", - "αντιεπιστηµονικο": "αντεπιστηµονικό", - "αξόφλητος": "ανεξόφλητος", - "αξόφλητη": "ανεξόφλητη", - "αξόφλητο": "ανεξόφλητο", - "άπαιχτος": "άπαικτος", - "άπαιχτη": "άπαικτη", - "άπαιχτο": "άπαικτο", - "απηρχαιωµένος": "απαρχαιωµένος", - "απηρχαιωµένη": "απαρχαιωµένη", - "απηρχαιωµένο": "απαρχαιωµένο", - "άπιωτος": "άπιοτος", - "άπιωτη": "άπιοτη", - "άπιωτο": "άπιοτο", - "άπραχτος": "άπρακτος", - "άπραχτη": "άπρακτη", - "άπραχτο": "άπρακτο", - "άραχλος": "άραχνος", - "άραχλη": "άραχνη", - "άραχλο": "άραχνο", - "αρήγωτος": "αρίγωτος", - "αρήγωτη": "αρίγωτη", - "αρήγωτο": "αρίγωτο", - "αρµενικός": "αρµενιακός", - "αρµενική": "αρµενιακή", - "αρµενικό": "αρµενιακό", - "αρµυρός": "αλµυρός", - "αρµυρή": "αλµυρή", - "αρµυρό": "αλµυρό", - "άσβεστος": "άσβηστος", - "άσβεστη": "άσβηστη", - "άσβεστο": "άσβηστο", - "άσκηµος": "άσχηµος", - "άσκηµη": "άσχηµη", - "άσκηµο": "άσχηµο", - "άστυφτος": "άστειφτος", - "άστυφτη": "άστειφτη", - "άστυφτο": "άστειφτο", - "ασυχώρετος": "ασυγχώρητος", - "ασυχώρετη": "ασυγχώρητη", - "ασυχώρετο": "ασυγχώρητο", - "άταχτος": "άτακτος", - "άταχτη": "άτακτη", - "άταχτο": "άτακτο", - "άφκιαστος": "άφτειαχτος", - "άφκιαστη": "άφτειαχτη", - "άφκιαστο": "άφτειαχτο", - "άφκειαστος": "άφτειαχτος", - "άφκειαστη": "άφτειαχτη", - "άφκειαστο": "άφτειαχτο", - "άφταστος": "άφθαστος", - "άφταστη": "άφθαστη", - "άφταστο": "άφθαστο", - "άφτερος": "άπτερος", - "άφτερη": "άπτερη", - "άφτερο": "άπτερο", - "αχτιδωτος": "ακτινωτός", - "αχτιδωτη": "ακτινωτή", - "αχτιδωτο": "ακτινωτό", - "άχτιστος": "άκτιστος", - "άχτιστη": "άκτιστη", - "άχτιστο": "άκτιστο", - "βιωτικός": "βιοτικός", - "βιωτική": "βιοτική", - "βιωτικό": "βιοτικό", - "βλάστηµος": "βλάσφηµος", - "βλάστηµη": "βλάσφηµη", - "βλάστηµο": "βλάσφηµο", - "βλογηµένος": "ευλογηµένος", - "βλογηµένη": "ευλογηµένη", - "βλογηµένο": "ευλογηµένο", - "βοϊδινός": "βοδινός", - "βοϊδινή": "βοδινή", - "βοϊδινό": "βοδινό", - "βορινός": "βορεινός", - "βορινή": "βορεινή", - "βορινό": "βορεινό", - "βρωµερός": "βροµερός", - "βρωµερή": "βροµερή", - "βρωµερό": "βροµερό", - "βρώµικος": "βρόµικος", - "βρώµικη": "βρόµικη", - "βρώµικο": "βρόµικο", - "γαλατερός": "γαλακτερός", - "γαλατερή": "γαλακτερή", - "γαλατερό": "γαλακτερό", - "γδυµνός": "γυµνός", - "γδυµνή": "γυµνή", - "γδυµνό": "γυµνό", - "γελαδινός": "αγελαδινός", - "γελαδινή": "αγελαδινή", - "γελαδινό": "αγελαδινό", - "γερτός": "γειρτός", - "γερτή": "γειρτή", - "γερτό": "γειρτό", - "γιοµάτος": "γεµάτος", - "γιοµάτη": "γεµάτη", - "γιοµάτο": "γεµάτο", - "γκεµπελικός": "γκαιµπελικός", - "γκεµπελική": "γκαιµπελική", - "γκεµπελικό": "γκαιµπελικό", - "γλήγορος": "γρήγορος", - "γλήγορη": "γρήγορη", - "γλήγορο": "γρήγορο", - "γρανίτινος": "γρανιτένιος", - "γρανίτινη": "γρανιτένιη", - "γρανίτινο": "γρανιτένιο", - "γραφτός": "γραπτός", - "γραφτή": "γραπτή", - "γραφτό": "γραπτό", - "γυρτός": "γειρτός", - "γυρτή": "γειρτή", - "γυρτό": "γειρτό", - "δαιµονόπληκτος": "δαιµονιόπληκτος", - "δαιµονόπληκτη": "δαιµονιόπληκτη", - "δαιµονόπληκτο": "δαιµονιόπληκτο", - "δερµικός": "δερµατικός", - "δερµική": "δερµατική", - "δερµικό": "δερµατικό", - "δεχτός": "δεκτός", - "δεχτή": "δεκτή", - "δεχτό": "δεκτό", - "διαλεκτός": "διαλεχτός", - "διαλεκτή": "διαλεχτή", - "διαλεκτό": "διαλεχτό", - "διαολεµένος": "διαβολεµένος", - "διαολεµένη": "διαβολεµένη", - "διαολεµένο": "διαβολεµένο", - "δυσέλεγκτος": "δυσεξέλεγκτος", - "δυσέλεγκτη": "δυσεξέλεγκτη", - "δυσέλεγκτο": "δυσεξέλεγκτο", - "δυσλεκτικός": "δυσλεξικός", - "δυσλεκτική": "δυσλεξική", - "δυσλεκτικό": "δυσλεξικό", - "εκδοµένος": "εκδεδοµένος", - "εκδοµένη": "εκδεδοµένη", - "εκδοµένο": "εκδεδοµένο", - "ελεύτερος": "ελεύθερος", - "ελεύτερη": "ελεύθερη", - "ελεύτερο": "ελεύθερο", - "εξώφθαλµος": "εξόφθαλµος", - "εξώφθαλµη": "εξόφθαλµη", - "εξώφθαλµο": "εξόφθαλµο", - "επανωτός": "απανωτός", - "επανωτή": "απανωτή", - "επανωτό": "απανωτό", - "επεξηγητικος": "επεξηγηµατικός", - "επεξηγητικη": "επεξηγηµατική", - "επεξηγητικο": "επεξηγηµατικό", - "έρµος": "έρηµος", - "έρµη": "έρηµη", - "έρµο": "έρηµο", - "ετερόκλητος": "ετερόκλιτος", - "ετερόκλητη": "ετερόκλιτη", - "ετερόκλητο": "ετερόκλιτο", - "ετούτος": "τούτος", - "ετούτη": "τούτη", - "ετούτο": "τούτο", - "εφετεινός": "εφετινός", - "εφετεινή": "εφετινή", - "εφετεινό": "εφετινό", - "εφταήµερος": "επταήµερος", - "εφταήµερη": "επταήµερη", - "εφταήµερο": "επταήµερο", - "ζάµπλουτος": "ζάπλουτος", - "ζάµπλουτη": "ζάπλουτη", - "ζάµπλουτο": "ζάπλουτο", - "ζαχαράτος": "ζαχαρωτός", - "ζαχαράτη": "ζαχαρωτή", - "ζαχαράτο": "ζαχαρωτό", - "θαµβός": "θαµπός", - "θαµβή": "θαµπή", - "θαµβό": "θαµπό", - "θραψερός": "θρεψερός", - "θραψερή": "θρεψερή", - "θραψερό": "θρεψερό", - "ιονικός": "ιοντικός", - "ιονική": "ιοντική", - "ιονικό": "ιοντικό", - "καββαλιστικός": "καβαλιστικός", - "καββαλιστική": "καβαλιστική", - "καββαλιστικό": "καβαλιστικό", - "καλλίτερος": "καλύτερος", - "καλλίτερη": "καλύτερη", - "καλλίτερο": "καλύτερο", - "καταχτητικός": "κατακτητικός", - "καταχτητική": "κατακτητική", - "καταχτητικό": "κατακτητικό", - "καταψυγµένος": "κατεψυγµένος", - "καταψυγµένη": "κατεψυγµένη", - "καταψυγµένο": "κατεψυγµένο", - "καυδιανός": "καβδιανός", - "καυδιανή": "καβδιανή", - "καυδιανό": "καβδιανό", - "καϋµένος": "καηµένος", - "καϋµένη": "καηµένη", - "καϋµένο": "καηµένο", - "κέδρινος": "κέδρος", - "κέδρινη": "κέδρη", - "κέδρινο": "κέδρο", - "κεραµεικος": "κεραµικός", - "κεραµεικη": "κεραµική", - "κεραµεικο": "κεραµικό", - "κλασσικός": "κλασικός", - "κλασσική": "κλασική", - "κλασσικό": "κλασικό", - "κόλαριστός": "κολλαριστός", - "κόλαριστή": "κολλαριστή", - "κόλαριστό": "κολλαριστό", - "κοµµουνιστικός": "κοµουνιστικός", - "κοµµουνιστική": "κοµουνιστική", - "κοµµουνιστικό": "κοµουνιστικό", - "κοράλλινος": "κοραλλένιος", - "κοράλλινη": "κοραλλένιη", - "κοράλλινο": "κοραλλένιο", - "κτυπητός": "χτυπητός", - "κτυπητή": "χτυπητή", - "κτυπητό": "χτυπητό", - "κωφός": "κουφός", - "κωφή": "κουφή", - "κωφό": "κουφό", - "λειπανάβατος": "λειψανάβατος", - "λειπανάβατη": "λειψανάβατη", - "λειπανάβατο": "λειψανάβατο", - "λιανικός": "λειανικός", - "λιανική": "λειανική", - "λιανικό": "λειανικό", - "λιανός": "λειανός", - "λιανή": "λειανή", - "λιανό": "λειανό", - "λιγοήµερος": "ολιγοήµερος", - "λιγοήµερη": "ολιγοήµερη", - "λιγοήµερο": "ολιγοήµερο", - "λιγόκαρδος": "ολιγόκαρδος", - "λιγόκαρδη": "ολιγόκαρδη", - "λιγόκαρδο": "ολιγόκαρδο", - "λιγόλογος": "ολιγόλογος", - "λιγόλογη": "ολιγόλογη", - "λιγόλογο": "ολιγόλογο", - "λιγόπιστος": "ολιγόπιστος", - "λιγόπιστη": "ολιγόπιστη", - "λιγόπιστο": "ολιγόπιστο", - "λιγόψυχος": "ολιγοψυχία", - "λιγόψυχοςή": "ολιγοψυχίαη", - "λιγόψυχοςό": "ολιγοψυχίαο", - "λιόλουστος": "ηλιόλουστος", - "λιόλουστη": "ηλιόλουστη", - "λιόλουστο": "ηλιόλουστο", - "λιόµορφος": "ηλιόµορφος", - "λιόµορφη": "ηλιόµορφη", - "λιόµορφο": "ηλιόµορφο", - "λιόχαρος": "ηλιόχαρος", - "λιόχαρη": "ηλιόχαρη", - "λιόχαρο": "ηλιόχαρο", - "λιπανάβατος": "λειψανάβατος", - "λιπανάβατη": "λειψανάβατη", - "λιπανάβατο": "λειψανάβατο", - "λυµφατικός": "λεµφατικός", - "λυµφατική": "λεµφατική", - "λυµφατικό": "λεµφατικό", - "µαυριδερός": "µαυρειδερός", - "µαυριδερή": "µαυρειδερή", - "µαυριδερό": "µαυρειδερό", - "µεικτός": "µικτός", - "µεικτή": "µικτή", - "µεικτό": "µικτό", - "µελαψός": "µελαµψός", - "µελαψή": "µελαµψή", - "µελαψό": "µελαµψό", - "µετάξινος": "µεταξένιος", - "µετάξινη": "µεταξένιη", - "µετάξινο": "µεταξένιο", - "µιξοβάρβαρος": "µειξοβάρβαρος", - "µιξοβάρβαρη": "µειξοβάρβαρη", - "µιξοβάρβαρο": "µειξοβάρβαρο", - "µοσκαναθρεµµένος": "µοσχαναθρεµµένος", - "µοσκαναθρεµµένη": "µοσχαναθρεµµένη", - "µοσκαναθρεµµένο": "µοσχαναθρεµµένο", - "µουλωχτός": "µουλλωχτός", - "µουλωχτή": "µουλλωχτή", - "µουλωχτό": "µουλλωχτό", - "µπαµπακερός": "βαµβακερός", - "µπαµπακερή": "βαµβακερή", - "µπαµπακερό": "βαµβακερό", - "νεόχτιστος": "νεόκτιστος", - "νεόχτιστη": "νεόκτιστη", - "νεόχτιστο": "νεόκτιστο", - "νηστίσιµος": "νηστήσιµος", - "νηστίσιµη": "νηστήσιµη", - "νηστίσιµο": "νηστήσιµο", - "νιογέννητος": "νεογέννητος", - "νιογέννητη": "νεογέννητη", - "νιογέννητο": "νεογέννητο", - "νυκτερινός": "νυχτερινός", - "νυκτερινή": "νυχτερινή", - "νυκτερινό": "νυχτερινό", - "ξιπόλητος": "ξυπόλυτος", - "ξιπόλητη": "ξυπόλυτη", - "ξιπόλητο": "ξυπόλυτο", - "ξυνός": "ξινός", - "ξυνή": "ξινή", - "ξυνό": "ξινό", - "ξωτικός": "εξωτικός", - "ξωτική": "εξωτική", - "ξωτικό": "εξωτικό", - "οικονοµίστικος": "οικονοµικίστικος", - "οικονοµίστικη": "οικονοµικίστικη", - "οικονοµίστικο": "οικονοµικίστικο", - "οκταγωνικός": "οχταγωνικός", - "οκταγωνική": "οχταγωνική", - "οκταγωνικό": "οχταγωνικό", - "οκτάγωνος": "οχτάγωνος", - "οκτάγωνη": "οχτάγωνη", - "οκτάγωνο": "οχτάγωνο", - "οκτάεδρος": "οχτάεδρος", - "οκτάεδρη": "οχτάεδρη", - "οκτάεδρο": "οχτάεδρο", - "οκτάκιλος": "οχτάκιλος", - "οκτάκιλη": "οχτάκιλη", - "οκτάκιλο": "οχτάκιλο", - "οξειδώσιµος": "οξιδώσιµος", - "οξειδώσιµη": "οξιδώσιµη", - "οξειδώσιµο": "οξιδώσιµο", - "ορεχτικός": "ορεκτικός", - "ορεχτική": "ορεκτική", - "ορεχτικό": "ορεκτικό", - "οχταγωνικός": "οκταγωνικός", - "οχταγωνική": "οκταγωνική", - "οχταγωνικό": "οκταγωνικό", - "οχτάγωνος": "οκτάγωνος", - "οχτάγωνη": "οκτάγωνη", - "οχτάγωνο": "οκτάγωνο", - "οχτάεδρος": "οκτάεδρος", - "οχτάεδρη": "οκτάεδρη", - "οχτάεδρο": "οκτάεδρο", - "οχτακοσιοστός": "οκτακοσιοστός", - "οχτακοσιοστή": "οκτακοσιοστή", - "οχτακοσιοστό": "οκτακοσιοστό", - "οχτάπλευρος": "οκτάπλευρος", - "οχτάπλευρη": "οκτάπλευρη", - "οχτάπλευρο": "οκτάπλευρο", - "οχτάστηλος": "οκτάστηλος", - "οχτάστηλη": "οκτάστηλη", - "οχτάστηλο": "οκτάστηλο", - "οχτάστιχος": "οκτάστιχος", - "οχτάστιχη": "οκτάστιχη", - "οχτάστιχο": "οκτάστιχο", - "οχτάωρος": "οκτάωρος", - "οχτάωρη": "οκτάωρη", - "οχτάωρο": "οκτάωρο", - "οχτωβριανός": "οκτωβριανός", - "οχτωβριανή": "οκτωβριανή", - "οχτωβριανό": "οκτωβριανό", - "παιδιακίστικος": "παιδιάστικος", - "παιδιακίστικη": "παιδιάστικη", - "παιδιακίστικο": "παιδιάστικο", - "πανέρµος": "πανέρηµος", - "πανέρµη": "πανέρηµη", - "πανέρµο": "πανέρηµο", - "παπαδικός": "παππαδικός", - "παπαδική": "παππαδική", - "παπαδικό": "παππαδικό", - "παπαδίστικος": "παππαδίστικος", - "παπαδίστικη": "παππαδίστικη", - "παπαδίστικο": "παππαδίστικο", - "παραεκκλησιαστικός": "παρεκκλησιαστικός", - "παραεκκλησιαστική": "παρεκκλησιαστική", - "παραεκκλησιαστικό": "παρεκκλησιαστικό", - "πειρακτικός": "πειραχτικός", - "πειρακτική": "πειραχτική", - "πειρακτικό": "πειραχτικό", - "περήφανος": "υπερήφανος", - "περήφανη": "υπερήφανη", - "περήφανο": "υπερήφανο", - "περσότερος": "περισσότερος", - "περσότερη": "περισσότερη", - "περσότερο": "περισσότερο", - "πεταγµένος": "πεταµένος", - "πεταγµένη": "πεταµένη", - "πεταγµένο": "πεταµένο", - "πηκτός": "πηχτός", - "πηκτή": "πηχτή", - "πηκτό": "πηχτό", - "πιτσιλιστός": "πιτσυλιστός", - "πιτσιλιστή": "πιτσυλιστή", - "πιτσιλιστό": "πιτσυλιστό", - "πλεχτικός": "πλεκτικός", - "πλεχτική": "πλεκτική", - "πλεχτικό": "πλεκτικό", - "πλεχτός": "πλεκτός", - "πλεχτή": "πλεκτή", - "πλεχτό": "πλεκτό", - "προσεχτικός": "προσεκτικός", - "προσεχτική": "προσεκτική", - "προσεχτικό": "προσεκτικό", - "προψεσινός": "προχθεσινός", - "προψεσινή": "προχθεσινή", - "προψεσινό": "προχθεσινό", - "πτερωτός": "φτερωτός", - "πτερωτή": "φτερωτή", - "πτερωτό": "φτερωτό", - "πτωχικός": "φτωχικός", - "πτωχική": "φτωχική", - "πτωχικό": "φτωχικό", - "ραφτικός": "ραπτικός", - "ραφτική": "ραπτική", - "ραφτικό": "ραπτικό", - "ραφτός": "ραπτός", - "ραφτή": "ραπτή", - "ραφτό": "ραπτό", - "ρούσικος": "ρωσικός", - "ρούσικη": "ρωσική", - "ρούσικο": "ρωσικό", - "ρωµαντικός": "ροµαντικός", - "ρωµαντική": "ροµαντική", - "ρωµαντικό": "ροµαντικό", - "σειληνικός": "σιληνικός", - "σειληνική": "σιληνική", - "σειληνικό": "σιληνικό", - "σειριακός": "σειραϊκός", - "σειριακή": "σειραϊκή", - "σειριακό": "σειραϊκό", - "σεξπιρικός": "σαιξπηρικός", - "σεξπιρική": "σαιξπηρική", - "σεξπιρικό": "σαιξπηρικό", - "σιδηρόφρακτος": "σιδερόφραχτος", - "σιδηρόφρακτη": "σιδερόφραχτη", - "σιδηρόφρακτο": "σιδερόφραχτο", - "σκεβρός": "σκευρός", - "σκεβρή": "σκευρή", - "σκεβρό": "σκευρό", - "σκεφτικός": "σκεπτικός", - "σκεφτική": "σκεπτική", - "σκεφτικό": "σκεπτικό", - "σκιστός": "σχιστός", - "σκιστή": "σχιστή", - "σκιστό": "σχιστό", - "σκολιανός": "σχολιανός", - "σκολιανή": "σχολιανή", - "σκολιανό": "σχολιανό", - "σκοτσέζικος": "σκοτσέζικος", - "σκοτσέζικη": "σκοτσέζικη", - "σκοτσέζικο": "σκοτσέζικο", - "σµυρνιώτικος": "σµυρναίικος", - "σµυρνιώτικη": "σµυρναίικη", - "σµυρνιώτικο": "σµυρναίικο", - "σοροπιαστός": "σιροπιαστός", - "σοροπιαστή": "σιροπιαστή", - "σοροπιαστό": "σιροπιαστό", - "σπερνός": "εσπερινός", - "σπερνή": "εσπερινή", - "σπερνό": "εσπερινό", - "σταρόχρωµος": "σιταρόχρωµος", - "σταρόχρωµη": "σιταρόχρωµη", - "σταρόχρωµο": "σιταρόχρωµο", - "στενάχωρος": "στενόχωρος", - "στενάχωρη": "στενόχωρη", - "στενάχωρο": "στενόχωρο", - "στιλιστικός": "στυλιστικός", - "στιλιστική": "στυλιστική", - "στιλιστικό": "στυλιστικό", - "στριµόκωλος": "στρυµόκωλος", - "στριµόκωλη": "στρυµόκωλη", - "στριµόκωλο": "στρυµόκωλο", - "στριµωχτός": "στρυµωχτός", - "στριµωχτή": "στρυµωχτή", - "στριµωχτό": "στρυµωχτό", - "στριφνός": "στρυφνός", - "στριφνή": "στρυφνή", - "στριφνό": "στρυφνό", - "σύµµεικτος": "σύµµικτος", - "σύµµεικτη": "σύµµικτη", - "σύµµεικτο": "σύµµικτο", - "σύµψυχος": "σύψυχος", - "σύµψυχη": "σύψυχη", - "σύµψυχο": "σύψυχο", - "συντεθειµένος": "συνθέτω", - "συντεθειµένοςή": "συνθέτωη", - "συντεθειµένοςό": "συνθέτωο", - "συφοριασµένος": "συμφοριασμένος", - "συφοριασµένη": "συμφοριασμένη", - "συφοριασµένο": "συμφοριασμένο", - "συχωριανός": "συγχωριανός", - "συχωριανή": "συγχωριανή", - "συχωριανό": "συγχωριανό", - "ταγκός": "ταγγός", - "ταγκή": "ταγγή", - "ταµιευτικός": "αποταµιευτικός", - "ταµιευτική": "αποταµιευτική", - "ταµιευτικό": "αποταµιευτικό", - "ταχτικός": "τακτικός", - "ταχτική": "τακτική", - "ταχτικό": "τακτικό", - "τελολογικός": "τελεολογικός", - "τελολογική": "τελεολογική", - "τελολογικό": "τελεολογικό", - "τραγικοκωµικός": "κωµικοτραγικός", - "τραγικοκωµική": "κωµικοτραγική", - "τραγικοκωµικό": "κωµικοτραγικό", - "τρελλός": "τρελός", - "τρελλή": "τρελή", - "τρελλό": "τρελό", - "τσεβδός": "τσευδός", - "τσεβδή": "τσευδή", - "τσεβδό": "τσευδό", - "τσιριχτός": "τσυριχτός", - "τσιριχτή": "τσυριχτή", - "τσιριχτό": "τσυριχτό", - "τσιτωτός": "τσητωτός", - "τσιτωτή": "τσητωτή", - "τσιτωτό": "τσητωτό", - "υποµονητικός": "υποµονετικός", - "υποµονητική": "υποµονετική", - "υποµονητικό": "υποµονετικό", - "φαµφαρονικός": "φανφαρονίστικος", - "φαµφαρονική": "φανφαρονίστικη", - "φαµφαρονικό": "φανφαρονίστικο", - "φαµφαρονίστικος": "φανφαρονίστικος", - "φαµφαρονίστικη": "φανφαρονίστικη", - "φαµφαρονίστικο": "φανφαρονίστικο", - "φαντός": "υφαντός", - "φαντή": "υφαντή", - "φαντό": "υφαντό", - "φανφαρονικός": "φανφαρονιστικός", - "φανφαρονική": "φανφαρονιστική", - "φανφαρονικό": "φανφαρονιστικό", - "φαρακλός": "φαλακρός", - "φαρακλή": "φαλακρή", - "φαρακλό": "φαλακρό", - "φεγγαροφώτιστος": "φεγγαρόφωτος", - "φεγγαροφώτιστη": "φεγγαρόφωτη", - "φεγγαροφώτιστο": "φεγγαρόφωτο", - "φεουδαλικός": "φεουδαρχικός", - "φεουδαλική": "φεουδαρχική", - "φεουδαλικό": "φεουδαρχικό", - "φλοκάτος": "φλοκωτός", - "φλοκάτη": "φλοκωτή", - "φλοκάτο": "φλοκωτό", - "φριχτός": "φρικτός", - "φριχτή": "φρικτή", - "φριχτό": "φρικτό", - "φροϋδικός": "φροϊδικός", - "φροϋδική": "φροϊδική", - "φροϋδικό": "φροϊδικό", - "φτειαστός": "φτειαχτός", - "φτειαστή": "φτειαχτή", - "φτειαστό": "φτειαχτό", - "φτηνός": "φθηνός", - "φτηνή": "φθηνή", - "φτηνό": "φθηνό", - "φυσιοθεραπευτικός": "φυσιοθεραπευτικός", - "φυσιοθεραπευτική": "φυσιοθεραπευτική", - "φυσιοθεραπευτικό": "φυσιοθεραπευτικό", - "φωβιστικός": "φοβιστικός", - "φωβιστική": "φοβιστική", - "φωβιστικό": "φοβιστικό", - "χαδεµένος": "χαϊδεµένος", - "χαδεµένη": "χαϊδεµένη", - "χαδεµένο": "χαϊδεµένο", - "χειλόφωνος": "χειλεόφωνος", - "χειλόφωνη": "χειλεόφωνη", - "χειλόφωνο": "χειλεόφωνο", - "χειροδύναµος": "χεροδύναµος", - "χειροδύναµη": "χεροδύναµη", - "χειροδύναµο": "χεροδύναµο", - "χηράµενος": "χηρευάµενος", - "χηράµενη": "χηρευάµενη", - "χηράµενο": "χηρευάµενο", - "χλωµός": "χλοµός", - "χλωµή": "χλοµή", - "χλωµό": "χλοµό", - "χνουδάτος": "χνουδωτός", - "χνουδάτη": "χνουδωτή", - "χνουδάτο": "χνουδωτό", - "χονδρός": "χοντρός", - "χονδρή": "χοντρή", - "χονδρό": "χοντρό", - "χουβαρντάδικος": "χουβαρντάς", - "χουβαρντάδικοςή": "χουβαρντάςη", - "χουβαρντάδικοςό": "χουβαρντάςο", - "χρεολυτικός": "χρεωλυτικός", - "χρεολυτική": "χρεωλυτική", - "χρεολυτικό": "χρεωλυτικό", - "χρησµοδοτικός": "χρησµοδοσία", - "χρησµοδοτική": "χρησµοδοσίαη", - "χρησµοδοτικό": "χρησµοδοσίαο", - "χρυσόπλεχτος": "χρυσόπλεκτος", - "χρυσόπλεχτη": "χρυσόπλεκτη", - "χρυσόπλεχτο": "χρυσόπλεκτο", - "χτεσινός": "χθεσινός", - "χτεσινή": "χθεσινή", - "χτεσινό": "χθεσινό", - "χτιστός": "κτιστός", - "χτιστή": "κτιστή", - "χτιστό": "κτιστό", - "αντρείος": "ανδρείος", - "αντρεία": "ανδρεία", - "αντρείο": "ανδρείο", - "αποποµπαίος": "αποδιοποµπαίος", - "αποποµπαία": "αποδιοποµπαία", - "αποποµπαίο": "αποδιοποµπαίο", - "γεραλεος": "γηραλέος", - "γεραλεα": "γηραλέα", - "γεραλεο": "γηραλέο", - "εντόπιος": "ντόπιος", - "εντόπια": "ντόπια", - "εντόπιο": "ντόπιο", - "εφταπλάσιος": "επταπλάσιος", - "εφταπλάσια": "επταπλάσια", - "εφταπλάσιο": "επταπλάσιο", - "ζούφιος": "τζούφιος", - "ζούφια": "τζούφια", - "ζούφιο": "τζούφιο", - "καθάριος": "καθάρειος", - "καθάρια": "καθάρεια", - "καθάριο": "καθάρειο", - "λαφήσιος": "ελαφήσιος", - "λαφήσια": "ελαφήσια", - "λαφήσιο": "ελαφήσιο", - "οκταθέσιος": "οχταθέσιος", - "οκταθέσια": "οχταθέσια", - "οκταθέσιο": "οχταθέσιο", - "ονυχαίος": "ονυχιαίος", - "ονυχαία": "ονυχιαία", - "ονυχαίο": "ονυχιαίο", - "οχταπλάσιος": "οκταπλάσιος", - "οχταπλάσια": "οκταπλάσια", - "οχταπλάσιο": "οκταπλάσιο", - "βοϊδήσιος": "βοδινός", - "βοϊδήσια": "βοδινή", - "βοϊδήσιο": "βοδινό", - "καλαµποκίσιος": "καλαµποκήσιος", - "καλαµποκίσια": "καλαµποκήσια", - "καλαµποκίσιο": "καλαµποκήσιο", - "κεφαλίσιος": "κεφαλήσιος", - "κεφαλίσια": "κεφαλήσια", - "κεφαλίσιο": "κεφαλήσιο", - "κρουσταλλένιος": "κρυσταλλένιος", - "κρουσταλλένια": "κρυσταλλένια", - "κρουσταλλένιο": "κρυσταλλένιο", - "µοσκαρήσιος": "µοσχαρήσιος", - "µοσκαρήσια": "µοσχαρήσια", - "µοσκαρήσιο": "µοσχαρήσιο", - "παλικαρήσιος": "παλληκαρήσιος", - "παλικαρήσια": "παλληκαρήσια", - "παλικαρήσιο": "παλληκαρήσιο", - "πετρένιος": "πέτρινος", - "πετρένια": "πέτρινη", - "πετρένιο": "πέτρινο", - "σιταρένιος": "σταρένιος", - "σιταρένια": "σταρένια", - "σιταρένιο": "σταρένιο", - "σκυλίσιος": "σκυλήσιος", - "σκυλίσια": "σκυλήσια", - "σκυλίσιο": "σκυλήσιο", - "χελίσιος": "χελήσιος", - "χελίσια": "χελήσια", - "χελίσιο": "χελήσιο", - "χελωνίσιος": "χελωνήσιος", - "χελωνίσια": "χελωνήσια", - "χελωνίσιο": "χελωνήσιο", - "γουρσούζης": "γρουσούζης", - "γουρσούζα": "γρουσούζα", - "γουρσούζικο": "γρουσούζικο", - "γρινιάρης": "γκρινιάρης", - "γρινιάρα": "γκρινιάρα", - "γρινιάρικο": "γκρινιάρικο", - "λιχούδης": "λειχούδης", - "λιχούδα": "λειχούδα", - "λιχούδικο": "λειχούδικο", - "µαργιόλής": "µαριόλης", - "µαργιόλήςα": "µαριόλα", - "µαργιόλήςικο": "µαριόλικο", - "ξεκουτιάρης": "ξεκούτης", - "ξεκουτιάρα": "ξεκούτα", - "ξεκουτιάρικο": "ξεκούτικο", - "σκανδαλιάρης": "σκανταλιάρης", - "σκανδαλιάρα": "σκανταλιάρα", - "σκανδαλιάρικο": "σκανταλιάρικο", - "τσιγκούνης": "τσιγγούνης", - "τσιγκούνα": "τσιγγούνα", - "τσιγκούνικο": "τσιγγούνικο", -} - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm diff --git a/spacy/lang/el/syntax_iterators.py b/spacy/lang/el/syntax_iterators.py index 988a36c80..ea3af576c 100644 --- a/spacy/lang/el/syntax_iterators.py +++ b/spacy/lang/el/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases. Works on both Doc and Span. """ @@ -10,13 +11,17 @@ def noun_chunks(obj): # obj tag corrects some DEP tagger mistakes. # Further improvement of the models will eliminate the need for this tag. labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings.add(label) for label in labels] conj = doc.vocab.strings.add("conj") nmod = doc.vocab.strings.add("nmod") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py index fa01e2b60..de09ec1e7 100644 --- a/spacy/lang/en/__init__.py +++ b/spacy/lang/en/__init__.py @@ -1,5 +1,4 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS @@ -7,10 +6,9 @@ from .morph_rules import MORPH_RULES from .syntax_iterators import SYNTAX_ITERATORS from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc def _return_en(_): @@ -21,9 +19,6 @@ class EnglishDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = _return_en - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tag_map = TAG_MAP stop_words = STOP_WORDS diff --git a/spacy/lang/en/norm_exceptions.py b/spacy/lang/en/norm_exceptions.py deleted file mode 100644 index 4125cd37b..000000000 --- a/spacy/lang/en/norm_exceptions.py +++ /dev/null @@ -1,1764 +0,0 @@ -_exc = { - # Slang and abbreviations - "cos": "because", - "cuz": "because", - "fav": "favorite", - "fave": "favorite", - "misc": "miscellaneous", - "plz": "please", - "pls": "please", - "thx": "thanks", - # US vs. UK spelling - "accessorise": "accessorize", - "accessorised": "accessorized", - "accessorises": "accessorizes", - "accessorising": "accessorizing", - "acclimatisation": "acclimatization", - "acclimatise": "acclimatize", - "acclimatised": "acclimatized", - "acclimatises": "acclimatizes", - "acclimatising": "acclimatizing", - "accoutrements": "accouterments", - "aeon": "eon", - "aeons": "eons", - "aerogramme": "aerogram", - "aerogrammes": "aerograms", - "aeroplane": "airplane", - "aeroplanes ": "airplanes ", - "aesthete": "esthete", - "aesthetes": "esthetes", - "aesthetic": "esthetic", - "aesthetically": "esthetically", - "aesthetics": "esthetics", - "aetiology": "etiology", - "ageing": "aging", - "aggrandisement": "aggrandizement", - "agonise": "agonize", - "agonised": "agonized", - "agonises": "agonizes", - "agonising": "agonizing", - "agonisingly": "agonizingly", - "almanack": "almanac", - "almanacks": "almanacs", - "aluminium": "aluminum", - "amortisable": "amortizable", - "amortisation": "amortization", - "amortisations": "amortizations", - "amortise": "amortize", - "amortised": "amortized", - "amortises": "amortizes", - "amortising": "amortizing", - "amphitheatre": "amphitheater", - "amphitheatres": "amphitheaters", - "anaemia": "anemia", - "anaemic": "anemic", - "anaesthesia": "anesthesia", - "anaesthetic": "anesthetic", - "anaesthetics": "anesthetics", - "anaesthetise": "anesthetize", - "anaesthetised": "anesthetized", - "anaesthetises": "anesthetizes", - "anaesthetising": "anesthetizing", - "anaesthetist": "anesthetist", - "anaesthetists": "anesthetists", - "anaesthetize": "anesthetize", - "anaesthetized": "anesthetized", - "anaesthetizes": "anesthetizes", - "anaesthetizing": "anesthetizing", - "analogue": "analog", - "analogues": "analogs", - "analyse": "analyze", - "analysed": "analyzed", - "analyses": "analyzes", - "analysing": "analyzing", - "anglicise": "anglicize", - "anglicised": "anglicized", - "anglicises": "anglicizes", - "anglicising": "anglicizing", - "annualised": "annualized", - "antagonise": "antagonize", - "antagonised": "antagonized", - "antagonises": "antagonizes", - "antagonising": "antagonizing", - "apologise": "apologize", - "apologised": "apologized", - "apologises": "apologizes", - "apologising": "apologizing", - "appal": "appall", - "appals": "appalls", - "appetiser": "appetizer", - "appetisers": "appetizers", - "appetising": "appetizing", - "appetisingly": "appetizingly", - "arbour": "arbor", - "arbours": "arbors", - "archaeological": "archeological", - "archaeologically": "archeologically", - "archaeologist": "archeologist", - "archaeologists": "archeologists", - "archaeology": "archeology", - "ardour": "ardor", - "armour": "armor", - "armoured": "armored", - "armourer": "armorer", - "armourers": "armorers", - "armouries": "armories", - "armoury": "armory", - "artefact": "artifact", - "artefacts": "artifacts", - "authorise": "authorize", - "authorised": "authorized", - "authorises": "authorizes", - "authorising": "authorizing", - "axe": "ax", - "backpedalled": "backpedaled", - "backpedalling": "backpedaling", - "bannister": "banister", - "bannisters": "banisters", - "baptise": "baptize", - "baptised": "baptized", - "baptises": "baptizes", - "baptising": "baptizing", - "bastardise": "bastardize", - "bastardised": "bastardized", - "bastardises": "bastardizes", - "bastardising": "bastardizing", - "battleaxe": "battleax", - "baulk": "balk", - "baulked": "balked", - "baulking": "balking", - "baulks": "balks", - "bedevilled": "bedeviled", - "bedevilling": "bedeviling", - "behaviour": "behavior", - "behavioural": "behavioral", - "behaviourism": "behaviorism", - "behaviourist": "behaviorist", - "behaviourists": "behaviorists", - "behaviours": "behaviors", - "behove": "behoove", - "behoved": "behooved", - "behoves": "behooves", - "bejewelled": "bejeweled", - "belabour": "belabor", - "belaboured": "belabored", - "belabouring": "belaboring", - "belabours": "belabors", - "bevelled": "beveled", - "bevvies": "bevies", - "bevvy": "bevy", - "biassed": "biased", - "biassing": "biasing", - "bingeing": "binging", - "bougainvillaea": "bougainvillea", - "bougainvillaeas": "bougainvilleas", - "bowdlerise": "bowdlerize", - "bowdlerised": "bowdlerized", - "bowdlerises": "bowdlerizes", - "bowdlerising": "bowdlerizing", - "breathalyse": "breathalyze", - "breathalysed": "breathalyzed", - "breathalyser": "breathalyzer", - "breathalysers": "breathalyzers", - "breathalyses": "breathalyzes", - "breathalysing": "breathalyzing", - "brutalise": "brutalize", - "brutalised": "brutalized", - "brutalises": "brutalizes", - "brutalising": "brutalizing", - "buses": "busses", - "busing": "bussing", - "caesarean": "cesarean", - "caesareans": "cesareans", - "calibre": "caliber", - "calibres": "calibers", - "calliper": "caliper", - "callipers": "calipers", - "callisthenics": "calisthenics", - "canalise": "canalize", - "canalised": "canalized", - "canalises": "canalizes", - "canalising": "canalizing", - "cancellation": "cancelation", - "cancellations": "cancelations", - "cancelled": "canceled", - "cancelling": "canceling", - "candour": "candor", - "cannibalise": "cannibalize", - "cannibalised": "cannibalized", - "cannibalises": "cannibalizes", - "cannibalising": "cannibalizing", - "canonise": "canonize", - "canonised": "canonized", - "canonises": "canonizes", - "canonising": "canonizing", - "capitalise": "capitalize", - "capitalised": "capitalized", - "capitalises": "capitalizes", - "capitalising": "capitalizing", - "caramelise": "caramelize", - "caramelised": "caramelized", - "caramelises": "caramelizes", - "caramelising": "caramelizing", - "carbonise": "carbonize", - "carbonised": "carbonized", - "carbonises": "carbonizes", - "carbonising": "carbonizing", - "carolled": "caroled", - "carolling": "caroling", - "catalogue": "catalog", - "catalogued": "cataloged", - "catalogues": "catalogs", - "cataloguing": "cataloging", - "catalyse": "catalyze", - "catalysed": "catalyzed", - "catalyses": "catalyzes", - "catalysing": "catalyzing", - "categorise": "categorize", - "categorised": "categorized", - "categorises": "categorizes", - "categorising": "categorizing", - "cauterise": "cauterize", - "cauterised": "cauterized", - "cauterises": "cauterizes", - "cauterising": "cauterizing", - "cavilled": "caviled", - "cavilling": "caviling", - "centigramme": "centigram", - "centigrammes": "centigrams", - "centilitre": "centiliter", - "centilitres": "centiliters", - "centimetre": "centimeter", - "centimetres": "centimeters", - "centralise": "centralize", - "centralised": "centralized", - "centralises": "centralizes", - "centralising": "centralizing", - "centre": "center", - "centred": "centered", - "centrefold": "centerfold", - "centrefolds": "centerfolds", - "centrepiece": "centerpiece", - "centrepieces": "centerpieces", - "centres": "centers", - "channelled": "channeled", - "channelling": "channeling", - "characterise": "characterize", - "characterised": "characterized", - "characterises": "characterizes", - "characterising": "characterizing", - "cheque": "check", - "chequebook": "checkbook", - "chequebooks": "checkbooks", - "chequered": "checkered", - "cheques": "checks", - "chilli": "chili", - "chimaera": "chimera", - "chimaeras": "chimeras", - "chiselled": "chiseled", - "chiselling": "chiseling", - "circularise": "circularize", - "circularised": "circularized", - "circularises": "circularizes", - "circularising": "circularizing", - "civilise": "civilize", - "civilised": "civilized", - "civilises": "civilizes", - "civilising": "civilizing", - "clamour": "clamor", - "clamoured": "clamored", - "clamouring": "clamoring", - "clamours": "clamors", - "clangour": "clangor", - "clarinettist": "clarinetist", - "clarinettists": "clarinetists", - "collectivise": "collectivize", - "collectivised": "collectivized", - "collectivises": "collectivizes", - "collectivising": "collectivizing", - "colonisation": "colonization", - "colonise": "colonize", - "colonised": "colonized", - "coloniser": "colonizer", - "colonisers": "colonizers", - "colonises": "colonizes", - "colonising": "colonizing", - "colour": "color", - "colourant": "colorant", - "colourants": "colorants", - "coloured": "colored", - "coloureds": "coloreds", - "colourful": "colorful", - "colourfully": "colorfully", - "colouring": "coloring", - "colourize": "colorize", - "colourized": "colorized", - "colourizes": "colorizes", - "colourizing": "colorizing", - "colourless": "colorless", - "colours": "colors", - "commercialise": "commercialize", - "commercialised": "commercialized", - "commercialises": "commercializes", - "commercialising": "commercializing", - "compartmentalise": "compartmentalize", - "compartmentalised": "compartmentalized", - "compartmentalises": "compartmentalizes", - "compartmentalising": "compartmentalizing", - "computerise": "computerize", - "computerised": "computerized", - "computerises": "computerizes", - "computerising": "computerizing", - "conceptualise": "conceptualize", - "conceptualised": "conceptualized", - "conceptualises": "conceptualizes", - "conceptualising": "conceptualizing", - "connexion": "connection", - "connexions": "connections", - "contextualise": "contextualize", - "contextualised": "contextualized", - "contextualises": "contextualizes", - "contextualising": "contextualizing", - "cosier": "cozier", - "cosies": "cozies", - "cosiest": "coziest", - "cosily": "cozily", - "cosiness": "coziness", - "cosy": "cozy", - "councillor": "councilor", - "councillors": "councilors", - "counselled": "counseled", - "counselling": "counseling", - "counsellor": "counselor", - "counsellors": "counselors", - "crenellated": "crenelated", - "criminalise": "criminalize", - "criminalised": "criminalized", - "criminalises": "criminalizes", - "criminalising": "criminalizing", - "criticise": "criticize", - "criticised": "criticized", - "criticises": "criticizes", - "criticising": "criticizing", - "crueller": "crueler", - "cruellest": "cruelest", - "crystallisation": "crystallization", - "crystallise": "crystallize", - "crystallised": "crystallized", - "crystallises": "crystallizes", - "crystallising": "crystallizing", - "cudgelled": "cudgeled", - "cudgelling": "cudgeling", - "customise": "customize", - "customised": "customized", - "customises": "customizes", - "customising": "customizing", - "cypher": "cipher", - "cyphers": "ciphers", - "decentralisation": "decentralization", - "decentralise": "decentralize", - "decentralised": "decentralized", - "decentralises": "decentralizes", - "decentralising": "decentralizing", - "decriminalisation": "decriminalization", - "decriminalise": "decriminalize", - "decriminalised": "decriminalized", - "decriminalises": "decriminalizes", - "decriminalising": "decriminalizing", - "defence": "defense", - "defenceless": "defenseless", - "defences": "defenses", - "dehumanisation": "dehumanization", - "dehumanise": "dehumanize", - "dehumanised": "dehumanized", - "dehumanises": "dehumanizes", - "dehumanising": "dehumanizing", - "demeanour": "demeanor", - "demilitarisation": "demilitarization", - "demilitarise": "demilitarize", - "demilitarised": "demilitarized", - "demilitarises": "demilitarizes", - "demilitarising": "demilitarizing", - "demobilisation": "demobilization", - "demobilise": "demobilize", - "demobilised": "demobilized", - "demobilises": "demobilizes", - "demobilising": "demobilizing", - "democratisation": "democratization", - "democratise": "democratize", - "democratised": "democratized", - "democratises": "democratizes", - "democratising": "democratizing", - "demonise": "demonize", - "demonised": "demonized", - "demonises": "demonizes", - "demonising": "demonizing", - "demoralisation": "demoralization", - "demoralise": "demoralize", - "demoralised": "demoralized", - "demoralises": "demoralizes", - "demoralising": "demoralizing", - "denationalisation": "denationalization", - "denationalise": "denationalize", - "denationalised": "denationalized", - "denationalises": "denationalizes", - "denationalising": "denationalizing", - "deodorise": "deodorize", - "deodorised": "deodorized", - "deodorises": "deodorizes", - "deodorising": "deodorizing", - "depersonalise": "depersonalize", - "depersonalised": "depersonalized", - "depersonalises": "depersonalizes", - "depersonalising": "depersonalizing", - "deputise": "deputize", - "deputised": "deputized", - "deputises": "deputizes", - "deputising": "deputizing", - "desensitisation": "desensitization", - "desensitise": "desensitize", - "desensitised": "desensitized", - "desensitises": "desensitizes", - "desensitising": "desensitizing", - "destabilisation": "destabilization", - "destabilise": "destabilize", - "destabilised": "destabilized", - "destabilises": "destabilizes", - "destabilising": "destabilizing", - "dialled": "dialed", - "dialling": "dialing", - "dialogue": "dialog", - "dialogues": "dialogs", - "diarrhoea": "diarrhea", - "digitise": "digitize", - "digitised": "digitized", - "digitises": "digitizes", - "digitising": "digitizing", - "disc": "disk", - "discolour": "discolor", - "discoloured": "discolored", - "discolouring": "discoloring", - "discolours": "discolors", - "discs": "disks", - "disembowelled": "disemboweled", - "disembowelling": "disemboweling", - "disfavour": "disfavor", - "dishevelled": "disheveled", - "dishonour": "dishonor", - "dishonourable": "dishonorable", - "dishonourably": "dishonorably", - "dishonoured": "dishonored", - "dishonouring": "dishonoring", - "dishonours": "dishonors", - "disorganisation": "disorganization", - "disorganised": "disorganized", - "distil": "distill", - "distils": "distills", - "doin": "doing", - "doin'": "doing", - "dramatisation": "dramatization", - "dramatisations": "dramatizations", - "dramatise": "dramatize", - "dramatised": "dramatized", - "dramatises": "dramatizes", - "dramatising": "dramatizing", - "draught": "draft", - "draughtboard": "draftboard", - "draughtboards": "draftboards", - "draughtier": "draftier", - "draughtiest": "draftiest", - "draughts": "drafts", - "draughtsman": "draftsman", - "draughtsmanship": "draftsmanship", - "draughtsmen": "draftsmen", - "draughtswoman": "draftswoman", - "draughtswomen": "draftswomen", - "draughty": "drafty", - "drivelled": "driveled", - "drivelling": "driveling", - "duelled": "dueled", - "duelling": "dueling", - "economise": "economize", - "economised": "economized", - "economises": "economizes", - "economising": "economizing", - "edoema": "edema ", - "editorialise": "editorialize", - "editorialised": "editorialized", - "editorialises": "editorializes", - "editorialising": "editorializing", - "empathise": "empathize", - "empathised": "empathized", - "empathises": "empathizes", - "empathising": "empathizing", - "emphasise": "emphasize", - "emphasised": "emphasized", - "emphasises": "emphasizes", - "emphasising": "emphasizing", - "enamelled": "enameled", - "enamelling": "enameling", - "enamoured": "enamored", - "encyclopaedia": "encyclopedia", - "encyclopaedias": "encyclopedias", - "encyclopaedic": "encyclopedic", - "endeavour": "endeavor", - "endeavoured": "endeavored", - "endeavouring": "endeavoring", - "endeavours": "endeavors", - "energise": "energize", - "energised": "energized", - "energises": "energizes", - "energising": "energizing", - "enrol": "enroll", - "enrols": "enrolls", - "enthral": "enthrall", - "enthrals": "enthralls", - "epaulette": "epaulet", - "epaulettes": "epaulets", - "epicentre": "epicenter", - "epicentres": "epicenters", - "epilogue": "epilog", - "epilogues": "epilogs", - "epitomise": "epitomize", - "epitomised": "epitomized", - "epitomises": "epitomizes", - "epitomising": "epitomizing", - "equalisation": "equalization", - "equalise": "equalize", - "equalised": "equalized", - "equaliser": "equalizer", - "equalisers": "equalizers", - "equalises": "equalizes", - "equalising": "equalizing", - "eulogise": "eulogize", - "eulogised": "eulogized", - "eulogises": "eulogizes", - "eulogising": "eulogizing", - "evangelise": "evangelize", - "evangelised": "evangelized", - "evangelises": "evangelizes", - "evangelising": "evangelizing", - "exorcise": "exorcize", - "exorcised": "exorcized", - "exorcises": "exorcizes", - "exorcising": "exorcizing", - "extemporisation": "extemporization", - "extemporise": "extemporize", - "extemporised": "extemporized", - "extemporises": "extemporizes", - "extemporising": "extemporizing", - "externalisation": "externalization", - "externalisations": "externalizations", - "externalise": "externalize", - "externalised": "externalized", - "externalises": "externalizes", - "externalising": "externalizing", - "factorise": "factorize", - "factorised": "factorized", - "factorises": "factorizes", - "factorising": "factorizing", - "faecal": "fecal", - "faeces": "feces", - "familiarisation": "familiarization", - "familiarise": "familiarize", - "familiarised": "familiarized", - "familiarises": "familiarizes", - "familiarising": "familiarizing", - "fantasise": "fantasize", - "fantasised": "fantasized", - "fantasises": "fantasizes", - "fantasising": "fantasizing", - "favour": "favor", - "favourable": "favorable", - "favourably": "favorably", - "favoured": "favored", - "favouring": "favoring", - "favourite": "favorite", - "favourites": "favorites", - "favouritism": "favoritism", - "favours": "favors", - "feminise": "feminize", - "feminised": "feminized", - "feminises": "feminizes", - "feminising": "feminizing", - "fertilisation": "fertilization", - "fertilise": "fertilize", - "fertilised": "fertilized", - "fertiliser": "fertilizer", - "fertilisers": "fertilizers", - "fertilises": "fertilizes", - "fertilising": "fertilizing", - "fervour": "fervor", - "fibre": "fiber", - "fibreglass": "fiberglass", - "fibres": "fibers", - "fictionalisation": "fictionalization", - "fictionalisations": "fictionalizations", - "fictionalise": "fictionalize", - "fictionalised": "fictionalized", - "fictionalises": "fictionalizes", - "fictionalising": "fictionalizing", - "fillet": "filet", - "filleted ": "fileted ", - "filleting": "fileting", - "fillets ": "filets ", - "finalisation": "finalization", - "finalise": "finalize", - "finalised": "finalized", - "finalises": "finalizes", - "finalising": "finalizing", - "flautist": "flutist", - "flautists": "flutists", - "flavour": "flavor", - "flavoured": "flavored", - "flavouring": "flavoring", - "flavourings": "flavorings", - "flavourless": "flavorless", - "flavours": "flavors", - "flavoursome": "flavorsome", - "flyer / flier ": "flier / flyer ", - "foetal": "fetal", - "foetid": "fetid", - "foetus": "fetus", - "foetuses": "fetuses", - "formalisation": "formalization", - "formalise": "formalize", - "formalised": "formalized", - "formalises": "formalizes", - "formalising": "formalizing", - "fossilisation": "fossilization", - "fossilise": "fossilize", - "fossilised": "fossilized", - "fossilises": "fossilizes", - "fossilising": "fossilizing", - "fraternisation": "fraternization", - "fraternise": "fraternize", - "fraternised": "fraternized", - "fraternises": "fraternizes", - "fraternising": "fraternizing", - "fulfil": "fulfill", - "fulfilment": "fulfillment", - "fulfils": "fulfills", - "funnelled": "funneled", - "funnelling": "funneling", - "galvanise": "galvanize", - "galvanised": "galvanized", - "galvanises": "galvanizes", - "galvanising": "galvanizing", - "gambolled": "gamboled", - "gambolling": "gamboling", - "gaol": "jail", - "gaolbird": "jailbird", - "gaolbirds": "jailbirds", - "gaolbreak": "jailbreak", - "gaolbreaks": "jailbreaks", - "gaoled": "jailed", - "gaoler": "jailer", - "gaolers": "jailers", - "gaoling": "jailing", - "gaols": "jails", - "gases": "gasses", - "gauge": "gage", - "gauged": "gaged", - "gauges": "gages", - "gauging": "gaging", - "generalisation": "generalization", - "generalisations": "generalizations", - "generalise": "generalize", - "generalised": "generalized", - "generalises": "generalizes", - "generalising": "generalizing", - "ghettoise": "ghettoize", - "ghettoised": "ghettoized", - "ghettoises": "ghettoizes", - "ghettoising": "ghettoizing", - "gipsies": "gypsies", - "glamorise": "glamorize", - "glamorised": "glamorized", - "glamorises": "glamorizes", - "glamorising": "glamorizing", - "glamour": "glamor", - "globalisation": "globalization", - "globalise": "globalize", - "globalised": "globalized", - "globalises": "globalizes", - "globalising": "globalizing", - "glueing ": "gluing ", - "goin": "going", - "goin'": "going", - "goitre": "goiter", - "goitres": "goiters", - "gonorrhoea": "gonorrhea", - "gramme": "gram", - "grammes": "grams", - "gravelled": "graveled", - "grey": "gray", - "greyed": "grayed", - "greying": "graying", - "greyish": "grayish", - "greyness": "grayness", - "greys": "grays", - "grovelled": "groveled", - "grovelling": "groveling", - "groyne": "groin", - "groynes ": "groins", - "gruelling": "grueling", - "gruellingly": "gruelingly", - "gryphon": "griffin", - "gryphons": "griffins", - "gynaecological": "gynecological", - "gynaecologist": "gynecologist", - "gynaecologists": "gynecologists", - "gynaecology": "gynecology", - "haematological": "hematological", - "haematologist": "hematologist", - "haematologists": "hematologists", - "haematology": "hematology", - "haemoglobin": "hemoglobin", - "haemophilia": "hemophilia", - "haemophiliac": "hemophiliac", - "haemophiliacs": "hemophiliacs", - "haemorrhage": "hemorrhage", - "haemorrhaged": "hemorrhaged", - "haemorrhages": "hemorrhages", - "haemorrhaging": "hemorrhaging", - "haemorrhoids": "hemorrhoids", - "harbour": "harbor", - "harboured": "harbored", - "harbouring": "harboring", - "harbours": "harbors", - "harmonisation": "harmonization", - "harmonise": "harmonize", - "harmonised": "harmonized", - "harmonises": "harmonizes", - "harmonising": "harmonizing", - "havin": "having", - "havin'": "having", - "homoeopath": "homeopath", - "homoeopathic": "homeopathic", - "homoeopaths": "homeopaths", - "homoeopathy": "homeopathy", - "homogenise": "homogenize", - "homogenised": "homogenized", - "homogenises": "homogenizes", - "homogenising": "homogenizing", - "honour": "honor", - "honourable": "honorable", - "honourably": "honorably", - "honoured": "honored", - "honouring": "honoring", - "honours": "honors", - "hospitalisation": "hospitalization", - "hospitalise": "hospitalize", - "hospitalised": "hospitalized", - "hospitalises": "hospitalizes", - "hospitalising": "hospitalizing", - "humanise": "humanize", - "humanised": "humanized", - "humanises": "humanizes", - "humanising": "humanizing", - "humour": "humor", - "humoured": "humored", - "humouring": "humoring", - "humourless": "humorless", - "humours": "humors", - "hybridise": "hybridize", - "hybridised": "hybridized", - "hybridises": "hybridizes", - "hybridising": "hybridizing", - "hypnotise": "hypnotize", - "hypnotised": "hypnotized", - "hypnotises": "hypnotizes", - "hypnotising": "hypnotizing", - "hypothesise": "hypothesize", - "hypothesised": "hypothesized", - "hypothesises": "hypothesizes", - "hypothesising": "hypothesizing", - "idealisation": "idealization", - "idealise": "idealize", - "idealised": "idealized", - "idealises": "idealizes", - "idealising": "idealizing", - "idolise": "idolize", - "idolised": "idolized", - "idolises": "idolizes", - "idolising": "idolizing", - "immobilisation": "immobilization", - "immobilise": "immobilize", - "immobilised": "immobilized", - "immobiliser": "immobilizer", - "immobilisers": "immobilizers", - "immobilises": "immobilizes", - "immobilising": "immobilizing", - "immortalise": "immortalize", - "immortalised": "immortalized", - "immortalises": "immortalizes", - "immortalising": "immortalizing", - "immunisation": "immunization", - "immunise": "immunize", - "immunised": "immunized", - "immunises": "immunizes", - "immunising": "immunizing", - "impanelled": "impaneled", - "impanelling": "impaneling", - "imperilled": "imperiled", - "imperilling": "imperiling", - "individualise": "individualize", - "individualised": "individualized", - "individualises": "individualizes", - "individualising": "individualizing", - "industrialise": "industrialize", - "industrialised": "industrialized", - "industrialises": "industrializes", - "industrialising": "industrializing", - "inflexion": "inflection", - "inflexions": "inflections", - "initialise": "initialize", - "initialised": "initialized", - "initialises": "initializes", - "initialising": "initializing", - "initialled": "initialed", - "initialling": "initialing", - "instal": "install", - "instalment": "installment", - "instalments": "installments", - "instals": "installs", - "instil": "instill", - "instils": "instills", - "institutionalisation": "institutionalization", - "institutionalise": "institutionalize", - "institutionalised": "institutionalized", - "institutionalises": "institutionalizes", - "institutionalising": "institutionalizing", - "intellectualise": "intellectualize", - "intellectualised": "intellectualized", - "intellectualises": "intellectualizes", - "intellectualising": "intellectualizing", - "internalisation": "internalization", - "internalise": "internalize", - "internalised": "internalized", - "internalises": "internalizes", - "internalising": "internalizing", - "internationalisation": "internationalization", - "internationalise": "internationalize", - "internationalised": "internationalized", - "internationalises": "internationalizes", - "internationalising": "internationalizing", - "ionisation": "ionization", - "ionise": "ionize", - "ionised": "ionized", - "ioniser": "ionizer", - "ionisers": "ionizers", - "ionises": "ionizes", - "ionising": "ionizing", - "italicise": "italicize", - "italicised": "italicized", - "italicises": "italicizes", - "italicising": "italicizing", - "itemise": "itemize", - "itemised": "itemized", - "itemises": "itemizes", - "itemising": "itemizing", - "jeopardise": "jeopardize", - "jeopardised": "jeopardized", - "jeopardises": "jeopardizes", - "jeopardising": "jeopardizing", - "jewelled": "jeweled", - "jeweller": "jeweler", - "jewellers": "jewelers", - "jewellery": "jewelry", - "judgement ": "judgment", - "kilogramme": "kilogram", - "kilogrammes": "kilograms", - "kilometre": "kilometer", - "kilometres": "kilometers", - "labelled": "labeled", - "labelling": "labeling", - "labour": "labor", - "laboured": "labored", - "labourer": "laborer", - "labourers": "laborers", - "labouring": "laboring", - "labours": "labors", - "lacklustre": "lackluster", - "legalisation": "legalization", - "legalise": "legalize", - "legalised": "legalized", - "legalises": "legalizes", - "legalising": "legalizing", - "legitimise": "legitimize", - "legitimised": "legitimized", - "legitimises": "legitimizes", - "legitimising": "legitimizing", - "leukaemia": "leukemia", - "levelled": "leveled", - "leveller": "leveler", - "levellers": "levelers", - "levelling": "leveling", - "libelled": "libeled", - "libelling": "libeling", - "libellous": "libelous", - "liberalisation": "liberalization", - "liberalise": "liberalize", - "liberalised": "liberalized", - "liberalises": "liberalizes", - "liberalising": "liberalizing", - "licence": "license", - "licenced": "licensed", - "licences": "licenses", - "licencing": "licensing", - "likeable": "likable ", - "lionisation": "lionization", - "lionise": "lionize", - "lionised": "lionized", - "lionises": "lionizes", - "lionising": "lionizing", - "liquidise": "liquidize", - "liquidised": "liquidized", - "liquidiser": "liquidizer", - "liquidisers": "liquidizers", - "liquidises": "liquidizes", - "liquidising": "liquidizing", - "litre": "liter", - "litres": "liters", - "localise": "localize", - "localised": "localized", - "localises": "localizes", - "localising": "localizing", - "lovin": "loving", - "lovin'": "loving", - "louvre": "louver", - "louvred": "louvered", - "louvres": "louvers ", - "lustre": "luster", - "magnetise": "magnetize", - "magnetised": "magnetized", - "magnetises": "magnetizes", - "magnetising": "magnetizing", - "manoeuvrability": "maneuverability", - "manoeuvrable": "maneuverable", - "manoeuvre": "maneuver", - "manoeuvred": "maneuvered", - "manoeuvres": "maneuvers", - "manoeuvring": "maneuvering", - "manoeuvrings": "maneuverings", - "marginalisation": "marginalization", - "marginalise": "marginalize", - "marginalised": "marginalized", - "marginalises": "marginalizes", - "marginalising": "marginalizing", - "marshalled": "marshaled", - "marshalling": "marshaling", - "marvelled": "marveled", - "marvelling": "marveling", - "marvellous": "marvelous", - "marvellously": "marvelously", - "materialisation": "materialization", - "materialise": "materialize", - "materialised": "materialized", - "materialises": "materializes", - "materialising": "materializing", - "maximisation": "maximization", - "maximise": "maximize", - "maximised": "maximized", - "maximises": "maximizes", - "maximising": "maximizing", - "meagre": "meager", - "mechanisation": "mechanization", - "mechanise": "mechanize", - "mechanised": "mechanized", - "mechanises": "mechanizes", - "mechanising": "mechanizing", - "mediaeval": "medieval", - "memorialise": "memorialize", - "memorialised": "memorialized", - "memorialises": "memorializes", - "memorialising": "memorializing", - "memorise": "memorize", - "memorised": "memorized", - "memorises": "memorizes", - "memorising": "memorizing", - "mesmerise": "mesmerize", - "mesmerised": "mesmerized", - "mesmerises": "mesmerizes", - "mesmerising": "mesmerizing", - "metabolise": "metabolize", - "metabolised": "metabolized", - "metabolises": "metabolizes", - "metabolising": "metabolizing", - "metre": "meter", - "metres": "meters", - "micrometre": "micrometer", - "micrometres": "micrometers", - "militarise": "militarize", - "militarised": "militarized", - "militarises": "militarizes", - "militarising": "militarizing", - "milligramme": "milligram", - "milligrammes": "milligrams", - "millilitre": "milliliter", - "millilitres": "milliliters", - "millimetre": "millimeter", - "millimetres": "millimeters", - "miniaturisation": "miniaturization", - "miniaturise": "miniaturize", - "miniaturised": "miniaturized", - "miniaturises": "miniaturizes", - "miniaturising": "miniaturizing", - "minibuses": "minibusses ", - "minimise": "minimize", - "minimised": "minimized", - "minimises": "minimizes", - "minimising": "minimizing", - "misbehaviour": "misbehavior", - "misdemeanour": "misdemeanor", - "misdemeanours": "misdemeanors", - "misspelt": "misspelled ", - "mitre": "miter", - "mitres": "miters", - "mobilisation": "mobilization", - "mobilise": "mobilize", - "mobilised": "mobilized", - "mobilises": "mobilizes", - "mobilising": "mobilizing", - "modelled": "modeled", - "modeller": "modeler", - "modellers": "modelers", - "modelling": "modeling", - "modernise": "modernize", - "modernised": "modernized", - "modernises": "modernizes", - "modernising": "modernizing", - "moisturise": "moisturize", - "moisturised": "moisturized", - "moisturiser": "moisturizer", - "moisturisers": "moisturizers", - "moisturises": "moisturizes", - "moisturising": "moisturizing", - "monologue": "monolog", - "monologues": "monologs", - "monopolisation": "monopolization", - "monopolise": "monopolize", - "monopolised": "monopolized", - "monopolises": "monopolizes", - "monopolising": "monopolizing", - "moralise": "moralize", - "moralised": "moralized", - "moralises": "moralizes", - "moralising": "moralizing", - "motorised": "motorized", - "mould": "mold", - "moulded": "molded", - "moulder": "molder", - "mouldered": "moldered", - "mouldering": "moldering", - "moulders": "molders", - "mouldier": "moldier", - "mouldiest": "moldiest", - "moulding": "molding", - "mouldings": "moldings", - "moulds": "molds", - "mouldy": "moldy", - "moult": "molt", - "moulted": "molted", - "moulting": "molting", - "moults": "molts", - "moustache": "mustache", - "moustached": "mustached", - "moustaches": "mustaches", - "moustachioed": "mustachioed", - "multicoloured": "multicolored", - "nationalisation": "nationalization", - "nationalisations": "nationalizations", - "nationalise": "nationalize", - "nationalised": "nationalized", - "nationalises": "nationalizes", - "nationalising": "nationalizing", - "naturalisation": "naturalization", - "naturalise": "naturalize", - "naturalised": "naturalized", - "naturalises": "naturalizes", - "naturalising": "naturalizing", - "neighbour": "neighbor", - "neighbourhood": "neighborhood", - "neighbourhoods": "neighborhoods", - "neighbouring": "neighboring", - "neighbourliness": "neighborliness", - "neighbourly": "neighborly", - "neighbours": "neighbors", - "neutralisation": "neutralization", - "neutralise": "neutralize", - "neutralised": "neutralized", - "neutralises": "neutralizes", - "neutralising": "neutralizing", - "normalisation": "normalization", - "normalise": "normalize", - "normalised": "normalized", - "normalises": "normalizes", - "normalising": "normalizing", - "odour": "odor", - "odourless": "odorless", - "odours": "odors", - "oesophagus": "esophagus", - "oesophaguses": "esophaguses", - "oestrogen": "estrogen", - "offence": "offense", - "offences": "offenses", - "omelette": "omelet", - "omelettes": "omelets", - "optimise": "optimize", - "optimised": "optimized", - "optimises": "optimizes", - "optimising": "optimizing", - "organisation": "organization", - "organisational": "organizational", - "organisations": "organizations", - "organise": "organize", - "organised": "organized", - "organiser": "organizer", - "organisers": "organizers", - "organises": "organizes", - "organising": "organizing", - "orthopaedic": "orthopedic", - "orthopaedics": "orthopedics", - "ostracise": "ostracize", - "ostracised": "ostracized", - "ostracises": "ostracizes", - "ostracising": "ostracizing", - "outmanoeuvre": "outmaneuver", - "outmanoeuvred": "outmaneuvered", - "outmanoeuvres": "outmaneuvers", - "outmanoeuvring": "outmaneuvering", - "overemphasise": "overemphasize", - "overemphasised": "overemphasized", - "overemphasises": "overemphasizes", - "overemphasising": "overemphasizing", - "oxidisation": "oxidization", - "oxidise": "oxidize", - "oxidised": "oxidized", - "oxidises": "oxidizes", - "oxidising": "oxidizing", - "paederast": "pederast", - "paederasts": "pederasts", - "paediatric": "pediatric", - "paediatrician": "pediatrician", - "paediatricians": "pediatricians", - "paediatrics": "pediatrics", - "paedophile": "pedophile", - "paedophiles": "pedophiles", - "paedophilia": "pedophilia", - "palaeolithic": "paleolithic", - "palaeontologist": "paleontologist", - "palaeontologists": "paleontologists", - "palaeontology": "paleontology", - "panelled": "paneled", - "panelling": "paneling", - "panellist": "panelist", - "panellists": "panelists", - "paralyse": "paralyze", - "paralysed": "paralyzed", - "paralyses": "paralyzes", - "paralysing": "paralyzing", - "parcelled": "parceled", - "parcelling": "parceling", - "parlour": "parlor", - "parlours": "parlors", - "particularise": "particularize", - "particularised": "particularized", - "particularises": "particularizes", - "particularising": "particularizing", - "passivisation": "passivization", - "passivise": "passivize", - "passivised": "passivized", - "passivises": "passivizes", - "passivising": "passivizing", - "pasteurisation": "pasteurization", - "pasteurise": "pasteurize", - "pasteurised": "pasteurized", - "pasteurises": "pasteurizes", - "pasteurising": "pasteurizing", - "patronise": "patronize", - "patronised": "patronized", - "patronises": "patronizes", - "patronising": "patronizing", - "patronisingly": "patronizingly", - "pedalled": "pedaled", - "pedalling": "pedaling", - "pedestrianisation": "pedestrianization", - "pedestrianise": "pedestrianize", - "pedestrianised": "pedestrianized", - "pedestrianises": "pedestrianizes", - "pedestrianising": "pedestrianizing", - "penalise": "penalize", - "penalised": "penalized", - "penalises": "penalizes", - "penalising": "penalizing", - "pencilled": "penciled", - "pencilling": "penciling", - "personalise": "personalize", - "personalised": "personalized", - "personalises": "personalizes", - "personalising": "personalizing", - "pharmacopoeia": "pharmacopeia", - "pharmacopoeias": "pharmacopeias", - "philosophise": "philosophize", - "philosophised": "philosophized", - "philosophises": "philosophizes", - "philosophising": "philosophizing", - "philtre": "filter", - "philtres": "filters", - "phoney ": "phony ", - "plagiarise": "plagiarize", - "plagiarised": "plagiarized", - "plagiarises": "plagiarizes", - "plagiarising": "plagiarizing", - "plough": "plow", - "ploughed": "plowed", - "ploughing": "plowing", - "ploughman": "plowman", - "ploughmen": "plowmen", - "ploughs": "plows", - "ploughshare": "plowshare", - "ploughshares": "plowshares", - "polarisation": "polarization", - "polarise": "polarize", - "polarised": "polarized", - "polarises": "polarizes", - "polarising": "polarizing", - "politicisation": "politicization", - "politicise": "politicize", - "politicised": "politicized", - "politicises": "politicizes", - "politicising": "politicizing", - "popularisation": "popularization", - "popularise": "popularize", - "popularised": "popularized", - "popularises": "popularizes", - "popularising": "popularizing", - "pouffe": "pouf", - "pouffes": "poufs", - "practise": "practice", - "practised": "practiced", - "practises": "practices", - "practising ": "practicing ", - "praesidium": "presidium", - "praesidiums ": "presidiums ", - "pressurisation": "pressurization", - "pressurise": "pressurize", - "pressurised": "pressurized", - "pressurises": "pressurizes", - "pressurising": "pressurizing", - "pretence": "pretense", - "pretences": "pretenses", - "primaeval": "primeval", - "prioritisation": "prioritization", - "prioritise": "prioritize", - "prioritised": "prioritized", - "prioritises": "prioritizes", - "prioritising": "prioritizing", - "privatisation": "privatization", - "privatisations": "privatizations", - "privatise": "privatize", - "privatised": "privatized", - "privatises": "privatizes", - "privatising": "privatizing", - "professionalisation": "professionalization", - "professionalise": "professionalize", - "professionalised": "professionalized", - "professionalises": "professionalizes", - "professionalising": "professionalizing", - "programme": "program", - "programmes": "programs", - "prologue": "prolog", - "prologues": "prologs", - "propagandise": "propagandize", - "propagandised": "propagandized", - "propagandises": "propagandizes", - "propagandising": "propagandizing", - "proselytise": "proselytize", - "proselytised": "proselytized", - "proselytiser": "proselytizer", - "proselytisers": "proselytizers", - "proselytises": "proselytizes", - "proselytising": "proselytizing", - "psychoanalyse": "psychoanalyze", - "psychoanalysed": "psychoanalyzed", - "psychoanalyses": "psychoanalyzes", - "psychoanalysing": "psychoanalyzing", - "publicise": "publicize", - "publicised": "publicized", - "publicises": "publicizes", - "publicising": "publicizing", - "pulverisation": "pulverization", - "pulverise": "pulverize", - "pulverised": "pulverized", - "pulverises": "pulverizes", - "pulverising": "pulverizing", - "pummelled": "pummel", - "pummelling": "pummeled", - "pyjama": "pajama", - "pyjamas": "pajamas", - "pzazz": "pizzazz", - "quarrelled": "quarreled", - "quarrelling": "quarreling", - "radicalise": "radicalize", - "radicalised": "radicalized", - "radicalises": "radicalizes", - "radicalising": "radicalizing", - "rancour": "rancor", - "randomise": "randomize", - "randomised": "randomized", - "randomises": "randomizes", - "randomising": "randomizing", - "rationalisation": "rationalization", - "rationalisations": "rationalizations", - "rationalise": "rationalize", - "rationalised": "rationalized", - "rationalises": "rationalizes", - "rationalising": "rationalizing", - "ravelled": "raveled", - "ravelling": "raveling", - "realisable": "realizable", - "realisation": "realization", - "realisations": "realizations", - "realise": "realize", - "realised": "realized", - "realises": "realizes", - "realising": "realizing", - "recognisable": "recognizable", - "recognisably": "recognizably", - "recognisance": "recognizance", - "recognise": "recognize", - "recognised": "recognized", - "recognises": "recognizes", - "recognising": "recognizing", - "reconnoitre": "reconnoiter", - "reconnoitred": "reconnoitered", - "reconnoitres": "reconnoiters", - "reconnoitring": "reconnoitering", - "refuelled": "refueled", - "refuelling": "refueling", - "regularisation": "regularization", - "regularise": "regularize", - "regularised": "regularized", - "regularises": "regularizes", - "regularising": "regularizing", - "remodelled": "remodeled", - "remodelling": "remodeling", - "remould": "remold", - "remoulded": "remolded", - "remoulding": "remolding", - "remoulds": "remolds", - "reorganisation": "reorganization", - "reorganisations": "reorganizations", - "reorganise": "reorganize", - "reorganised": "reorganized", - "reorganises": "reorganizes", - "reorganising": "reorganizing", - "revelled": "reveled", - "reveller": "reveler", - "revellers": "revelers", - "revelling": "reveling", - "revitalise": "revitalize", - "revitalised": "revitalized", - "revitalises": "revitalizes", - "revitalising": "revitalizing", - "revolutionise": "revolutionize", - "revolutionised": "revolutionized", - "revolutionises": "revolutionizes", - "revolutionising": "revolutionizing", - "rhapsodise": "rhapsodize", - "rhapsodised": "rhapsodized", - "rhapsodises": "rhapsodizes", - "rhapsodising": "rhapsodizing", - "rigour": "rigor", - "rigours": "rigors", - "ritualised": "ritualized", - "rivalled": "rivaled", - "rivalling": "rivaling", - "romanticise": "romanticize", - "romanticised": "romanticized", - "romanticises": "romanticizes", - "romanticising": "romanticizing", - "rumour": "rumor", - "rumoured": "rumored", - "rumours": "rumors", - "sabre": "saber", - "sabres": "sabers", - "saltpetre": "saltpeter", - "sanitise": "sanitize", - "sanitised": "sanitized", - "sanitises": "sanitizes", - "sanitising": "sanitizing", - "satirise": "satirize", - "satirised": "satirized", - "satirises": "satirizes", - "satirising": "satirizing", - "saviour": "savior", - "saviours": "saviors", - "savour": "savor", - "savoured": "savored", - "savouries": "savories", - "savouring": "savoring", - "savours": "savors", - "savoury": "savory", - "scandalise": "scandalize", - "scandalised": "scandalized", - "scandalises": "scandalizes", - "scandalising": "scandalizing", - "sceptic": "skeptic", - "sceptical": "skeptical", - "sceptically": "skeptically", - "scepticism": "skepticism", - "sceptics": "skeptics", - "sceptre": "scepter", - "sceptres": "scepters", - "scrutinise": "scrutinize", - "scrutinised": "scrutinized", - "scrutinises": "scrutinizes", - "scrutinising": "scrutinizing", - "secularisation": "secularization", - "secularise": "secularize", - "secularised": "secularized", - "secularises": "secularizes", - "secularising": "secularizing", - "sensationalise": "sensationalize", - "sensationalised": "sensationalized", - "sensationalises": "sensationalizes", - "sensationalising": "sensationalizing", - "sensitise": "sensitize", - "sensitised": "sensitized", - "sensitises": "sensitizes", - "sensitising": "sensitizing", - "sentimentalise": "sentimentalize", - "sentimentalised": "sentimentalized", - "sentimentalises": "sentimentalizes", - "sentimentalising": "sentimentalizing", - "sepulchre": "sepulcher", - "sepulchres": "sepulchers ", - "serialisation": "serialization", - "serialisations": "serializations", - "serialise": "serialize", - "serialised": "serialized", - "serialises": "serializes", - "serialising": "serializing", - "sermonise": "sermonize", - "sermonised": "sermonized", - "sermonises": "sermonizes", - "sermonising": "sermonizing", - "sheikh ": "sheik ", - "shovelled": "shoveled", - "shovelling": "shoveling", - "shrivelled": "shriveled", - "shrivelling": "shriveling", - "signalise": "signalize", - "signalised": "signalized", - "signalises": "signalizes", - "signalising": "signalizing", - "signalled": "signaled", - "signalling": "signaling", - "smoulder": "smolder", - "smouldered": "smoldered", - "smouldering": "smoldering", - "smoulders": "smolders", - "snivelled": "sniveled", - "snivelling": "sniveling", - "snorkelled": "snorkeled", - "snorkelling": "snorkeling", - "snowplough": "snowplow", - "snowploughs": "snowplow", - "socialisation": "socialization", - "socialise": "socialize", - "socialised": "socialized", - "socialises": "socializes", - "socialising": "socializing", - "sodomise": "sodomize", - "sodomised": "sodomized", - "sodomises": "sodomizes", - "sodomising": "sodomizing", - "solemnise": "solemnize", - "solemnised": "solemnized", - "solemnises": "solemnizes", - "solemnising": "solemnizing", - "sombre": "somber", - "specialisation": "specialization", - "specialisations": "specializations", - "specialise": "specialize", - "specialised": "specialized", - "specialises": "specializes", - "specialising": "specializing", - "spectre": "specter", - "spectres": "specters", - "spiralled": "spiraled", - "spiralling": "spiraling", - "splendour": "splendor", - "splendours": "splendors", - "squirrelled": "squirreled", - "squirrelling": "squirreling", - "stabilisation": "stabilization", - "stabilise": "stabilize", - "stabilised": "stabilized", - "stabiliser": "stabilizer", - "stabilisers": "stabilizers", - "stabilises": "stabilizes", - "stabilising": "stabilizing", - "standardisation": "standardization", - "standardise": "standardize", - "standardised": "standardized", - "standardises": "standardizes", - "standardising": "standardizing", - "stencilled": "stenciled", - "stencilling": "stenciling", - "sterilisation": "sterilization", - "sterilisations": "sterilizations", - "sterilise": "sterilize", - "sterilised": "sterilized", - "steriliser": "sterilizer", - "sterilisers": "sterilizers", - "sterilises": "sterilizes", - "sterilising": "sterilizing", - "stigmatisation": "stigmatization", - "stigmatise": "stigmatize", - "stigmatised": "stigmatized", - "stigmatises": "stigmatizes", - "stigmatising": "stigmatizing", - "storey": "story", - "storeys": "stories", - "subsidisation": "subsidization", - "subsidise": "subsidize", - "subsidised": "subsidized", - "subsidiser": "subsidizer", - "subsidisers": "subsidizers", - "subsidises": "subsidizes", - "subsidising": "subsidizing", - "succour": "succor", - "succoured": "succored", - "succouring": "succoring", - "succours": "succors", - "sulphate": "sulfate", - "sulphates": "sulfates", - "sulphide": "sulfide", - "sulphides": "sulfides", - "sulphur": "sulfur", - "sulphurous": "sulfurous", - "summarise": "summarize", - "summarised": "summarized", - "summarises": "summarizes", - "summarising": "summarizing", - "swivelled": "swiveled", - "swivelling": "swiveling", - "symbolise": "symbolize", - "symbolised": "symbolized", - "symbolises": "symbolizes", - "symbolising": "symbolizing", - "sympathise": "sympathize", - "sympathised": "sympathized", - "sympathiser": "sympathizer", - "sympathisers": "sympathizers", - "sympathises": "sympathizes", - "sympathising": "sympathizing", - "synchronisation": "synchronization", - "synchronise": "synchronize", - "synchronised": "synchronized", - "synchronises": "synchronizes", - "synchronising": "synchronizing", - "synthesise": "synthesize", - "synthesised": "synthesized", - "synthesiser": "synthesizer", - "synthesisers": "synthesizers", - "synthesises": "synthesizes", - "synthesising": "synthesizing", - "syphon": "siphon", - "syphoned": "siphoned", - "syphoning": "siphoning", - "syphons": "siphons", - "systematisation": "systematization", - "systematise": "systematize", - "systematised": "systematized", - "systematises": "systematizes", - "systematising": "systematizing", - "tantalise": "tantalize", - "tantalised": "tantalized", - "tantalises": "tantalizes", - "tantalising": "tantalizing", - "tantalisingly": "tantalizingly", - "tasselled": "tasseled", - "technicolour": "technicolor", - "temporise": "temporize", - "temporised": "temporized", - "temporises": "temporizes", - "temporising": "temporizing", - "tenderise": "tenderize", - "tenderised": "tenderized", - "tenderises": "tenderizes", - "tenderising": "tenderizing", - "terrorise": "terrorize", - "terrorised": "terrorized", - "terrorises": "terrorizes", - "terrorising": "terrorizing", - "theatre": "theater", - "theatregoer": "theatergoer", - "theatregoers": "theatergoers", - "theatres": "theaters", - "theorise": "theorize", - "theorised": "theorized", - "theorises": "theorizes", - "theorising": "theorizing", - "tonne": "ton", - "tonnes": "tons", - "towelled": "toweled", - "towelling": "toweling", - "toxaemia": "toxemia", - "tranquillise": "tranquilize", - "tranquillised": "tranquilized", - "tranquilliser": "tranquilizer", - "tranquillisers": "tranquilizers", - "tranquillises": "tranquilizes", - "tranquillising": "tranquilizing", - "tranquillity": "tranquility", - "tranquillize": "tranquilize", - "tranquillized": "tranquilized", - "tranquillizer": "tranquilizer", - "tranquillizers": "tranquilizers", - "tranquillizes": "tranquilizes", - "tranquillizing": "tranquilizing", - "tranquilly": "tranquility", - "transistorised": "transistorized", - "traumatise": "traumatize", - "traumatised": "traumatized", - "traumatises": "traumatizes", - "traumatising": "traumatizing", - "travelled": "traveled", - "traveller": "traveler", - "travellers": "travelers", - "travelling": "traveling", - "travelogue": "travelog", - "travelogues ": "travelogs ", - "trialled": "trialed", - "trialling": "trialing", - "tricolour": "tricolor", - "tricolours": "tricolors", - "trivialise": "trivialize", - "trivialised": "trivialized", - "trivialises": "trivializes", - "trivialising": "trivializing", - "tumour": "tumor", - "tumours": "tumors", - "tunnelled": "tunneled", - "tunnelling": "tunneling", - "tyrannise": "tyrannize", - "tyrannised": "tyrannized", - "tyrannises": "tyrannizes", - "tyrannising": "tyrannizing", - "tyre": "tire", - "tyres": "tires", - "unauthorised": "unauthorized", - "uncivilised": "uncivilized", - "underutilised": "underutilized", - "unequalled": "unequaled", - "unfavourable": "unfavorable", - "unfavourably": "unfavorably", - "unionisation": "unionization", - "unionise": "unionize", - "unionised": "unionized", - "unionises": "unionizes", - "unionising": "unionizing", - "unorganised": "unorganized", - "unravelled": "unraveled", - "unravelling": "unraveling", - "unrecognisable": "unrecognizable", - "unrecognised": "unrecognized", - "unrivalled": "unrivaled", - "unsavoury": "unsavory", - "untrammelled": "untrammeled", - "urbanisation": "urbanization", - "urbanise": "urbanize", - "urbanised": "urbanized", - "urbanises": "urbanizes", - "urbanising": "urbanizing", - "utilisable": "utilizable", - "utilisation": "utilization", - "utilise": "utilize", - "utilised": "utilized", - "utilises": "utilizes", - "utilising": "utilizing", - "valour": "valor", - "vandalise": "vandalize", - "vandalised": "vandalized", - "vandalises": "vandalizes", - "vandalising": "vandalizing", - "vaporisation": "vaporization", - "vaporise": "vaporize", - "vaporised": "vaporized", - "vaporises": "vaporizes", - "vaporising": "vaporizing", - "vapour": "vapor", - "vapours": "vapors", - "verbalise": "verbalize", - "verbalised": "verbalized", - "verbalises": "verbalizes", - "verbalising": "verbalizing", - "victimisation": "victimization", - "victimise": "victimize", - "victimised": "victimized", - "victimises": "victimizes", - "victimising": "victimizing", - "videodisc": "videodisk", - "videodiscs": "videodisks", - "vigour": "vigor", - "visualisation": "visualization", - "visualisations": "visualizations", - "visualise": "visualize", - "visualised": "visualized", - "visualises": "visualizes", - "visualising": "visualizing", - "vocalisation": "vocalization", - "vocalisations": "vocalizations", - "vocalise": "vocalize", - "vocalised": "vocalized", - "vocalises": "vocalizes", - "vocalising": "vocalizing", - "vulcanised": "vulcanized", - "vulgarisation": "vulgarization", - "vulgarise": "vulgarize", - "vulgarised": "vulgarized", - "vulgarises": "vulgarizes", - "vulgarising": "vulgarizing", - "waggon": "wagon", - "waggons": "wagons", - "watercolour": "watercolor", - "watercolours": "watercolors", - "weaselled": "weaseled", - "weaselling": "weaseling", - "westernisation": "westernization", - "westernise": "westernize", - "westernised": "westernized", - "westernises": "westernizes", - "westernising": "westernizing", - "womanise": "womanize", - "womanised": "womanized", - "womaniser": "womanizer", - "womanisers": "womanizers", - "womanises": "womanizes", - "womanising": "womanizing", - "woollen": "woolen", - "woollens": "woolens", - "woollies": "woolies", - "woolly": "wooly", - "worshipped ": "worshiped", - "worshipping ": "worshiping ", - "worshipper": "worshiper", - "yodelled": "yodeled", - "yodelling": "yodeling", - "yoghourt": "yogurt", - "yoghourts": "yogurts", - "yoghurt": "yogurt", - "yoghurts": "yogurts", -} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/en/syntax_iterators.py b/spacy/lang/en/syntax_iterators.py index 86695cf6f..c41120afb 100644 --- a/spacy/lang/en/syntax_iterators.py +++ b/spacy/lang/en/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -16,12 +17,16 @@ def noun_chunks(obj): "attr", "ROOT", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings.add(label) for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/en/tokenizer_exceptions.py b/spacy/lang/en/tokenizer_exceptions.py index 3e8075ec4..908ac3940 100644 --- a/spacy/lang/en/tokenizer_exceptions.py +++ b/spacy/lang/en/tokenizer_exceptions.py @@ -74,12 +74,12 @@ for pron in ["i", "you", "he", "she", "it", "we", "they"]: _exc[orth + "'d"] = [ {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"}, + {ORTH: "'d", NORM: "'d"}, ] _exc[orth + "d"] = [ {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"}, + {ORTH: "d", NORM: "'d"}, ] _exc[orth + "'d've"] = [ @@ -192,7 +192,10 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]: {ORTH: "'d", NORM: "'d"}, ] - _exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}] + _exc[orth + "d"] = [ + {ORTH: orth, LEMMA: word, NORM: word}, + {ORTH: "d", NORM: "'d"}, + ] _exc[orth + "'d've"] = [ {ORTH: orth, LEMMA: word, NORM: word}, diff --git a/spacy/lang/es/__init__.py b/spacy/lang/es/__init__.py index 060bd8fc6..f3b1f756e 100644 --- a/spacy/lang/es/__init__.py +++ b/spacy/lang/es/__init__.py @@ -3,6 +3,7 @@ from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .syntax_iterators import SYNTAX_ITERATORS +from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..norm_exceptions import BASE_NORMS @@ -20,6 +21,8 @@ class SpanishDefaults(Language.Defaults): ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tag_map = TAG_MAP + infixes = TOKENIZER_INFIXES + suffixes = TOKENIZER_SUFFIXES stop_words = STOP_WORDS syntax_iterators = SYNTAX_ITERATORS diff --git a/spacy/lang/es/lex_attrs.py b/spacy/lang/es/lex_attrs.py index d2a3c891a..988dbaba1 100644 --- a/spacy/lang/es/lex_attrs.py +++ b/spacy/lang/es/lex_attrs.py @@ -23,6 +23,15 @@ _num_words = [ "dieciocho", "diecinueve", "veinte", + "veintiuno", + "veintidós", + "veintitrés", + "veinticuatro", + "veinticinco", + "veintiséis", + "veintisiete", + "veintiocho", + "veintinueve", "treinta", "cuarenta", "cincuenta", diff --git a/spacy/lang/es/punctuation.py b/spacy/lang/es/punctuation.py new file mode 100644 index 000000000..f989221c2 --- /dev/null +++ b/spacy/lang/es/punctuation.py @@ -0,0 +1,47 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES +from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT +from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA +from ..char_classes import merge_chars + + +_list_units = [u for u in LIST_UNITS if u != "%"] +_units = merge_chars(" ".join(_list_units)) +_concat_quotes = CONCAT_QUOTES + "—–" + + +_suffixes = ( + ["—", "–"] + + LIST_PUNCT + + LIST_ELLIPSES + + LIST_QUOTES + + LIST_ICONS + + [ + r"(?<=[0-9])\+", + r"(?<=°[FfCcKk])\.", + r"(?<=[0-9])(?:{c})".format(c=CURRENCY), + r"(?<=[0-9])(?:{u})".format(u=_units), + r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format( + al=ALPHA_LOWER, e=r"%²\-\+", q=_concat_quotes, p=PUNCT + ), + r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER), + ] +) + +_infixes = ( + LIST_ELLIPSES + + LIST_ICONS + + [ + r"(?<=[0-9])[+\*^](?=[0-9-])", + r"(?<=[{al}{q}])\.(?=[{au}{q}])".format( + al=ALPHA_LOWER, au=ALPHA_UPPER, q=_concat_quotes + ), + r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), + r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), + ] +) + +TOKENIZER_SUFFIXES = _suffixes +TOKENIZER_INFIXES = _infixes diff --git a/spacy/lang/es/syntax_iterators.py b/spacy/lang/es/syntax_iterators.py index e998cd1d6..3c65bd441 100644 --- a/spacy/lang/es/syntax_iterators.py +++ b/spacy/lang/es/syntax_iterators.py @@ -1,8 +1,13 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX +from ...errors import Errors -def noun_chunks(obj): - doc = obj.doc +def noun_chunks(doclike): + doc = doclike.doc + + if not doc.is_parsed: + raise ValueError(Errors.E029) + if not len(doc): return np_label = doc.vocab.strings.add("NP") @@ -13,7 +18,7 @@ def noun_chunks(obj): np_right_deps = [doc.vocab.strings.add(label) for label in right_labels] stop_deps = [doc.vocab.strings.add(label) for label in stop_labels] token = doc[0] - while token and token.i < len(doc): + while token and token.i < len(doclike): if token.pos in [PROPN, NOUN, PRON]: left, right = noun_bounds( doc, token, np_left_deps, np_right_deps, stop_deps diff --git a/spacy/lang/es/tokenizer_exceptions.py b/spacy/lang/es/tokenizer_exceptions.py index d5eb42e29..7836f1c43 100644 --- a/spacy/lang/es/tokenizer_exceptions.py +++ b/spacy/lang/es/tokenizer_exceptions.py @@ -39,14 +39,16 @@ for orth in [ "Av.", "Avda.", "Cía.", + "EE.UU.", "etc.", + "fig.", "Gob.", "Gral.", "Ing.", "J.C.", + "km/h", "Lic.", "m.n.", - "no.", "núm.", "P.D.", "Prof.", diff --git a/spacy/lang/fa/__init__.py b/spacy/lang/fa/__init__.py index aa02855e9..4c5a7074c 100644 --- a/spacy/lang/fa/__init__.py +++ b/spacy/lang/fa/__init__.py @@ -7,6 +7,7 @@ from .lex_attrs import LEX_ATTRS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .punctuation import TOKENIZER_SUFFIXES +from .syntax_iterators import SYNTAX_ITERATORS class PersianDefaults(Language.Defaults): @@ -21,6 +22,7 @@ class PersianDefaults(Language.Defaults): tag_map = TAG_MAP suffixes = TOKENIZER_SUFFIXES writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} + syntax_iterators = SYNTAX_ITERATORS class Persian(Language): diff --git a/spacy/lang/fa/syntax_iterators.py b/spacy/lang/fa/syntax_iterators.py index 86695cf6f..c41120afb 100644 --- a/spacy/lang/fa/syntax_iterators.py +++ b/spacy/lang/fa/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -16,12 +17,16 @@ def noun_chunks(obj): "attr", "ROOT", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings.add(label) for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/fr/syntax_iterators.py b/spacy/lang/fr/syntax_iterators.py index 96636b0b7..c09b0e840 100644 --- a/spacy/lang/fr/syntax_iterators.py +++ b/spacy/lang/fr/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -15,12 +16,16 @@ def noun_chunks(obj): "nmod", "nmod:poss", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/fr/tokenizer_exceptions.py b/spacy/lang/fr/tokenizer_exceptions.py index 2a87e7d7a..7bf4922d8 100644 --- a/spacy/lang/fr/tokenizer_exceptions.py +++ b/spacy/lang/fr/tokenizer_exceptions.py @@ -458,5 +458,5 @@ _regular_exp.append(URL_PATTERN) TOKENIZER_EXCEPTIONS = _exc TOKEN_MATCH = re.compile( - "|".join("(?:{})".format(m) for m in _regular_exp), re.IGNORECASE | re.UNICODE + "(?iu)" + "|".join("(?:{})".format(m) for m in _regular_exp) ).match diff --git a/spacy/lang/gu/__init__.py b/spacy/lang/gu/__init__.py new file mode 100644 index 000000000..1f080c7c2 --- /dev/null +++ b/spacy/lang/gu/__init__.py @@ -0,0 +1,18 @@ +# coding: utf8 +from __future__ import unicode_literals + +from .stop_words import STOP_WORDS + +from ...language import Language + + +class GujaratiDefaults(Language.Defaults): + stop_words = STOP_WORDS + + +class Gujarati(Language): + lang = "gu" + Defaults = GujaratiDefaults + + +__all__ = ["Gujarati"] diff --git a/spacy/lang/gu/examples.py b/spacy/lang/gu/examples.py new file mode 100644 index 000000000..202a8d022 --- /dev/null +++ b/spacy/lang/gu/examples.py @@ -0,0 +1,22 @@ +# coding: utf8 +from __future__ import unicode_literals + + +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.gu.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "લોકશાહી એ સરકારનું એક એવું તંત્ર છે જ્યાં નાગરિકો મત દ્વારા સત્તાનો ઉપયોગ કરે છે.", + "તે ગુજરાત રાજ્યના ધરમપુર શહેરમાં આવેલું હતું", + "કર્ણદેવ પહેલો સોલંકી વંશનો રાજા હતો", + "તેજપાળને બે પત્ની હતી", + "ગુજરાતમાં ભારતીય જનતા પક્ષનો ઉદય આ સમયગાળા દરમિયાન થયો", + "આંદોલનકારીઓએ ચીમનભાઇ પટેલના રાજીનામાની માંગણી કરી.", + "અહિયાં શું જોડાય છે?", + "મંદિરનો પૂર્વાભિમુખ ભાગ નાના મંડપ સાથે થોડો લંબચોરસ આકારનો છે.", +] diff --git a/spacy/lang/gu/stop_words.py b/spacy/lang/gu/stop_words.py new file mode 100644 index 000000000..85d33763d --- /dev/null +++ b/spacy/lang/gu/stop_words.py @@ -0,0 +1,91 @@ +# coding: utf8 +from __future__ import unicode_literals + +STOP_WORDS = set( + """ +એમ +આ +એ +રહી +છે +છો +હતા +હતું +હતી +હોય +હતો +શકે +તે +તેના +તેનું +તેને +તેની +તેઓ +તેમને +તેમના +તેમણે +તેમનું +તેમાં +અને +અહીં +થી +થઈ +થાય +જે + ને +કે +ના +ની +નો +ને +નું +શું +માં +પણ +પર +જેવા +જેવું +જાય +જેમ +જેથી +માત્ર +માટે +પરથી +આવ્યું +એવી +આવી +રીતે +સુધી +થાય +થઈ +સાથે +લાગે +હોવા +છતાં +રહેલા +કરી +કરે +કેટલા +કોઈ +કેમ +કર્યો +કર્યુ +કરે +સૌથી +ત્યારબાદ +તથા +દ્વારા +જુઓ +જાઓ +જ્યારે +ત્યારે +શકો +નથી +હવે +અથવા +થતો +દર +એટલો +પરંતુ +""".split() +) diff --git a/spacy/lang/hy/__init__.py b/spacy/lang/hy/__init__.py new file mode 100644 index 000000000..6aaa965bb --- /dev/null +++ b/spacy/lang/hy/__init__.py @@ -0,0 +1,26 @@ +# coding: utf8 +from __future__ import unicode_literals + +from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS +from .tag_map import TAG_MAP + +from ...attrs import LANG +from ...language import Language + + +class ArmenianDefaults(Language.Defaults): + lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters[LANG] = lambda text: "hy" + + lex_attr_getters.update(LEX_ATTRS) + stop_words = STOP_WORDS + tag_map = TAG_MAP + + +class Armenian(Language): + lang = "hy" + Defaults = ArmenianDefaults + + +__all__ = ["Armenian"] diff --git a/spacy/lang/hy/examples.py b/spacy/lang/hy/examples.py new file mode 100644 index 000000000..323f77b1c --- /dev/null +++ b/spacy/lang/hy/examples.py @@ -0,0 +1,16 @@ +# coding: utf8 +from __future__ import unicode_literals + +""" +Example sentences to test spaCy and its language models. +>>> from spacy.lang.hy.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։", + "Ո՞վ է Ֆրանսիայի նախագահը։", + "Որն է Միացյալ Նահանգների մայրաքաղաքը։", + "Ե՞րբ է ծնվել Բարաք Օբաման։", +] diff --git a/spacy/lang/hy/lex_attrs.py b/spacy/lang/hy/lex_attrs.py new file mode 100644 index 000000000..910625fb8 --- /dev/null +++ b/spacy/lang/hy/lex_attrs.py @@ -0,0 +1,59 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +_num_words = [ + "զրօ", + "մէկ", + "երկու", + "երեք", + "չորս", + "հինգ", + "վեց", + "յոթ", + "ութ", + "ինը", + "տասը", + "տասնմեկ", + "տասներկու", + "տասն­երեք", + "տասն­չորս", + "տասն­հինգ", + "տասն­վեց", + "տասն­յոթ", + "տասն­ութ", + "տասն­ինը", + "քսան" "երեսուն", + "քառասուն", + "հիսուն", + "վաթցսուն", + "յոթանասուն", + "ութսուն", + "ինիսուն", + "հարյուր", + "հազար", + "միլիոն", + "միլիարդ", + "տրիլիոն", + "քվինտիլիոն", +] + + +def like_num(text): + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text.lower() in _num_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/hy/stop_words.py b/spacy/lang/hy/stop_words.py new file mode 100644 index 000000000..d75aad6e2 --- /dev/null +++ b/spacy/lang/hy/stop_words.py @@ -0,0 +1,110 @@ +# coding: utf8 +from __future__ import unicode_literals + +STOP_WORDS = set( + """ +նա +ողջը +այստեղ +ենք +նա +էիր +որպես +ուրիշ +բոլորը +այն +այլ +նույնչափ +էի +մի +և +ողջ +ես +ոմն +հետ +նրանք +ամենքը +ըստ +ինչ-ինչ +այսպես +համայն +մի +նաև +նույնքան +դա +ովևէ +համար +այնտեղ +էին +որոնք +սույն +ինչ-որ +ամենը +նույնպիսի +ու +իր +որոշ +միևնույն +ի +այնպիսի +մենք +ամեն ոք +նույն +երբևէ +այն +որևէ +ին +այդպես +նրա +որը +վրա +դու +էինք +այդպիսի +էիք +յուրաքանչյուրը +եմ +պիտի +այդ +ամբողջը +հետո +եք +ամեն +այլ +կամ +այսքան +որ +այնպես +այսինչ +բոլոր +է +մեկնումեկը +այդչափ +այնքան +ամբողջ +երբևիցե +այնչափ +ամենայն +մյուս +այնինչ +իսկ +այդտեղ +այս +սա +են +ամեն ինչ +որևիցե +ում +մեկը +այդ +դուք +այսչափ +այդքան +այսպիսի +էր +յուրաքանչյուր +այս +մեջ +թ +""".split() +) diff --git a/spacy/lang/hy/tag_map.py b/spacy/lang/hy/tag_map.py new file mode 100644 index 000000000..722270110 --- /dev/null +++ b/spacy/lang/hy/tag_map.py @@ -0,0 +1,2478 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN +from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ + +TAG_MAP = { + "ADJ_Abbr=Yes": {POS: ADJ, "Abbr": "Yes"}, + "ADJ_Degree=Pos|NumForm=Word|NumType=Ord": { + POS: ADJ, + "Degree": "Pos", + "NumForm": "Word", + "NumType": "Ord", + }, + "ADJ_Degree=Pos": {POS: ADJ, "Degree": "Pos"}, + "ADJ_Degree=Pos|Style=Coll": {POS: ADJ, "Degree": "Pos", "Style": "Coll"}, + "ADJ_Degree=Pos|Style=Expr": {POS: ADJ, "Degree": "Pos", "Style": "Expr"}, + "ADJ_Degree=Sup": {POS: ADJ, "Degree": "Sup"}, + "ADJ_NumForm=Digit|NumType=Ord": {POS: ADJ, "NumForm": "Digit", "NumType": "Ord"}, + "ADJ_NumForm=Word|NumType=Card": {POS: ADJ, "NumForm": "Word", "NumType": "Card"}, + "ADJ_NumForm=Word|NumType=Ord": {POS: ADJ, "NumForm": "Word", "NumType": "Ord"}, + "ADJ_Style=Coll": {POS: ADJ, "Style": "Coll"}, + "ADJ_Style=Expr": {POS: ADJ, "Style": "Expr"}, + "ADP_AdpType=Post|Case=Dat": {POS: ADP, "AdpType": "Post", "Case": "Dat"}, + "ADP_AdpType=Post|Case=Nom": {POS: ADP, "AdpType": "Post", "Case": "Nom"}, + "ADP_AdpType=Post|Number=Plur|Person=3": { + POS: ADP, + "AdpType": "Post", + "Number": "Plur", + "Person": "3", + }, + "ADP_AdpType=Post": {POS: ADP, "AdpType": "Post"}, + "ADP_AdpType=Prep": {POS: ADP, "AdpType": "Prep"}, + "ADP_AdpType=Prep|Style=Arch": {POS: ADP, "AdpType": "Prep", "Style": "Arch"}, + "ADV_Degree=Cmp": {POS: ADV, "Degree": "Cmp"}, + "ADV_Degree=Pos": {POS: ADV, "Degree": "Pos"}, + "ADV_Degree=Sup": {POS: ADV, "Degree": "Sup"}, + "ADV_Distance=Dist|PronType=Dem": {POS: ADV, "Distance": "Dist", "PronType": "Dem"}, + "ADV_Distance=Dist|PronType=Exc": {POS: ADV, "Distance": "Dist", "PronType": "Exc"}, + "ADV_Distance=Med|PronType=Dem": {POS: ADV, "Distance": "Med", "PronType": "Dem"}, + "ADV_Distance=Med|PronType=Dem|Style=Coll": { + POS: ADV, + "Distance": "Med", + "PronType": "Dem", + "Style": "Coll", + }, + "ADV_NumForm=Word|NumType=Card|PronType=Tot": { + POS: ADV, + "NumForm": "Word", + "NumType": "Card", + "PronType": "Tot", + }, + "ADV_PronType=Dem": {POS: ADV, "PronType": "Dem"}, + "ADV_PronType=Exc": {POS: ADV, "PronType": "Exc"}, + "ADV_PronType=Ind": {POS: ADV, "PronType": "Ind"}, + "ADV_PronType=Int": {POS: ADV, "PronType": "Int"}, + "ADV_PronType=Int|Style=Coll": {POS: ADV, "PronType": "Int", "Style": "Coll"}, + "ADV_PronType=Rel": {POS: ADV, "PronType": "Rel"}, + "ADV_Style=Coll": {POS: ADV, "Style": "Coll"}, + "ADV_Style=Rare": {POS: ADV, "Style": "Rare"}, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "1", + "Polarity": "Neg", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "2", + "Polarity": "Pos", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Neg", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Neg", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Neg", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Neg", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "2", + "Polarity": "Neg", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "2", + "Polarity": "Pos", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Imp|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Tense": "Imp", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin": { + POS: AUX, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Tense": "Pres", + "VerbForm": "Fin", + }, + "AUX_Aspect=Imp|VerbForm=Part": {POS: AUX, "Aspect": "Imp", "VerbForm": "Part"}, + "AUX_Aspect=Perf|VerbForm=Part": {POS: AUX, "Aspect": "Perf", "VerbForm": "Part"}, + "AUX_Aspect=Prosp|VerbForm=Part": {POS: AUX, "Aspect": "Prosp", "VerbForm": "Part"}, + "AUX_Polarity=Pos": {POS: AUX, "Polarity": "Pos"}, + "CCONJ_ConjType=Comp": {POS: CCONJ, "ConjType": "Comp"}, + "CCONJ_ConjType=Comp|Style=Coll": {POS: CCONJ, "ConjType": "Comp", "Style": "Coll"}, + "DET_Case=Gen|Distance=Med|Number=Plur|Poss=Yes|PronType=Dem": { + POS: DET, + "Case": "Gen", + "Distance": "Med", + "Number": "Plur", + "Poss": "Yes", + "PronType": "Dem", + }, + "DET_Case=Gen|Distance=Med|Number=Sing|Poss=Yes|PronType=Dem": { + POS: DET, + "Case": "Gen", + "Distance": "Med", + "Number": "Sing", + "Poss": "Yes", + "PronType": "Dem", + }, + "DET_Case=Gen|Number=Plur|Person=1|Poss=Yes|PronType=Prs": { + POS: DET, + "Case": "Gen", + "Number": "Plur", + "Person": "1", + "Poss": "Yes", + "PronType": "Prs", + }, + "DET_Case=Gen|Number=Plur|Person=2|Polite=Infm|Poss=Yes|PronType=Prs": { + POS: DET, + "Case": "Gen", + "Number": "Plur", + "Person": "2", + "Polite": "Infm", + "Poss": "Yes", + "PronType": "Prs", + }, + "DET_Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Emp": { + POS: DET, + "Case": "Gen", + "Number": "Plur", + "Person": "3", + "Poss": "Yes", + "PronType": "Emp", + }, + "DET_Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Emp|Reflex=Yes": { + POS: DET, + "Case": "Gen", + "Number": "Plur", + "Person": "3", + "Poss": "Yes", + "PronType": "Emp", + "Reflex": "Yes", + }, + "DET_Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Person": "1", + "Poss": "Yes", + "PronType": "Prs", + }, + "DET_Case=Gen|Number=Sing|Person=2|Polite=Infm|Poss=Yes|PronType=Prs": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Person": "2", + "Polite": "Infm", + "Poss": "Yes", + "PronType": "Prs", + }, + "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Emp": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Person": "3", + "Poss": "Yes", + "PronType": "Emp", + }, + "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Emp|Reflex=Yes": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Person": "3", + "Poss": "Yes", + "PronType": "Emp", + "Reflex": "Yes", + }, + "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Prs": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Person": "3", + "Poss": "Yes", + "PronType": "Prs", + }, + "DET_Case=Gen|Number=Sing|Poss=Yes|PronType=Rel": { + POS: DET, + "Case": "Gen", + "Number": "Sing", + "Poss": "Yes", + "PronType": "Rel", + }, + "DET_Distance=Dist|PronType=Dem": {POS: DET, "Distance": "Dist", "PronType": "Dem"}, + "DET_Distance=Dist|PronType=Dem|Style=Coll": { + POS: DET, + "Distance": "Dist", + "PronType": "Dem", + "Style": "Coll", + }, + "DET_Distance=Dist|PronType=Dem|Style=Vrnc": { + POS: DET, + "Distance": "Dist", + "PronType": "Dem", + "Style": "Vrnc", + }, + "DET_Distance=Med|PronType=Dem": {POS: DET, "Distance": "Med", "PronType": "Dem"}, + "DET_Distance=Med|PronType=Dem|Style=Coll": { + POS: DET, + "Distance": "Med", + "PronType": "Dem", + "Style": "Coll", + }, + "DET_Distance=Prox|PronType=Dem": {POS: DET, "Distance": "Prox", "PronType": "Dem"}, + "DET_Distance=Prox|PronType=Dem|Style=Coll": { + POS: DET, + "Distance": "Prox", + "PronType": "Dem", + "Style": "Coll", + }, + "DET_PronType=Art": {POS: DET, "PronType": "Art"}, + "DET_PronType=Exc": {POS: DET, "PronType": "Exc"}, + "DET_PronType=Ind": {POS: DET, "PronType": "Ind"}, + "DET_PronType=Int": {POS: DET, "PronType": "Int"}, + "DET_PronType=Tot": {POS: DET, "PronType": "Tot"}, + "DET_PronType=Tot|Style=Arch": {POS: DET, "PronType": "Tot", "Style": "Arch"}, + "INTJ_Style=Vrnc": {POS: INTJ, "Style": "Vrnc"}, + "NOUN_Abbr=Yes|Animacy=Nhum|Case=Dat|Definite=Ind|Number=Plur": { + POS: NOUN, + "Abbr": "Yes", + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Abbr=Yes|Animacy=Nhum|Case=Nom|Definite=Ind|Number=Sing": { + POS: NOUN, + "Abbr": "Yes", + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Plur|Style=Slng": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Plur", + "Style": "Slng", + }, + "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Plur": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Def", + "Number": "Plur", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Sing|Style=Slng": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "Style": "Slng", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Assoc": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Assoc", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur|Style=Coll": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Plur", + "Style": "Coll", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur|Style=Slng": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Plur", + "Style": "Slng", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Sing|Style=Arch": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + "Style": "Arch", + }, + "NOUN_Animacy=Hum|Case=Dat|Number=Sing|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Hum|Case=Dat|Number=Sing|Number=Sing|Person=1|Style=Coll": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Dat", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + "Style": "Coll", + }, + "NOUN_Animacy=Hum|Case=Ins|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Ins", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Plur": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "Number": "Plur", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Plur|Style=Slng": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "Number": "Plur", + "Style": "Slng", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Sing|Style=Coll": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + "Style": "Coll", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Assoc": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Assoc", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Style=Coll": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Plur", + "Style": "Coll", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Style=Slng": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Plur", + "Style": "Slng", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Typo=Yes": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Plur", + "Typo": "Yes", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Sing|Style=Coll": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + "Style": "Coll", + }, + "NOUN_Animacy=Hum|Case=Nom|Number=Sing|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Hum", + "Case": "Nom", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Sing|Style=Arch": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Sing", + "Style": "Arch", + }, + "NOUN_Animacy=Nhum|Case=Abl|Number=Sing|Number=Sing|Person=2": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Abl", + "Number": "Sing", + "Number": "Sing", + "Person": "2", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|NumForm=Digit": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "NumForm": "Digit", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|NumForm=Word": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "NumForm": "Word", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|Style=Rare": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "Style": "Rare", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|Style=Vrnc": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "Style": "Vrnc", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|NumForm=Digit": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + "NumForm": "Digit", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|Style=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + "Style": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|Style=Vrnc": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + "Style": "Vrnc", + }, + "NOUN_Animacy=Nhum|Case=Dat|Number=Coll|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + # "Number": "Coll", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Nhum|Case=Dat|Number=Sing|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Nhum|Case=Dat|Number=Sing|Number=Sing|Person=2": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Dat", + "Number": "Sing", + "Number": "Sing", + "Person": "2", + }, + "NOUN_Animacy=Nhum|Case=Gen|Definite=Ind|Number=Sing|Style=Arch": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Gen", + "Definite": "Ind", + "Number": "Sing", + "Style": "Arch", + }, + "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Ins", + "Definite": "Ind", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Ins", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Ins", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Sing|Style=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Ins", + "Definite": "Ind", + "Number": "Sing", + "Style": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Ins|Number=Sing|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Ins", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Nhum|Case=Loc|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Loc", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Loc|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Loc", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Loc|Number=Sing|Number=Sing|Person=2": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Loc", + "Number": "Sing", + "Number": "Sing", + "Person": "2", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Plur|Number=Sing|Poss=Yes": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + # "Number": "Plur", + "Number": "Sing", + "Poss": "Yes", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Sing|NumForm=Digit": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + "NumForm": "Digit", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Coll": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Coll", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Coll|Typo=Yes": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Coll", + "Typo": "Yes", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Plur": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Plur", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Sing": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + }, + "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + }, + "NOUN_Animacy=Nhum|Case=Nom|Number=Plur|Number=Sing|Person=2": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + # "Number": "Plur", + "Number": "Sing", + "Person": "2", + }, + "NOUN_Animacy=Nhum|Case=Nom|Number=Sing|Number=Sing|Person=1": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Number": "Sing", + "Number": "Sing", + "Person": "1", + }, + "NOUN_Animacy=Nhum|Case=Nom|Number=Sing|Number=Sing|Person=2": { + POS: NOUN, + "Animacy": "Nhum", + "Case": "Nom", + "Number": "Sing", + "Number": "Sing", + "Person": "2", + }, + "NUM_NumForm=Digit|NumType=Card": {POS: NUM, "NumForm": "Digit", "NumType": "Card"}, + "NUM_NumForm=Digit|NumType=Frac|Typo=Yes": { + POS: NUM, + "NumForm": "Digit", + "NumType": "Frac", + "Typo": "Yes", + }, + "NUM_NumForm=Digit|NumType=Range": { + POS: NUM, + "NumForm": "Digit", + "NumType": "Range", + }, + "NUM_NumForm=Word|NumType=Card": {POS: NUM, "NumForm": "Word", "NumType": "Card"}, + "NUM_NumForm=Word|NumType=Dist": {POS: NUM, "NumForm": "Word", "NumType": "Dist"}, + "NUM_NumForm=Word|NumType=Range": {POS: NUM, "NumForm": "Word", "NumType": "Range"}, + "PART_Polarity=Neg": {POS: PART, "Polarity": "Neg"}, + "PRON_Case=Abl|Definite=Ind|Number=Sing|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Abl", + "Definite": "Ind", + "Number": "Sing", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Abl|Number=Plur|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Abl", + "Number": "Plur", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Abl|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { + POS: PRON, + "Case": "Abl", + "Number": "Sing", + "Person": "2", + "Polite": "Infm", + "PronType": "Prs", + }, + "PRON_Case=Dat|Definite=Def|Distance=Dist|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Dat", + "Definite": "Def", + "Distance": "Dist", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Dat|Definite=Def|Number=Sing|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Definite": "Def", + "Number": "Sing", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Dat|Definite=Ind|Number=Sing|PronType=Int": { + POS: PRON, + "Case": "Dat", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Int", + }, + "PRON_Case=Dat|Distance=Dist|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Dat", + "Distance": "Dist", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Dat|Distance=Med|Number=Plur|PronType=Dem": { + POS: PRON, + "Case": "Dat", + "Distance": "Med", + "Number": "Plur", + "PronType": "Dem", + }, + "PRON_Case=Dat|Number=Plur|Person=1|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Number": "Plur", + "Person": "1", + "PronType": "Prs", + }, + "PRON_Case=Dat|Number=Plur|Person=2|Polite=Infm|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Number": "Plur", + "Person": "2", + "Polite": "Infm", + "PronType": "Prs", + }, + "PRON_Case=Dat|Number=Plur|Person=3|PronType=Emp|Reflex=Yes": { + POS: PRON, + "Case": "Dat", + "Number": "Plur", + "Person": "3", + "PronType": "Emp", + "Reflex": "Yes", + }, + "PRON_Case=Dat|Number=Plur|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Number": "Plur", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Dat|Number=Plur|PronType=Rcp": { + POS: PRON, + "Case": "Dat", + "Number": "Plur", + "PronType": "Rcp", + }, + "PRON_Case=Dat|Number=Sing|Person=1|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "Person": "1", + "PronType": "Prs", + }, + "PRON_Case=Dat|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "Person": "2", + "Polite": "Infm", + "PronType": "Prs", + }, + "PRON_Case=Dat|Number=Sing|Person=3|PronType=Emp": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "Person": "3", + "PronType": "Emp", + }, + "PRON_Case=Dat|Number=Sing|Person=3|PronType=Emp|Reflex=Yes": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "Person": "3", + "PronType": "Emp", + "Reflex": "Yes", + }, + "PRON_Case=Dat|Number=Sing|PronType=Int": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "PronType": "Int", + }, + "PRON_Case=Dat|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Dat", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Dat|PronType=Tot": {POS: PRON, "Case": "Dat", "PronType": "Tot"}, + "PRON_Case=Gen|Distance=Med|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Gen", + "Distance": "Med", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Gen|Number=Plur|Person=1|PronType=Prs": { + POS: PRON, + "Case": "Gen", + "Number": "Plur", + "Person": "1", + "PronType": "Prs", + }, + "PRON_Case=Gen|Number=Sing|Person=2|PronType=Prs": { + POS: PRON, + "Case": "Gen", + "Number": "Sing", + "Person": "2", + "PronType": "Prs", + }, + "PRON_Case=Gen|Number=Sing|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Gen", + "Number": "Sing", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Gen|PronType=Tot": {POS: PRON, "Case": "Gen", "PronType": "Tot"}, + "PRON_Case=Ins|Definite=Ind|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Ins", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Ins|Distance=Med|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Ins", + "Distance": "Med", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Loc|Definite=Ind|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Loc", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Loc|Distance=Med|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Loc", + "Distance": "Med", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Nom|Definite=Def|Distance=Dist|Number=Plur|PronType=Dem": { + POS: PRON, + "Case": "Nom", + "Definite": "Def", + "Distance": "Dist", + "Number": "Plur", + "PronType": "Dem", + }, + "PRON_Case=Nom|Definite=Def|Distance=Med|Number=Sing|PronType=Dem|Style=Coll": { + POS: PRON, + "Case": "Nom", + "Definite": "Def", + "Distance": "Med", + "Number": "Sing", + "PronType": "Dem", + "Style": "Coll", + }, + "PRON_Case=Nom|Definite=Def|Number=Sing|PronType=Int": { + POS: PRON, + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + "PronType": "Int", + }, + "PRON_Case=Nom|Definite=Def|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Nom", + "Definite": "Def", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Int": { + POS: PRON, + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Int", + }, + "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Neg": { + POS: PRON, + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Neg", + }, + "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Nom", + "Definite": "Ind", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Nom|Distance=Dist|Number=Plur|Person=1|PronType=Dem": { + POS: PRON, + "Case": "Nom", + "Distance": "Dist", + "Number": "Plur", + "Person": "1", + "PronType": "Dem", + }, + "PRON_Case=Nom|Distance=Med|Number=Plur|PronType=Dem": { + POS: PRON, + "Case": "Nom", + "Distance": "Med", + "Number": "Plur", + "PronType": "Dem", + }, + "PRON_Case=Nom|Distance=Med|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Nom", + "Distance": "Med", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Nom|Distance=Prox|Number=Sing|PronType=Dem": { + POS: PRON, + "Case": "Nom", + "Distance": "Prox", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Case=Nom|Number=Plur|Person=1|PronType=Prs": { + POS: PRON, + "Case": "Nom", + "Number": "Plur", + "Person": "1", + "PronType": "Prs", + }, + "PRON_Case=Nom|Number=Plur|Person=3|PronType=Emp": { + POS: PRON, + "Case": "Nom", + "Number": "Plur", + "Person": "3", + "PronType": "Emp", + }, + "PRON_Case=Nom|Number=Plur|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Nom", + "Number": "Plur", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Nom|Number=Plur|PronType=Rel": { + POS: PRON, + "Case": "Nom", + "Number": "Plur", + "PronType": "Rel", + }, + "PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": { + POS: PRON, + "Case": "Nom", + # "Number": "Sing", + "Number": "Plur", + # "Person": "3", + "Person": "1", + "PronType": "Emp", + }, + "PRON_Case=Nom|Number=Sing|Person=1|PronType=Int": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "Person": "1", + "PronType": "Int", + }, + "PRON_Case=Nom|Number=Sing|Person=1|PronType=Prs": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "Person": "1", + "PronType": "Prs", + }, + "PRON_Case=Nom|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "Person": "2", + "Polite": "Infm", + "PronType": "Prs", + }, + "PRON_Case=Nom|Number=Sing|Person=3|PronType=Emp": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "Person": "3", + "PronType": "Emp", + }, + "PRON_Case=Nom|Number=Sing|Person=3|PronType=Prs": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "Person": "3", + "PronType": "Prs", + }, + "PRON_Case=Nom|Number=Sing|PronType=Int": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "PronType": "Int", + }, + "PRON_Case=Nom|Number=Sing|PronType=Rel": { + POS: PRON, + "Case": "Nom", + "Number": "Sing", + "PronType": "Rel", + }, + "PRON_Case=Nom|Person=1|PronType=Tot": { + POS: PRON, + "Case": "Nom", + "Person": "1", + "PronType": "Tot", + }, + "PRON_Case=Nom|PronType=Ind": {POS: PRON, "Case": "Nom", "PronType": "Ind"}, + "PRON_Case=Nom|PronType=Tot": {POS: PRON, "Case": "Nom", "PronType": "Tot"}, + "PRON_Distance=Dist|Number=Sing|PronType=Dem": { + POS: PRON, + "Distance": "Dist", + "Number": "Sing", + "PronType": "Dem", + }, + "PRON_Distance=Med|PronType=Dem|Style=Coll": { + POS: PRON, + "Distance": "Med", + "PronType": "Dem", + "Style": "Coll", + }, + "PRON_Distance=Prox|PronType=Dem|Style=Coll": { + POS: PRON, + "Distance": "Prox", + "PronType": "Dem", + "Style": "Coll", + }, + "PRON_Number=Plur|PronType=Rel": {POS: PRON, "Number": "Plur", "PronType": "Rel"}, + "PROPN_Abbr=Yes|Animacy=Hum|Case=Nom|Definite=Ind|NameType=Giv|Number=Sing": { + POS: PROPN, + "Abbr": "Yes", + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Giv", + "Number": "Sing", + }, + "PROPN_Abbr=Yes|Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Com|Number=Sing": { + POS: PROPN, + "Abbr": "Yes", + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Com", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Dat|Definite=Def|NameType=Sur|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Def", + "NameType": "Sur", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Dat|Definite=Ind|NameType=Prs|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "NameType": "Prs", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Dat|Definite=Ind|NameType=Sur|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Dat", + "Definite": "Ind", + "NameType": "Sur", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Nom|Definite=Def|NameType=Giv|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "NameType": "Giv", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Nom|Definite=Def|NameType=Sur|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Def", + "NameType": "Sur", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Nom|Definite=Ind|NameType=Giv|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Giv", + "Number": "Sing", + }, + "PROPN_Animacy=Hum|Case=Nom|Definite=Ind|NameType=Sur|Number=Sing": { + POS: PROPN, + "Animacy": "Hum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Sur", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|NameType=Geo|Number=Coll": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Coll", + }, + "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|NameType=Geo|Number=Sing": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Plur": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Abl", + "Definite": "Ind", + "Number": "Plur", + }, + "PROPN_Animacy=Nhum|Case=Dat|Definite=Ind|NameType=Geo|Number=Sing": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Dat|Definite=Ind|NameType=Geo|Number=Sing|Style=Coll": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Dat", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + "Style": "Coll", + }, + "PROPN_Animacy=Nhum|Case=Loc|Definite=Ind|NameType=Geo|Number=Sing": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Loc", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Nom|Definite=Def|NameType=Geo|Number=Sing": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "NameType": "Geo", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Nom|Definite=Def|NameType=Pro|Number=Sing|Style=Coll": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Def", + "NameType": "Pro", + "Number": "Sing", + "Style": "Coll", + }, + "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Coll": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Coll", + }, + "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Sing": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + }, + "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Sing|Style=Vrnc": { + POS: PROPN, + "Animacy": "Nhum", + "Case": "Nom", + "Definite": "Ind", + "NameType": "Geo", + "Number": "Sing", + "Style": "Vrnc", + }, + "SCONJ_Style=Coll": {POS: SCONJ, "Style": "Coll"}, + "VERB_Aspect=Dur|Polarity=Neg|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Dur", + "Polarity": "Neg", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Dur|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Dur", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Dur|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Dur", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Dur|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Dur", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Dur|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Dur", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "1", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "2", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Style=Coll|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Style": "Coll", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Style=Vrnc|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Style": "Vrnc", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Intr", + "VerbForm": "Part", + }, + "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Imp|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Imp|Subcat=Tran|VerbForm=Part|Voice=Cau": { + POS: VERB, + "Aspect": "Imp", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Cau", + }, + "VERB_Aspect=Iter|Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Aspect": "Iter", + "Case": "Ins", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Aspect=Iter|Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { + POS: VERB, + "Aspect": "Iter", + "Case": "Ins", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Gdv", + "Voice": "Act", + }, + "VERB_Aspect=Iter": {POS: VERB, "Aspect": "Iter"}, + "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Style=Vrnc|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Style": "Vrnc", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "2", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Style=Vrnc|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Style": "Vrnc", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Mood": "Ind", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Past", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Polarity=Neg|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Neg", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Perf|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Perf|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Perf|Polarity=Pos|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Polarity": "Pos", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Perf", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Perf|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Perf", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Perf|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Perf", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Perf|Subcat=Tran|VerbForm=Part|Voice=Cau": { + POS: VERB, + "Aspect": "Perf", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Cau", + }, + "VERB_Aspect=Prog|Subcat=Intr|VerbForm=Conv|Voice=Mid": { + POS: VERB, + "Aspect": "Prog", + "Subcat": "Intr", + "VerbForm": "Conv", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Connegative=Yes|Mood=Cnd|Subcat=Tran|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Connegative": "Yes", + "Mood": "Cnd", + "Subcat": "Tran", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Plur|Person=3|Polarity=Pos|Style=Vrnc|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Style": "Vrnc", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "2", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Pass": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Pass", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Cnd", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Imp|Number=Sing|Person=2|Subcat=Intr|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Imp", + "Number": "Sing", + "Person": "2", + "Subcat": "Intr", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Imp|Number=Sing|Person=2|Subcat=Tran|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Imp", + "Number": "Sing", + "Person": "2", + "Subcat": "Tran", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Plur", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Plur", + "Person": "3", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Plur", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "1", + "Polarity": "Neg", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "1", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "1", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "2", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "2", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Imp|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|VerbForm=Fin|Voice=Pass": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Fin", + "Voice": "Pass", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Imp", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Number": "Sing", + "Person": "3", + "Polarity": "Pos", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Mood=Sub|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Mood": "Sub", + "Person": "1", + "Polarity": "Neg", + "Subcat": "Tran", + "Tense": "Pres", + "VerbForm": "Fin", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Aspect=Prosp|Subcat=Intr|VerbForm=Part|Voice=Mid": { + POS: VERB, + "Aspect": "Prosp", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Mid", + }, + "VERB_Aspect=Prosp|Subcat=Intr|VerbForm=Part|Voice=Pass": { + POS: VERB, + "Aspect": "Prosp", + "Subcat": "Intr", + "VerbForm": "Part", + "Voice": "Pass", + }, + "VERB_Aspect=Prosp|Subcat=Tran|VerbForm=Part|Voice=Act": { + POS: VERB, + "Aspect": "Prosp", + "Subcat": "Tran", + "VerbForm": "Part", + "Voice": "Act", + }, + "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Case": "Abl", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Pass": { + POS: VERB, + "Case": "Abl", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Pass", + }, + "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { + POS: VERB, + "Case": "Abl", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Gdv", + "Voice": "Act", + }, + "VERB_Case=Dat|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Case": "Dat", + "Definite": "Def", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Neg|Subcat=Intr|VerbForm=Gdv|Voice=Pass": { + POS: VERB, + "Case": "Dat", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Neg", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Pass", + }, + "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Case": "Dat", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { + POS: VERB, + "Case": "Dat", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Gdv", + "Voice": "Act", + }, + "VERB_Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Case": "Ins", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { + POS: VERB, + "Case": "Ins", + "Definite": "Ind", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Gdv", + "Voice": "Act", + }, + "VERB_Case=Nom|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { + POS: VERB, + "Case": "Nom", + "Definite": "Def", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Gdv", + "Voice": "Mid", + }, + "VERB_Case=Nom|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { + POS: VERB, + "Case": "Nom", + "Definite": "Def", + "Number": "Coll", + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Gdv", + "Voice": "Act", + }, + "VERB_Mood=Imp|Number=Sing|Person=2|Subcat=Intr|VerbForm=Fin|Voice=Mid": { + POS: VERB, + "Mood": "Imp", + "Number": "Sing", + "Person": "2", + "Subcat": "Intr", + "VerbForm": "Fin", + "Voice": "Mid", + }, + "VERB_Polarity=Neg|Subcat=Intr|VerbForm=Inf|Voice=Mid": { + POS: VERB, + "Polarity": "Neg", + "Subcat": "Intr", + "VerbForm": "Inf", + "Voice": "Mid", + }, + "VERB_Polarity=Pos|Style=Coll|Subcat=Tran|VerbForm=Inf|Voice=Act": { + POS: VERB, + "Polarity": "Pos", + "Style": "Coll", + "Subcat": "Tran", + "VerbForm": "Inf", + "Voice": "Act", + }, + "VERB_Polarity=Pos|Style=Vrnc|Subcat=Tran|VerbForm=Inf|Voice=Act": { + POS: VERB, + "Polarity": "Pos", + "Style": "Vrnc", + "Subcat": "Tran", + "VerbForm": "Inf", + "Voice": "Act", + }, + "VERB_Polarity=Pos|Subcat=Intr|VerbForm=Inf|Voice=Mid": { + POS: VERB, + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Inf", + "Voice": "Mid", + }, + "VERB_Polarity=Pos|Subcat=Intr|VerbForm=Inf|Voice=Pass": { + POS: VERB, + "Polarity": "Pos", + "Subcat": "Intr", + "VerbForm": "Inf", + "Voice": "Pass", + }, + "VERB_Polarity=Pos|Subcat=Tran|Typo=Yes|VerbForm=Inf|Voice=Act": { + POS: VERB, + "Polarity": "Pos", + "Subcat": "Tran", + "Typo": "Yes", + "VerbForm": "Inf", + "Voice": "Act", + }, + "VERB_Polarity=Pos|Subcat=Tran|VerbForm=Inf|Voice=Act": { + POS: VERB, + "Polarity": "Pos", + "Subcat": "Tran", + "VerbForm": "Inf", + "Voice": "Act", + }, + "X_Foreign=Yes": {POS: X, "Foreign": "Yes"}, + "X_Style=Vrnc": {POS: X, "Style": "Vrnc"}, +} diff --git a/spacy/lang/id/__init__.py b/spacy/lang/id/__init__.py index 89f874abe..db576f4eb 100644 --- a/spacy/lang/id/__init__.py +++ b/spacy/lang/id/__init__.py @@ -1,25 +1,20 @@ from .stop_words import STOP_WORDS from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .lex_attrs import LEX_ATTRS from .syntax_iterators import SYNTAX_ITERATORS from .tag_map import TAG_MAP from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class IndonesianDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: "id" lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS prefixes = TOKENIZER_PREFIXES diff --git a/spacy/lang/id/norm_exceptions.py b/spacy/lang/id/norm_exceptions.py deleted file mode 100644 index 63d2081e9..000000000 --- a/spacy/lang/id/norm_exceptions.py +++ /dev/null @@ -1,529 +0,0 @@ -# Daftar kosakata yang sering salah dieja -# https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja -_exc = { - # Slang and abbreviations - "silahkan": "silakan", - "yg": "yang", - "kalo": "kalau", - "cawu": "caturwulan", - "ok": "oke", - "gak": "tidak", - "enggak": "tidak", - "nggak": "tidak", - "ndak": "tidak", - "ngga": "tidak", - "dgn": "dengan", - "tdk": "tidak", - "jg": "juga", - "klo": "kalau", - "denger": "dengar", - "pinter": "pintar", - "krn": "karena", - "nemuin": "menemukan", - "jgn": "jangan", - "udah": "sudah", - "sy": "saya", - "udh": "sudah", - "dapetin": "mendapatkan", - "ngelakuin": "melakukan", - "ngebuat": "membuat", - "membikin": "membuat", - "bikin": "buat", - # Daftar kosakata yang sering salah dieja - "malpraktik": "malapraktik", - "malfungsi": "malafungsi", - "malserap": "malaserap", - "maladaptasi": "malaadaptasi", - "malsuai": "malasuai", - "maldistribusi": "maladistribusi", - "malgizi": "malagizi", - "malsikap": "malasikap", - "memperhatikan": "memerhatikan", - "akte": "akta", - "cemilan": "camilan", - "esei": "esai", - "frase": "frasa", - "kafeteria": "kafetaria", - "ketapel": "katapel", - "kenderaan": "kendaraan", - "menejemen": "manajemen", - "menejer": "manajer", - "mesjid": "masjid", - "rebo": "rabu", - "seksama": "saksama", - "senggama": "sanggama", - "sekedar": "sekadar", - "seprei": "seprai", - "semedi": "semadi", - "samadi": "semadi", - "amandemen": "amendemen", - "algoritma": "algoritme", - "aritmatika": "aritmetika", - "metoda": "metode", - "materai": "meterai", - "meterei": "meterai", - "kalendar": "kalender", - "kadaluwarsa": "kedaluwarsa", - "katagori": "kategori", - "parlamen": "parlemen", - "sekular": "sekuler", - "selular": "seluler", - "sirkular": "sirkuler", - "survai": "survei", - "survey": "survei", - "aktuil": "aktual", - "formil": "formal", - "trotoir": "trotoar", - "komersiil": "komersial", - "komersil": "komersial", - "tradisionil": "tradisionial", - "orisinil": "orisinal", - "orijinil": "orisinal", - "afdol": "afdal", - "antri": "antre", - "apotik": "apotek", - "atlit": "atlet", - "atmosfir": "atmosfer", - "cidera": "cedera", - "cendikiawan": "cendekiawan", - "cepet": "cepat", - "cinderamata": "cenderamata", - "debet": "debit", - "difinisi": "definisi", - "dekrit": "dekret", - "disain": "desain", - "diskripsi": "deskripsi", - "diskotik": "diskotek", - "eksim": "eksem", - "exim": "eksem", - "faidah": "faedah", - "ekstrim": "ekstrem", - "ekstrimis": "ekstremis", - "komplit": "komplet", - "konkrit": "konkret", - "kongkrit": "konkret", - "kongkret": "konkret", - "kridit": "kredit", - "musium": "museum", - "pinalti": "penalti", - "piranti": "peranti", - "pinsil": "pensil", - "personil": "personel", - "sistim": "sistem", - "teoritis": "teoretis", - "vidio": "video", - "cengkeh": "cengkih", - "desertasi": "disertasi", - "hakekat": "hakikat", - "intelejen": "intelijen", - "kaedah": "kaidah", - "kempes": "kempis", - "kementrian": "kementerian", - "ledeng": "leding", - "nasehat": "nasihat", - "penasehat": "penasihat", - "praktek": "praktik", - "praktekum": "praktikum", - "resiko": "risiko", - "retsleting": "ritsleting", - "senen": "senin", - "amuba": "ameba", - "punggawa": "penggawa", - "surban": "serban", - "nomer": "nomor", - "sorban": "serban", - "bis": "bus", - "agribisnis": "agrobisnis", - "kantung": "kantong", - "khutbah": "khotbah", - "mandur": "mandor", - "rubuh": "roboh", - "pastur": "pastor", - "supir": "sopir", - "goncang": "guncang", - "goa": "gua", - "kaos": "kaus", - "kokoh": "kukuh", - "komulatif": "kumulatif", - "kolomnis": "kolumnis", - "korma": "kurma", - "lobang": "lubang", - "limo": "limusin", - "limosin": "limusin", - "mangkok": "mangkuk", - "saos": "saus", - "sop": "sup", - "sorga": "surga", - "tegor": "tegur", - "telor": "telur", - "obrak-abrik": "ubrak-abrik", - "ekwivalen": "ekuivalen", - "frekwensi": "frekuensi", - "konsekwensi": "konsekuensi", - "kwadran": "kuadran", - "kwadrat": "kuadrat", - "kwalifikasi": "kualifikasi", - "kwalitas": "kualitas", - "kwalitet": "kualitas", - "kwalitatif": "kualitatif", - "kwantitas": "kuantitas", - "kwantitatif": "kuantitatif", - "kwantum": "kuantum", - "kwartal": "kuartal", - "kwintal": "kuintal", - "kwitansi": "kuitansi", - "kwatir": "khawatir", - "kuatir": "khawatir", - "jadual": "jadwal", - "hirarki": "hierarki", - "karir": "karier", - "aktip": "aktif", - "daptar": "daftar", - "efektip": "efektif", - "epektif": "efektif", - "epektip": "efektif", - "Pebruari": "Februari", - "pisik": "fisik", - "pondasi": "fondasi", - "photo": "foto", - "photokopi": "fotokopi", - "hapal": "hafal", - "insap": "insaf", - "insyaf": "insaf", - "konperensi": "konferensi", - "kreatip": "kreatif", - "kreativ": "kreatif", - "maap": "maaf", - "napsu": "nafsu", - "negatip": "negatif", - "negativ": "negatif", - "objektip": "objektif", - "obyektip": "objektif", - "obyektif": "objektif", - "pasip": "pasif", - "pasiv": "pasif", - "positip": "positif", - "positiv": "positif", - "produktip": "produktif", - "produktiv": "produktif", - "sarap": "saraf", - "sertipikat": "sertifikat", - "subjektip": "subjektif", - "subyektip": "subjektif", - "subyektif": "subjektif", - "tarip": "tarif", - "transitip": "transitif", - "transitiv": "transitif", - "faham": "paham", - "fikir": "pikir", - "berfikir": "berpikir", - "telefon": "telepon", - "telfon": "telepon", - "telpon": "telepon", - "tilpon": "telepon", - "nafas": "napas", - "bernafas": "bernapas", - "pernafasan": "pernapasan", - "vermak": "permak", - "vulpen": "pulpen", - "aktifis": "aktivis", - "konfeksi": "konveksi", - "motifasi": "motivasi", - "Nopember": "November", - "propinsi": "provinsi", - "babtis": "baptis", - "jerembab": "jerembap", - "lembab": "lembap", - "sembab": "sembap", - "saptu": "sabtu", - "tekat": "tekad", - "bejad": "bejat", - "nekad": "nekat", - "otoped": "otopet", - "skuad": "skuat", - "jenius": "genius", - "marjin": "margin", - "marjinal": "marginal", - "obyek": "objek", - "subyek": "subjek", - "projek": "proyek", - "azas": "asas", - "ijasah": "ijazah", - "jenasah": "jenazah", - "plasa": "plaza", - "bathin": "batin", - "Katholik": "Katolik", - "orthografi": "ortografi", - "pathogen": "patogen", - "theologi": "teologi", - "ijin": "izin", - "rejeki": "rezeki", - "rejim": "rezim", - "jaman": "zaman", - "jamrud": "zamrud", - "jinah": "zina", - "perjinahan": "perzinaan", - "anugrah": "anugerah", - "cendrawasih": "cenderawasih", - "jendral": "jenderal", - "kripik": "keripik", - "krupuk": "kerupuk", - "ksatria": "kesatria", - "mentri": "menteri", - "negri": "negeri", - "Prancis": "Perancis", - "sebrang": "seberang", - "menyebrang": "menyeberang", - "Sumatra": "Sumatera", - "trampil": "terampil", - "isteri": "istri", - "justeru": "justru", - "perajurit": "prajurit", - "putera": "putra", - "puteri": "putri", - "samudera": "samudra", - "sastera": "sastra", - "sutera": "sutra", - "terompet": "trompet", - "iklas": "ikhlas", - "iktisar": "ikhtisar", - "kafilah": "khafilah", - "kawatir": "khawatir", - "kotbah": "khotbah", - "kusyuk": "khusyuk", - "makluk": "makhluk", - "mahluk": "makhluk", - "mahkluk": "makhluk", - "nahkoda": "nakhoda", - "nakoda": "nakhoda", - "tahta": "takhta", - "takhyul": "takhayul", - "tahyul": "takhayul", - "tahayul": "takhayul", - "akhli": "ahli", - "anarkhi": "anarki", - "kharisma": "karisma", - "kharismatik": "karismatik", - "mahsud": "maksud", - "makhsud": "maksud", - "rakhmat": "rahmat", - "tekhnik": "teknik", - "tehnik": "teknik", - "tehnologi": "teknologi", - "ikhwal": "ihwal", - "expor": "ekspor", - "extra": "ekstra", - "komplex": "komplek", - "sex": "seks", - "taxi": "taksi", - "extasi": "ekstasi", - "syaraf": "saraf", - "syurga": "surga", - "mashur": "masyhur", - "masyur": "masyhur", - "mahsyur": "masyhur", - "mashyur": "masyhur", - "muadzin": "muazin", - "adzan": "azan", - "ustadz": "ustaz", - "ustad": "ustaz", - "ustadzah": "ustaz", - "dzikir": "zikir", - "dzuhur": "zuhur", - "dhuhur": "zuhur", - "zhuhur": "zuhur", - "analisa": "analisis", - "diagnosa": "diagnosis", - "hipotesa": "hipotesis", - "sintesa": "sintesis", - "aktiviti": "aktivitas", - "aktifitas": "aktivitas", - "efektifitas": "efektivitas", - "komuniti": "komunitas", - "kreatifitas": "kreativitas", - "produktifitas": "produktivitas", - "realiti": "realitas", - "realita": "realitas", - "selebriti": "selebritas", - "spotifitas": "sportivitas", - "universiti": "universitas", - "utiliti": "utilitas", - "validiti": "validitas", - "dilokalisir": "dilokalisasi", - "didramatisir": "didramatisasi", - "dipolitisir": "dipolitisasi", - "dinetralisir": "dinetralisasi", - "dikonfrontir": "dikonfrontasi", - "mendominir": "mendominasi", - "koordinir": "koordinasi", - "proklamir": "proklamasi", - "terorganisir": "terorganisasi", - "terealisir": "terealisasi", - "robah": "ubah", - "dirubah": "diubah", - "merubah": "mengubah", - "terlanjur": "telanjur", - "terlantar": "telantar", - "penglepasan": "pelepasan", - "pelihatan": "penglihatan", - "pemukiman": "permukiman", - "pengrumahan": "perumahan", - "penyewaan": "persewaan", - "menyintai": "mencintai", - "menyolok": "mencolok", - "contek": "sontek", - "mencontek": "menyontek", - "pungkir": "mungkir", - "dipungkiri": "dimungkiri", - "kupungkiri": "kumungkiri", - "kaupungkiri": "kaumungkiri", - "nampak": "tampak", - "nampaknya": "tampaknya", - "nongkrong": "tongkrong", - "berternak": "beternak", - "berterbangan": "beterbangan", - "berserta": "beserta", - "berperkara": "beperkara", - "berpergian": "bepergian", - "berkerja": "bekerja", - "berberapa": "beberapa", - "terbersit": "tebersit", - "terpercaya": "tepercaya", - "terperdaya": "teperdaya", - "terpercik": "tepercik", - "terpergok": "tepergok", - "aksesoris": "aksesori", - "handal": "andal", - "hantar": "antar", - "panutan": "anutan", - "atsiri": "asiri", - "bhakti": "bakti", - "china": "cina", - "dharma": "darma", - "diktaktor": "diktator", - "eksport": "ekspor", - "hembus": "embus", - "hadits": "hadis", - "hadist": "hadits", - "harafiah": "harfiah", - "himbau": "imbau", - "import": "impor", - "inget": "ingat", - "hisap": "isap", - "interprestasi": "interpretasi", - "kangker": "kanker", - "konggres": "kongres", - "lansekap": "lanskap", - "maghrib": "magrib", - "emak": "mak", - "moderen": "modern", - "pasport": "paspor", - "perduli": "peduli", - "ramadhan": "ramadan", - "rapih": "rapi", - "Sansekerta": "Sanskerta", - "shalat": "salat", - "sholat": "salat", - "silahkan": "silakan", - "standard": "standar", - "hutang": "utang", - "zinah": "zina", - "ambulan": "ambulans", - "antartika": "sntarktika", - "arteri": "arteria", - "asik": "asyik", - "australi": "australia", - "denga": "dengan", - "depo": "depot", - "detil": "detail", - "ensiklopedi": "ensiklopedia", - "elit": "elite", - "frustasi": "frustrasi", - "gladi": "geladi", - "greget": "gereget", - "itali": "italia", - "karna": "karena", - "klenteng": "kelenteng", - "erling": "kerling", - "kontruksi": "konstruksi", - "masal": "massal", - "merk": "merek", - "respon": "respons", - "diresponi": "direspons", - "skak": "sekak", - "stir": "setir", - "singapur": "singapura", - "standarisasi": "standardisasi", - "varitas": "varietas", - "amphibi": "amfibi", - "anjlog": "anjlok", - "alpukat": "avokad", - "alpokat": "avokad", - "bolpen": "pulpen", - "cabe": "cabai", - "cabay": "cabai", - "ceret": "cerek", - "differensial": "diferensial", - "duren": "durian", - "faksimili": "faksimile", - "faksimil": "faksimile", - "graha": "gerha", - "goblog": "goblok", - "gombrong": "gombroh", - "horden": "gorden", - "korden": "gorden", - "gubug": "gubuk", - "imaginasi": "imajinasi", - "jerigen": "jeriken", - "jirigen": "jeriken", - "carut-marut": "karut-marut", - "kwota": "kuota", - "mahzab": "mazhab", - "mempesona": "memesona", - "milyar": "miliar", - "missi": "misi", - "nenas": "nanas", - "negoisasi": "negosiasi", - "automotif": "otomotif", - "pararel": "paralel", - "paska": "pasca", - "prosen": "persen", - "pete": "petai", - "petay": "petai", - "proffesor": "profesor", - "rame": "ramai", - "rapot": "rapor", - "rileks": "relaks", - "rileksasi": "relaksasi", - "renumerasi": "remunerasi", - "seketaris": "sekretaris", - "sekertaris": "sekretaris", - "sensorik": "sensoris", - "sentausa": "sentosa", - "strawberi": "stroberi", - "strawbery": "stroberi", - "taqwa": "takwa", - "tauco": "taoco", - "tauge": "taoge", - "toge": "taoge", - "tauladan": "teladan", - "taubat": "tobat", - "trilyun": "triliun", - "vissi": "visi", - "coklat": "cokelat", - "narkotika": "narkotik", - "oase": "oasis", - "politisi": "politikus", - "terong": "terung", - "wool": "wol", - "himpit": "impit", - "mujizat": "mukjizat", - "mujijat": "mukjizat", - "yag": "yang", -} - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/id/syntax_iterators.py b/spacy/lang/id/syntax_iterators.py index 96636b0b7..c09b0e840 100644 --- a/spacy/lang/id/syntax_iterators.py +++ b/spacy/lang/id/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -15,12 +16,16 @@ def noun_chunks(obj): "nmod", "nmod:poss", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/kn/examples.py b/spacy/lang/kn/examples.py new file mode 100644 index 000000000..d82630432 --- /dev/null +++ b/spacy/lang/kn/examples.py @@ -0,0 +1,22 @@ +# coding: utf8 +from __future__ import unicode_literals + + +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.en.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "ಆಪಲ್ ಒಂದು ಯು.ಕೆ. ಸ್ಟಾರ್ಟ್ಅಪ್ ಅನ್ನು ೧ ಶತಕೋಟಿ ಡಾಲರ್ಗಳಿಗೆ ಖರೀದಿಸಲು ನೋಡುತ್ತಿದೆ.", + "ಸ್ವಾಯತ್ತ ಕಾರುಗಳು ವಿಮಾ ಹೊಣೆಗಾರಿಕೆಯನ್ನು ತಯಾರಕರ ಕಡೆಗೆ ಬದಲಾಯಿಸುತ್ತವೆ.", + "ಕಾಲುದಾರಿ ವಿತರಣಾ ರೋಬೋಟ್‌ಗಳನ್ನು ನಿಷೇಧಿಸುವುದನ್ನು ಸ್ಯಾನ್ ಫ್ರಾನ್ಸಿಸ್ಕೊ ​​ಪರಿಗಣಿಸುತ್ತದೆ.", + "ಲಂಡನ್ ಯುನೈಟೆಡ್ ಕಿಂಗ್‌ಡಂನ ದೊಡ್ಡ ನಗರ.", + "ನೀನು ಎಲ್ಲಿದಿಯಾ?", + "ಫ್ರಾನ್ಸಾದ ಅಧ್ಯಕ್ಷರು ಯಾರು?", + "ಯುನೈಟೆಡ್ ಸ್ಟೇಟ್ಸ್ನ ರಾಜಧಾನಿ ಯಾವುದು?", + "ಬರಾಕ್ ಒಬಾಮ ಯಾವಾಗ ಜನಿಸಿದರು?", +] diff --git a/spacy/lang/ko/examples.py b/spacy/lang/ko/examples.py index cc0a66c0a..edb755eaa 100644 --- a/spacy/lang/ko/examples.py +++ b/spacy/lang/ko/examples.py @@ -6,8 +6,8 @@ Example sentences to test spaCy and its language models. """ sentences = [ - "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.", - "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.", - "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.", + "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.", + "자율주행 자동차의 손해 배상 책임이 제조 업체로 옮겨 가다", + "샌프란시스코 시가 자동 배달 로봇의 보도 주행 금지를 검토 중이라고 합니다.", "런던은 영국의 수도이자 가장 큰 도시입니다.", ] diff --git a/spacy/lang/lb/__init__.py b/spacy/lang/lb/__init__.py index afcf77f33..8235e7610 100644 --- a/spacy/lang/lb/__init__.py +++ b/spacy/lang/lb/__init__.py @@ -1,24 +1,19 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES from .lex_attrs import LEX_ATTRS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class LuxembourgishDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "lb" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS tag_map = TAG_MAP diff --git a/spacy/lang/lb/norm_exceptions.py b/spacy/lang/lb/norm_exceptions.py deleted file mode 100644 index afc384228..000000000 --- a/spacy/lang/lb/norm_exceptions.py +++ /dev/null @@ -1,13 +0,0 @@ -# TODO -# norm execptions: find a possibility to deal with the zillions of spelling -# variants (vläicht = vlaicht, vleicht, viläicht, viläischt, etc. etc.) -# here one could include the most common spelling mistakes - -_exc = {"dass": "datt", "viläicht": "vläicht"} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/lex_attrs.py b/spacy/lang/lex_attrs.py index 339290d4a..0310b2b36 100644 --- a/spacy/lang/lex_attrs.py +++ b/spacy/lang/lex_attrs.py @@ -183,10 +183,6 @@ def suffix(string): return string[-3:] -def cluster(string): - return 0 - - def is_alpha(string): return string.isalpha() @@ -215,20 +211,11 @@ def is_stop(string, stops=set()): return string.lower() in stops -def is_oov(string): - return True - - -def get_prob(string): - return -20.0 - - LEX_ATTRS = { attrs.LOWER: lower, attrs.NORM: lower, attrs.PREFIX: prefix, attrs.SUFFIX: suffix, - attrs.CLUSTER: cluster, attrs.IS_ALPHA: is_alpha, attrs.IS_DIGIT: is_digit, attrs.IS_LOWER: is_lower, @@ -236,8 +223,6 @@ LEX_ATTRS = { attrs.IS_TITLE: is_title, attrs.IS_UPPER: is_upper, attrs.IS_STOP: is_stop, - attrs.IS_OOV: is_oov, - attrs.PROB: get_prob, attrs.LIKE_EMAIL: like_email, attrs.LIKE_NUM: like_num, attrs.IS_PUNCT: is_punct, diff --git a/spacy/lang/ml/__init__.py b/spacy/lang/ml/__init__.py new file mode 100644 index 000000000..d052ded1b --- /dev/null +++ b/spacy/lang/ml/__init__.py @@ -0,0 +1,18 @@ +# coding: utf8 +from __future__ import unicode_literals + +from .stop_words import STOP_WORDS + +from ...language import Language + + +class MalayalamDefaults(Language.Defaults): + stop_words = STOP_WORDS + + +class Malayalam(Language): + lang = "ml" + Defaults = MalayalamDefaults + + +__all__ = ["Malayalam"] diff --git a/spacy/lang/ml/examples.py b/spacy/lang/ml/examples.py new file mode 100644 index 000000000..a2a0ed10e --- /dev/null +++ b/spacy/lang/ml/examples.py @@ -0,0 +1,19 @@ +# coding: utf8 +from __future__ import unicode_literals + + +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.ml.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + + +sentences = [ + "അനാവശ്യമായി കണ്ണിലും മൂക്കിലും വായിലും സ്പർശിക്കാതിരിക്കുക", + "പൊതുരംഗത്ത് മലയാള ഭാഷയുടെ സമഗ്രപുരോഗതി ലക്ഷ്യമാക്കി പ്രവർത്തിക്കുന്ന സംഘടനയായ മലയാളഐക്യവേദിയുടെ വിദ്യാർത്ഥിക്കൂട്ടായ്മയാണ് വിദ്യാർത്ഥി മലയാളവേദി", + "എന്താണ്‌ കവാടങ്ങൾ?", + "ചുരുക്കത്തിൽ വിക്കിപീഡിയയുടെ ഉള്ളടക്കത്തിലേക്കുള്ള പടിപ്പുരകളാണ്‌‌ കവാടങ്ങൾ. അവ ലളിതവും വായനക്കാരനെ ആകർഷിക്കുന്നതുമായിരിക്കും", + "പതിനൊന്നുപേർ വീതമുള്ള രണ്ടു ടീമുകൾ കളിക്കുന്ന സംഘകായിക വിനോദമാണു ക്രിക്കറ്റ്", +] diff --git a/spacy/lang/ml/lex_attrs.py b/spacy/lang/ml/lex_attrs.py new file mode 100644 index 000000000..468ad88f8 --- /dev/null +++ b/spacy/lang/ml/lex_attrs.py @@ -0,0 +1,80 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +# reference 2: https://www.omniglot.com/language/numbers/malayalam.htm + +_num_words = [ + "പൂജ്യം ", + "ഒന്ന് ", + "രണ്ട് ", + "മൂന്ന് ", + "നാല്‌ ", + "അഞ്ച് ", + "ആറ് ", + "ഏഴ് ", + "എട്ട് ", + "ഒന്‍പത് ", + "പത്ത് ", + "പതിനൊന്ന്", + "പന്ത്രണ്ട്", + "പതി മൂന്നു", + "പതിനാല്", + "പതിനഞ്ച്", + "പതിനാറ്", + "പതിനേഴ്", + "പതിനെട്ട്", + "പത്തൊമ്പതു", + "ഇരുപത്", + "ഇരുപത്തിഒന്ന്", + "ഇരുപത്തിരണ്ട്‌", + "ഇരുപത്തിമൂന്ന്", + "ഇരുപത്തിനാല്", + "ഇരുപത്തിഅഞ്ചു", + "ഇരുപത്തിആറ്", + "ഇരുപത്തിഏഴ്", + "ഇരുപത്തിഎട്ടു", + "ഇരുപത്തിഒന്‍പത്", + "മുപ്പത്", + "മുപ്പത്തിഒന്ന്", + "മുപ്പത്തിരണ്ട്", + "മുപ്പത്തിമൂന്ന്", + "മുപ്പത്തിനാല്", + "മുപ്പത്തിഅഞ്ചു", + "മുപ്പത്തിആറ്", + "മുപ്പത്തിഏഴ്", + "മുപ്പത്തിഎട്ട്", + "മുപ്പത്തിഒന്‍പതു", + "നാല്‍പത്‌ ", + "അന്‍പത് ", + "അറുപത് ", + "എഴുപത് ", + "എണ്‍പത് ", + "തൊണ്ണൂറ് ", + "നുറ് ", + "ആയിരം ", + "പത്തുലക്ഷം", +] + + +def like_num(text): + """ + Check if text resembles a number + """ + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text in _num_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/ml/stop_words.py b/spacy/lang/ml/stop_words.py new file mode 100644 index 000000000..8bd6a7e02 --- /dev/null +++ b/spacy/lang/ml/stop_words.py @@ -0,0 +1,17 @@ +# coding: utf8 +from __future__ import unicode_literals + + +STOP_WORDS = set( + """ +അത് +ഇത് +ആയിരുന്നു +ആകുന്നു +വരെ +അന്നേരം +അന്ന് +ഇന്ന് +ആണ് +""".split() +) diff --git a/spacy/lang/nb/syntax_iterators.py b/spacy/lang/nb/syntax_iterators.py index 96636b0b7..c09b0e840 100644 --- a/spacy/lang/nb/syntax_iterators.py +++ b/spacy/lang/nb/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -15,12 +16,16 @@ def noun_chunks(obj): "nmod", "nmod:poss", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/nl/__init__.py b/spacy/lang/nl/__init__.py index c12b08d77..e99665e1d 100644 --- a/spacy/lang/nl/__init__.py +++ b/spacy/lang/nl/__init__.py @@ -2,7 +2,8 @@ from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .tag_map import TAG_MAP from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES +from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES +from .punctuation import TOKENIZER_SUFFIXES from .lemmatizer import DutchLemmatizer from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..norm_exceptions import BASE_NORMS @@ -22,6 +23,7 @@ class DutchDefaults(Language.Defaults): tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS tag_map = TAG_MAP + prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES diff --git a/spacy/lang/nl/punctuation.py b/spacy/lang/nl/punctuation.py index 3f3be61f8..d9dd2a6e3 100644 --- a/spacy/lang/nl/punctuation.py +++ b/spacy/lang/nl/punctuation.py @@ -1,7 +1,11 @@ -from ..char_classes import LIST_ELLIPSES, LIST_ICONS +from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_UNITS, merge_chars +from ..char_classes import LIST_PUNCT, LIST_QUOTES, CURRENCY, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER -from ..punctuation import TOKENIZER_SUFFIXES as DEFAULT_TOKENIZER_SUFFIXES +from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES + + +_prefixes = [",,"] + BASE_TOKENIZER_PREFIXES # Copied from `de` package. Main purpose is to ensure that hyphens are not @@ -19,20 +23,33 @@ _infixes = ( r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes), r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA), - r"(?<=[0-9])-(?=[0-9])", ] ) -# Remove "'s" suffix from suffix list. In Dutch, "'s" is a plural ending when -# it occurs as a suffix and a clitic for "eens" in standalone use. To avoid -# ambiguity it's better to just leave it attached when it occurs as a suffix. -default_suffix_blacklist = ("'s", "'S", "’s", "’S") -_suffixes = [ - suffix - for suffix in DEFAULT_TOKENIZER_SUFFIXES - if suffix not in default_suffix_blacklist -] +_list_units = [u for u in LIST_UNITS if u != "%"] +_units = merge_chars(" ".join(_list_units)) +_suffixes = ( + ["''"] + + LIST_PUNCT + + LIST_ELLIPSES + + LIST_QUOTES + + LIST_ICONS + + ["—", "–"] + + [ + r"(?<=[0-9])\+", + r"(?<=°[FfCcKk])\.", + r"(?<=[0-9])(?:{c})".format(c=CURRENCY), + r"(?<=[0-9])(?:{u})".format(u=_units), + r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format( + al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT + ), + r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER), + ] +) + + +TOKENIZER_PREFIXES = _prefixes TOKENIZER_INFIXES = _infixes TOKENIZER_SUFFIXES = _suffixes diff --git a/spacy/lang/nl/tokenizer_exceptions.py b/spacy/lang/nl/tokenizer_exceptions.py index 12ab8aef5..df69c7a8a 100644 --- a/spacy/lang/nl/tokenizer_exceptions.py +++ b/spacy/lang/nl/tokenizer_exceptions.py @@ -13,317 +13,1585 @@ from ...symbols import ORTH # are extremely domain-specific. Tokenizer performance may benefit from some # slight pruning, although no performance regression has been observed so far. -# fmt: off -abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.', - 'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.', - 'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.', - 'a.t.d.f.', 'a.u.b.', 'a.v.a.', 'a.w.', 'aanbev.', - 'aanbev.comm.', 'aant.', 'aanv.st.', 'aanw.', 'vnw.', - 'aanw.vnw.', 'abd.', 'abm.', 'abs.', 'acc.act.', - 'acc.bedr.m.', 'acc.bedr.t.', 'achterv.', 'act.dr.', - 'act.dr.fam.', 'act.fisc.', 'act.soc.', 'adm.akk.', - 'adm.besl.', 'adm.lex.', 'adm.onderr.', 'adm.ov.', 'adv.', - 'adv.', 'gen.', 'adv.bl.', 'afd.', 'afl.', 'aggl.verord.', - 'agr.', 'al.', 'alg.', 'alg.richts.', 'amén.', 'ann.dr.', - 'ann.dr.lg.', 'ann.dr.sc.pol.', 'ann.ét.eur.', - 'ann.fac.dr.lg.', 'ann.jur.créd.', - 'ann.jur.créd.règl.coll.', 'ann.not.', 'ann.parl.', - 'ann.prat.comm.', 'app.', 'arb.', 'aud.', 'arbbl.', - 'arbh.', 'arbit.besl.', 'arbrb.', 'arr.', 'arr.cass.', - 'arr.r.v.st.', 'arr.verbr.', 'arrondrb.', 'art.', 'artw.', - 'aud.', 'b.', 'b.', 'b.&w.', 'b.a.', 'b.a.s.', 'b.b.o.', - 'b.best.dep.', 'b.br.ex.', 'b.coll.fr.gem.comm.', - 'b.coll.vl.gem.comm.', 'b.d.cult.r.', 'b.d.gem.ex.', - 'b.d.gem.reg.', 'b.dep.', 'b.e.b.', 'b.f.r.', - 'b.fr.gem.ex.', 'b.fr.gem.reg.', 'b.i.h.', 'b.inl.j.d.', - 'b.inl.s.reg.', 'b.j.', 'b.l.', 'b.o.z.', 'b.prov.r.', - 'b.r.h.', 'b.s.', 'b.sr.', 'b.stb.', 'b.t.i.r.', - 'b.t.s.z.', 'b.t.w.rev.', 'b.v.', - 'b.ver.coll.gem.gem.comm.', 'b.verg.r.b.', 'b.versl.', - 'b.vl.ex.', 'b.voorl.reg.', 'b.w.', 'b.w.gew.ex.', - 'b.z.d.g.', 'b.z.v.', 'bab.', 'bedr.org.', 'begins.', - 'beheersov.', 'bekendm.comm.', 'bel.', 'bel.besch.', - 'bel.w.p.', 'beleidsov.', 'belg.', 'grondw.', 'ber.', - 'ber.w.', 'besch.', 'besl.', 'beslagr.', 'bestuurswet.', - 'bet.', 'betr.', 'betr.', 'vnw.', 'bevest.', 'bew.', - 'bijbl.', 'ind.', 'eig.', 'bijbl.n.bijdr.', 'bijl.', - 'bijv.', 'bijw.', 'bijz.decr.', 'bin.b.', 'bkh.', 'bl.', - 'blz.', 'bm.', 'bn.', 'rh.', 'bnw.', 'bouwr.', 'br.parl.', - 'bs.', 'bull.', 'bull.adm.pénit.', 'bull.ass.', - 'bull.b.m.m.', 'bull.bel.', 'bull.best.strafinr.', - 'bull.bmm.', 'bull.c.b.n.', 'bull.c.n.c.', 'bull.cbn.', - 'bull.centr.arb.', 'bull.cnc.', 'bull.contr.', - 'bull.doc.min.fin.', 'bull.f.e.b.', 'bull.feb.', - 'bull.fisc.fin.r.', 'bull.i.u.m.', - 'bull.inf.ass.secr.soc.', 'bull.inf.i.e.c.', - 'bull.inf.i.n.a.m.i.', 'bull.inf.i.r.e.', 'bull.inf.iec.', - 'bull.inf.inami.', 'bull.inf.ire.', 'bull.inst.arb.', - 'bull.ium.', 'bull.jur.imm.', 'bull.lég.b.', 'bull.off.', - 'bull.trim.b.dr.comp.', 'bull.us.', 'bull.v.b.o.', - 'bull.vbo.', 'bv.', 'bw.', 'bxh.', 'byz.', 'c.', 'c.a.', - 'c.a.-a.', 'c.a.b.g.', 'c.c.', 'c.c.i.', 'c.c.s.', - 'c.conc.jur.', 'c.d.e.', 'c.d.p.k.', 'c.e.', 'c.ex.', - 'c.f.', 'c.h.a.', 'c.i.f.', 'c.i.f.i.c.', 'c.j.', 'c.l.', - 'c.n.', 'c.o.d.', 'c.p.', 'c.pr.civ.', 'c.q.', 'c.r.', - 'c.r.a.', 'c.s.', 'c.s.a.', 'c.s.q.n.', 'c.v.', 'c.v.a.', - 'c.v.o.', 'ca.', 'cadeaust.', 'cah.const.', - 'cah.dr.europ.', 'cah.dr.immo.', 'cah.dr.jud.', 'cal.', - '2d.', 'cal.', '3e.', 'cal.', 'rprt.', 'cap.', 'carg.', - 'cass.', 'cass.', 'verw.', 'cert.', 'cf.', 'ch.', 'chron.', - 'chron.d.s.', 'chron.dr.not.', 'cie.', 'cie.', - 'verz.schr.', 'cir.', 'circ.', 'circ.z.', 'cit.', - 'cit.loc.', 'civ.', 'cl.et.b.', 'cmt.', 'co.', - 'cognoss.v.', 'coll.', 'v.', 'b.', 'colp.w.', 'com.', - 'com.', 'cas.', 'com.v.min.', 'comm.', 'comm.', 'v.', - 'comm.bijz.ov.', 'comm.erf.', 'comm.fin.', 'comm.ger.', - 'comm.handel.', 'comm.pers.', 'comm.pub.', 'comm.straf.', - 'comm.v.', 'comm.venn.', 'comm.verz.', 'comm.voor.', - 'comp.', 'compt.w.', 'computerr.', 'con.m.', 'concl.', - 'concr.', 'conf.', 'confl.w.', 'confl.w.huwbetr.', 'cons.', - 'conv.', 'coöp.', 'ver.', 'corr.', 'corr.bl.', - 'cour.fisc.', 'cour.immo.', 'cridon.', 'crim.', 'cur.', - 'cur.', 'crt.', 'curs.', 'd.', 'd.-g.', 'd.a.', 'd.a.v.', - 'd.b.f.', 'd.c.', 'd.c.c.r.', 'd.d.', 'd.d.p.', 'd.e.t.', - 'd.gem.r.', 'd.h.', 'd.h.z.', 'd.i.', 'd.i.t.', 'd.j.', - 'd.l.r.', 'd.m.', 'd.m.v.', 'd.o.v.', 'd.parl.', 'd.w.z.', - 'dact.', 'dat.', 'dbesch.', 'dbesl.', 'decr.', 'decr.d.', - 'decr.fr.', 'decr.vl.', 'decr.w.', 'def.', 'dep.opv.', - 'dep.rtl.', 'derg.', 'desp.', 'det.mag.', 'deurw.regl.', - 'dez.', 'dgl.', 'dhr.', 'disp.', 'diss.', 'div.', - 'div.act.', 'div.bel.', 'dl.', 'dln.', 'dnotz.', 'doc.', - 'hist.', 'doc.jur.b.', 'doc.min.fin.', 'doc.parl.', - 'doctr.', 'dpl.', 'dpl.besl.', 'dr.', 'dr.banc.fin.', - 'dr.circ.', 'dr.inform.', 'dr.mr.', 'dr.pén.entr.', - 'dr.q.m.', 'drs.', 'dtp.', 'dwz.', 'dyn.', 'e.', 'e.a.', - 'e.b.', 'tek.mod.', 'e.c.', 'e.c.a.', 'e.d.', 'e.e.', - 'e.e.a.', 'e.e.g.', 'e.g.', 'e.g.a.', 'e.h.a.', 'e.i.', - 'e.j.', 'e.m.a.', 'e.n.a.c.', 'e.o.', 'e.p.c.', 'e.r.c.', - 'e.r.f.', 'e.r.h.', 'e.r.o.', 'e.r.p.', 'e.r.v.', - 'e.s.r.a.', 'e.s.t.', 'e.v.', 'e.v.a.', 'e.w.', 'e&o.e.', - 'ec.pol.r.', 'econ.', 'ed.', 'ed(s).', 'eff.', 'eig.', - 'eig.mag.', 'eil.', 'elektr.', 'enmb.', 'enz.', 'err.', - 'etc.', 'etq.', 'eur.', 'parl.', 'eur.t.s.', 'ev.', 'evt.', - 'ex.', 'ex.crim.', 'exec.', 'f.', 'f.a.o.', 'f.a.q.', - 'f.a.s.', 'f.i.b.', 'f.j.f.', 'f.o.b.', 'f.o.r.', 'f.o.s.', - 'f.o.t.', 'f.r.', 'f.supp.', 'f.suppl.', 'fa.', 'facs.', - 'fasc.', 'fg.', 'fid.ber.', 'fig.', 'fin.verh.w.', 'fisc.', - 'fisc.', 'tijdschr.', 'fisc.act.', 'fisc.koer.', 'fl.', - 'form.', 'foro.', 'it.', 'fr.', 'fr.cult.r.', 'fr.gem.r.', - 'fr.parl.', 'fra.', 'ft.', 'g.', 'g.a.', 'g.a.v.', - 'g.a.w.v.', 'g.g.d.', 'g.m.t.', 'g.o.', 'g.omt.e.', 'g.p.', - 'g.s.', 'g.v.', 'g.w.w.', 'geb.', 'gebr.', 'gebrs.', - 'gec.', 'gec.decr.', 'ged.', 'ged.st.', 'gedipl.', - 'gedr.st.', 'geh.', 'gem.', 'gem.', 'gem.', - 'gem.gem.comm.', 'gem.st.', 'gem.stem.', 'gem.w.', - 'gemeensch.optr.', 'gemeensch.standp.', 'gemeensch.strat.', - 'gemeent.', 'gemeent.b.', 'gemeent.regl.', - 'gemeent.verord.', 'geol.', 'geopp.', 'gepubl.', - 'ger.deurw.', 'ger.w.', 'gerekw.', 'gereq.', 'gesch.', - 'get.', 'getr.', 'gev.m.', 'gev.maatr.', 'gew.', 'ghert.', - 'gir.eff.verk.', 'gk.', 'gr.', 'gramm.', 'grat.w.', - 'grootb.w.', 'grs.', 'grvm.', 'grw.', 'gst.', 'gw.', - 'h.a.', 'h.a.v.o.', 'h.b.o.', 'h.e.a.o.', 'h.e.g.a.', - 'h.e.geb.', 'h.e.gestr.', 'h.l.', 'h.m.', 'h.o.', 'h.r.', - 'h.t.l.', 'h.t.m.', 'h.w.geb.', 'hand.', 'handelsn.w.', - 'handelspr.', 'handelsr.w.', 'handelsreg.w.', 'handv.', - 'harv.l.rev.', 'hc.', 'herald.', 'hert.', 'herz.', - 'hfdst.', 'hfst.', 'hgrw.', 'hhr.', 'hist.', 'hooggel.', - 'hoogl.', 'hosp.', 'hpw.', 'hr.', 'hr.', 'ms.', 'hr.ms.', - 'hregw.', 'hrg.', 'hst.', 'huis.just.', 'huisv.w.', - 'huurbl.', 'hv.vn.', 'hw.', 'hyp.w.', 'i.b.s.', 'i.c.', - 'i.c.m.h.', 'i.e.', 'i.f.', 'i.f.p.', 'i.g.v.', 'i.h.', - 'i.h.a.', 'i.h.b.', 'i.l.pr.', 'i.o.', 'i.p.o.', 'i.p.r.', - 'i.p.v.', 'i.pl.v.', 'i.r.d.i.', 'i.s.m.', 'i.t.t.', - 'i.v.', 'i.v.m.', 'i.v.s.', 'i.w.tr.', 'i.z.', 'ib.', - 'ibid.', 'icip-ing.cons.', 'iem.', 'indic.soc.', 'indiv.', - 'inf.', 'inf.i.d.a.c.', 'inf.idac.', 'inf.r.i.z.i.v.', - 'inf.riziv.', 'inf.soc.secr.', 'ing.', 'ing.', 'cons.', - 'ing.cons.', 'inst.', 'int.', 'int.', 'rechtsh.', - 'strafz.', 'interm.', 'intern.fisc.act.', - 'intern.vervoerr.', 'inv.', 'inv.', 'f.', 'inv.w.', - 'inv.wet.', 'invord.w.', 'inz.', 'ir.', 'irspr.', 'iwtr.', - 'j.', 'j.-cl.', 'j.c.b.', 'j.c.e.', 'j.c.fl.', 'j.c.j.', - 'j.c.p.', 'j.d.e.', 'j.d.f.', 'j.d.s.c.', 'j.dr.jeun.', - 'j.j.d.', 'j.j.p.', 'j.j.pol.', 'j.l.', 'j.l.m.b.', - 'j.l.o.', 'j.p.a.', 'j.r.s.', 'j.t.', 'j.t.d.e.', - 'j.t.dr.eur.', 'j.t.o.', 'j.t.t.', 'jaarl.', 'jb.hand.', - 'jb.kred.', 'jb.kred.c.s.', 'jb.l.r.b.', 'jb.lrb.', - 'jb.markt.', 'jb.mens.', 'jb.t.r.d.', 'jb.trd.', - 'jeugdrb.', 'jeugdwerkg.w.', 'jg.', 'jis.', 'jl.', - 'journ.jur.', 'journ.prat.dr.fisc.fin.', 'journ.proc.', - 'jrg.', 'jur.', 'jur.comm.fl.', 'jur.dr.soc.b.l.n.', - 'jur.f.p.e.', 'jur.fpe.', 'jur.niv.', 'jur.trav.brux.', - 'jurambt.', 'jv.cass.', 'jv.h.r.j.', 'jv.hrj.', 'jw.', - 'k.', 'k.', 'k.b.', 'k.g.', 'k.k.', 'k.m.b.o.', 'k.o.o.', - 'k.v.k.', 'k.v.v.v.', 'kadasterw.', 'kaderb.', 'kador.', - 'kbo-nr.', 'kg.', 'kh.', 'kiesw.', 'kind.bes.v.', 'kkr.', - 'koopv.', 'kr.', 'krankz.w.', 'ksbel.', 'kt.', 'ktg.', - 'ktr.', 'kvdm.', 'kw.r.', 'kymr.', 'kzr.', 'kzw.', 'l.', - 'l.b.', 'l.b.o.', 'l.bas.', 'l.c.', 'l.gew.', 'l.j.', - 'l.k.', 'l.l.', 'l.o.', 'l.r.b.', 'l.u.v.i.', 'l.v.r.', - 'l.v.w.', 'l.w.', "l'exp.-compt.b..", 'l’exp.-compt.b.', - 'landinr.w.', 'landscrt.', 'lat.', 'law.ed.', 'lett.', - 'levensverz.', 'lgrs.', 'lidw.', 'limb.rechtsl.', 'lit.', - 'litt.', 'liw.', 'liwet.', 'lk.', 'll.', 'll.(l.)l.r.', - 'loonw.', 'losbl.', 'ltd.', 'luchtv.', 'luchtv.w.', 'm.', - 'm.', 'not.', 'm.a.v.o.', 'm.a.w.', 'm.b.', 'm.b.o.', - 'm.b.r.', 'm.b.t.', 'm.d.g.o.', 'm.e.a.o.', 'm.e.r.', - 'm.h.', 'm.h.d.', 'm.i.v.', 'm.j.t.', 'm.k.', 'm.m.', - 'm.m.a.', 'm.m.h.h.', 'm.m.v.', 'm.n.', 'm.not.fisc.', - 'm.nt.', 'm.o.', 'm.r.', 'm.s.a.', 'm.u.p.', 'm.v.a.', - 'm.v.h.n.', 'm.v.t.', 'm.z.', 'maatr.teboekgest.luchtv.', - 'maced.', 'mand.', 'max.', 'mbl.not.', 'me.', 'med.', - 'med.', 'v.b.o.', 'med.b.u.f.r.', 'med.bufr.', 'med.vbo.', - 'meerv.', 'meetbr.w.', 'mém.adm.', 'mgr.', 'mgrs.', 'mhd.', - 'mi.verantw.', 'mil.', 'mil.bed.', 'mil.ger.', 'min.', - 'min.', 'aanbev.', 'min.', 'circ.', 'min.', 'fin.', - 'min.j.omz.', 'min.just.circ.', 'mitt.', 'mnd.', 'mod.', - 'mon.', 'mouv.comm.', 'mr.', 'ms.', 'muz.', 'mv.', 'n.', - 'chr.', 'n.a.', 'n.a.g.', 'n.a.v.', 'n.b.', 'n.c.', - 'n.chr.', 'n.d.', 'n.d.r.', 'n.e.a.', 'n.g.', 'n.h.b.c.', - 'n.j.', 'n.j.b.', 'n.j.w.', 'n.l.', 'n.m.', 'n.m.m.', - 'n.n.', 'n.n.b.', 'n.n.g.', 'n.n.k.', 'n.o.m.', 'n.o.t.k.', - 'n.rapp.', 'n.tijd.pol.', 'n.v.', 'n.v.d.r.', 'n.v.d.v.', - 'n.v.o.b.', 'n.v.t.', 'nat.besch.w.', 'nat.omb.', - 'nat.pers.', 'ned.cult.r.', 'neg.verkl.', 'nhd.', 'wisk.', - 'njcm-bull.', 'nl.', 'nnd.', 'no.', 'not.fisc.m.', - 'not.w.', 'not.wet.', 'nr.', 'nrs.', 'nste.', 'nt.', - 'numism.', 'o.', 'o.a.', 'o.b.', 'o.c.', 'o.g.', 'o.g.v.', - 'o.i.', 'o.i.d.', 'o.m.', 'o.o.', 'o.o.d.', 'o.o.v.', - 'o.p.', 'o.r.', 'o.regl.', 'o.s.', 'o.t.s.', 'o.t.t.', - 'o.t.t.t.', 'o.t.t.z.', 'o.tk.t.', 'o.v.t.', 'o.v.t.t.', - 'o.v.tk.t.', 'o.v.v.', 'ob.', 'obsv.', 'octr.', - 'octr.gem.regl.', 'octr.regl.', 'oe.', 'off.pol.', 'ofra.', - 'ohd.', 'omb.', 'omnil.', 'omz.', 'on.ww.', 'onderr.', - 'onfrank.', 'onteig.w.', 'ontw.', 'b.w.', 'onuitg.', - 'onz.', 'oorl.w.', 'op.cit.', 'opin.pa.', 'opm.', 'or.', - 'ord.br.', 'ord.gem.', 'ors.', 'orth.', 'os.', 'osm.', - 'ov.', 'ov.w.i.', 'ov.w.ii.', 'ov.ww.', 'overg.w.', - 'overw.', 'ovkst.', 'oz.', 'p.', 'p.a.', 'p.a.o.', - 'p.b.o.', 'p.e.', 'p.g.', 'p.j.', 'p.m.', 'p.m.a.', 'p.o.', - 'p.o.j.t.', 'p.p.', 'p.v.', 'p.v.s.', 'pachtw.', 'pag.', - 'pan.', 'pand.b.', 'pand.pér.', 'parl.gesch.', - 'parl.gesch.', 'inv.', 'parl.st.', 'part.arb.', 'pas.', - 'pasin.', 'pat.', 'pb.c.', 'pb.l.', 'pens.', - 'pensioenverz.', 'per.ber.i.b.r.', 'per.ber.ibr.', 'pers.', - 'st.', 'pft.', 'pk.', 'pktg.', 'plv.', 'po.', 'pol.', - 'pol.off.', 'pol.r.', 'pol.w.', 'postbankw.', 'postw.', - 'pp.', 'pr.', 'preadv.', 'pres.', 'prf.', 'prft.', 'prg.', - 'prijz.w.', 'proc.', 'procesregl.', 'prof.', 'prot.', - 'prov.', 'prov.b.', 'prov.instr.h.m.g.', 'prov.regl.', - 'prov.verord.', 'prov.w.', 'publ.', 'pun.', 'pw.', - 'q.b.d.', 'q.e.d.', 'q.q.', 'q.r.', 'r.', 'r.a.b.g.', - 'r.a.c.e.', 'r.a.j.b.', 'r.b.d.c.', 'r.b.d.i.', 'r.b.s.s.', - 'r.c.', 'r.c.b.', 'r.c.d.c.', 'r.c.j.b.', 'r.c.s.j.', - 'r.cass.', 'r.d.c.', 'r.d.i.', 'r.d.i.d.c.', 'r.d.j.b.', - 'r.d.j.p.', 'r.d.p.c.', 'r.d.s.', 'r.d.t.i.', 'r.e.', - 'r.f.s.v.p.', 'r.g.a.r.', 'r.g.c.f.', 'r.g.d.c.', 'r.g.f.', - 'r.g.z.', 'r.h.a.', 'r.i.c.', 'r.i.d.a.', 'r.i.e.j.', - 'r.i.n.', 'r.i.s.a.', 'r.j.d.a.', 'r.j.i.', 'r.k.', 'r.l.', - 'r.l.g.b.', 'r.med.', 'r.med.rechtspr.', 'r.n.b.', 'r.o.', - 'r.ov.', 'r.p.', 'r.p.d.b.', 'r.p.o.t.', 'r.p.r.j.', - 'r.p.s.', 'r.r.d.', 'r.r.s.', 'r.s.', 'r.s.v.p.', - 'r.stvb.', 'r.t.d.f.', 'r.t.d.h.', 'r.t.l.', - 'r.trim.dr.eur.', 'r.v.a.', 'r.verkb.', 'r.w.', 'r.w.d.', - 'rap.ann.c.a.', 'rap.ann.c.c.', 'rap.ann.c.e.', - 'rap.ann.c.s.j.', 'rap.ann.ca.', 'rap.ann.cass.', - 'rap.ann.cc.', 'rap.ann.ce.', 'rap.ann.csj.', 'rapp.', - 'rb.', 'rb.kh.', 'rdn.', 'rdnr.', 're.pers.', 'rec.', - 'rec.c.i.j.', 'rec.c.j.c.e.', 'rec.cij.', 'rec.cjce.', - 'rec.gén.enr.not.', 'rechtsk.t.', 'rechtspl.zeem.', - 'rechtspr.arb.br.', 'rechtspr.b.f.e.', 'rechtspr.bfe.', - 'rechtspr.soc.r.b.l.n.', 'recl.reg.', 'rect.', 'red.', - 'reg.', 'reg.huiz.bew.', 'reg.w.', 'registr.w.', 'regl.', - 'regl.', 'r.v.k.', 'regl.besl.', 'regl.onderr.', - 'regl.r.t.', 'rep.', 'rép.fisc.', 'rép.not.', 'rep.r.j.', - 'rep.rj.', 'req.', 'res.', 'resp.', 'rev.', 'rev.', - 'comp.', 'rev.', 'trim.', 'civ.', 'rev.', 'trim.', 'comm.', - 'rev.acc.trav.', 'rev.adm.', 'rev.b.compt.', - 'rev.b.dr.const.', 'rev.b.dr.intern.', 'rev.b.séc.soc.', - 'rev.banc.fin.', 'rev.comm.', 'rev.cons.prud.', - 'rev.dr.b.', 'rev.dr.commun.', 'rev.dr.étr.', - 'rev.dr.fam.', 'rev.dr.intern.comp.', 'rev.dr.mil.', - 'rev.dr.min.', 'rev.dr.pén.', 'rev.dr.pén.mil.', - 'rev.dr.rur.', 'rev.dr.u.l.b.', 'rev.dr.ulb.', 'rev.exp.', - 'rev.faill.', 'rev.fisc.', 'rev.gd.', 'rev.hist.dr.', - 'rev.i.p.c.', 'rev.ipc.', 'rev.not.b.', - 'rev.prat.dr.comm.', 'rev.prat.not.b.', 'rev.prat.soc.', - 'rev.rec.', 'rev.rw.', 'rev.trav.', 'rev.trim.d.h.', - 'rev.trim.dr.fam.', 'rev.urb.', 'richtl.', 'riv.dir.int.', - 'riv.dir.int.priv.proc.', 'rk.', 'rln.', 'roln.', 'rom.', - 'rondz.', 'rov.', 'rtl.', 'rubr.', 'ruilv.wet.', - 'rv.verdr.', 'rvkb.', 's.', 's.', 's.a.', 's.b.n.', - 's.ct.', 's.d.', 's.e.c.', 's.e.et.o.', 's.e.w.', - 's.exec.rept.', 's.hrg.', 's.j.b.', 's.l.', 's.l.e.a.', - 's.l.n.d.', 's.p.a.', 's.s.', 's.t.', 's.t.b.', 's.v.', - 's.v.p.', 'samenw.', 'sc.', 'sch.', 'scheidsr.uitspr.', - 'schepel.besl.', 'secr.comm.', 'secr.gen.', 'sect.soc.', - 'sess.', 'cas.', 'sir.', 'soc.', 'best.', 'soc.', 'handv.', - 'soc.', 'verz.', 'soc.act.', 'soc.best.', 'soc.kron.', - 'soc.r.', 'soc.sw.', 'soc.weg.', 'sofi-nr.', 'somm.', - 'somm.ann.', 'sp.c.c.', 'sr.', 'ss.', 'st.doc.b.c.n.a.r.', - 'st.doc.bcnar.', 'st.vw.', 'stagever.', 'stas.', 'stat.', - 'stb.', 'stbl.', 'stcrt.', 'stud.dipl.', 'su.', 'subs.', - 'subst.', 'succ.w.', 'suppl.', 'sv.', 'sw.', 't.', 't.a.', - 't.a.a.', 't.a.n.', 't.a.p.', 't.a.s.n.', 't.a.v.', - 't.a.v.w.', 't.aann.', 't.acc.', 't.agr.r.', 't.app.', - 't.b.b.r.', 't.b.h.', 't.b.m.', 't.b.o.', 't.b.p.', - 't.b.r.', 't.b.s.', 't.b.v.', 't.bankw.', 't.belg.not.', - 't.desk.', 't.e.m.', 't.e.p.', 't.f.r.', 't.fam.', - 't.fin.r.', 't.g.r.', 't.g.t.', 't.g.v.', 't.gem.', - 't.gez.', 't.huur.', 't.i.n.', 't.j.k.', 't.l.l.', - 't.l.v.', 't.m.', 't.m.r.', 't.m.w.', 't.mil.r.', - 't.mil.strafr.', 't.not.', 't.o.', 't.o.r.b.', 't.o.v.', - 't.ontv.', 't.p.r.', 't.pol.', 't.r.', 't.r.g.', - 't.r.o.s.', 't.r.v.', 't.s.r.', 't.strafr.', 't.t.', - 't.u.', 't.v.c.', 't.v.g.', 't.v.m.r.', 't.v.o.', 't.v.v.', - 't.v.v.d.b.', 't.v.w.', 't.verz.', 't.vred.', 't.vreemd.', - 't.w.', 't.w.k.', 't.w.v.', 't.w.v.r.', 't.wrr.', 't.z.', - 't.z.t.', 't.z.v.', 'taalk.', 'tar.burg.z.', 'td.', - 'techn.', 'telecomm.', 'toel.', 'toel.st.v.w.', 'toep.', - 'toep.regl.', 'tom.', 'top.', 'trans.b.', 'transp.r.', - 'trb.', 'trib.', 'trib.civ.', 'trib.gr.inst.', 'ts.', - 'ts.', 'best.', 'ts.', 'verv.', 'turnh.rechtsl.', 'tvpol.', - 'tvpr.', 'tvrechtsgesch.', 'tw.', 'u.', 'u.a.', 'u.a.r.', - 'u.a.v.', 'u.c.', 'u.c.c.', 'u.g.', 'u.p.', 'u.s.', - 'u.s.d.c.', 'uitdr.', 'uitl.w.', 'uitv.besch.div.b.', - 'uitv.besl.', 'uitv.besl.', 'succ.w.', 'uitv.besl.bel.rv.', - 'uitv.besl.l.b.', 'uitv.reg.', 'inv.w.', 'uitv.reg.bel.d.', - 'uitv.reg.afd.verm.', 'uitv.reg.lb.', 'uitv.reg.succ.w.', - 'univ.', 'univ.verkl.', 'v.', 'v.', 'chr.', 'v.a.', - 'v.a.v.', 'v.c.', 'v.chr.', 'v.h.', 'v.huw.verm.', 'v.i.', - 'v.i.o.', 'v.k.a.', 'v.m.', 'v.o.f.', 'v.o.n.', - 'v.onderh.verpl.', 'v.p.', 'v.r.', 'v.s.o.', 'v.t.t.', - 'v.t.t.t.', 'v.tk.t.', 'v.toep.r.vert.', 'v.v.b.', - 'v.v.g.', 'v.v.t.', 'v.v.t.t.', 'v.v.tk.t.', 'v.w.b.', - 'v.z.m.', 'vb.', 'vb.bo.', 'vbb.', 'vc.', 'vd.', 'veldw.', - 'ver.k.', 'ver.verg.gem.', 'gem.comm.', 'verbr.', 'verd.', - 'verdr.', 'verdr.v.', 'tek.mod.', 'verenw.', 'verg.', - 'verg.fr.gem.', 'comm.', 'verkl.', 'verkl.herz.gw.', - 'verl.', 'deelw.', 'vern.', 'verord.', 'vers.r.', - 'versch.', 'versl.c.s.w.', 'versl.csw.', 'vert.', 'verw.', - 'verz.', 'verz.w.', 'verz.wett.besl.', - 'verz.wett.decr.besl.', 'vgl.', 'vid.', 'viss.w.', - 'vl.parl.', 'vl.r.', 'vl.t.gez.', 'vl.w.reg.', - 'vl.w.succ.', 'vlg.', 'vn.', 'vnl.', 'vnw.', 'vo.', - 'vo.bl.', 'voegw.', 'vol.', 'volg.', 'volt.', 'deelw.', - 'voorl.', 'voorz.', 'vord.w.', 'vorst.d.', 'vr.', 'vred.', - 'vrg.', 'vnw.', 'vrijgrs.', 'vs.', 'vt.', 'vw.', 'vz.', - 'vzngr.', 'vzr.', 'w.', 'w.a.', 'w.b.r.', 'w.c.h.', - 'w.conf.huw.', 'w.conf.huwelijksb.', 'w.consum.kr.', - 'w.f.r.', 'w.g.', 'w.gew.r.', 'w.ident.pl.', 'w.just.doc.', - 'w.kh.', 'w.l.r.', 'w.l.v.', 'w.mil.straf.spr.', 'w.n.', - 'w.not.ambt.', 'w.o.', 'w.o.d.huurcomm.', 'w.o.d.k.', - 'w.openb.manif.', 'w.parl.', 'w.r.', 'w.reg.', 'w.succ.', - 'w.u.b.', 'w.uitv.pl.verord.', 'w.v.', 'w.v.k.', - 'w.v.m.s.', 'w.v.r.', 'w.v.w.', 'w.venn.', 'wac.', 'wd.', - 'wetb.', 'n.v.h.', 'wgb.', 'winkelt.w.', 'wisk.', - 'wka-verkl.', 'wnd.', 'won.w.', 'woningw.', 'woonr.w.', - 'wrr.', 'wrr.ber.', 'wrsch.', 'ws.', 'wsch.', 'wsr.', - 'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.', - 'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.', - 'zr.', 'zr.', 'ms.', 'zr.ms.'] -# fmt: on +abbrevs = [ + "a.2d.", + "a.a.", + "a.a.j.b.", + "a.f.t.", + "a.g.j.b.", + "a.h.v.", + "a.h.w.", + "a.hosp.", + "a.i.", + "a.j.b.", + "a.j.t.", + "a.m.", + "a.m.r.", + "a.p.m.", + "a.p.r.", + "a.p.t.", + "a.s.", + "a.t.d.f.", + "a.u.b.", + "a.v.a.", + "a.w.", + "aanbev.", + "aanbev.comm.", + "aant.", + "aanv.st.", + "aanw.", + "vnw.", + "aanw.vnw.", + "abd.", + "abm.", + "abs.", + "acc.act.", + "acc.bedr.m.", + "acc.bedr.t.", + "achterv.", + "act.dr.", + "act.dr.fam.", + "act.fisc.", + "act.soc.", + "adm.akk.", + "adm.besl.", + "adm.lex.", + "adm.onderr.", + "adm.ov.", + "adv.", + "adv.", + "gen.", + "adv.bl.", + "afd.", + "afl.", + "aggl.verord.", + "agr.", + "al.", + "alg.", + "alg.richts.", + "amén.", + "ann.dr.", + "ann.dr.lg.", + "ann.dr.sc.pol.", + "ann.ét.eur.", + "ann.fac.dr.lg.", + "ann.jur.créd.", + "ann.jur.créd.règl.coll.", + "ann.not.", + "ann.parl.", + "ann.prat.comm.", + "app.", + "arb.", + "aud.", + "arbbl.", + "arbh.", + "arbit.besl.", + "arbrb.", + "arr.", + "arr.cass.", + "arr.r.v.st.", + "arr.verbr.", + "arrondrb.", + "art.", + "artw.", + "aud.", + "b.", + "b.", + "b.&w.", + "b.a.", + "b.a.s.", + "b.b.o.", + "b.best.dep.", + "b.br.ex.", + "b.coll.fr.gem.comm.", + "b.coll.vl.gem.comm.", + "b.d.cult.r.", + "b.d.gem.ex.", + "b.d.gem.reg.", + "b.dep.", + "b.e.b.", + "b.f.r.", + "b.fr.gem.ex.", + "b.fr.gem.reg.", + "b.i.h.", + "b.inl.j.d.", + "b.inl.s.reg.", + "b.j.", + "b.l.", + "b.o.z.", + "b.prov.r.", + "b.r.h.", + "b.s.", + "b.sr.", + "b.stb.", + "b.t.i.r.", + "b.t.s.z.", + "b.t.w.rev.", + "b.v.", + "b.ver.coll.gem.gem.comm.", + "b.verg.r.b.", + "b.versl.", + "b.vl.ex.", + "b.voorl.reg.", + "b.w.", + "b.w.gew.ex.", + "b.z.d.g.", + "b.z.v.", + "bab.", + "bedr.org.", + "begins.", + "beheersov.", + "bekendm.comm.", + "bel.", + "bel.besch.", + "bel.w.p.", + "beleidsov.", + "belg.", + "grondw.", + "ber.", + "ber.w.", + "besch.", + "besl.", + "beslagr.", + "bestuurswet.", + "bet.", + "betr.", + "betr.", + "vnw.", + "bevest.", + "bew.", + "bijbl.", + "ind.", + "eig.", + "bijbl.n.bijdr.", + "bijl.", + "bijv.", + "bijw.", + "bijz.decr.", + "bin.b.", + "bkh.", + "bl.", + "blz.", + "bm.", + "bn.", + "rh.", + "bnw.", + "bouwr.", + "br.parl.", + "bs.", + "bull.", + "bull.adm.pénit.", + "bull.ass.", + "bull.b.m.m.", + "bull.bel.", + "bull.best.strafinr.", + "bull.bmm.", + "bull.c.b.n.", + "bull.c.n.c.", + "bull.cbn.", + "bull.centr.arb.", + "bull.cnc.", + "bull.contr.", + "bull.doc.min.fin.", + "bull.f.e.b.", + "bull.feb.", + "bull.fisc.fin.r.", + "bull.i.u.m.", + "bull.inf.ass.secr.soc.", + "bull.inf.i.e.c.", + "bull.inf.i.n.a.m.i.", + "bull.inf.i.r.e.", + "bull.inf.iec.", + "bull.inf.inami.", + "bull.inf.ire.", + "bull.inst.arb.", + "bull.ium.", + "bull.jur.imm.", + "bull.lég.b.", + "bull.off.", + "bull.trim.b.dr.comp.", + "bull.us.", + "bull.v.b.o.", + "bull.vbo.", + "bv.", + "bw.", + "bxh.", + "byz.", + "c.", + "c.a.", + "c.a.-a.", + "c.a.b.g.", + "c.c.", + "c.c.i.", + "c.c.s.", + "c.conc.jur.", + "c.d.e.", + "c.d.p.k.", + "c.e.", + "c.ex.", + "c.f.", + "c.h.a.", + "c.i.f.", + "c.i.f.i.c.", + "c.j.", + "c.l.", + "c.n.", + "c.o.d.", + "c.p.", + "c.pr.civ.", + "c.q.", + "c.r.", + "c.r.a.", + "c.s.", + "c.s.a.", + "c.s.q.n.", + "c.v.", + "c.v.a.", + "c.v.o.", + "ca.", + "cadeaust.", + "cah.const.", + "cah.dr.europ.", + "cah.dr.immo.", + "cah.dr.jud.", + "cal.", + "2d.", + "cal.", + "3e.", + "cal.", + "rprt.", + "cap.", + "carg.", + "cass.", + "cass.", + "verw.", + "cert.", + "cf.", + "ch.", + "chron.", + "chron.d.s.", + "chron.dr.not.", + "cie.", + "cie.", + "verz.schr.", + "cir.", + "circ.", + "circ.z.", + "cit.", + "cit.loc.", + "civ.", + "cl.et.b.", + "cmt.", + "co.", + "cognoss.v.", + "coll.", + "v.", + "b.", + "colp.w.", + "com.", + "com.", + "cas.", + "com.v.min.", + "comm.", + "comm.", + "v.", + "comm.bijz.ov.", + "comm.erf.", + "comm.fin.", + "comm.ger.", + "comm.handel.", + "comm.pers.", + "comm.pub.", + "comm.straf.", + "comm.v.", + "comm.venn.", + "comm.verz.", + "comm.voor.", + "comp.", + "compt.w.", + "computerr.", + "con.m.", + "concl.", + "concr.", + "conf.", + "confl.w.", + "confl.w.huwbetr.", + "cons.", + "conv.", + "coöp.", + "ver.", + "corr.", + "corr.bl.", + "cour.fisc.", + "cour.immo.", + "cridon.", + "crim.", + "cur.", + "cur.", + "crt.", + "curs.", + "d.", + "d.-g.", + "d.a.", + "d.a.v.", + "d.b.f.", + "d.c.", + "d.c.c.r.", + "d.d.", + "d.d.p.", + "d.e.t.", + "d.gem.r.", + "d.h.", + "d.h.z.", + "d.i.", + "d.i.t.", + "d.j.", + "d.l.r.", + "d.m.", + "d.m.v.", + "d.o.v.", + "d.parl.", + "d.w.z.", + "dact.", + "dat.", + "dbesch.", + "dbesl.", + "dec.", + "decr.", + "decr.d.", + "decr.fr.", + "decr.vl.", + "decr.w.", + "def.", + "dep.opv.", + "dep.rtl.", + "derg.", + "desp.", + "det.mag.", + "deurw.regl.", + "dez.", + "dgl.", + "dhr.", + "disp.", + "diss.", + "div.", + "div.act.", + "div.bel.", + "dl.", + "dln.", + "dnotz.", + "doc.", + "hist.", + "doc.jur.b.", + "doc.min.fin.", + "doc.parl.", + "doctr.", + "dpl.", + "dpl.besl.", + "dr.", + "dr.banc.fin.", + "dr.circ.", + "dr.inform.", + "dr.mr.", + "dr.pén.entr.", + "dr.q.m.", + "drs.", + "ds.", + "dtp.", + "dwz.", + "dyn.", + "e.", + "e.a.", + "e.b.", + "tek.mod.", + "e.c.", + "e.c.a.", + "e.d.", + "e.e.", + "e.e.a.", + "e.e.g.", + "e.g.", + "e.g.a.", + "e.h.a.", + "e.i.", + "e.j.", + "e.m.a.", + "e.n.a.c.", + "e.o.", + "e.p.c.", + "e.r.c.", + "e.r.f.", + "e.r.h.", + "e.r.o.", + "e.r.p.", + "e.r.v.", + "e.s.r.a.", + "e.s.t.", + "e.v.", + "e.v.a.", + "e.w.", + "e&o.e.", + "ec.pol.r.", + "econ.", + "ed.", + "ed(s).", + "eff.", + "eig.", + "eig.mag.", + "eil.", + "elektr.", + "enmb.", + "enz.", + "err.", + "etc.", + "etq.", + "eur.", + "parl.", + "eur.t.s.", + "ev.", + "evt.", + "ex.", + "ex.crim.", + "exec.", + "f.", + "f.a.o.", + "f.a.q.", + "f.a.s.", + "f.i.b.", + "f.j.f.", + "f.o.b.", + "f.o.r.", + "f.o.s.", + "f.o.t.", + "f.r.", + "f.supp.", + "f.suppl.", + "fa.", + "facs.", + "fasc.", + "fg.", + "fid.ber.", + "fig.", + "fin.verh.w.", + "fisc.", + "fisc.", + "tijdschr.", + "fisc.act.", + "fisc.koer.", + "fl.", + "form.", + "foro.", + "it.", + "fr.", + "fr.cult.r.", + "fr.gem.r.", + "fr.parl.", + "fra.", + "ft.", + "g.", + "g.a.", + "g.a.v.", + "g.a.w.v.", + "g.g.d.", + "g.m.t.", + "g.o.", + "g.omt.e.", + "g.p.", + "g.s.", + "g.v.", + "g.w.w.", + "geb.", + "gebr.", + "gebrs.", + "gec.", + "gec.decr.", + "ged.", + "ged.st.", + "gedipl.", + "gedr.st.", + "geh.", + "gem.", + "gem.", + "gem.", + "gem.gem.comm.", + "gem.st.", + "gem.stem.", + "gem.w.", + "gemeensch.optr.", + "gemeensch.standp.", + "gemeensch.strat.", + "gemeent.", + "gemeent.b.", + "gemeent.regl.", + "gemeent.verord.", + "geol.", + "geopp.", + "gepubl.", + "ger.deurw.", + "ger.w.", + "gerekw.", + "gereq.", + "gesch.", + "get.", + "getr.", + "gev.m.", + "gev.maatr.", + "gew.", + "ghert.", + "gir.eff.verk.", + "gk.", + "gr.", + "gramm.", + "grat.w.", + "grootb.w.", + "grs.", + "grvm.", + "grw.", + "gst.", + "gw.", + "h.a.", + "h.a.v.o.", + "h.b.o.", + "h.e.a.o.", + "h.e.g.a.", + "h.e.geb.", + "h.e.gestr.", + "h.l.", + "h.m.", + "h.o.", + "h.r.", + "h.t.l.", + "h.t.m.", + "h.w.geb.", + "hand.", + "handelsn.w.", + "handelspr.", + "handelsr.w.", + "handelsreg.w.", + "handv.", + "harv.l.rev.", + "hc.", + "herald.", + "hert.", + "herz.", + "hfdst.", + "hfst.", + "hgrw.", + "hhr.", + "hist.", + "hooggel.", + "hoogl.", + "hosp.", + "hpw.", + "hr.", + "hr.", + "ms.", + "hr.ms.", + "hregw.", + "hrg.", + "hst.", + "huis.just.", + "huisv.w.", + "huurbl.", + "hv.vn.", + "hw.", + "hyp.w.", + "i.b.s.", + "i.c.", + "i.c.m.h.", + "i.e.", + "i.f.", + "i.f.p.", + "i.g.v.", + "i.h.", + "i.h.a.", + "i.h.b.", + "i.l.pr.", + "i.o.", + "i.p.o.", + "i.p.r.", + "i.p.v.", + "i.pl.v.", + "i.r.d.i.", + "i.s.m.", + "i.t.t.", + "i.v.", + "i.v.m.", + "i.v.s.", + "i.w.tr.", + "i.z.", + "ib.", + "ibid.", + "icip-ing.cons.", + "iem.", + "inc.", + "indic.soc.", + "indiv.", + "inf.", + "inf.i.d.a.c.", + "inf.idac.", + "inf.r.i.z.i.v.", + "inf.riziv.", + "inf.soc.secr.", + "ing.", + "ing.", + "cons.", + "ing.cons.", + "inst.", + "int.", + "int.", + "rechtsh.", + "strafz.", + "interm.", + "intern.fisc.act.", + "intern.vervoerr.", + "inv.", + "inv.", + "f.", + "inv.w.", + "inv.wet.", + "invord.w.", + "inz.", + "ir.", + "irspr.", + "iwtr.", + "j.", + "j.-cl.", + "j.c.b.", + "j.c.e.", + "j.c.fl.", + "j.c.j.", + "j.c.p.", + "j.d.e.", + "j.d.f.", + "j.d.s.c.", + "j.dr.jeun.", + "j.j.d.", + "j.j.p.", + "j.j.pol.", + "j.l.", + "j.l.m.b.", + "j.l.o.", + "j.p.a.", + "j.r.s.", + "j.t.", + "j.t.d.e.", + "j.t.dr.eur.", + "j.t.o.", + "j.t.t.", + "jaarl.", + "jb.hand.", + "jb.kred.", + "jb.kred.c.s.", + "jb.l.r.b.", + "jb.lrb.", + "jb.markt.", + "jb.mens.", + "jb.t.r.d.", + "jb.trd.", + "jeugdrb.", + "jeugdwerkg.w.", + "jhr.", + "jg.", + "jis.", + "jl.", + "journ.jur.", + "journ.prat.dr.fisc.fin.", + "journ.proc.", + "jr.", + "jrg.", + "jur.", + "jur.comm.fl.", + "jur.dr.soc.b.l.n.", + "jur.f.p.e.", + "jur.fpe.", + "jur.niv.", + "jur.trav.brux.", + "jurambt.", + "jv.cass.", + "jv.h.r.j.", + "jv.hrj.", + "jw.", + "k.", + "k.", + "k.b.", + "k.g.", + "k.k.", + "k.m.b.o.", + "k.o.o.", + "k.v.k.", + "k.v.v.v.", + "kadasterw.", + "kaderb.", + "kador.", + "kbo-nr.", + "kg.", + "kh.", + "kiesw.", + "kind.bes.v.", + "kkr.", + "kon.", + "koopv.", + "kr.", + "krankz.w.", + "ksbel.", + "kt.", + "ktg.", + "ktr.", + "kvdm.", + "kw.r.", + "kymr.", + "kzr.", + "kzw.", + "l.", + "l.b.", + "l.b.o.", + "l.bas.", + "l.c.", + "l.gew.", + "l.j.", + "l.k.", + "l.l.", + "l.o.", + "l.p.", + "l.r.b.", + "l.u.v.i.", + "l.v.r.", + "l.v.w.", + "l.w.", + "l'exp.-compt.b..", + "l’exp.-compt.b.", + "landinr.w.", + "landscrt.", + "lat.", + "law.ed.", + "lett.", + "levensverz.", + "lgrs.", + "lidw.", + "limb.rechtsl.", + "lit.", + "litt.", + "liw.", + "liwet.", + "lk.", + "ll.", + "ll.(l.)l.r.", + "loonw.", + "losbl.", + "ltd.", + "luchtv.", + "luchtv.w.", + "m.", + "m.", + "not.", + "m.a.v.o.", + "m.a.w.", + "m.b.", + "m.b.o.", + "m.b.r.", + "m.b.t.", + "m.d.g.o.", + "m.e.a.o.", + "m.e.r.", + "m.h.", + "m.h.d.", + "m.i.v.", + "m.j.t.", + "m.k.", + "m.m.", + "m.m.a.", + "m.m.h.h.", + "m.m.v.", + "m.n.", + "m.not.fisc.", + "m.nt.", + "m.o.", + "m.r.", + "m.s.a.", + "m.u.p.", + "m.v.a.", + "m.v.h.n.", + "m.v.t.", + "m.z.", + "maatr.teboekgest.luchtv.", + "maced.", + "mand.", + "max.", + "mbl.not.", + "me.", + "med.", + "med.", + "v.b.o.", + "med.b.u.f.r.", + "med.bufr.", + "med.vbo.", + "meerv.", + "meetbr.w.", + "mej.", + "mevr.", + "mém.adm.", + "mgr.", + "mgrs.", + "mhd.", + "mi.verantw.", + "mil.", + "mil.bed.", + "mil.ger.", + "min.", + "min.", + "aanbev.", + "min.", + "circ.", + "min.", + "fin.", + "min.j.omz.", + "min.just.circ.", + "mitt.", + "mln.", + "mnd.", + "mod.", + "mon.", + "mouv.comm.", + "mr.", + "ms.", + "muz.", + "mv.", + "n.", + "chr.", + "n.a.", + "n.a.g.", + "n.a.v.", + "n.b.", + "n.c.", + "n.chr.", + "n.d.", + "n.d.r.", + "n.e.a.", + "n.g.", + "n.h.b.c.", + "n.j.", + "n.j.b.", + "n.j.w.", + "n.l.", + "n.m.", + "n.m.m.", + "n.n.", + "n.n.b.", + "n.n.g.", + "n.n.k.", + "n.o.m.", + "n.o.t.k.", + "n.rapp.", + "n.tijd.pol.", + "n.v.", + "n.v.d.r.", + "n.v.d.v.", + "n.v.o.b.", + "n.v.t.", + "nat.besch.w.", + "nat.omb.", + "nat.pers.", + "ned.", + "ned.cult.r.", + "neg.verkl.", + "nhd.", + "wisk.", + "njcm-bull.", + "nl.", + "nnd.", + "no.", + "not.fisc.m.", + "not.w.", + "not.wet.", + "nr.", + "nrs.", + "nste.", + "nt.", + "numism.", + "o.", + "o.a.", + "o.b.", + "o.c.", + "o.g.", + "o.g.v.", + "o.i.", + "o.i.d.", + "o.m.", + "o.o.", + "o.o.d.", + "o.o.v.", + "o.p.", + "o.r.", + "o.regl.", + "o.s.", + "o.t.s.", + "o.t.t.", + "o.t.t.t.", + "o.t.t.z.", + "o.tk.t.", + "o.v.t.", + "o.v.t.t.", + "o.v.tk.t.", + "o.v.v.", + "ob.", + "obsv.", + "octr.", + "octr.gem.regl.", + "octr.regl.", + "oe.", + "off.pol.", + "ofra.", + "ohd.", + "omb.", + "omnil.", + "omz.", + "on.ww.", + "onderr.", + "onfrank.", + "onteig.w.", + "ontw.", + "b.w.", + "onuitg.", + "onz.", + "oorl.w.", + "op.cit.", + "opin.pa.", + "opm.", + "or.", + "ord.br.", + "ord.gem.", + "ors.", + "orth.", + "os.", + "osm.", + "ov.", + "ov.w.i.", + "ov.w.ii.", + "ov.ww.", + "overg.w.", + "overw.", + "ovkst.", + "oz.", + "p.", + "p.a.", + "p.a.o.", + "p.b.o.", + "p.e.", + "p.g.", + "p.j.", + "p.m.", + "p.m.a.", + "p.o.", + "p.o.j.t.", + "p.p.", + "p.v.", + "p.v.s.", + "pachtw.", + "pag.", + "pan.", + "pand.b.", + "pand.pér.", + "parl.gesch.", + "parl.gesch.", + "inv.", + "parl.st.", + "part.arb.", + "pas.", + "pasin.", + "pat.", + "pb.c.", + "pb.l.", + "pct.", + "pens.", + "pensioenverz.", + "per.ber.i.b.r.", + "per.ber.ibr.", + "pers.", + "st.", + "pft.", + "pk.", + "pktg.", + "plv.", + "po.", + "pol.", + "pol.off.", + "pol.r.", + "pol.w.", + "postbankw.", + "postw.", + "pp.", + "pr.", + "preadv.", + "pres.", + "prf.", + "prft.", + "prg.", + "prijz.w.", + "proc.", + "procesregl.", + "prof.", + "prot.", + "prov.", + "prov.b.", + "prov.instr.h.m.g.", + "prov.regl.", + "prov.verord.", + "prov.w.", + "publ.", + "pun.", + "pw.", + "q.b.d.", + "q.e.d.", + "q.q.", + "q.r.", + "r.", + "r.a.b.g.", + "r.a.c.e.", + "r.a.j.b.", + "r.b.d.c.", + "r.b.d.i.", + "r.b.s.s.", + "r.c.", + "r.c.b.", + "r.c.d.c.", + "r.c.j.b.", + "r.c.s.j.", + "r.cass.", + "r.d.c.", + "r.d.i.", + "r.d.i.d.c.", + "r.d.j.b.", + "r.d.j.p.", + "r.d.p.c.", + "r.d.s.", + "r.d.t.i.", + "r.e.", + "r.f.s.v.p.", + "r.g.a.r.", + "r.g.c.f.", + "r.g.d.c.", + "r.g.f.", + "r.g.z.", + "r.h.a.", + "r.i.c.", + "r.i.d.a.", + "r.i.e.j.", + "r.i.n.", + "r.i.s.a.", + "r.j.d.a.", + "r.j.i.", + "r.k.", + "r.l.", + "r.l.g.b.", + "r.med.", + "r.med.rechtspr.", + "r.n.b.", + "r.o.", + "r.ov.", + "r.p.", + "r.p.d.b.", + "r.p.o.t.", + "r.p.r.j.", + "r.p.s.", + "r.r.d.", + "r.r.s.", + "r.s.", + "r.s.v.p.", + "r.stvb.", + "r.t.d.f.", + "r.t.d.h.", + "r.t.l.", + "r.trim.dr.eur.", + "r.v.a.", + "r.verkb.", + "r.w.", + "r.w.d.", + "rap.ann.c.a.", + "rap.ann.c.c.", + "rap.ann.c.e.", + "rap.ann.c.s.j.", + "rap.ann.ca.", + "rap.ann.cass.", + "rap.ann.cc.", + "rap.ann.ce.", + "rap.ann.csj.", + "rapp.", + "rb.", + "rb.kh.", + "rdn.", + "rdnr.", + "re.pers.", + "rec.", + "rec.c.i.j.", + "rec.c.j.c.e.", + "rec.cij.", + "rec.cjce.", + "rec.gén.enr.not.", + "rechtsk.t.", + "rechtspl.zeem.", + "rechtspr.arb.br.", + "rechtspr.b.f.e.", + "rechtspr.bfe.", + "rechtspr.soc.r.b.l.n.", + "recl.reg.", + "rect.", + "red.", + "reg.", + "reg.huiz.bew.", + "reg.w.", + "registr.w.", + "regl.", + "regl.", + "r.v.k.", + "regl.besl.", + "regl.onderr.", + "regl.r.t.", + "rep.", + "rép.fisc.", + "rép.not.", + "rep.r.j.", + "rep.rj.", + "req.", + "res.", + "resp.", + "rev.", + "rev.", + "comp.", + "rev.", + "trim.", + "civ.", + "rev.", + "trim.", + "comm.", + "rev.acc.trav.", + "rev.adm.", + "rev.b.compt.", + "rev.b.dr.const.", + "rev.b.dr.intern.", + "rev.b.séc.soc.", + "rev.banc.fin.", + "rev.comm.", + "rev.cons.prud.", + "rev.dr.b.", + "rev.dr.commun.", + "rev.dr.étr.", + "rev.dr.fam.", + "rev.dr.intern.comp.", + "rev.dr.mil.", + "rev.dr.min.", + "rev.dr.pén.", + "rev.dr.pén.mil.", + "rev.dr.rur.", + "rev.dr.u.l.b.", + "rev.dr.ulb.", + "rev.exp.", + "rev.faill.", + "rev.fisc.", + "rev.gd.", + "rev.hist.dr.", + "rev.i.p.c.", + "rev.ipc.", + "rev.not.b.", + "rev.prat.dr.comm.", + "rev.prat.not.b.", + "rev.prat.soc.", + "rev.rec.", + "rev.rw.", + "rev.trav.", + "rev.trim.d.h.", + "rev.trim.dr.fam.", + "rev.urb.", + "richtl.", + "riv.dir.int.", + "riv.dir.int.priv.proc.", + "rk.", + "rln.", + "roln.", + "rom.", + "rondz.", + "rov.", + "rtl.", + "rubr.", + "ruilv.wet.", + "rv.verdr.", + "rvkb.", + "s.", + "s.", + "s.a.", + "s.b.n.", + "s.ct.", + "s.d.", + "s.e.c.", + "s.e.et.o.", + "s.e.w.", + "s.exec.rept.", + "s.hrg.", + "s.j.b.", + "s.l.", + "s.l.e.a.", + "s.l.n.d.", + "s.p.a.", + "s.s.", + "s.t.", + "s.t.b.", + "s.v.", + "s.v.p.", + "samenw.", + "sc.", + "sch.", + "scheidsr.uitspr.", + "schepel.besl.", + "sec.", + "secr.comm.", + "secr.gen.", + "sect.soc.", + "sess.", + "cas.", + "sir.", + "soc.", + "best.", + "soc.", + "handv.", + "soc.", + "verz.", + "soc.act.", + "soc.best.", + "soc.kron.", + "soc.r.", + "soc.sw.", + "soc.weg.", + "sofi-nr.", + "somm.", + "somm.ann.", + "sp.c.c.", + "sr.", + "ss.", + "st.doc.b.c.n.a.r.", + "st.doc.bcnar.", + "st.vw.", + "stagever.", + "stas.", + "stat.", + "stb.", + "stbl.", + "stcrt.", + "stud.dipl.", + "su.", + "subs.", + "subst.", + "succ.w.", + "suppl.", + "sv.", + "sw.", + "t.", + "t.a.", + "t.a.a.", + "t.a.n.", + "t.a.p.", + "t.a.s.n.", + "t.a.v.", + "t.a.v.w.", + "t.aann.", + "t.acc.", + "t.agr.r.", + "t.app.", + "t.b.b.r.", + "t.b.h.", + "t.b.m.", + "t.b.o.", + "t.b.p.", + "t.b.r.", + "t.b.s.", + "t.b.v.", + "t.bankw.", + "t.belg.not.", + "t.desk.", + "t.e.m.", + "t.e.p.", + "t.f.r.", + "t.fam.", + "t.fin.r.", + "t.g.r.", + "t.g.t.", + "t.g.v.", + "t.gem.", + "t.gez.", + "t.huur.", + "t.i.n.", + "t.j.k.", + "t.l.l.", + "t.l.v.", + "t.m.", + "t.m.r.", + "t.m.w.", + "t.mil.r.", + "t.mil.strafr.", + "t.not.", + "t.o.", + "t.o.r.b.", + "t.o.v.", + "t.ontv.", + "t.p.r.", + "t.pol.", + "t.r.", + "t.r.g.", + "t.r.o.s.", + "t.r.v.", + "t.s.r.", + "t.strafr.", + "t.t.", + "t.u.", + "t.v.c.", + "t.v.g.", + "t.v.m.r.", + "t.v.o.", + "t.v.v.", + "t.v.v.d.b.", + "t.v.w.", + "t.verz.", + "t.vred.", + "t.vreemd.", + "t.w.", + "t.w.k.", + "t.w.v.", + "t.w.v.r.", + "t.wrr.", + "t.z.", + "t.z.t.", + "t.z.v.", + "taalk.", + "tar.burg.z.", + "td.", + "techn.", + "telecomm.", + "th.", + "toel.", + "toel.st.v.w.", + "toep.", + "toep.regl.", + "tom.", + "top.", + "trans.b.", + "transp.r.", + "trb.", + "trib.", + "trib.civ.", + "trib.gr.inst.", + "ts.", + "ts.", + "best.", + "ts.", + "verv.", + "turnh.rechtsl.", + "tvpol.", + "tvpr.", + "tvrechtsgesch.", + "tw.", + "u.", + "u.a.", + "u.a.r.", + "u.a.v.", + "u.c.", + "u.c.c.", + "u.g.", + "u.p.", + "u.s.", + "u.s.d.c.", + "uitdr.", + "uitl.w.", + "uitv.besch.div.b.", + "uitv.besl.", + "uitv.besl.", + "succ.w.", + "uitv.besl.bel.rv.", + "uitv.besl.l.b.", + "uitv.reg.", + "inv.w.", + "uitv.reg.bel.d.", + "uitv.reg.afd.verm.", + "uitv.reg.lb.", + "uitv.reg.succ.w.", + "univ.", + "univ.verkl.", + "v.", + "v.", + "chr.", + "v.a.", + "v.a.v.", + "v.c.", + "v.C.", + "v.Chr.", + "v.chr.", + "v.d.", + "v.h.", + "v.huw.verm.", + "v.i.", + "v.i.o.", + "v.j.", + "v.k.a.", + "v.m.", + "v.o.f.", + "v.o.n.", + "v.onderh.verpl.", + "v.p.", + "v.r.", + "v.s.o.", + "v.t.t.", + "v.t.t.t.", + "v.tk.t.", + "v.toep.r.vert.", + "v.v.b.", + "v.v.g.", + "v.v.t.", + "v.v.t.t.", + "v.v.tk.t.", + "v.w.b.", + "v.z.m.", + "vb.", + "vb.bo.", + "vbb.", + "vc.", + "vd.", + "veldw.", + "ver.k.", + "ver.verg.gem.", + "gem.comm.", + "verbr.", + "verd.", + "verdr.", + "verdr.v.", + "tek.mod.", + "verenw.", + "verg.", + "verg.fr.gem.", + "comm.", + "verkl.", + "verkl.herz.gw.", + "verl.", + "deelw.", + "vern.", + "verord.", + "vers.r.", + "versch.", + "versl.c.s.w.", + "versl.csw.", + "vert.", + "verw.", + "verz.", + "verz.w.", + "verz.wett.besl.", + "verz.wett.decr.besl.", + "vgl.", + "vid.", + "viss.w.", + "vl.parl.", + "vl.r.", + "vl.t.gez.", + "vl.w.reg.", + "vl.w.succ.", + "vlg.", + "vn.", + "vnl.", + "vnw.", + "vo.", + "vo.bl.", + "voegw.", + "vol.", + "volg.", + "volt.", + "deelw.", + "voorl.", + "voorz.", + "vord.w.", + "vorst.d.", + "vr.", + "vred.", + "vrg.", + "vnw.", + "vrijgrs.", + "vs.", + "vt.", + "vw.", + "vz.", + "vzngr.", + "vzr.", + "w.", + "w.a.", + "w.b.r.", + "w.c.h.", + "w.conf.huw.", + "w.conf.huwelijksb.", + "w.consum.kr.", + "w.f.r.", + "w.g.", + "w.gew.r.", + "w.ident.pl.", + "w.just.doc.", + "w.kh.", + "w.l.r.", + "w.l.v.", + "w.mil.straf.spr.", + "w.n.", + "w.not.ambt.", + "w.o.", + "w.o.d.huurcomm.", + "w.o.d.k.", + "w.openb.manif.", + "w.parl.", + "w.r.", + "w.reg.", + "w.succ.", + "w.u.b.", + "w.uitv.pl.verord.", + "w.v.", + "w.v.k.", + "w.v.m.s.", + "w.v.r.", + "w.v.w.", + "w.venn.", + "wac.", + "wd.", + "wetb.", + "n.v.h.", + "wgb.", + "winkelt.w.", + "wisk.", + "wka-verkl.", + "wnd.", + "won.w.", + "woningw.", + "woonr.w.", + "wrr.", + "wrr.ber.", + "wrsch.", + "ws.", + "wsch.", + "wsr.", + "wtvb.", + "ww.", + "x.d.", + "z.a.", + "z.g.", + "z.i.", + "z.j.", + "z.o.z.", + "z.p.", + "z.s.m.", + "zg.", + "zgn.", + "zn.", + "znw.", + "zr.", + "zr.", + "ms.", + "zr.ms.", + "'m", + "'n", + "'ns", + "'s", + "'t", +] _exc = {} for orth in abbrevs: diff --git a/spacy/lang/pl/__init__.py b/spacy/lang/pl/__init__.py index a03ead1ff..660931ffd 100644 --- a/spacy/lang/pl/__init__.py +++ b/spacy/lang/pl/__init__.py @@ -1,14 +1,16 @@ -from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .punctuation import TOKENIZER_INFIXES +from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES +from .punctuation import TOKENIZER_SUFFIXES from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS +from .lemmatizer import PolishLemmatizer from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..norm_exceptions import BASE_NORMS from ...language import Language from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...util import add_lookups +from ...lookups import Lookups class PolishDefaults(Language.Defaults): @@ -18,10 +20,21 @@ class PolishDefaults(Language.Defaults): lex_attr_getters[NORM] = add_lookups( Language.Defaults.lex_attr_getters[NORM], BASE_NORMS ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + mod_base_exceptions = { + exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".") + } + tokenizer_exceptions = mod_base_exceptions stop_words = STOP_WORDS tag_map = TAG_MAP + prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES + suffixes = TOKENIZER_SUFFIXES + + @classmethod + def create_lemmatizer(cls, nlp=None, lookups=None): + if lookups is None: + lookups = Lookups() + return PolishLemmatizer(lookups) class Polish(Language): diff --git a/spacy/lang/pl/_tokenizer_exceptions_list.py b/spacy/lang/pl/_tokenizer_exceptions_list.py deleted file mode 100644 index 965318442..000000000 --- a/spacy/lang/pl/_tokenizer_exceptions_list.py +++ /dev/null @@ -1,1439 +0,0 @@ -# The following list consists of: -# - exceptions generated from polish_srx_rules [1] -# (https://github.com/milekpl/polish_srx_rules) -# - abbreviations parsed from Wikipedia -# - some manually added exceptions -# -# [1] M. Miłkowski and J. Lipski, -# "Using SRX Standard for Sentence Segmentation," in LTC 2009, -# Lecture Notes in Artificial Intelligence 6562, -# Z. Vetulani, Ed. Berlin Heidelberg: Springer-Verlag, 2011, pp. 172–182. -PL_BASE_EXCEPTIONS = [ - "0.", - "1.", - "10.", - "2.", - "3.", - "4.", - "5.", - "6.", - "7.", - "8.", - "9.", - "A.A.", - "A.B.", - "A.C.", - "A.D.", - "A.E.", - "A.F.", - "A.G.", - "A.H.", - "A.I.", - "A.J.", - "A.K.", - "A.L.", - "A.M.", - "A.N.", - "A.O.", - "A.P.", - "A.R.", - "A.S.", - "A.T.", - "A.U.", - "A.W.", - "A.Y.", - "A.Z.", - "A.Ó.", - "A.Ą.", - "A.Ć.", - "A.Ę.", - "A.Ł.", - "A.Ń.", - "A.Ś.", - "A.Ź.", - "A.Ż.", - "Ad.", - "Adw.", - "Al.", - "Art.", - "B.A.", - "B.B.", - "B.C.", - "B.D.", - "B.E.", - "B.F.", - "B.G.", - "B.H.", - "B.I.", - "B.J.", - "B.K.", - "B.L.", - "B.M.", - "B.N.", - "B.O.", - "B.P.", - "B.R.", - "B.S.", - "B.T.", - "B.U.", - "B.W.", - "B.Y.", - "B.Z.", - "B.Ó.", - "B.Ą.", - "B.Ć.", - "B.Ę.", - "B.Ł.", - "B.Ń.", - "B.Ś.", - "B.Ź.", - "B.Ż.", - "D.A.", - "D.B.", - "D.C.", - "D.D.", - "D.E.", - "D.F.", - "D.G.", - "D.H.", - "D.I.", - "D.J.", - "D.K.", - "D.L.", - "D.M.", - "D.N.", - "D.O.", - "D.P.", - "D.R.", - "D.S.", - "D.T.", - "D.U.", - "D.W.", - "D.Y.", - "D.Z.", - "D.Ó.", - "D.Ą.", - "D.Ć.", - "D.Ę.", - "D.Ł.", - "D.Ń.", - "D.Ś.", - "D.Ź.", - "D.Ż.", - "Dh.", - "Doc.", - "Dr.", - "Dyr.", - "Dyw.", - "Dz.U.", - "E.A.", - "E.B.", - "E.C.", - "E.D.", - "E.E.", - "E.F.", - "E.G.", - "E.H.", - "E.I.", - "E.J.", - "E.K.", - "E.L.", - "E.M.", - "E.N.", - "E.O.", - "E.P.", - "E.R.", - "E.S.", - "E.T.", - "E.U.", - "E.W.", - "E.Y.", - "E.Z.", - "E.Ó.", - "E.Ą.", - "E.Ć.", - "E.Ę.", - "E.Ł.", - "E.Ń.", - "E.Ś.", - "E.Ź.", - "E.Ż.", - "F.A.", - "F.B.", - "F.C.", - "F.D.", - "F.E.", - "F.F.", - "F.G.", - "F.H.", - "F.I.", - "F.J.", - "F.K.", - "F.L.", - "F.M.", - "F.N.", - "F.O.", - "F.P.", - "F.R.", - "F.S.", - "F.T.", - "F.U.", - "F.W.", - "F.Y.", - "F.Z.", - "F.Ó.", - "F.Ą.", - "F.Ć.", - "F.Ę.", - "F.Ł.", - "F.Ń.", - "F.Ś.", - "F.Ź.", - "F.Ż.", - "G.A.", - "G.B.", - "G.C.", - "G.D.", - "G.E.", - "G.F.", - "G.G.", - "G.H.", - "G.I.", - "G.J.", - "G.K.", - "G.L.", - "G.M.", - "G.N.", - "G.O.", - "G.P.", - "G.R.", - "G.S.", - "G.T.", - "G.U.", - "G.W.", - "G.Y.", - "G.Z.", - "G.Ó.", - "G.Ą.", - "G.Ć.", - "G.Ę.", - "G.Ł.", - "G.Ń.", - "G.Ś.", - "G.Ź.", - "G.Ż.", - "H.A.", - "H.B.", - "H.C.", - "H.D.", - "H.E.", - "H.F.", - "H.G.", - "H.H.", - "H.I.", - "H.J.", - "H.K.", - "H.L.", - "H.M.", - "H.N.", - "H.O.", - "H.P.", - "H.R.", - "H.S.", - "H.T.", - "H.U.", - "H.W.", - "H.Y.", - "H.Z.", - "H.Ó.", - "H.Ą.", - "H.Ć.", - "H.Ę.", - "H.Ł.", - "H.Ń.", - "H.Ś.", - "H.Ź.", - "H.Ż.", - "Hr.", - "I.A.", - "I.B.", - "I.C.", - "I.D.", - "I.E.", - "I.F.", - "I.G.", - "I.H.", - "I.I.", - "I.J.", - "I.K.", - "I.L.", - "I.M.", - "I.N.", - "I.O.", - "I.P.", - "I.R.", - "I.S.", - "I.T.", - "I.U.", - "I.W.", - "I.Y.", - "I.Z.", - "I.Ó.", - "I.Ą.", - "I.Ć.", - "I.Ę.", - "I.Ł.", - "I.Ń.", - "I.Ś.", - "I.Ź.", - "I.Ż.", - "Inż.", - "J.A.", - "J.B.", - "J.C.", - "J.D.", - "J.E.", - "J.F.", - "J.G.", - "J.H.", - "J.I.", - "J.J.", - "J.K.", - "J.L.", - "J.M.", - "J.N.", - "J.O.", - "J.P.", - "J.R.", - "J.S.", - "J.T.", - "J.U.", - "J.W.", - "J.Y.", - "J.Z.", - "J.Ó.", - "J.Ą.", - "J.Ć.", - "J.Ę.", - "J.Ł.", - "J.Ń.", - "J.Ś.", - "J.Ź.", - "J.Ż.", - "K.A.", - "K.B.", - "K.C.", - "K.D.", - "K.E.", - "K.F.", - "K.G.", - "K.H.", - "K.I.", - "K.J.", - "K.K.", - "K.L.", - "K.M.", - "K.N.", - "K.O.", - "K.P.", - "K.R.", - "K.S.", - "K.T.", - "K.U.", - "K.W.", - "K.Y.", - "K.Z.", - "K.Ó.", - "K.Ą.", - "K.Ć.", - "K.Ę.", - "K.Ł.", - "K.Ń.", - "K.Ś.", - "K.Ź.", - "K.Ż.", - "Ks.", - "L.A.", - "L.B.", - "L.C.", - "L.D.", - "L.E.", - "L.F.", - "L.G.", - "L.H.", - "L.I.", - "L.J.", - "L.K.", - "L.L.", - "L.M.", - "L.N.", - "L.O.", - "L.P.", - "L.R.", - "L.S.", - "L.T.", - "L.U.", - "L.W.", - "L.Y.", - "L.Z.", - "L.Ó.", - "L.Ą.", - "L.Ć.", - "L.Ę.", - "L.Ł.", - "L.Ń.", - "L.Ś.", - "L.Ź.", - "L.Ż.", - "Lek.", - "M.A.", - "M.B.", - "M.C.", - "M.D.", - "M.E.", - "M.F.", - "M.G.", - "M.H.", - "M.I.", - "M.J.", - "M.K.", - "M.L.", - "M.M.", - "M.N.", - "M.O.", - "M.P.", - "M.R.", - "M.S.", - "M.T.", - "M.U.", - "M.W.", - "M.Y.", - "M.Z.", - "M.Ó.", - "M.Ą.", - "M.Ć.", - "M.Ę.", - "M.Ł.", - "M.Ń.", - "M.Ś.", - "M.Ź.", - "M.Ż.", - "Mat.", - "Mec.", - "Mojż.", - "N.A.", - "N.B.", - "N.C.", - "N.D.", - "N.E.", - "N.F.", - "N.G.", - "N.H.", - "N.I.", - "N.J.", - "N.K.", - "N.L.", - "N.M.", - "N.N.", - "N.O.", - "N.P.", - "N.R.", - "N.S.", - "N.T.", - "N.U.", - "N.W.", - "N.Y.", - "N.Z.", - "N.Ó.", - "N.Ą.", - "N.Ć.", - "N.Ę.", - "N.Ł.", - "N.Ń.", - "N.Ś.", - "N.Ź.", - "N.Ż.", - "Na os.", - "Nadkom.", - "Najśw.", - "Nb.", - "Np.", - "O.A.", - "O.B.", - "O.C.", - "O.D.", - "O.E.", - "O.F.", - "O.G.", - "O.H.", - "O.I.", - "O.J.", - "O.K.", - "O.L.", - "O.M.", - "O.N.", - "O.O.", - "O.P.", - "O.R.", - "O.S.", - "O.T.", - "O.U.", - "O.W.", - "O.Y.", - "O.Z.", - "O.Ó.", - "O.Ą.", - "O.Ć.", - "O.Ę.", - "O.Ł.", - "O.Ń.", - "O.Ś.", - "O.Ź.", - "O.Ż.", - "OO.", - "Oo.", - "P.A.", - "P.B.", - "P.C.", - "P.D.", - "P.E.", - "P.F.", - "P.G.", - "P.H.", - "P.I.", - "P.J.", - "P.K.", - "P.L.", - "P.M.", - "P.N.", - "P.O.", - "P.P.", - "P.R.", - "P.S.", - "P.T.", - "P.U.", - "P.W.", - "P.Y.", - "P.Z.", - "P.Ó.", - "P.Ą.", - "P.Ć.", - "P.Ę.", - "P.Ł.", - "P.Ń.", - "P.Ś.", - "P.Ź.", - "P.Ż.", - "Podkom.", - "Przyp.", - "Ps.", - "Pt.", - "Płk.", - "R.A.", - "R.B.", - "R.C.", - "R.D.", - "R.E.", - "R.F.", - "R.G.", - "R.H.", - "R.I.", - "R.J.", - "R.K.", - "R.L.", - "R.M.", - "R.N.", - "R.O.", - "R.P.", - "R.R.", - "R.S.", - "R.T.", - "R.U.", - "R.W.", - "R.Y.", - "R.Z.", - "R.Ó.", - "R.Ą.", - "R.Ć.", - "R.Ę.", - "R.Ł.", - "R.Ń.", - "R.Ś.", - "R.Ź.", - "R.Ż.", - "Red.", - "Reż.", - "Ryc.", - "Rys.", - "S.A.", - "S.B.", - "S.C.", - "S.D.", - "S.E.", - "S.F.", - "S.G.", - "S.H.", - "S.I.", - "S.J.", - "S.K.", - "S.L.", - "S.M.", - "S.N.", - "S.O.", - "S.P.", - "S.R.", - "S.S.", - "S.T.", - "S.U.", - "S.W.", - "S.Y.", - "S.Z.", - "S.Ó.", - "S.Ą.", - "S.Ć.", - "S.Ę.", - "S.Ł.", - "S.Ń.", - "S.Ś.", - "S.Ź.", - "S.Ż.", - "Sp.", - "Spółdz.", - "Stow.", - "Stoł.", - "Sz.P.", - "Szer.", - "T.A.", - "T.B.", - "T.C.", - "T.D.", - "T.E.", - "T.F.", - "T.G.", - "T.H.", - "T.I.", - "T.J.", - "T.K.", - "T.L.", - "T.M.", - "T.N.", - "T.O.", - "T.P.", - "T.R.", - "T.S.", - "T.T.", - "T.U.", - "T.W.", - "T.Y.", - "T.Z.", - "T.Ó.", - "T.Ą.", - "T.Ć.", - "T.Ę.", - "T.Ł.", - "T.Ń.", - "T.Ś.", - "T.Ź.", - "T.Ż.", - "Tow.", - "Tzw.", - "U.A.", - "U.B.", - "U.C.", - "U.D.", - "U.E.", - "U.F.", - "U.G.", - "U.H.", - "U.I.", - "U.J.", - "U.K.", - "U.L.", - "U.M.", - "U.N.", - "U.O.", - "U.P.", - "U.R.", - "U.S.", - "U.T.", - "U.U.", - "U.W.", - "U.Y.", - "U.Z.", - "U.Ó.", - "U.Ą.", - "U.Ć.", - "U.Ę.", - "U.Ł.", - "U.Ń.", - "U.Ś.", - "U.Ź.", - "U.Ż.", - "W.A.", - "W.B.", - "W.C.", - "W.D.", - "W.E.", - "W.F.", - "W.G.", - "W.H.", - "W.I.", - "W.J.", - "W.K.", - "W.L.", - "W.M.", - "W.N.", - "W.O.", - "W.P.", - "W.R.", - "W.S.", - "W.T.", - "W.U.", - "W.W.", - "W.Y.", - "W.Z.", - "W.Ó.", - "W.Ą.", - "W.Ć.", - "W.Ę.", - "W.Ł.", - "W.Ń.", - "W.Ś.", - "W.Ź.", - "W.Ż.", - "Y.A.", - "Y.B.", - "Y.C.", - "Y.D.", - "Y.E.", - "Y.F.", - "Y.G.", - "Y.H.", - "Y.I.", - "Y.J.", - "Y.K.", - "Y.L.", - "Y.M.", - "Y.N.", - "Y.O.", - "Y.P.", - "Y.R.", - "Y.S.", - "Y.T.", - "Y.U.", - "Y.W.", - "Y.Y.", - "Y.Z.", - "Y.Ó.", - "Y.Ą.", - "Y.Ć.", - "Y.Ę.", - "Y.Ł.", - "Y.Ń.", - "Y.Ś.", - "Y.Ź.", - "Y.Ż.", - "Z.A.", - "Z.B.", - "Z.C.", - "Z.D.", - "Z.E.", - "Z.F.", - "Z.G.", - "Z.H.", - "Z.I.", - "Z.J.", - "Z.K.", - "Z.L.", - "Z.M.", - "Z.N.", - "Z.O.", - "Z.P.", - "Z.R.", - "Z.S.", - "Z.T.", - "Z.U.", - "Z.W.", - "Z.Y.", - "Z.Z.", - "Z.Ó.", - "Z.Ą.", - "Z.Ć.", - "Z.Ę.", - "Z.Ł.", - "Z.Ń.", - "Z.Ś.", - "Z.Ź.", - "Z.Ż.", - "Zob.", - "a.", - "ad.", - "adw.", - "afr.", - "ags.", - "akad.", - "al.", - "alb.", - "am.", - "amer.", - "ang.", - "aor.", - "ap.", - "apost.", - "arch.", - "arcyks.", - "art.", - "artyst.", - "asp.", - "astr.", - "aust.", - "austr.", - "austral.", - "b.", - "bałt.", - "bdb.", - "belg.", - "białorus.", - "białost.", - "bm.", - "bot.", - "bp.", - "br.", - "bryg.", - "bryt.", - "bułg.", - "bł.", - "c.b.d.o.", - "c.k.", - "c.o.", - "cbdu.", - "cd.", - "cdn.", - "centr.", - "ces.", - "chem.", - "chir.", - "chiń.", - "chor.", - "chorw.", - "cieśn.", - "cnd.", - "cyg.", - "cyt.", - "cyw.", - "cz.", - "czes.", - "czw.", - "czyt.", - "d.", - "daw.", - "dcn.", - "dekl.", - "demokr.", - "det.", - "dh.", - "diec.", - "dk.", - "dn.", - "doc.", - "doktor h.c.", - "dol.", - "dolnośl.", - "dost.", - "dosł.", - "dot.", - "dr h.c.", - "dr hab.", - "dr.", - "ds.", - "dst.", - "duszp.", - "dypl.", - "dyr.", - "dyw.", - "dł.", - "egz.", - "ekol.", - "ekon.", - "elektr.", - "em.", - "ent.", - "est.", - "europ.", - "ew.", - "fab.", - "farm.", - "fot.", - "fr.", - "franc.", - "g.", - "gastr.", - "gat.", - "gd.", - "gen.", - "geogr.", - "geol.", - "gimn.", - "gm.", - "godz.", - "gorz.", - "gosp.", - "gosp.-polit.", - "gr.", - "gram.", - "grub.", - "górn.", - "głęb.", - "h.c.", - "hab.", - "hist.", - "hiszp.", - "hitl.", - "hm.", - "hot.", - "hr.", - "i in.", - "i s.", - "id.", - "ie.", - "im.", - "in.", - "inż.", - "iron.", - "itd.", - "itp.", - "j.", - "j.a.", - "jez.", - "jn.", - "jw.", - "jwt.", - "k.", - "k.k.", - "k.o.", - "k.p.a.", - "k.p.c.", - "k.r.", - "k.r.o.", - "kard.", - "kark.", - "kasz.", - "kat.", - "katol.", - "kier.", - "kk.", - "kl.", - "kol.", - "kpc.", - "kpt.", - "kr.", - "krak.", - "kryt.", - "ks.", - "książk.", - "kuj.", - "kult.", - "kł.", - "l.", - "laic.", - "lek.", - "lit.", - "lp.", - "lub.", - "m.", - "m.b.", - "m.in.", - "m.p.", - "m.st.", - "mar.", - "maz.", - "małop.", - "mec.", - "med.", - "mgr.", - "min.", - "mn.", - "mn.w.", - "muz.", - "mł.", - "n.", - "n.e.", - "n.p.m.", - "n.p.u.", - "na os.", - "nadkom.", - "najśw.", - "nb.", - "niedz.", - "niem.", - "norw.", - "np.", - "nt.", - "nż.", - "o s.", - "o.", - "oO.", - "ob.", - "odc.", - "odp.", - "ok.", - "oo.", - "op.", - "os.", - "p.", - "p.a.", - "p.f.", - "p.f.v.", - "p.n.e.", - "p.o.", - "p.p.", - "p.p.m.", - "p.r.", - "p.r.v.", - "phm.", - "pie.", - "pl.", - "pn.", - "pocz.", - "pod.", - "podgat.", - "podkarp.", - "podkom.", - "poet.", - "poj.", - "pok.", - "pol.", - "pom.", - "pon.", - "poprz.", - "por.", - "port.", - "posp.", - "pow.", - "poz.", - "poł.", - "pp.", - "ppanc.", - "ppor.", - "ppoż.", - "prawdop.", - "proc.", - "prof.", - "prok.", - "przed Chr.", - "przyp.", - "ps.", - "pseud.", - "pt.", - "pw.", - "półn.", - "płd.", - "płk.", - "płn.", - "r.", - "r.ż.", - "red.", - "reż.", - "ros.", - "rozdz.", - "rtg.", - "rtm.", - "rub.", - "rum.", - "ryc.", - "rys.", - "rz.", - "s.", - "serb.", - "sierż.", - "skr.", - "sob.", - "sp.", - "społ.", - "spółdz.", - "spółgł.", - "st.", - "st.rus.", - "stow.", - "stoł.", - "str.", - "sud.", - "szczec.", - "szer.", - "szt.", - "szw.", - "szwajc.", - "słow.", - "t.", - "t.j.", - "tatrz.", - "tel.", - "tj.", - "tow.", - "trl.", - "tryb.", - "ts.", - "tur.", - "tys.", - "tzn.", - "tzw.", - "tłum.", - "u s.", - "ub.", - "ukr.", - "ul.", - "up.", - "ur.", - "v.v.", - "vs.", - "w.", - "warm.", - "wlk.", - "wlkp.", - "woj.", - "wroc.", - "ws.", - "wsch.", - "wt.", - "ww.", - "wyb.", - "wyd.", - "wyj.", - "wym.", - "wyst.", - "wył.", - "wyż.", - "wzgl.", - "wędr.", - "węg.", - "wł.", - "x.", - "xx.", - "zach.", - "zagr.", - "zak.", - "zakł.", - "zal.", - "zam.", - "zast.", - "zaw.", - "zazw.", - "zał.", - "zdr.", - "zew.", - "zewn.", - "ziel.", - "zm.", - "zn.", - "zob.", - "zool.", - "zw.", - "ząbk.", - "Ó.A.", - "Ó.B.", - "Ó.C.", - "Ó.D.", - "Ó.E.", - "Ó.F.", - "Ó.G.", - "Ó.H.", - "Ó.I.", - "Ó.J.", - "Ó.K.", - "Ó.L.", - "Ó.M.", - "Ó.N.", - "Ó.O.", - "Ó.P.", - "Ó.R.", - "Ó.S.", - "Ó.T.", - "Ó.U.", - "Ó.W.", - "Ó.Y.", - "Ó.Z.", - "Ó.Ó.", - "Ó.Ą.", - "Ó.Ć.", - "Ó.Ę.", - "Ó.Ł.", - "Ó.Ń.", - "Ó.Ś.", - "Ó.Ź.", - "Ó.Ż.", - "Ą.A.", - "Ą.B.", - "Ą.C.", - "Ą.D.", - "Ą.E.", - "Ą.F.", - "Ą.G.", - "Ą.H.", - "Ą.I.", - "Ą.J.", - "Ą.K.", - "Ą.L.", - "Ą.M.", - "Ą.N.", - "Ą.O.", - "Ą.P.", - "Ą.R.", - "Ą.S.", - "Ą.T.", - "Ą.U.", - "Ą.W.", - "Ą.Y.", - "Ą.Z.", - "Ą.Ó.", - "Ą.Ą.", - "Ą.Ć.", - "Ą.Ę.", - "Ą.Ł.", - "Ą.Ń.", - "Ą.Ś.", - "Ą.Ź.", - "Ą.Ż.", - "Ć.A.", - "Ć.B.", - "Ć.C.", - "Ć.D.", - "Ć.E.", - "Ć.F.", - "Ć.G.", - "Ć.H.", - "Ć.I.", - "Ć.J.", - "Ć.K.", - "Ć.L.", - "Ć.M.", - "Ć.N.", - "Ć.O.", - "Ć.P.", - "Ć.R.", - "Ć.S.", - "Ć.T.", - "Ć.U.", - "Ć.W.", - "Ć.Y.", - "Ć.Z.", - "Ć.Ó.", - "Ć.Ą.", - "Ć.Ć.", - "Ć.Ę.", - "Ć.Ł.", - "Ć.Ń.", - "Ć.Ś.", - "Ć.Ź.", - "Ć.Ż.", - "ćw.", - "ćwicz.", - "Ę.A.", - "Ę.B.", - "Ę.C.", - "Ę.D.", - "Ę.E.", - "Ę.F.", - "Ę.G.", - "Ę.H.", - "Ę.I.", - "Ę.J.", - "Ę.K.", - "Ę.L.", - "Ę.M.", - "Ę.N.", - "Ę.O.", - "Ę.P.", - "Ę.R.", - "Ę.S.", - "Ę.T.", - "Ę.U.", - "Ę.W.", - "Ę.Y.", - "Ę.Z.", - "Ę.Ó.", - "Ę.Ą.", - "Ę.Ć.", - "Ę.Ę.", - "Ę.Ł.", - "Ę.Ń.", - "Ę.Ś.", - "Ę.Ź.", - "Ę.Ż.", - "Ł.A.", - "Ł.B.", - "Ł.C.", - "Ł.D.", - "Ł.E.", - "Ł.F.", - "Ł.G.", - "Ł.H.", - "Ł.I.", - "Ł.J.", - "Ł.K.", - "Ł.L.", - "Ł.M.", - "Ł.N.", - "Ł.O.", - "Ł.P.", - "Ł.R.", - "Ł.S.", - "Ł.T.", - "Ł.U.", - "Ł.W.", - "Ł.Y.", - "Ł.Z.", - "Ł.Ó.", - "Ł.Ą.", - "Ł.Ć.", - "Ł.Ę.", - "Ł.Ł.", - "Ł.Ń.", - "Ł.Ś.", - "Ł.Ź.", - "Ł.Ż.", - "Łuk.", - "łac.", - "łot.", - "łow.", - "Ń.A.", - "Ń.B.", - "Ń.C.", - "Ń.D.", - "Ń.E.", - "Ń.F.", - "Ń.G.", - "Ń.H.", - "Ń.I.", - "Ń.J.", - "Ń.K.", - "Ń.L.", - "Ń.M.", - "Ń.N.", - "Ń.O.", - "Ń.P.", - "Ń.R.", - "Ń.S.", - "Ń.T.", - "Ń.U.", - "Ń.W.", - "Ń.Y.", - "Ń.Z.", - "Ń.Ó.", - "Ń.Ą.", - "Ń.Ć.", - "Ń.Ę.", - "Ń.Ł.", - "Ń.Ń.", - "Ń.Ś.", - "Ń.Ź.", - "Ń.Ż.", - "Ś.A.", - "Ś.B.", - "Ś.C.", - "Ś.D.", - "Ś.E.", - "Ś.F.", - "Ś.G.", - "Ś.H.", - "Ś.I.", - "Ś.J.", - "Ś.K.", - "Ś.L.", - "Ś.M.", - "Ś.N.", - "Ś.O.", - "Ś.P.", - "Ś.R.", - "Ś.S.", - "Ś.T.", - "Ś.U.", - "Ś.W.", - "Ś.Y.", - "Ś.Z.", - "Ś.Ó.", - "Ś.Ą.", - "Ś.Ć.", - "Ś.Ę.", - "Ś.Ł.", - "Ś.Ń.", - "Ś.Ś.", - "Ś.Ź.", - "Ś.Ż.", - "ŚW.", - "Śp.", - "Św.", - "śW.", - "śl.", - "śp.", - "śr.", - "św.", - "Ź.A.", - "Ź.B.", - "Ź.C.", - "Ź.D.", - "Ź.E.", - "Ź.F.", - "Ź.G.", - "Ź.H.", - "Ź.I.", - "Ź.J.", - "Ź.K.", - "Ź.L.", - "Ź.M.", - "Ź.N.", - "Ź.O.", - "Ź.P.", - "Ź.R.", - "Ź.S.", - "Ź.T.", - "Ź.U.", - "Ź.W.", - "Ź.Y.", - "Ź.Z.", - "Ź.Ó.", - "Ź.Ą.", - "Ź.Ć.", - "Ź.Ę.", - "Ź.Ł.", - "Ź.Ń.", - "Ź.Ś.", - "Ź.Ź.", - "Ź.Ż.", - "Ż.A.", - "Ż.B.", - "Ż.C.", - "Ż.D.", - "Ż.E.", - "Ż.F.", - "Ż.G.", - "Ż.H.", - "Ż.I.", - "Ż.J.", - "Ż.K.", - "Ż.L.", - "Ż.M.", - "Ż.N.", - "Ż.O.", - "Ż.P.", - "Ż.R.", - "Ż.S.", - "Ż.T.", - "Ż.U.", - "Ż.W.", - "Ż.Y.", - "Ż.Z.", - "Ż.Ó.", - "Ż.Ą.", - "Ż.Ć.", - "Ż.Ę.", - "Ż.Ł.", - "Ż.Ń.", - "Ż.Ś.", - "Ż.Ź.", - "Ż.Ż.", - "ż.", - "żarg.", - "żart.", - "żyd.", - "żyw.", -] diff --git a/spacy/lang/pl/lemmatizer.py b/spacy/lang/pl/lemmatizer.py new file mode 100644 index 000000000..cd555b9c2 --- /dev/null +++ b/spacy/lang/pl/lemmatizer.py @@ -0,0 +1,106 @@ +# coding: utf-8 +from __future__ import unicode_literals + +from ...lemmatizer import Lemmatizer +from ...parts_of_speech import NAMES + + +class PolishLemmatizer(Lemmatizer): + # This lemmatizer implements lookup lemmatization based on + # the Morfeusz dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS + # It utilizes some prefix based improvements for + # verb and adjectives lemmatization, as well as case-sensitive + # lemmatization for nouns + def __init__(self, lookups, *args, **kwargs): + # this lemmatizer is lookup based, so it does not require an index, exceptionlist, or rules + super().__init__(lookups) + self.lemma_lookups = {} + for tag in [ + "ADJ", + "ADP", + "ADV", + "AUX", + "NOUN", + "NUM", + "PART", + "PRON", + "VERB", + "X", + ]: + self.lemma_lookups[tag] = self.lookups.get_table( + "lemma_lookup_" + tag.lower(), {} + ) + self.lemma_lookups["DET"] = self.lemma_lookups["X"] + self.lemma_lookups["PROPN"] = self.lemma_lookups["NOUN"] + + def __call__(self, string, univ_pos, morphology=None): + if isinstance(univ_pos, int): + univ_pos = NAMES.get(univ_pos, "X") + univ_pos = univ_pos.upper() + + if univ_pos == "NOUN": + return self.lemmatize_noun(string, morphology) + + if univ_pos != "PROPN": + string = string.lower() + + if univ_pos == "ADJ": + return self.lemmatize_adj(string, morphology) + elif univ_pos == "VERB": + return self.lemmatize_verb(string, morphology) + + lemma_dict = self.lemma_lookups.get(univ_pos, {}) + return [lemma_dict.get(string, string.lower())] + + def lemmatize_adj(self, string, morphology): + # this method utilizes different procedures for adjectives + # with 'nie' and 'naj' prefixes + lemma_dict = self.lemma_lookups["ADJ"] + + if string[:3] == "nie": + search_string = string[3:] + if search_string[:3] == "naj": + naj_search_string = search_string[3:] + if naj_search_string in lemma_dict: + return [lemma_dict[naj_search_string]] + if search_string in lemma_dict: + return [lemma_dict[search_string]] + + if string[:3] == "naj": + naj_search_string = string[3:] + if naj_search_string in lemma_dict: + return [lemma_dict[naj_search_string]] + + return [lemma_dict.get(string, string)] + + def lemmatize_verb(self, string, morphology): + # this method utilizes a different procedure for verbs + # with 'nie' prefix + lemma_dict = self.lemma_lookups["VERB"] + + if string[:3] == "nie": + search_string = string[3:] + if search_string in lemma_dict: + return [lemma_dict[search_string]] + + return [lemma_dict.get(string, string)] + + def lemmatize_noun(self, string, morphology): + # this method is case-sensitive, in order to work + # for incorrectly tagged proper names + lemma_dict = self.lemma_lookups["NOUN"] + + if string != string.lower(): + if string.lower() in lemma_dict: + return [lemma_dict[string.lower()]] + elif string in lemma_dict: + return [lemma_dict[string]] + return [string.lower()] + + return [lemma_dict.get(string, string)] + + def lookup(self, string, orth=None): + return string.lower() + + def lemmatize(self, string, index, exceptions, rules): + raise NotImplementedError diff --git a/spacy/lang/pl/polish_srx_rules_LICENSE.txt b/spacy/lang/pl/polish_srx_rules_LICENSE.txt deleted file mode 100644 index 995a1b0f7..000000000 --- a/spacy/lang/pl/polish_srx_rules_LICENSE.txt +++ /dev/null @@ -1,23 +0,0 @@ - -Copyright (c) 2019, Marcin Miłkowski -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: - -1. Redistributions of source code must retain the above copyright notice, this - list of conditions and the following disclaimer. -2. Redistributions in binary form must reproduce the above copyright notice, - this list of conditions and the following disclaimer in the documentation - and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR -ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND -ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/spacy/lang/pl/punctuation.py b/spacy/lang/pl/punctuation.py index eea28de11..31e56b9ae 100644 --- a/spacy/lang/pl/punctuation.py +++ b/spacy/lang/pl/punctuation.py @@ -1,19 +1,45 @@ -from ..char_classes import LIST_ELLIPSES, CONCAT_ICONS +from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS +from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER +from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES _quotes = CONCAT_QUOTES.replace("'", "") +_prefixes = _prefixes = [ + r"(długo|krótko|jedno|dwu|trzy|cztero)-" +] + BASE_TOKENIZER_PREFIXES + _infixes = ( LIST_ELLIPSES - + [CONCAT_ICONS] + + LIST_ICONS + + LIST_HYPHENS + [ - r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER), + r"(?<=[0-9{al}])\.(?=[0-9{au}])".format(al=ALPHA, au=ALPHA_UPPER), r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA), - r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA), - r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA), + r"(?<=[{a}])[:<>=\/](?=[{a}])".format(a=ALPHA), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), - r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=CONCAT_QUOTES), + r"(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])".format(a=ALPHA, q=_quotes), ] ) +_suffixes = ( + ["''", "’’", r"\.", "…"] + + LIST_PUNCT + + LIST_QUOTES + + LIST_ICONS + + [ + r"(?<=[0-9])\+", + r"(?<=°[FfCcKk])\.", + r"(?<=[0-9])(?:{c})".format(c=CURRENCY), + r"(?<=[0-9])(?:{u})".format(u=UNITS), + r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format( + al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT + ), + r"(?<=[{au}])\.".format(au=ALPHA_UPPER), + ] +) + + +TOKENIZER_PREFIXES = _prefixes TOKENIZER_INFIXES = _infixes +TOKENIZER_SUFFIXES = _suffixes diff --git a/spacy/lang/pl/tokenizer_exceptions.py b/spacy/lang/pl/tokenizer_exceptions.py deleted file mode 100644 index 39f3017ed..000000000 --- a/spacy/lang/pl/tokenizer_exceptions.py +++ /dev/null @@ -1,23 +0,0 @@ -from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS -from ...symbols import POS, ADV, NOUN, ORTH, LEMMA, ADJ - - -_exc = {} - -for exc_data in [ - {ORTH: "m.in.", LEMMA: "między innymi", POS: ADV}, - {ORTH: "inż.", LEMMA: "inżynier", POS: NOUN}, - {ORTH: "mgr.", LEMMA: "magister", POS: NOUN}, - {ORTH: "tzn.", LEMMA: "to znaczy", POS: ADV}, - {ORTH: "tj.", LEMMA: "to jest", POS: ADV}, - {ORTH: "tzw.", LEMMA: "tak zwany", POS: ADJ}, -]: - _exc[exc_data[ORTH]] = [exc_data] - -for orth in ["w.", "r."]: - _exc[orth] = [{ORTH: orth}] - -for orth in PL_BASE_EXCEPTIONS: - _exc[orth] = [{ORTH: orth}] - -TOKENIZER_EXCEPTIONS = _exc diff --git a/spacy/lang/pt/__init__.py b/spacy/lang/pt/__init__.py index 0557e8b31..d212d1e39 100644 --- a/spacy/lang/pt/__init__.py +++ b/spacy/lang/pt/__init__.py @@ -2,22 +2,17 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .tag_map import TAG_MAP -from .norm_exceptions import NORM_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class PortugueseDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: "pt" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) lex_attr_getters.update(LEX_ATTRS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS diff --git a/spacy/lang/pt/norm_exceptions.py b/spacy/lang/pt/norm_exceptions.py deleted file mode 100644 index e115b0385..000000000 --- a/spacy/lang/pt/norm_exceptions.py +++ /dev/null @@ -1,20 +0,0 @@ -# These exceptions are used to add NORM values based on a token's ORTH value. -# Individual languages can also add their own exceptions and overwrite them - -# for example, British vs. American spelling in English. - -# Norms are only set if no alternative is provided in the tokenizer exceptions. -# Note that this does not change any other token attributes. Its main purpose -# is to normalise the word representations so that equivalent tokens receive -# similar representations. For example: $ and € are very different, but they're -# both currency symbols. By normalising currency symbols to $, all symbols are -# seen as similar, no matter how common they are in the training data. - - -NORM_EXCEPTIONS = { - "R$": "$", # Real - "r$": "$", # Real - "Cz$": "$", # Cruzado - "cz$": "$", # Cruzado - "NCz$": "$", # Cruzado Novo - "ncz$": "$", # Cruzado Novo -} diff --git a/spacy/lang/ru/__init__.py b/spacy/lang/ru/__init__.py index d25e8048b..52cab1db1 100644 --- a/spacy/lang/ru/__init__.py +++ b/spacy/lang/ru/__init__.py @@ -1,25 +1,20 @@ from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .lex_attrs import LEX_ATTRS from .tag_map import TAG_MAP from .lemmatizer import RussianLemmatizer from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS -from ...util import update_exc, add_lookups +from ...util import update_exc from ...language import Language from ...lookups import Lookups -from ...attrs import LANG, NORM +from ...attrs import LANG class RussianDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "ru" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS tag_map = TAG_MAP diff --git a/spacy/lang/ru/norm_exceptions.py b/spacy/lang/ru/norm_exceptions.py deleted file mode 100644 index 0975bf5b8..000000000 --- a/spacy/lang/ru/norm_exceptions.py +++ /dev/null @@ -1,32 +0,0 @@ -_exc = { - # Slang - "прив": "привет", - "дарова": "привет", - "дак": "так", - "дык": "так", - "здарова": "привет", - "пакедава": "пока", - "пакедаво": "пока", - "ща": "сейчас", - "спс": "спасибо", - "пжлст": "пожалуйста", - "плиз": "пожалуйста", - "ладненько": "ладно", - "лады": "ладно", - "лан": "ладно", - "ясн": "ясно", - "всм": "всмысле", - "хош": "хочешь", - "хаюшки": "привет", - "оч": "очень", - "че": "что", - "чо": "что", - "шо": "что", -} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/sr/__init__.py b/spacy/lang/sr/__init__.py index 151cc231c..7f2172707 100644 --- a/spacy/lang/sr/__init__.py +++ b/spacy/lang/sr/__init__.py @@ -1,21 +1,16 @@ from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .norm_exceptions import NORM_EXCEPTIONS from .lex_attrs import LEX_ATTRS from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...attrs import LANG +from ...util import update_exc class SerbianDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "sr" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = STOP_WORDS diff --git a/spacy/lang/sr/norm_exceptions.py b/spacy/lang/sr/norm_exceptions.py deleted file mode 100644 index 723ab84c0..000000000 --- a/spacy/lang/sr/norm_exceptions.py +++ /dev/null @@ -1,22 +0,0 @@ -_exc = { - # Slang - "ћале": "отац", - "кева": "мајка", - "смор": "досада", - "кец": "јединица", - "тебра": "брат", - "штребер": "ученик", - "факс": "факултет", - "профа": "професор", - "бус": "аутобус", - "пискарало": "службеник", - "бакутанер": "бака", - "џибер": "простак", -} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/sv/__init__.py b/spacy/lang/sv/__init__.py index d400eae4d..8179b1c84 100644 --- a/spacy/lang/sv/__init__.py +++ b/spacy/lang/sv/__init__.py @@ -1,6 +1,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS from .morph_rules import MORPH_RULES # Punctuation stolen from Danish @@ -16,6 +17,7 @@ from .syntax_iterators import SYNTAX_ITERATORS class SwedishDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "sv" lex_attr_getters[NORM] = add_lookups( Language.Defaults.lex_attr_getters[NORM], BASE_NORMS diff --git a/spacy/lang/sv/lex_attrs.py b/spacy/lang/sv/lex_attrs.py new file mode 100644 index 000000000..24d06a97a --- /dev/null +++ b/spacy/lang/sv/lex_attrs.py @@ -0,0 +1,62 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +_num_words = [ + "noll", + "en", + "ett", + "två", + "tre", + "fyra", + "fem", + "sex", + "sju", + "åtta", + "nio", + "tio", + "elva", + "tolv", + "tretton", + "fjorton", + "femton", + "sexton", + "sjutton", + "arton", + "nitton", + "tjugo", + "trettio", + "fyrtio", + "femtio", + "sextio", + "sjuttio", + "åttio", + "nittio", + "hundra", + "tusen", + "miljon", + "miljard", + "biljon", + "biljard", + "kvadriljon", +] + + +def like_num(text): + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text.lower() in _num_words: + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/sv/syntax_iterators.py b/spacy/lang/sv/syntax_iterators.py index 021d5d2f5..ec92c08d3 100644 --- a/spacy/lang/sv/syntax_iterators.py +++ b/spacy/lang/sv/syntax_iterators.py @@ -1,7 +1,8 @@ from ...symbols import NOUN, PROPN, PRON +from ...errors import Errors -def noun_chunks(obj): +def noun_chunks(doclike): """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ @@ -16,12 +17,16 @@ def noun_chunks(obj): "nmod", "nmod:poss", ] - doc = obj.doc # Ensure works on both Doc and Span. + doc = doclike.doc # Ensure works on both Doc and Span. + + if not doc.is_parsed: + raise ValueError(Errors.E029) + np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced diff --git a/spacy/lang/sv/tokenizer_exceptions.py b/spacy/lang/sv/tokenizer_exceptions.py index 834a088ad..a78a51f31 100644 --- a/spacy/lang/sv/tokenizer_exceptions.py +++ b/spacy/lang/sv/tokenizer_exceptions.py @@ -1,4 +1,4 @@ -from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA, PUNCT, TAG +from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA _exc = {} @@ -152,6 +152,6 @@ for orth in ABBREVIATIONS: # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), # should be tokenized as two separate tokens. for orth in ["i", "m"]: - _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: ".", TAG: PUNCT}] + _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}] TOKENIZER_EXCEPTIONS = _exc diff --git a/spacy/lang/ta/norm_exceptions.py b/spacy/lang/ta/norm_exceptions.py deleted file mode 100644 index 8eaf0aa74..000000000 --- a/spacy/lang/ta/norm_exceptions.py +++ /dev/null @@ -1,136 +0,0 @@ -_exc = { - # Regional words normal - # Sri Lanka - wikipeadia - "இங்க": "இங்கே", - "வாங்க": "வாருங்கள்", - "ஒண்டு": "ஒன்று", - "கண்டு": "கன்று", - "கொண்டு": "கொன்று", - "பண்டி": "பன்றி", - "பச்ச": "பச்சை", - "அம்பது": "ஐம்பது", - "வெச்ச": "வைத்து", - "வச்ச": "வைத்து", - "வச்சி": "வைத்து", - "வாளைப்பழம்": "வாழைப்பழம்", - "மண்ணு": "மண்", - "பொன்னு": "பொன்", - "சாவல்": "சேவல்", - "அங்கால": "அங்கு ", - "அசுப்பு": "நடமாட்டம்", - "எழுவான் கரை": "எழுவான்கரை", - "ஓய்யாரம்": "எழில் ", - "ஒளும்பு": "எழும்பு", - "ஓர்மை": "துணிவு", - "கச்சை": "கோவணம்", - "கடப்பு": "தெருவாசல்", - "சுள்ளி": "காய்ந்த குச்சி", - "திறாவுதல்": "தடவுதல்", - "நாசமறுப்பு": "தொல்லை", - "பரிசாரி": "வைத்தியன்", - "பறவாதி": "பேராசைக்காரன்", - "பிசினி": "உலோபி ", - "விசர்": "பைத்தியம்", - "ஏனம்": "பாத்திரம்", - "ஏலா": "இயலாது", - "ஒசில்": "அழகு", - "ஒள்ளுப்பம்": "கொஞ்சம்", - # Srilankan and indian - "குத்துமதிப்பு": "", - "நூனாயம்": "நூல்நயம்", - "பைய": "மெதுவாக", - "மண்டை": "தலை", - "வெள்ளனே": "சீக்கிரம்", - "உசுப்பு": "எழுப்பு", - "ஆணம்": "குழம்பு", - "உறக்கம்": "தூக்கம்", - "பஸ்": "பேருந்து", - "களவு": "திருட்டு ", - # relationship - "புருசன்": "கணவன்", - "பொஞ்சாதி": "மனைவி", - "புள்ள": "பிள்ளை", - "பிள்ள": "பிள்ளை", - "ஆம்பிளப்புள்ள": "ஆண் பிள்ளை", - "பொம்பிளப்புள்ள": "பெண் பிள்ளை", - "அண்ணாச்சி": "அண்ணா", - "அக்காச்சி": "அக்கா", - "தங்கச்சி": "தங்கை", - # difference words - "பொடியன்": "சிறுவன்", - "பொட்டை": "சிறுமி", - "பிறகு": "பின்பு", - "டக்கென்டு": "விரைவாக", - "கெதியா": "விரைவாக", - "கிறுகி": "திரும்பி", - "போயித்து வாறன்": "போய் வருகிறேன்", - "வருவாங்களா": "வருவார்களா", - # regular spokens - "சொல்லு": "சொல்", - "கேளு": "கேள்", - "சொல்லுங்க": "சொல்லுங்கள்", - "கேளுங்க": "கேளுங்கள்", - "நீங்கள்": "நீ", - "உன்": "உன்னுடைய", - # Portugeese formal words - "அலவாங்கு": "கடப்பாரை", - "ஆசுப்பத்திரி": "மருத்துவமனை", - "உரோதை": "சில்லு", - "கடுதாசி": "கடிதம்", - "கதிரை": "நாற்காலி", - "குசினி": "அடுக்களை", - "கோப்பை": "கிண்ணம்", - "சப்பாத்து": "காலணி", - "தாச்சி": "இரும்புச் சட்டி", - "துவாய்": "துவாலை", - "தவறணை": "மதுக்கடை", - "பீப்பா": "மரத்தாழி", - "யன்னல்": "சாளரம்", - "வாங்கு": "மரஇருக்கை", - # Dutch formal words - "இறாக்கை": "பற்சட்டம்", - "இலாட்சி": "இழுப்பறை", - "கந்தோர்": "பணிமனை", - "நொத்தாரிசு": "ஆவண எழுத்துபதிவாளர்", - # English formal words - "இஞ்சினியர்": "பொறியியலாளர்", - "சூப்பு": "ரசம்", - "செக்": "காசோலை", - "சேட்டு": "மேற்ச்சட்டை", - "மார்க்கட்டு": "சந்தை", - "விண்ணன்": "கெட்டிக்காரன்", - # Arabic formal words - "ஈமான்": "நம்பிக்கை", - "சுன்னத்து": "விருத்தசேதனம்", - "செய்த்தான்": "பிசாசு", - "மவுத்து": "இறப்பு", - "ஹலால்": "அங்கீகரிக்கப்பட்டது", - "கறாம்": "நிராகரிக்கப்பட்டது", - # Persian, Hindustanian and hindi formal words - "சுமார்": "கிட்டத்தட்ட", - "சிப்பாய்": "போர்வீரன்", - "சிபார்சு": "சிபாரிசு", - "ஜமீன்": "பணக்காரா்", - "அசல்": "மெய்யான", - "அந்தஸ்து": "கௌரவம்", - "ஆஜர்": "சமா்ப்பித்தல்", - "உசார்": "எச்சரிக்கை", - "அச்சா": "நல்ல", - # English words used in text conversations - "bcoz": "ஏனெனில்", - "bcuz": "ஏனெனில்", - "fav": "விருப்பமான", - "morning": "காலை வணக்கம்", - "gdeveng": "மாலை வணக்கம்", - "gdnyt": "இரவு வணக்கம்", - "gdnit": "இரவு வணக்கம்", - "plz": "தயவு செய்து", - "pls": "தயவு செய்து", - "thx": "நன்றி", - "thanx": "நன்றி", -} - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm diff --git a/spacy/lang/th/__init__.py b/spacy/lang/th/__init__.py index 950a77818..4333afcc9 100644 --- a/spacy/lang/th/__init__.py +++ b/spacy/lang/th/__init__.py @@ -1,14 +1,12 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS -from .norm_exceptions import NORM_EXCEPTIONS from .lex_attrs import LEX_ATTRS -from ..norm_exceptions import BASE_NORMS -from ...attrs import LANG, NORM +from ...attrs import LANG from ...language import Language from ...tokens import Doc -from ...util import DummyTokenizer, add_lookups +from ...util import DummyTokenizer class ThaiTokenizer(DummyTokenizer): @@ -34,9 +32,6 @@ class ThaiDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda _text: "th" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS - ) tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS) tag_map = TAG_MAP stop_words = STOP_WORDS diff --git a/spacy/lang/th/norm_exceptions.py b/spacy/lang/th/norm_exceptions.py deleted file mode 100644 index b8ddbab16..000000000 --- a/spacy/lang/th/norm_exceptions.py +++ /dev/null @@ -1,109 +0,0 @@ -_exc = { - # Conjugation and Diversion invalid to Tonal form (ผันอักษรและเสียงไม่ตรงกับรูปวรรณยุกต์) - "สนุ๊กเกอร์": "สนุกเกอร์", - "โน้ต": "โน้ต", - # Misspelled because of being lazy or hustle (สะกดผิดเพราะขี้เกียจพิมพ์ หรือเร่งรีบ) - "โทสับ": "โทรศัพท์", - "พุ่งนี้": "พรุ่งนี้", - # Strange (ให้ดูแปลกตา) - "ชะมะ": "ใช่ไหม", - "ชิมิ": "ใช่ไหม", - "ชะ": "ใช่ไหม", - "ช่ายมะ": "ใช่ไหม", - "ป่าว": "เปล่า", - "ป่ะ": "เปล่า", - "ปล่าว": "เปล่า", - "คัย": "ใคร", - "ไค": "ใคร", - "คราย": "ใคร", - "เตง": "ตัวเอง", - "ตะเอง": "ตัวเอง", - "รึ": "หรือ", - "เหรอ": "หรือ", - "หรา": "หรือ", - "หรอ": "หรือ", - "ชั้น": "ฉัน", - "ชั้ล": "ฉัน", - "ช้าน": "ฉัน", - "เทอ": "เธอ", - "เทอร์": "เธอ", - "เทอว์": "เธอ", - "แกร": "แก", - "ป๋ม": "ผม", - "บ่องตง": "บอกตรงๆ", - "ถ่ามตง": "ถามตรงๆ", - "ต่อมตง": "ตอบตรงๆ", - "เพิ่ล": "เพื่อน", - "จอบอ": "จอบอ", - "ดั้ย": "ได้", - "ขอบคุง": "ขอบคุณ", - "ยังงัย": "ยังไง", - "Inw": "เทพ", - "uou": "นอน", - "Lกรีeu": "เกรียน", - # Misspelled to express emotions (คำที่สะกดผิดเพื่อแสดงอารมณ์) - "เปงราย": "เป็นอะไร", - "เปนรัย": "เป็นอะไร", - "เปงรัย": "เป็นอะไร", - "เป็นอัลไล": "เป็นอะไร", - "ทามมาย": "ทำไม", - "ทามมัย": "ทำไม", - "จังรุย": "จังเลย", - "จังเยย": "จังเลย", - "จุงเบย": "จังเลย", - "ไม่รู้": "มะรุ", - "เฮ่ย": "เฮ้ย", - "เห้ย": "เฮ้ย", - "น่าร็อค": "น่ารัก", - "น่าร๊าก": "น่ารัก", - "ตั้ลล๊าก": "น่ารัก", - "คือร๊ะ": "คืออะไร", - "โอป่ะ": "โอเคหรือเปล่า", - "น่ามคาน": "น่ารำคาญ", - "น่ามสาร": "น่าสงสาร", - "วงวาร": "สงสาร", - "บับว่า": "แบบว่า", - "อัลไล": "อะไร", - "อิจ": "อิจฉา", - # Reduce rough words or Avoid to software filter (คำที่สะกดผิดเพื่อลดความหยาบของคำ หรืออาจใช้หลีกเลี่ยงการกรองคำหยาบของซอฟต์แวร์) - "กรู": "กู", - "กุ": "กู", - "กรุ": "กู", - "ตู": "กู", - "ตรู": "กู", - "มรึง": "มึง", - "เมิง": "มึง", - "มืง": "มึง", - "มุง": "มึง", - "สาด": "สัตว์", - "สัส": "สัตว์", - "สัก": "สัตว์", - "แสรด": "สัตว์", - "โคโตะ": "โคตร", - "โคด": "โคตร", - "โครต": "โคตร", - "โคตะระ": "โคตร", - "พ่อง": "พ่อมึง", - "แม่เมิง": "แม่มึง", - "เชี่ย": "เหี้ย", - # Imitate words (คำเลียนเสียง โดยส่วนใหญ่จะเพิ่มทัณฑฆาต หรือซ้ำตัวอักษร) - "แอร๊ยย": "อ๊าย", - "อร๊ายยย": "อ๊าย", - "มันส์": "มัน", - "วู๊วววววววว์": "วู้", - # Acronym (แบบคำย่อ) - "หมาลัย": "มหาวิทยาลัย", - "วิดวะ": "วิศวะ", - "สินสาด ": "ศิลปศาสตร์", - "สินกำ ": "ศิลปกรรมศาสตร์", - "เสารีย์ ": "อนุเสาวรีย์ชัยสมรภูมิ", - "เมกา ": "อเมริกา", - "มอไซค์ ": "มอเตอร์ไซค์", -} - - -NORM_EXCEPTIONS = {} - -for string, norm in _exc.items(): - NORM_EXCEPTIONS[string] = norm - NORM_EXCEPTIONS[string.title()] = norm diff --git a/spacy/lang/tokenizer_exceptions.py b/spacy/lang/tokenizer_exceptions.py index ee58a7b09..3bb299d6d 100644 --- a/spacy/lang/tokenizer_exceptions.py +++ b/spacy/lang/tokenizer_exceptions.py @@ -55,7 +55,7 @@ URL_PATTERN = ( # fmt: on ).strip() -TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match +TOKEN_MATCH = re.compile("(?u)" + URL_PATTERN).match BASE_EXCEPTIONS = {} diff --git a/spacy/lang/ur/tag_map.py b/spacy/lang/ur/tag_map.py index d990fd46a..4ae0d7014 100644 --- a/spacy/lang/ur/tag_map.py +++ b/spacy/lang/ur/tag_map.py @@ -1,63 +1,90 @@ +from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SCONJ from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON TAG_MAP = { - ".": {POS: PUNCT, "PunctType": "peri"}, - ",": {POS: PUNCT, "PunctType": "comm"}, - "-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"}, - "-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"}, - "``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"}, - '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, - "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, + "JJ-Ez": {POS: ADJ}, + "INJC": {POS: X}, + "QFC": {POS: DET}, + "UNK": {POS: X}, + "NSTC": {POS: ADV}, + "NST": {POS: ADV}, + "VMC": {POS: VERB}, + "PRPC": {POS: PRON}, + "RBC": {POS: ADV}, + "PSPC": {POS: ADP}, + "INJ": {POS: X}, + "JJZ": {POS: ADJ}, + "CCC": {POS: SCONJ}, + "NN-Ez": {POS: NOUN}, + "ECH": {POS: NOUN}, + "WQ": {POS: DET}, + "RDP": {POS: ADJ}, + "JJC": {POS: ADJ}, + "NEG": {POS: PART}, + "NNZ": {POS: NOUN}, + "QO": {POS: ADJ}, + "INTFC": {POS: ADV}, + "INTF": {POS: ADV}, + "NFC": {POS: ADP}, + "QCC": {POS: NUM}, + "QC": {POS: NUM}, + "QF": {POS: DET}, + "VAUX": {POS: AUX}, + "VM": {POS: VERB}, + "DEM": {POS: DET}, + "NNPC": {POS: PROPN}, + "NNC": {POS: NOUN}, + "PSP": {POS: ADP}, + ".": {POS: PUNCT}, + ",": {POS: PUNCT}, + "-LRB-": {POS: PUNCT}, + "-RRB-": {POS: PUNCT}, + "``": {POS: PUNCT}, + '""': {POS: PUNCT}, + "''": {POS: PUNCT}, ":": {POS: PUNCT}, - "$": {POS: SYM, "SymType": "currency"}, - "#": {POS: SYM, "SymType": "numbersign"}, - "AFX": {POS: ADJ, "Hyph": "yes"}, - "CC": {POS: CCONJ, "ConjType": "coor"}, - "CD": {POS: NUM, "NumType": "card"}, + "$": {POS: SYM}, + "#": {POS: SYM}, + "AFX": {POS: ADJ}, + "CC": {POS: CCONJ}, + "CD": {POS: NUM}, "DT": {POS: DET}, - "EX": {POS: ADV, "AdvType": "ex"}, - "FW": {POS: X, "Foreign": "yes"}, - "HYPH": {POS: PUNCT, "PunctType": "dash"}, + "EX": {POS: ADV}, + "FW": {POS: X}, + "HYPH": {POS: PUNCT}, "IN": {POS: ADP}, - "JJ": {POS: ADJ, "Degree": "pos"}, - "JJR": {POS: ADJ, "Degree": "comp"}, - "JJS": {POS: ADJ, "Degree": "sup"}, - "LS": {POS: PUNCT, "NumType": "ord"}, - "MD": {POS: VERB, "VerbType": "mod"}, + "JJ": {POS: ADJ}, + "JJR": {POS: ADJ}, + "JJS": {POS: ADJ}, + "LS": {POS: PUNCT}, + "MD": {POS: VERB}, "NIL": {POS: ""}, - "NN": {POS: NOUN, "Number": "sing"}, - "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, - "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, - "NNS": {POS: NOUN, "Number": "plur"}, - "PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"}, - "POS": {POS: PART, "Poss": "yes"}, - "PRP": {POS: PRON, "PronType": "prs"}, - "PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"}, - "RB": {POS: ADV, "Degree": "pos"}, - "RBR": {POS: ADV, "Degree": "comp"}, - "RBS": {POS: ADV, "Degree": "sup"}, + "NN": {POS: NOUN}, + "NNP": {POS: PROPN}, + "NNPS": {POS: PROPN}, + "NNS": {POS: NOUN}, + "PDT": {POS: ADJ}, + "POS": {POS: PART}, + "PRP": {POS: PRON}, + "PRP$": {POS: ADJ}, + "RB": {POS: ADV}, + "RBR": {POS: ADV}, + "RBS": {POS: ADV}, "RP": {POS: PART}, "SP": {POS: SPACE}, "SYM": {POS: SYM}, - "TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"}, + "TO": {POS: PART}, "UH": {POS: INTJ}, - "VB": {POS: VERB, "VerbForm": "inf"}, - "VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"}, - "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, - "VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"}, - "VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"}, - "VBZ": { - POS: VERB, - "VerbForm": "fin", - "Tense": "pres", - "Number": "sing", - "Person": "3", - }, - "WDT": {POS: ADJ, "PronType": "int|rel"}, - "WP": {POS: NOUN, "PronType": "int|rel"}, - "WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"}, - "WRB": {POS: ADV, "PronType": "int|rel"}, + "VB": {POS: VERB}, + "VBD": {POS: VERB}, + "VBG": {POS: VERB}, + "VBN": {POS: VERB}, + "VBP": {POS: VERB}, + "VBZ": {POS: VERB}, + "WDT": {POS: ADJ}, + "WP": {POS: NOUN}, + "WP$": {POS: ADJ}, + "WRB": {POS: ADV}, "ADD": {POS: X}, "NFP": {POS: PUNCT}, "GW": {POS: X}, diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index e427dc6d2..fc7573f8d 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -1,3 +1,7 @@ +import tempfile +import srsly +from pathlib import Path +from collections import OrderedDict from ...attrs import LANG from ...language import Language from ...tokens import Doc @@ -6,75 +10,274 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS from .lex_attrs import LEX_ATTRS from .stop_words import STOP_WORDS from .tag_map import TAG_MAP +from ... import util + + +_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.22` or from https://github.com/lancopku/pkuseg-python" def try_jieba_import(use_jieba): try: import jieba + # segment a short text to have jieba initialize its cache in advance + list(jieba.cut("作为", cut_all=False)) + return jieba except ImportError: if use_jieba: msg = ( - "Jieba not installed. Either set Chinese.use_jieba = False, " - "or install it https://github.com/fxsjy/jieba" + "Jieba not installed. Either set the default to False with " + "`from spacy.lang.zh import ChineseDefaults; ChineseDefaults.use_jieba = False`, " + "or install it with `pip install jieba` or from " + "https://github.com/fxsjy/jieba" ) raise ImportError(msg) +def try_pkuseg_import(use_pkuseg, pkuseg_model, pkuseg_user_dict): + try: + import pkuseg + + if pkuseg_model: + return pkuseg.pkuseg(pkuseg_model, pkuseg_user_dict) + elif use_pkuseg: + msg = ( + "Chinese.use_pkuseg is True but no pkuseg model was specified. " + "Please provide the name of a pretrained model " + "or the path to a model with " + '`Chinese(meta={"tokenizer": {"config": {"pkuseg_model": name_or_path}}}).' + ) + raise ValueError(msg) + except ImportError: + if use_pkuseg: + msg = ( + "pkuseg not installed. Either set Chinese.use_pkuseg = False, " + "or " + _PKUSEG_INSTALL_MSG + ) + raise ImportError(msg) + except FileNotFoundError: + if use_pkuseg: + msg = "Unable to load pkuseg model from: " + pkuseg_model + raise FileNotFoundError(msg) + + class ChineseTokenizer(DummyTokenizer): - def __init__(self, cls, nlp=None): + def __init__(self, cls, nlp=None, config={}): + self.use_jieba = config.get("use_jieba", cls.use_jieba) + self.use_pkuseg = config.get("use_pkuseg", cls.use_pkuseg) + self.require_pkuseg = config.get("require_pkuseg", False) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) - self.use_jieba = cls.use_jieba self.jieba_seg = try_jieba_import(self.use_jieba) + self.pkuseg_seg = try_pkuseg_import( + self.use_pkuseg, + pkuseg_model=config.get("pkuseg_model", None), + pkuseg_user_dict=config.get("pkuseg_user_dict", "default"), + ) + # remove relevant settings from config so they're not also saved in + # Language.meta + for key in ["use_jieba", "use_pkuseg", "require_pkuseg", "pkuseg_model"]: + if key in config: + del config[key] self.tokenizer = Language.Defaults().create_tokenizer(nlp) def __call__(self, text): - # use jieba - if self.use_jieba: - jieba_words = list( - [x for x in self.jieba_seg.cut(text, cut_all=False) if x] - ) - words = [jieba_words[0]] - spaces = [False] - for i in range(1, len(jieba_words)): - word = jieba_words[i] - if word.isspace(): - # second token in adjacent whitespace following a - # non-space token - if spaces[-1]: - words.append(word) - spaces.append(False) - # first space token following non-space token - elif word == " " and not words[-1].isspace(): - spaces[-1] = True - # token is non-space whitespace or any whitespace following - # a whitespace token - else: - # extend previous whitespace token with more whitespace - if words[-1].isspace(): - words[-1] += word - # otherwise it's a new whitespace token - else: - words.append(word) - spaces.append(False) - else: - words.append(word) - spaces.append(False) + use_jieba = self.use_jieba + use_pkuseg = self.use_pkuseg + if self.require_pkuseg: + use_jieba = False + use_pkuseg = True + if use_jieba: + words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x]) + (words, spaces) = util.get_words_and_spaces(words, text) + return Doc(self.vocab, words=words, spaces=spaces) + elif use_pkuseg: + words = self.pkuseg_seg.cut(text) + (words, spaces) = util.get_words_and_spaces(words, text) + return Doc(self.vocab, words=words, spaces=spaces) + else: + # split into individual characters + words = list(text) + (words, spaces) = util.get_words_and_spaces(words, text) return Doc(self.vocab, words=words, spaces=spaces) - # split into individual characters - words = [] - spaces = [] - for token in self.tokenizer(text): - if token.text.isspace(): - words.append(token.text) - spaces.append(False) - else: - words.extend(list(token.text)) - spaces.extend([False] * len(token.text)) - spaces[-1] = bool(token.whitespace_) - return Doc(self.vocab, words=words, spaces=spaces) + def pkuseg_update_user_dict(self, words, reset=False): + if self.pkuseg_seg: + if reset: + try: + import pkuseg + + self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None) + except ImportError: + if self.use_pkuseg: + msg = ( + "pkuseg not installed: unable to reset pkuseg " + "user dict. Please " + _PKUSEG_INSTALL_MSG + ) + raise ImportError(msg) + for word in words: + self.pkuseg_seg.preprocesser.insert(word.strip(), "") + + def _get_config(self): + config = OrderedDict( + ( + ("use_jieba", self.use_jieba), + ("use_pkuseg", self.use_pkuseg), + ("require_pkuseg", self.require_pkuseg), + ) + ) + return config + + def _set_config(self, config={}): + self.use_jieba = config.get("use_jieba", False) + self.use_pkuseg = config.get("use_pkuseg", False) + self.require_pkuseg = config.get("require_pkuseg", False) + + def to_bytes(self, **kwargs): + pkuseg_features_b = b"" + pkuseg_weights_b = b"" + pkuseg_processors_data = None + if self.pkuseg_seg: + with tempfile.TemporaryDirectory() as tempdir: + self.pkuseg_seg.feature_extractor.save(tempdir) + self.pkuseg_seg.model.save(tempdir) + tempdir = Path(tempdir) + with open(tempdir / "features.pkl", "rb") as fileh: + pkuseg_features_b = fileh.read() + with open(tempdir / "weights.npz", "rb") as fileh: + pkuseg_weights_b = fileh.read() + pkuseg_processors_data = ( + _get_pkuseg_trie_data(self.pkuseg_seg.preprocesser.trie), + self.pkuseg_seg.postprocesser.do_process, + sorted(list(self.pkuseg_seg.postprocesser.common_words)), + sorted(list(self.pkuseg_seg.postprocesser.other_words)), + ) + serializers = OrderedDict( + ( + ("cfg", lambda: srsly.json_dumps(self._get_config())), + ("pkuseg_features", lambda: pkuseg_features_b), + ("pkuseg_weights", lambda: pkuseg_weights_b), + ( + "pkuseg_processors", + lambda: srsly.msgpack_dumps(pkuseg_processors_data), + ), + ) + ) + return util.to_bytes(serializers, []) + + def from_bytes(self, data, **kwargs): + pkuseg_data = {"features_b": b"", "weights_b": b"", "processors_data": None} + + def deserialize_pkuseg_features(b): + pkuseg_data["features_b"] = b + + def deserialize_pkuseg_weights(b): + pkuseg_data["weights_b"] = b + + def deserialize_pkuseg_processors(b): + pkuseg_data["processors_data"] = srsly.msgpack_loads(b) + + deserializers = OrderedDict( + ( + ("cfg", lambda b: self._set_config(srsly.json_loads(b))), + ("pkuseg_features", deserialize_pkuseg_features), + ("pkuseg_weights", deserialize_pkuseg_weights), + ("pkuseg_processors", deserialize_pkuseg_processors), + ) + ) + util.from_bytes(data, deserializers, []) + + if pkuseg_data["features_b"] and pkuseg_data["weights_b"]: + with tempfile.TemporaryDirectory() as tempdir: + tempdir = Path(tempdir) + with open(tempdir / "features.pkl", "wb") as fileh: + fileh.write(pkuseg_data["features_b"]) + with open(tempdir / "weights.npz", "wb") as fileh: + fileh.write(pkuseg_data["weights_b"]) + try: + import pkuseg + except ImportError: + raise ImportError( + "pkuseg not installed. To use this model, " + + _PKUSEG_INSTALL_MSG + ) + self.pkuseg_seg = pkuseg.pkuseg(str(tempdir)) + if pkuseg_data["processors_data"]: + processors_data = pkuseg_data["processors_data"] + (user_dict, do_process, common_words, other_words) = processors_data + self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict) + self.pkuseg_seg.postprocesser.do_process = do_process + self.pkuseg_seg.postprocesser.common_words = set(common_words) + self.pkuseg_seg.postprocesser.other_words = set(other_words) + + return self + + def to_disk(self, path, **kwargs): + path = util.ensure_path(path) + + def save_pkuseg_model(path): + if self.pkuseg_seg: + if not path.exists(): + path.mkdir(parents=True) + self.pkuseg_seg.model.save(path) + self.pkuseg_seg.feature_extractor.save(path) + + def save_pkuseg_processors(path): + if self.pkuseg_seg: + data = ( + _get_pkuseg_trie_data(self.pkuseg_seg.preprocesser.trie), + self.pkuseg_seg.postprocesser.do_process, + sorted(list(self.pkuseg_seg.postprocesser.common_words)), + sorted(list(self.pkuseg_seg.postprocesser.other_words)), + ) + srsly.write_msgpack(path, data) + + serializers = OrderedDict( + ( + ("cfg", lambda p: srsly.write_json(p, self._get_config())), + ("pkuseg_model", lambda p: save_pkuseg_model(p)), + ("pkuseg_processors", lambda p: save_pkuseg_processors(p)), + ) + ) + return util.to_disk(path, serializers, []) + + def from_disk(self, path, **kwargs): + path = util.ensure_path(path) + + def load_pkuseg_model(path): + try: + import pkuseg + except ImportError: + if self.use_pkuseg: + raise ImportError( + "pkuseg not installed. To use this model, " + + _PKUSEG_INSTALL_MSG + ) + if path.exists(): + self.pkuseg_seg = pkuseg.pkuseg(path) + + def load_pkuseg_processors(path): + try: + import pkuseg + except ImportError: + if self.use_pkuseg: + raise ImportError(self._pkuseg_install_msg) + if self.pkuseg_seg: + data = srsly.read_msgpack(path) + (user_dict, do_process, common_words, other_words) = data + self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict) + self.pkuseg_seg.postprocesser.do_process = do_process + self.pkuseg_seg.postprocesser.common_words = set(common_words) + self.pkuseg_seg.postprocesser.other_words = set(other_words) + + serializers = OrderedDict( + ( + ("cfg", lambda p: self._set_config(srsly.read_json(p))), + ("pkuseg_model", lambda p: load_pkuseg_model(p)), + ("pkuseg_processors", lambda p: load_pkuseg_processors(p)), + ) + ) + util.from_disk(path, serializers, []) class ChineseDefaults(Language.Defaults): @@ -86,10 +289,11 @@ class ChineseDefaults(Language.Defaults): tag_map = TAG_MAP writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} use_jieba = True + use_pkuseg = False @classmethod - def create_tokenizer(cls, nlp=None): - return ChineseTokenizer(cls, nlp) + def create_tokenizer(cls, nlp=None, config={}): + return ChineseTokenizer(cls, nlp, config=config) class Chinese(Language): @@ -100,4 +304,13 @@ class Chinese(Language): return self.tokenizer(text) +def _get_pkuseg_trie_data(node, path=""): + data = [] + for c, child_node in sorted(node.children.items()): + data.extend(_get_pkuseg_trie_data(child_node, path + c)) + if node.isword: + data.append((path, node.usertag)) + return data + + __all__ = ["Chinese"] diff --git a/spacy/language.py b/spacy/language.py index 61d69b63e..f281fa1ba 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -22,10 +22,11 @@ from .pipe_analysis import count_pipeline_interdependencies from .gold import Example from .scorer import Scorer from .util import link_vectors_to_models, create_default_optimizer, registry -from .attrs import IS_STOP, LANG +from .attrs import IS_STOP, LANG, NORM from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .lang.punctuation import TOKENIZER_INFIXES from .lang.tokenizer_exceptions import TOKEN_MATCH +from .lang.norm_exceptions import BASE_NORMS from .lang.tag_map import TAG_MAP from .tokens import Doc from .lang.lex_attrs import LEX_ATTRS, is_stop @@ -71,6 +72,11 @@ class BaseDefaults(object): lemmatizer=lemmatizer, lookups=lookups, ) + vocab.lex_attr_getters[NORM] = util.add_lookups( + vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), + BASE_NORMS, + vocab.lookups.get_table("lexeme_norm"), + ) for tag_str, exc in cls.morph_rules.items(): for orth_str, attrs in exc.items(): vocab.morphology.add_special_case(tag_str, orth_str, attrs) @@ -1137,7 +1143,7 @@ def _fix_pretrained_vectors_name(nlp): else: raise ValueError(Errors.E092) if nlp.vocab.vectors.size != 0: - link_vectors_to_models(nlp.vocab) + link_vectors_to_models(nlp.vocab, skip_rank=True) for name, proc in nlp.pipeline: if not hasattr(proc, "cfg"): continue diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py index aeedbde84..c4944407f 100644 --- a/spacy/lemmatizer.py +++ b/spacy/lemmatizer.py @@ -1,6 +1,7 @@ from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN from .errors import Errors from .lookups import Lookups +from .parts_of_speech import NAMES as UPOS_NAMES class Lemmatizer(object): @@ -38,17 +39,11 @@ class Lemmatizer(object): lookup_table = self.lookups.get_table("lemma_lookup", {}) if "lemma_rules" not in self.lookups: return [lookup_table.get(string, string)] - if univ_pos in (NOUN, "NOUN", "noun"): - univ_pos = "noun" - elif univ_pos in (VERB, "VERB", "verb"): - univ_pos = "verb" - elif univ_pos in (ADJ, "ADJ", "adj"): - univ_pos = "adj" - elif univ_pos in (PUNCT, "PUNCT", "punct"): - univ_pos = "punct" - elif univ_pos in (PROPN, "PROPN"): - return [string] - else: + if isinstance(univ_pos, int): + univ_pos = UPOS_NAMES.get(univ_pos, "X") + univ_pos = univ_pos.lower() + + if univ_pos in ("", "eol", "space"): return [string.lower()] # See Issue #435 for example of where this logic is requied. if self.is_base_form(univ_pos, morphology): @@ -56,6 +51,11 @@ class Lemmatizer(object): index_table = self.lookups.get_table("lemma_index", {}) exc_table = self.lookups.get_table("lemma_exc", {}) rules_table = self.lookups.get_table("lemma_rules", {}) + if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))): + if univ_pos == "propn": + return [string] + else: + return [string.lower()] lemmas = self.lemmatize( string, index_table.get(univ_pos, {}), @@ -93,8 +93,6 @@ class Lemmatizer(object): return True elif morphology.get("VerbForm") == "none": return True - elif morphology.get("VerbForm") == "inf": - return True elif morphology.get("Degree") == "pos": return True else: diff --git a/spacy/lexeme.pxd b/spacy/lexeme.pxd index e73f1e700..c99b6912a 100644 --- a/spacy/lexeme.pxd +++ b/spacy/lexeme.pxd @@ -2,13 +2,15 @@ from numpy cimport ndarray from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t from .attrs cimport attr_id_t -from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER, LANG -from .structs cimport LexemeC, SerializedLexemeC +from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG + +from .structs cimport LexemeC from .strings cimport StringStore from .vocab cimport Vocab cdef LexemeC EMPTY_LEXEME +cdef attr_t OOV_RANK cdef class Lexeme: cdef LexemeC* c @@ -22,22 +24,6 @@ cdef class Lexeme: self.vocab = vocab self.orth = lex.orth - @staticmethod - cdef inline SerializedLexemeC c_to_bytes(const LexemeC* lex) nogil: - cdef SerializedLexemeC lex_data - buff = &lex.flags - end = &lex.sentiment + sizeof(lex.sentiment) - for i in range(sizeof(lex_data.data)): - lex_data.data[i] = buff[i] - return lex_data - - @staticmethod - cdef inline void c_from_bytes(LexemeC* lex, SerializedLexemeC lex_data) nogil: - buff = &lex.flags - end = &lex.sentiment + sizeof(lex.sentiment) - for i in range(sizeof(lex_data.data)): - buff[i] = lex_data.data[i] - @staticmethod cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil: if name < (sizeof(flags_t) * 8): @@ -54,8 +40,6 @@ cdef class Lexeme: lex.prefix = value elif name == SUFFIX: lex.suffix = value - elif name == CLUSTER: - lex.cluster = value elif name == LANG: lex.lang = value @@ -82,8 +66,6 @@ cdef class Lexeme: return lex.suffix elif feat_name == LENGTH: return lex.length - elif feat_name == CLUSTER: - return lex.cluster elif feat_name == LANG: return lex.lang else: diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index 911112d50..fc3b30a6d 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -9,17 +9,20 @@ import numpy from thinc.api import get_array_module import warnings +from libc.stdint cimport UINT64_MAX from .typedefs cimport attr_t, flags_t from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT -from .attrs cimport IS_CURRENCY, IS_OOV, PROB +from .attrs cimport IS_CURRENCY from .attrs import intify_attrs from .errors import Errors, Warnings +OOV_RANK = UINT64_MAX memset(&EMPTY_LEXEME, 0, sizeof(LexemeC)) +EMPTY_LEXEME.id = OOV_RANK cdef class Lexeme: @@ -83,12 +86,11 @@ cdef class Lexeme: cdef attr_id_t attr attrs = intify_attrs(attrs) for attr, value in attrs.items(): - if attr == PROB: - self.c.prob = value - elif attr == CLUSTER: - self.c.cluster = int(value) - elif isinstance(value, int) or isinstance(value, long): - Lexeme.set_struct_attr(self.c, attr, value) + # skip PROB, e.g. from lexemes.jsonl + if isinstance(value, float): + continue + elif isinstance(value, (int, long)): + Lexeme.set_struct_attr(self.c, attr, value) else: Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value)) @@ -131,34 +133,6 @@ cdef class Lexeme: xp = get_array_module(vector) return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) - def to_bytes(self): - lex_data = Lexeme.c_to_bytes(self.c) - start = &self.c.flags - end = &self.c.sentiment + sizeof(self.c.sentiment) - if (end-start) != sizeof(lex_data.data): - raise ValueError(Errors.E072.format(length=end-start, - bad_length=sizeof(lex_data.data))) - byte_string = b"\0" * sizeof(lex_data.data) - byte_chars = byte_string - for i in range(sizeof(lex_data.data)): - byte_chars[i] = lex_data.data[i] - if len(byte_string) != sizeof(lex_data.data): - raise ValueError(Errors.E072.format(length=len(byte_string), - bad_length=sizeof(lex_data.data))) - return byte_string - - def from_bytes(self, bytes byte_string): - # This method doesn't really have a use-case --- wrote it for testing. - # Possibly delete? It puts the Lexeme out of synch with the vocab. - cdef SerializedLexemeC lex_data - if len(byte_string) != sizeof(lex_data.data): - raise ValueError(Errors.E072.format(length=len(byte_string), - bad_length=sizeof(lex_data.data))) - for i in range(len(byte_string)): - lex_data.data[i] = byte_string[i] - Lexeme.c_from_bytes(self.c, lex_data) - self.orth = self.c.orth - @property def has_vector(self): """RETURNS (bool): Whether a word vector is associated with the object. @@ -202,10 +176,14 @@ cdef class Lexeme: """RETURNS (float): A scalar value indicating the positivity or negativity of the lexeme.""" def __get__(self): - return self.c.sentiment + sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {}) + return sentiment_table.get(self.c.orth, 0.0) - def __set__(self, float sentiment): - self.c.sentiment = sentiment + def __set__(self, float x): + if "lexeme_sentiment" not in self.vocab.lookups: + self.vocab.lookups.add_table("lexeme_sentiment") + sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment") + sentiment_table[self.c.orth] = x @property def orth_(self): @@ -232,9 +210,13 @@ cdef class Lexeme: lexeme text. """ def __get__(self): - return self.c.norm + return self.c.norm def __set__(self, attr_t x): + if "lexeme_norm" not in self.vocab.lookups: + self.vocab.lookups.add_table("lexeme_norm") + norm_table = self.vocab.lookups.get_table("lexeme_norm") + norm_table[self.c.orth] = self.vocab.strings[x] self.c.norm = x property shape: @@ -270,10 +252,12 @@ cdef class Lexeme: property cluster: """RETURNS (int): Brown cluster ID.""" def __get__(self): - return self.c.cluster + cluster_table = self.vocab.load_extra_lookups("lexeme_cluster") + return cluster_table.get(self.c.orth, 0) - def __set__(self, attr_t x): - self.c.cluster = x + def __set__(self, int x): + cluster_table = self.vocab.load_extra_lookups("lexeme_cluster") + cluster_table[self.c.orth] = x property lang: """RETURNS (uint64): Language of the parent vocabulary.""" @@ -287,10 +271,14 @@ cdef class Lexeme: """RETURNS (float): Smoothed log probability estimate of the lexeme's type.""" def __get__(self): - return self.c.prob + prob_table = self.vocab.load_extra_lookups("lexeme_prob") + settings_table = self.vocab.load_extra_lookups("lexeme_settings") + default_oov_prob = settings_table.get("oov_prob", -20.0) + return prob_table.get(self.c.orth, default_oov_prob) def __set__(self, float x): - self.c.prob = x + prob_table = self.vocab.load_extra_lookups("lexeme_prob") + prob_table[self.c.orth] = x property lower_: """RETURNS (str): Lowercase form of the word.""" @@ -308,7 +296,7 @@ cdef class Lexeme: return self.vocab.strings[self.c.norm] def __set__(self, unicode x): - self.c.norm = self.vocab.strings.add(x) + self.norm = self.vocab.strings.add(x) property shape_: """RETURNS (str): Transform of the word's string, to show @@ -356,13 +344,10 @@ cdef class Lexeme: def __set__(self, flags_t x): self.c.flags = x - property is_oov: + @property + def is_oov(self): """RETURNS (bool): Whether the lexeme is out-of-vocabulary.""" - def __get__(self): - return Lexeme.c_check_flag(self.c, IS_OOV) - - def __set__(self, attr_t x): - Lexeme.c_set_flag(self.c, IS_OOV, x) + return self.orth in self.vocab.vectors property is_stop: """RETURNS (bool): Whether the lexeme is a stop word.""" diff --git a/spacy/lookups.py b/spacy/lookups.py index 5661897e1..d6aa5f9a0 100644 --- a/spacy/lookups.py +++ b/spacy/lookups.py @@ -121,7 +121,7 @@ class Lookups(object): self._tables[key].update(value) return self - def to_disk(self, path, **kwargs): + def to_disk(self, path, filename="lookups.bin", **kwargs): """Save the lookups to a directory as lookups.bin. Expects a path to a directory, which will be created if it doesn't exist. @@ -133,11 +133,11 @@ class Lookups(object): path = ensure_path(path) if not path.exists(): path.mkdir() - filepath = path / "lookups.bin" + filepath = path / filename with filepath.open("wb") as file_: file_.write(self.to_bytes()) - def from_disk(self, path, **kwargs): + def from_disk(self, path, filename="lookups.bin", **kwargs): """Load lookups from a directory containing a lookups.bin. Will skip loading if the file doesn't exist. @@ -147,7 +147,7 @@ class Lookups(object): DOCS: https://spacy.io/api/lookups#from_disk """ path = ensure_path(path) - filepath = path / "lookups.bin" + filepath = path / filename if filepath.exists(): with filepath.open("rb") as file_: data = file_.read() diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 868465b8d..158730e60 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -11,7 +11,8 @@ import warnings from ..typedefs cimport attr_t from ..structs cimport TokenC from ..vocab cimport Vocab -from ..tokens.doc cimport Doc, get_token_attr +from ..tokens.doc cimport Doc, get_token_attr_for_matcher +from ..tokens.span cimport Span from ..tokens.token cimport Token from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA @@ -205,22 +206,29 @@ cdef class Matcher: else: yield doc - def __call__(self, Doc doc): + def __call__(self, object doclike): """Find all token sequences matching the supplied pattern. - doc (Doc): The document to match over. + doclike (Doc or Span): The document to match over. RETURNS (list): A list of `(key, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `label_id` and `key` are both integers. """ + if isinstance(doclike, Doc): + doc = doclike + length = len(doc) + elif isinstance(doclike, Span): + doc = doclike.doc + length = doclike.end - doclike.start + else: + raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \ and not doc.is_tagged: raise ValueError(Errors.E155.format()) if DEP in self._seen_attrs and not doc.is_parsed: raise ValueError(Errors.E156.format()) - matches = find_matches(&self.patterns[0], self.patterns.size(), doc, - extensions=self._extensions, - predicates=self._extra_predicates) + matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, + extensions=self._extensions, predicates=self._extra_predicates) for i, (key, start, end) in enumerate(matches): on_match = self._callbacks.get(key, None) if on_match is not None: @@ -242,9 +250,7 @@ def unpickle_matcher(vocab, patterns, callbacks): return matcher - -cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None, - predicates=tuple()): +cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()): """Find matches in a doc, with a compiled array of patterns. Matches are returned as a list of (id, start, end) tuples. @@ -262,18 +268,18 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None, cdef int i, j, nr_extra_attr cdef Pool mem = Pool() output = [] - if doc.length == 0: + if length == 0: # avoid any processing or mem alloc if the document is empty return output if len(predicates) > 0: - predicate_cache = mem.alloc(doc.length * len(predicates), sizeof(char)) + predicate_cache = mem.alloc(length * len(predicates), sizeof(char)) if extensions is not None and len(extensions) >= 1: nr_extra_attr = max(extensions.values()) + 1 - extra_attr_values = mem.alloc(doc.length * nr_extra_attr, sizeof(attr_t)) + extra_attr_values = mem.alloc(length * nr_extra_attr, sizeof(attr_t)) else: nr_extra_attr = 0 - extra_attr_values = mem.alloc(doc.length, sizeof(attr_t)) - for i, token in enumerate(doc): + extra_attr_values = mem.alloc(length, sizeof(attr_t)) + for i, token in enumerate(doclike): for name, index in extensions.items(): value = token._.get(name) if isinstance(value, basestring): @@ -281,11 +287,11 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None, extra_attr_values[i * nr_extra_attr + index] = value # Main loop cdef int nr_predicate = len(predicates) - for i in range(doc.length): + for i in range(length): for j in range(n): states.push_back(PatternStateC(patterns[j], i, 0)) transition_states(states, matches, predicate_cache, - doc[i], extra_attr_values, predicates) + doclike[i], extra_attr_values, predicates) extra_attr_values += nr_extra_attr predicate_cache += len(predicates) # Handle matches that end in 0-width patterns @@ -536,7 +542,7 @@ cdef char get_is_match(PatternStateC state, spec = state.pattern if spec.nr_attr > 0: for attr in spec.attrs[:spec.nr_attr]: - if get_token_attr(token, attr.attr) != attr.value: + if get_token_attr_for_matcher(token, attr.attr) != attr.value: return 0 for i in range(spec.nr_extra_attr): if spec.extra_attrs[i].value != extra_attrs[spec.extra_attrs[i].index]: @@ -705,7 +711,7 @@ class _RegexPredicate(object): if self.is_extension: value = token._.get(self.attr) else: - value = token.vocab.strings[get_token_attr(token.c, self.attr)] + value = token.vocab.strings[get_token_attr_for_matcher(token.c, self.attr)] return bool(self.value.search(value)) @@ -726,7 +732,7 @@ class _SetMemberPredicate(object): if self.is_extension: value = get_string_id(token._.get(self.attr)) else: - value = get_token_attr(token.c, self.attr) + value = get_token_attr_for_matcher(token.c, self.attr) if self.predicate == "IN": return value in self.value else: @@ -753,7 +759,7 @@ class _ComparisonPredicate(object): if self.is_extension: value = token._.get(self.attr) else: - value = get_token_attr(token.c, self.attr) + value = get_token_attr_for_matcher(token.c, self.attr) if self.predicate == "==": return value == self.value if self.predicate == "!=": @@ -774,6 +780,7 @@ def _get_extra_predicates(spec, extra_predicates): "IN": _SetMemberPredicate, "NOT_IN": _SetMemberPredicate, "==": _ComparisonPredicate, + "!=": _ComparisonPredicate, ">=": _ComparisonPredicate, "<=": _ComparisonPredicate, ">": _ComparisonPredicate, diff --git a/spacy/morphology.pyx b/spacy/morphology.pyx index 5dcf81ea7..3e369fb3e 100644 --- a/spacy/morphology.pyx +++ b/spacy/morphology.pyx @@ -42,7 +42,7 @@ def _normalize_props(props): elif isinstance(key, (int, str)) and isinstance(value, (int, str)): out[key] = value else: - warnings.warn(Warnings.W029.format(feature={key: value})) + warnings.warn(Warnings.W100.format(feature={key: value})) return out @@ -112,7 +112,7 @@ cdef class Morphology: return tag_ptr.key features = self.feats_to_dict(features) if not isinstance(features, dict): - warnings.warn(Warnings.W029.format(feature=features)) + warnings.warn(Warnings.W100.format(feature=features)) features = {} features = _normalize_props(features) string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()} diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index f75ed1659..42110efb0 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -1537,7 +1537,8 @@ class Sentencizer(Pipe): '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', - '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈'] + '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', + '。', '。'] def __init__(self, punct_chars=None, **kwargs): """Initialize the sentencizer. diff --git a/spacy/schemas.py b/spacy/schemas.py index 3b6313db8..3024326dd 100644 --- a/spacy/schemas.py +++ b/spacy/schemas.py @@ -62,6 +62,7 @@ class TokenPatternNumber(BaseModel): IN: Optional[List[StrictInt]] = None NOT_IN: Optional[List[StrictInt]] = None EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==") + NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=") GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=") LEQ: Union[StrictInt, StrictFloat] = Field(None, alias="<=") GT: Union[StrictInt, StrictFloat] = Field(None, alias=">") diff --git a/spacy/structs.pxd b/spacy/structs.pxd index f140a4220..a01244d7e 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -21,29 +21,6 @@ cdef struct LexemeC: attr_t prefix attr_t suffix - attr_t cluster - - float prob - float sentiment - - -cdef struct SerializedLexemeC: - unsigned char[8 + 8*10 + 4 + 4] data - # sizeof(flags_t) # flags - # + sizeof(attr_t) # lang - # + sizeof(attr_t) # id - # + sizeof(attr_t) # length - # + sizeof(attr_t) # orth - # + sizeof(attr_t) # lower - # + sizeof(attr_t) # norm - # + sizeof(attr_t) # shape - # + sizeof(attr_t) # prefix - # + sizeof(attr_t) # suffix - # + sizeof(attr_t) # cluster - # + sizeof(float) # prob - # + sizeof(float) # cluster - # + sizeof(float) # l2_norm - cdef struct SpanC: hash_t id @@ -82,6 +59,50 @@ cdef struct TokenC: cdef struct MorphAnalysisC: hash_t key int length + + attr_t abbr + attr_t adp_type + attr_t adv_type + attr_t animacy + attr_t aspect + attr_t case + attr_t conj_type + attr_t connegative + attr_t definite + attr_t degree + attr_t derivation + attr_t echo + attr_t foreign + attr_t gender + attr_t hyph + attr_t inf_form + attr_t mood + attr_t negative + attr_t number + attr_t name_type + attr_t noun_type + attr_t num_form + attr_t num_type + attr_t num_value + attr_t part_form + attr_t part_type + attr_t person + attr_t polite + attr_t polarity + attr_t poss + attr_t prefix + attr_t prep_case + attr_t pron_type + attr_t punct_side + attr_t punct_type + attr_t reflex + attr_t style + attr_t style_variant + attr_t tense + attr_t typo + attr_t verb_form + attr_t voice + attr_t verb_type attr_t* fields attr_t* features diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index 627827ddd..e516f3ed9 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -12,7 +12,7 @@ cdef enum symbol_t: LIKE_NUM LIKE_EMAIL IS_STOP - IS_OOV + IS_OOV_DEPRECATED IS_BRACKET IS_QUOTE IS_LEFT_PUNCT @@ -465,4 +465,4 @@ cdef enum symbol_t: MORPH ENT_ID - IDX \ No newline at end of file + IDX diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index 452c767a0..28bbc9fc3 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -13,7 +13,7 @@ IDS = { "LIKE_NUM": LIKE_NUM, "LIKE_EMAIL": LIKE_EMAIL, "IS_STOP": IS_STOP, - "IS_OOV": IS_OOV, + "IS_OOV_DEPRECATED": IS_OOV_DEPRECATED, "IS_BRACKET": IS_BRACKET, "IS_QUOTE": IS_QUOTE, "IS_LEFT_PUNCT": IS_LEFT_PUNCT, diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index f8e819268..fcaff444e 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -526,7 +526,7 @@ cdef class Parser: oracle_actions = self.moves.get_oracle_sequence(doc, gold) n_moves.append(len(oracle_actions)) return good_states, good_golds, max(n_moves, default=0) * 2 - + def _init_gold_batch(self, whole_examples, min_length=5, max_length=500): """Make a square batch, of length equal to the shortest doc. A long doc will get multiple states. Let's say we have a doc of length 2*N, diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 2ff001e0c..d75db26b6 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -85,6 +85,11 @@ def eu_tokenizer(): return get_lang_class("eu").Defaults.create_tokenizer() +@pytest.fixture(scope="session") +def fa_tokenizer(): + return get_lang_class("fa").Defaults.create_tokenizer() + + @pytest.fixture(scope="session") def fi_tokenizer(): return get_lang_class("fi").Defaults.create_tokenizer() @@ -100,6 +105,11 @@ def ga_tokenizer(): return get_lang_class("ga").Defaults.create_tokenizer() +@pytest.fixture(scope="session") +def gu_tokenizer(): + return get_lang_class("gu").Defaults.create_tokenizer() + + @pytest.fixture(scope="session") def he_tokenizer(): return get_lang_class("he").Defaults.create_tokenizer() @@ -147,6 +157,11 @@ def lt_tokenizer(): return get_lang_class("lt").Defaults.create_tokenizer() +@pytest.fixture(scope="session") +def ml_tokenizer(): + return get_lang_class("ml").Defaults.create_tokenizer() + + @pytest.fixture(scope="session") def nb_tokenizer(): return get_lang_class("nb").Defaults.create_tokenizer() @@ -228,6 +243,26 @@ def yo_tokenizer(): @pytest.fixture(scope="session") -def zh_tokenizer(): +def zh_tokenizer_char(): + return get_lang_class("zh").Defaults.create_tokenizer( + config={"use_jieba": False, "use_pkuseg": False} + ) + + +@pytest.fixture(scope="session") +def zh_tokenizer_jieba(): pytest.importorskip("jieba") return get_lang_class("zh").Defaults.create_tokenizer() + + +@pytest.fixture(scope="session") +def zh_tokenizer_pkuseg(): + pytest.importorskip("pkuseg") + return get_lang_class("zh").Defaults.create_tokenizer( + config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True} + ) + + +@pytest.fixture(scope="session") +def hy_tokenizer(): + return get_lang_class("hy").Defaults.create_tokenizer() diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index d986d160c..3ee833aa8 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -3,6 +3,7 @@ from spacy.vocab import Vocab from spacy.tokens import Doc from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups +from spacy import util @pytest.fixture @@ -35,3 +36,47 @@ def test_lookup_lemmatization(vocab): assert doc[0].lemma_ == "dog" assert doc[1].text == "dogses" assert doc[1].lemma_ == "dogses" + + +def test_create_from_words_and_text(vocab): + # no whitespace in words + words = ["'", "dogs", "'", "run"] + text = " 'dogs'\n\nrun " + (words, spaces) = util.get_words_and_spaces(words, text) + doc = Doc(vocab, words=words, spaces=spaces) + assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] + assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] + assert doc.text == text + assert [t.text for t in doc if not t.text.isspace()] == [ + word for word in words if not word.isspace() + ] + + # partial whitespace in words + words = [" ", "'", "dogs", "'", "\n\n", "run", " "] + text = " 'dogs'\n\nrun " + (words, spaces) = util.get_words_and_spaces(words, text) + doc = Doc(vocab, words=words, spaces=spaces) + assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] + assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] + assert doc.text == text + assert [t.text for t in doc if not t.text.isspace()] == [ + word for word in words if not word.isspace() + ] + + # non-standard whitespace tokens + words = [" ", " ", "'", "dogs", "'", "\n\n", "run"] + text = " 'dogs'\n\nrun " + (words, spaces) = util.get_words_and_spaces(words, text) + doc = Doc(vocab, words=words, spaces=spaces) + assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "] + assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""] + assert doc.text == text + assert [t.text for t in doc if not t.text.isspace()] == [ + word for word in words if not word.isspace() + ] + + # mismatch between words and text + with pytest.raises(ValueError): + words = [" ", " ", "'", "dogs", "'", "\n\n", "run"] + text = " 'dogs'\n\nrun " + (words, spaces) = util.get_words_and_spaces(words + ["away"], text) diff --git a/spacy/tests/doc/test_token_api.py b/spacy/tests/doc/test_token_api.py index af227294b..be56c9b71 100644 --- a/spacy/tests/doc/test_token_api.py +++ b/spacy/tests/doc/test_token_api.py @@ -179,6 +179,15 @@ def test_is_sent_start(en_tokenizer): assert len(list(doc.sents)) == 2 +def test_is_sent_end(en_tokenizer): + doc = en_tokenizer("This is a sentence. This is another.") + assert doc[4].is_sent_end is None + doc[5].is_sent_start = True + assert doc[4].is_sent_end is True + doc.is_parsed = True + assert len(list(doc.sents)) == 2 + + def test_set_pos(): doc = Doc(Vocab(), words=["hello", "world"]) doc[0].pos_ = "NOUN" @@ -203,6 +212,13 @@ def test_token0_has_sent_start_true(): assert not doc.is_sentenced +def test_tokenlast_has_sent_end_true(): + doc = Doc(Vocab(), words=["hello", "world"]) + assert doc[0].is_sent_end is None + assert doc[1].is_sent_end is True + assert not doc.is_sentenced + + def test_token_api_conjuncts_chain(en_vocab): words = "The boy and the girl and the man went .".split() heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1] diff --git a/spacy/tests/lang/da/test_exceptions.py b/spacy/tests/lang/da/test_exceptions.py index f7837e127..bd9f2710e 100644 --- a/spacy/tests/lang/da/test_exceptions.py +++ b/spacy/tests/lang/da/test_exceptions.py @@ -34,14 +34,6 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer): assert tokens[7].text == "." -@pytest.mark.parametrize( - "text,norm", [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")] -) -def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm): - tokens = da_tokenizer(text) - assert tokens[0].norm_ == norm - - @pytest.mark.parametrize( "text,n_tokens", [ diff --git a/spacy/tests/lang/de/test_exceptions.py b/spacy/tests/lang/de/test_exceptions.py index a4614f6c4..a1bbaf58b 100644 --- a/spacy/tests/lang/de/test_exceptions.py +++ b/spacy/tests/lang/de/test_exceptions.py @@ -19,17 +19,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer): assert len(tokens) == 6 assert tokens[2].text == "z.Zt." assert tokens[2].lemma_ == "zur Zeit" - - -@pytest.mark.parametrize( - "text,norms", [("vor'm", ["vor", "dem"]), ("du's", ["du", "es"])] -) -def test_de_tokenizer_norm_exceptions(de_tokenizer, text, norms): - tokens = de_tokenizer(text) - assert [token.norm_ for token in tokens] == norms - - -@pytest.mark.parametrize("text,norm", [("daß", "dass")]) -def test_de_lex_attrs_norm_exceptions(de_tokenizer, text, norm): - tokens = de_tokenizer(text) - assert tokens[0].norm_ == norm diff --git a/spacy/tests/lang/de/test_noun_chunks.py b/spacy/tests/lang/de/test_noun_chunks.py new file mode 100644 index 000000000..8d76ddd79 --- /dev/null +++ b/spacy/tests/lang/de/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_de(de_tokenizer): + """Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = de_tokenizer("Er lag auf seinem") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/el/test_noun_chunks.py b/spacy/tests/lang/el/test_noun_chunks.py new file mode 100644 index 000000000..4f24865d0 --- /dev/null +++ b/spacy/tests/lang/el/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_el(el_tokenizer): + """Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = el_tokenizer("είναι χώρα της νοτιοανατολικής") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/en/test_exceptions.py b/spacy/tests/lang/en/test_exceptions.py index b2e941dab..ce0dac50b 100644 --- a/spacy/tests/lang/en/test_exceptions.py +++ b/spacy/tests/lang/en/test_exceptions.py @@ -115,6 +115,7 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms): assert [token.norm_ for token in tokens] == norms +@pytest.mark.skip @pytest.mark.parametrize( "text,norm", [("radicalised", "radicalized"), ("cuz", "because")] ) diff --git a/spacy/tests/lang/en/test_noun_chunks.py b/spacy/tests/lang/en/test_noun_chunks.py index 6739b5137..2d3362317 100644 --- a/spacy/tests/lang/en/test_noun_chunks.py +++ b/spacy/tests/lang/en/test_noun_chunks.py @@ -3,9 +3,24 @@ from spacy.attrs import HEAD, DEP from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS +import pytest + + from ...util import get_doc +def test_noun_chunks_is_parsed(en_tokenizer): + """Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = en_tokenizer("This is a sentence") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) + + def test_en_noun_chunks_not_nested(en_vocab): words = ["Peter", "has", "chronic", "command", "and", "control", "issues"] heads = [1, 0, 4, 3, -1, -2, -5] diff --git a/spacy/tests/lang/es/test_noun_chunks.py b/spacy/tests/lang/es/test_noun_chunks.py new file mode 100644 index 000000000..66bbd8c3a --- /dev/null +++ b/spacy/tests/lang/es/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_es(es_tokenizer): + """Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = es_tokenizer("en Oxford este verano") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/es/test_text.py b/spacy/tests/lang/es/test_text.py index af7b0212d..96f6bcab5 100644 --- a/spacy/tests/lang/es/test_text.py +++ b/spacy/tests/lang/es/test_text.py @@ -1,4 +1,5 @@ import pytest +from spacy.lang.es.lex_attrs import like_num def test_es_tokenizer_handles_long_text(es_tokenizer): @@ -30,3 +31,32 @@ en Montevideo y que pregona las bondades de la vida austera.""" def test_es_tokenizer_handles_cnts(es_tokenizer, text, length): tokens = es_tokenizer(text) assert len(tokens) == length + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10.000", True), + ("1000", True), + ("999,0", True), + ("uno", True), + ("dos", True), + ("billón", True), + ("veintiséis", True), + ("perro", False), + (",", False), + ("1/2", True), + ], +) +def test_lex_attrs_like_number(es_tokenizer, text, match): + tokens = es_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match + + +@pytest.mark.parametrize("word", ["once"]) +def test_es_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/eu/__init__.py b/spacy/tests/lang/eu/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/fa/__init__.py b/spacy/tests/lang/fa/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/fa/test_noun_chunks.py b/spacy/tests/lang/fa/test_noun_chunks.py new file mode 100644 index 000000000..a98aae061 --- /dev/null +++ b/spacy/tests/lang/fa/test_noun_chunks.py @@ -0,0 +1,17 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_fa(fa_tokenizer): + """Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + + doc = fa_tokenizer("این یک جمله نمونه می باشد.") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/fr/test_noun_chunks.py b/spacy/tests/lang/fr/test_noun_chunks.py new file mode 100644 index 000000000..ea93a5a35 --- /dev/null +++ b/spacy/tests/lang/fr/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_fr(fr_tokenizer): + """Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = fr_tokenizer("trouver des travaux antérieurs") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/gu/__init__.py b/spacy/tests/lang/gu/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/gu/test_text.py b/spacy/tests/lang/gu/test_text.py new file mode 100644 index 000000000..aa8d442a2 --- /dev/null +++ b/spacy/tests/lang/gu/test_text.py @@ -0,0 +1,19 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_gu_tokenizer_handlers_long_text(gu_tokenizer): + text = """પશ્ચિમ ભારતમાં આવેલું ગુજરાત રાજ્ય જે વ્યક્તિઓની માતૃભૂમિ છે""" + tokens = gu_tokenizer(text) + assert len(tokens) == 9 + + +@pytest.mark.parametrize( + "text,length", + [("ગુજરાતીઓ ખાવાના શોખીન માનવામાં આવે છે", 6), ("ખેતરની ખેડ કરવામાં આવે છે.", 5)], +) +def test_gu_tokenizer_handles_cnts(gu_tokenizer, text, length): + tokens = gu_tokenizer(text) + assert len(tokens) == length diff --git a/spacy/tests/lang/hy/__init__.py b/spacy/tests/lang/hy/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/hy/test_text.py b/spacy/tests/lang/hy/test_text.py new file mode 100644 index 000000000..cbdb77e4e --- /dev/null +++ b/spacy/tests/lang/hy/test_text.py @@ -0,0 +1,11 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.hy.lex_attrs import like_num + + +@pytest.mark.parametrize("word", ["հիսուն"]) +def test_hy_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/hy/test_tokenizer.py b/spacy/tests/lang/hy/test_tokenizer.py new file mode 100644 index 000000000..3eeb8b54e --- /dev/null +++ b/spacy/tests/lang/hy/test_tokenizer.py @@ -0,0 +1,48 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest + + +# TODO add test cases with valid punctuation signs. + +hy_tokenize_text_test = [ + ( + "Մետաղագիտությունը պայմանականորեն բաժանվում է տեսականի և կիրառականի (տեխնիկական)", + [ + "Մետաղագիտությունը", + "պայմանականորեն", + "բաժանվում", + "է", + "տեսականի", + "և", + "կիրառականի", + "(", + "տեխնիկական", + ")", + ], + ), + ( + "Գետաբերանը գտնվում է Օմոլոնա գետի ձախ ափից 726 կմ հեռավորության վրա", + [ + "Գետաբերանը", + "գտնվում", + "է", + "Օմոլոնա", + "գետի", + "ձախ", + "ափից", + "726", + "կմ", + "հեռավորության", + "վրա", + ], + ), +] + + +@pytest.mark.parametrize("text,expected_tokens", hy_tokenize_text_test) +def test_ga_tokenizer_handles_exception_cases(hy_tokenizer, text, expected_tokens): + tokens = hy_tokenizer(text) + token_list = [token.text for token in tokens if not token.is_space] + assert expected_tokens == token_list diff --git a/spacy/tests/lang/id/test_noun_chunks.py b/spacy/tests/lang/id/test_noun_chunks.py new file mode 100644 index 000000000..add76f9b9 --- /dev/null +++ b/spacy/tests/lang/id/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_id(id_tokenizer): + """Test that noun_chunks raises Value Error for 'id' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = id_tokenizer("sebelas") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/lb/test_exceptions.py b/spacy/tests/lang/lb/test_exceptions.py index 5b5005ae7..d941a854b 100644 --- a/spacy/tests/lang/lb/test_exceptions.py +++ b/spacy/tests/lang/lb/test_exceptions.py @@ -19,9 +19,3 @@ def test_lb_tokenizer_handles_exc_in_text(lb_tokenizer): assert len(tokens) == 9 assert tokens[1].text == "'t" assert tokens[1].lemma_ == "et" - - -@pytest.mark.parametrize("text,norm", [("dass", "datt"), ("viläicht", "vläicht")]) -def test_lb_norm_exceptions(lb_tokenizer, text, norm): - tokens = lb_tokenizer(text) - assert tokens[0].norm_ == norm diff --git a/spacy/tests/lang/ml/__init__.py b/spacy/tests/lang/ml/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/ml/test_text.py b/spacy/tests/lang/ml/test_text.py new file mode 100644 index 000000000..2883cf5bb --- /dev/null +++ b/spacy/tests/lang/ml/test_text.py @@ -0,0 +1,25 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_ml_tokenizer_handles_long_text(ml_tokenizer): + text = """അനാവശ്യമായി കണ്ണിലും മൂക്കിലും വായിലും സ്പർശിക്കാതിരിക്കുക""" + tokens = ml_tokenizer(text) + assert len(tokens) == 5 + + +@pytest.mark.parametrize( + "text,length", + [ + ( + "എന്നാൽ അച്ചടിയുടെ ആവിർഭാവം ലിപിയിൽ കാര്യമായ മാറ്റങ്ങൾ വരുത്തിയത് കൂട്ടക്ഷരങ്ങളെ അണുഅക്ഷരങ്ങളായി പിരിച്ചുകൊണ്ടായിരുന്നു", + 10, + ), + ("പരമ്പരാഗതമായി മലയാളം ഇടത്തുനിന്ന് വലത്തോട്ടാണ് എഴുതുന്നത്", 5), + ], +) +def test_ml_tokenizer_handles_cnts(ml_tokenizer, text, length): + tokens = ml_tokenizer(text) + assert len(tokens) == length diff --git a/spacy/tests/lang/nb/test_noun_chunks.py b/spacy/tests/lang/nb/test_noun_chunks.py new file mode 100644 index 000000000..653491a64 --- /dev/null +++ b/spacy/tests/lang/nb/test_noun_chunks.py @@ -0,0 +1,16 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +def test_noun_chunks_is_parsed_nb(nb_tokenizer): + """Test that noun_chunks raises Value Error for 'nb' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = nb_tokenizer("Smørsausen brukes bl.a. til") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) diff --git a/spacy/tests/lang/pl/test_tokenizer.py b/spacy/tests/lang/pl/test_tokenizer.py index a04b4fdcb..44b1be9a6 100644 --- a/spacy/tests/lang/pl/test_tokenizer.py +++ b/spacy/tests/lang/pl/test_tokenizer.py @@ -1,49 +1,15 @@ import pytest DOT_TESTS = [ - ("tel.", ["tel."]), - ("np.", ["np."]), - ("godz. 21:37", ["godz.", "21:37"]), - ("inż.", ["inż."]), - ("gosp.-polit.", ["gosp.-polit."]), - ("ppoż", ["ppoż"]), - ("płn", ["płn"]), - ("ul.", ["ul."]), - ("jw.", ["jw."]), - ("itd.", ["itd."]), - ("cdn.", ["cdn."]), - ("itp.", ["itp."]), - ("10,- zł", ["10,-", "zł"]), + ("tel.", ["tel", "."]), ("0 zł 99 gr", ["0", "zł", "99", "gr"]), - ("0,99 rub.", ["0,99", "rub."]), - ("dol.", ["dol."]), - ("1000 m n.p.m.", ["1000", "m", "n.p.m."]), - ("m.in.", ["m.in."]), - ("p.n.e.", ["p.n.e."]), - ("Sz.P.", ["Sz.P."]), - ("p.o.", ["p.o."]), - ("k.o.", ["k.o."]), - ("m.st.", ["m.st."]), - ("dra.", ["dra", "."]), - ("pp.", ["pp."]), - ("oo.", ["oo."]), ] HYPHEN_TESTS = [ - ("5-fluoropentylo-3-pirydynyloindol", ["5-fluoropentylo-3-pirydynyloindol"]), - ("NESS-040C5", ["NESS-040C5"]), - ("JTE-7-31", ["JTE-7-31"]), - ("BAY-59-3074", ["BAY-59-3074"]), - ("BAY-38-7271", ["BAY-38-7271"]), - ("STS-135", ["STS-135"]), - ("5F-PB-22", ["5F-PB-22"]), ("cztero-", ["cztero-"]), ("jedno-", ["jedno-"]), ("dwu-", ["dwu-"]), ("trzy-", ["trzy-"]), - ("b-adoratorzy", ["b-adoratorzy"]), - ("2-3-4 drzewa", ["2-3-4", "drzewa"]), - ("b-drzewa", ["b-drzewa"]), ] diff --git a/spacy/tests/lang/sv/test_exceptions.py b/spacy/tests/lang/sv/test_exceptions.py index 5d3acf3d5..e6cae4d2b 100644 --- a/spacy/tests/lang/sv/test_exceptions.py +++ b/spacy/tests/lang/sv/test_exceptions.py @@ -44,15 +44,15 @@ def test_sv_tokenizer_handles_ambiguous_abbr(sv_tokenizer, text): def test_sv_tokenizer_handles_exc_in_text(sv_tokenizer): - text = "Det er bl.a. ikke meningen" + text = "Det är bl.a. inte meningen" tokens = sv_tokenizer(text) assert len(tokens) == 5 assert tokens[2].text == "bl.a." def test_sv_tokenizer_handles_custom_base_exc(sv_tokenizer): - text = "Her er noget du kan kigge i." + text = "Här är något du kan titta på." tokens = sv_tokenizer(text) assert len(tokens) == 8 - assert tokens[6].text == "i" + assert tokens[6].text == "på" assert tokens[7].text == "." diff --git a/spacy/tests/lang/sv/test_lex_attrs.py b/spacy/tests/lang/sv/test_lex_attrs.py new file mode 100644 index 000000000..abe6b0f7b --- /dev/null +++ b/spacy/tests/lang/sv/test_lex_attrs.py @@ -0,0 +1,33 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.sv.lex_attrs import like_num + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10.000", True), + ("10.00", True), + ("999,0", True), + ("en", True), + ("två", True), + ("miljard", True), + ("hund", False), + (",", False), + ("1/2", True), + ], +) +def test_lex_attrs_like_number(sv_tokenizer, text, match): + tokens = sv_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match + + +@pytest.mark.parametrize("word", ["elva"]) +def test_sv_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/sv/test_noun_chunks.py b/spacy/tests/lang/sv/test_noun_chunks.py index ad335c317..f352ca648 100644 --- a/spacy/tests/lang/sv/test_noun_chunks.py +++ b/spacy/tests/lang/sv/test_noun_chunks.py @@ -1,7 +1,20 @@ import pytest + from ...util import get_doc +def test_noun_chunks_is_parsed_sv(sv_tokenizer): + """Test that noun_chunks raises Value Error for 'sv' language if Doc is not parsed. + To check this test, we're constructing a Doc + with a new Vocab here and forcing is_parsed to 'False' + to make sure the noun chunks don't run. + """ + doc = sv_tokenizer("Studenten läste den bästa boken") + doc.is_parsed = False + with pytest.raises(ValueError): + list(doc.noun_chunks) + + SV_NP_TEST_EXAMPLES = [ ( "En student läste en bok", # A student read a book diff --git a/spacy/tests/lang/zh/test_serialize.py b/spacy/tests/lang/zh/test_serialize.py new file mode 100644 index 000000000..56f092ed8 --- /dev/null +++ b/spacy/tests/lang/zh/test_serialize.py @@ -0,0 +1,48 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.zh import Chinese +from ...util import make_tempdir + + +def zh_tokenizer_serialize(zh_tokenizer): + tokenizer_bytes = zh_tokenizer.to_bytes() + nlp = Chinese(meta={"tokenizer": {"config": {"use_jieba": False}}}) + nlp.tokenizer.from_bytes(tokenizer_bytes) + assert tokenizer_bytes == nlp.tokenizer.to_bytes() + + with make_tempdir() as d: + file_path = d / "tokenizer" + zh_tokenizer.to_disk(file_path) + nlp = Chinese(meta={"tokenizer": {"config": {"use_jieba": False}}}) + nlp.tokenizer.from_disk(file_path) + assert tokenizer_bytes == nlp.tokenizer.to_bytes() + + +def test_zh_tokenizer_serialize_char(zh_tokenizer_char): + zh_tokenizer_serialize(zh_tokenizer_char) + + +def test_zh_tokenizer_serialize_jieba(zh_tokenizer_jieba): + zh_tokenizer_serialize(zh_tokenizer_jieba) + + +def test_zh_tokenizer_serialize_pkuseg(zh_tokenizer_pkuseg): + zh_tokenizer_serialize(zh_tokenizer_pkuseg) + + +@pytest.mark.slow +def test_zh_tokenizer_serialize_pkuseg_with_processors(zh_tokenizer_pkuseg): + nlp = Chinese( + meta={ + "tokenizer": { + "config": { + "use_jieba": False, + "use_pkuseg": True, + "pkuseg_model": "medicine", + } + } + } + ) + zh_tokenizer_serialize(nlp.tokenizer) diff --git a/spacy/tests/lang/zh/test_text.py b/spacy/tests/lang/zh/test_text.py index d9a65732e..148257329 100644 --- a/spacy/tests/lang/zh/test_text.py +++ b/spacy/tests/lang/zh/test_text.py @@ -15,7 +15,7 @@ import pytest (",", False), ], ) -def test_lex_attrs_like_number(zh_tokenizer, text, match): - tokens = zh_tokenizer(text) +def test_lex_attrs_like_number(zh_tokenizer_jieba, text, match): + tokens = zh_tokenizer_jieba(text) assert len(tokens) == 1 assert tokens[0].like_num == match diff --git a/spacy/tests/lang/zh/test_tokenizer.py b/spacy/tests/lang/zh/test_tokenizer.py index f71785337..7af8a7604 100644 --- a/spacy/tests/lang/zh/test_tokenizer.py +++ b/spacy/tests/lang/zh/test_tokenizer.py @@ -1,28 +1,59 @@ import pytest +from spacy.lang.zh import _get_pkuseg_trie_data # fmt: off -TOKENIZER_TESTS = [ - ("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。", +TEXTS = ("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。",) +JIEBA_TOKENIZER_TESTS = [ + (TEXTS[0], ['作为', '语言', '而言', ',', '为', '世界', '使用', '人', '数最多', '的', '语言', ',', '目前', '世界', '有', '五分之一', '人口', '做', '为', '母语', '。']), ] +PKUSEG_TOKENIZER_TESTS = [ + (TEXTS[0], + ['作为', '语言', '而言', ',', '为', '世界', '使用', '人数', '最多', + '的', '语言', ',', '目前', '世界', '有', '五分之一', '人口', '做为', + '母语', '。']), +] # fmt: on -@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS) -def test_zh_tokenizer(zh_tokenizer, text, expected_tokens): - zh_tokenizer.use_jieba = False - tokens = [token.text for token in zh_tokenizer(text)] +@pytest.mark.parametrize("text", TEXTS) +def test_zh_tokenizer_char(zh_tokenizer_char, text): + tokens = [token.text for token in zh_tokenizer_char(text)] assert tokens == list(text) - zh_tokenizer.use_jieba = True - tokens = [token.text for token in zh_tokenizer(text)] + +@pytest.mark.parametrize("text,expected_tokens", JIEBA_TOKENIZER_TESTS) +def test_zh_tokenizer_jieba(zh_tokenizer_jieba, text, expected_tokens): + tokens = [token.text for token in zh_tokenizer_jieba(text)] assert tokens == expected_tokens -def test_extra_spaces(zh_tokenizer): +@pytest.mark.parametrize("text,expected_tokens", PKUSEG_TOKENIZER_TESTS) +def test_zh_tokenizer_pkuseg(zh_tokenizer_pkuseg, text, expected_tokens): + tokens = [token.text for token in zh_tokenizer_pkuseg(text)] + assert tokens == expected_tokens + + +def test_zh_tokenizer_pkuseg_user_dict(zh_tokenizer_pkuseg): + user_dict = _get_pkuseg_trie_data(zh_tokenizer_pkuseg.pkuseg_seg.preprocesser.trie) + zh_tokenizer_pkuseg.pkuseg_update_user_dict(["nonsense_asdf"]) + updated_user_dict = _get_pkuseg_trie_data( + zh_tokenizer_pkuseg.pkuseg_seg.preprocesser.trie + ) + assert len(user_dict) == len(updated_user_dict) - 1 + + # reset user dict + zh_tokenizer_pkuseg.pkuseg_update_user_dict([], reset=True) + reset_user_dict = _get_pkuseg_trie_data( + zh_tokenizer_pkuseg.pkuseg_seg.preprocesser.trie + ) + assert len(reset_user_dict) == 0 + + +def test_extra_spaces(zh_tokenizer_char): # note: three spaces after "I" - tokens = zh_tokenizer("I like cheese.") + tokens = zh_tokenizer_char("I like cheese.") assert tokens[1].orth_ == " " diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 7020b3e4f..98542e80f 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -3,7 +3,6 @@ import re from mock import Mock from spacy.matcher import Matcher, DependencyMatcher from spacy.tokens import Doc, Token - from ..doc.test_underscore import clean_underscore # noqa: F401 @@ -262,14 +261,25 @@ def test_matcher_regex_shape(en_vocab): assert len(matches) == 0 -def test_matcher_compare_length(en_vocab): +@pytest.mark.parametrize( + "cmp, bad", + [ + ("==", ["a", "aaa"]), + ("!=", ["aa"]), + (">=", ["a"]), + ("<=", ["aaa"]), + (">", ["a", "aa"]), + ("<", ["aa", "aaa"]), + ], +) +def test_matcher_compare_length(en_vocab, cmp, bad): matcher = Matcher(en_vocab) - pattern = [{"LENGTH": {">=": 2}}] + pattern = [{"LENGTH": {cmp: 2}}] matcher.add("LENGTH_COMPARE", [pattern]) doc = Doc(en_vocab, words=["a", "aa", "aaa"]) matches = matcher(doc) - assert len(matches) == 2 - doc = Doc(en_vocab, words=["a"]) + assert len(matches) == len(doc) - len(bad) + doc = Doc(en_vocab, words=bad) matches = matcher(doc) assert len(matches) == 0 @@ -456,3 +466,13 @@ def test_matcher_callback(en_vocab): doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) matches = matcher(doc) mock.assert_called_once_with(matcher, doc, 0, matches) + + +def test_matcher_span(matcher): + text = "JavaScript is good but Java is better" + doc = Doc(matcher.vocab, words=text.split()) + span_js = doc[:3] + span_java = doc[4:] + assert len(matcher(doc)) == 2 + assert len(matcher(span_js)) == 1 + assert len(matcher(span_java)) == 1 diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index 9656d3a10..8e41a16c0 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -1,17 +1,16 @@ import pytest - from spacy import util from spacy.lang.en import English from spacy.pipeline.defaults import default_ner - from spacy.pipeline import EntityRecognizer, EntityRuler from spacy.vocab import Vocab from spacy.syntax.ner import BiluoPushDown from spacy.gold import GoldParse - -from spacy.tests.util import make_tempdir from spacy.tokens import Doc +from ..util import make_tempdir + + TRAIN_DATA = [ ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}), @@ -181,6 +180,27 @@ def test_accept_blocked_token(): assert ner2.moves.is_valid(state2, "U-") +def test_train_empty(): + """Test that training an empty text does not throw errors.""" + train_data = [ + ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), + ("", {"entities": []}), + ] + + nlp = English() + ner = nlp.create_pipe("ner") + ner.add_label("PERSON") + nlp.add_pipe(ner, last=True) + + nlp.begin_training() + for itn in range(2): + losses = {} + batches = util.minibatch(train_data) + for batch in batches: + texts, annotations = zip(*batch) + nlp.update(train_data, losses=losses) + + def test_overwrite_token(): nlp = English() ner1 = nlp.create_pipe("ner") diff --git a/spacy/tests/pipeline/test_sentencizer.py b/spacy/tests/pipeline/test_sentencizer.py index 0432b00e0..5c00b97ce 100644 --- a/spacy/tests/pipeline/test_sentencizer.py +++ b/spacy/tests/pipeline/test_sentencizer.py @@ -11,7 +11,9 @@ def test_sentencizer(en_vocab): doc = sentencizer(doc) assert doc.is_sentenced sent_starts = [t.is_sent_start for t in doc] + sent_ends = [t.is_sent_end for t in doc] assert sent_starts == [True, False, True, False, False, False, False] + assert sent_ends == [False, True, False, False, False, False, True] assert len(list(doc.sents)) == 2 @@ -49,13 +51,14 @@ def test_sentencizer_empty_docs(): @pytest.mark.parametrize( - "words,sent_starts,n_sents", + "words,sent_starts,sent_ends,n_sents", [ # The expected result here is that the duplicate punctuation gets merged # onto the same sentence and no one-token sentence is created for them. ( ["Hello", "!", ".", "Test", ".", ".", "ok"], [True, False, False, True, False, False, True], + [False, False, True, False, False, True, True], 3, ), # We also want to make sure ¡ and ¿ aren't treated as sentence end @@ -63,32 +66,36 @@ def test_sentencizer_empty_docs(): ( ["¡", "Buen", "día", "!", "Hola", ",", "¿", "qué", "tal", "?"], [True, False, False, False, True, False, False, False, False, False], + [False, False, False, True, False, False, False, False, False, True], 2, ), # The Token.is_punct check ensures that quotes are handled as well ( ['"', "Nice", "!", '"', "I", "am", "happy", "."], [True, False, False, False, True, False, False, False], + [False, False, False, True, False, False, False, True], 2, ), ], ) -def test_sentencizer_complex(en_vocab, words, sent_starts, n_sents): +def test_sentencizer_complex(en_vocab, words, sent_starts, sent_ends, n_sents): doc = Doc(en_vocab, words=words) sentencizer = Sentencizer() doc = sentencizer(doc) assert doc.is_sentenced assert [t.is_sent_start for t in doc] == sent_starts + assert [t.is_sent_end for t in doc] == sent_ends assert len(list(doc.sents)) == n_sents @pytest.mark.parametrize( - "punct_chars,words,sent_starts,n_sents", + "punct_chars,words,sent_starts,sent_ends,n_sents", [ ( ["~", "?"], ["Hello", "world", "~", "A", ".", "B", "."], [True, False, False, True, False, False, False], + [False, False, True, False, False, False, True], 2, ), # Even thought it's not common, the punct_chars should be able to @@ -97,16 +104,20 @@ def test_sentencizer_complex(en_vocab, words, sent_starts, n_sents): [".", "ö"], ["Hello", ".", "Test", "ö", "Ok", "."], [True, False, True, False, True, False], + [False, True, False, True, False, True], 3, ), ], ) -def test_sentencizer_custom_punct(en_vocab, punct_chars, words, sent_starts, n_sents): +def test_sentencizer_custom_punct( + en_vocab, punct_chars, words, sent_starts, sent_ends, n_sents +): doc = Doc(en_vocab, words=words) sentencizer = Sentencizer(punct_chars=punct_chars) doc = sentencizer(doc) assert doc.is_sentenced assert [t.is_sent_start for t in doc] == sent_starts + assert [t.is_sent_end for t in doc] == sent_ends assert len(list(doc.sents)) == n_sents diff --git a/spacy/tests/pipeline/test_senter.py b/spacy/tests/pipeline/test_senter.py index 197fdca6e..041da2c9f 100644 --- a/spacy/tests/pipeline/test_senter.py +++ b/spacy/tests/pipeline/test_senter.py @@ -12,14 +12,21 @@ def test_label_types(): with pytest.raises(NotImplementedError): nlp.get_pipe("senter").add_label("A") + SENT_STARTS = [0] * 14 SENT_STARTS[0] = 1 SENT_STARTS[5] = 1 SENT_STARTS[9] = 1 TRAIN_DATA = [ - ("I like green eggs. Eat blue ham. I like purple eggs.", {"sent_starts": SENT_STARTS}), - ("She likes purple eggs. They hate ham. You like yellow eggs.", {"sent_starts": SENT_STARTS}), + ( + "I like green eggs. Eat blue ham. I like purple eggs.", + {"sent_starts": SENT_STARTS}, + ), + ( + "She likes purple eggs. They hate ham. You like yellow eggs.", + {"sent_starts": SENT_STARTS}, + ), ] @@ -36,7 +43,7 @@ def test_overfitting_IO(): assert losses["senter"] < 0.001 # test the trained model - test_text = "I like purple eggs. They eat ham. You like yellow eggs." + test_text = TRAIN_DATA[0][0] doc = nlp(test_text) gold_sent_starts = [0] * 14 gold_sent_starts[0] = 1 diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index 240163d6e..9ff118a1f 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -227,7 +227,7 @@ def test_issue3410(): def test_issue3412(): data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") - vectors = Vectors(data=data) + vectors = Vectors(data=data, keys=["A", "B", "C"]) keys, best_rows, scores = vectors.most_similar( numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f") ) diff --git a/spacy/tests/regression/test_issue5137.py b/spacy/tests/regression/test_issue5137.py new file mode 100644 index 000000000..e9fd268c8 --- /dev/null +++ b/spacy/tests/regression/test_issue5137.py @@ -0,0 +1,34 @@ +import spacy +from spacy.language import Language +from spacy.lang.en import English +from spacy.tests.util import make_tempdir + + +def test_issue5137(): + class MyComponent(object): + name = "my_component" + + def __init__(self, nlp, **cfg): + self.nlp = nlp + self.categories = cfg.get("categories", "all_categories") + + def __call__(self, doc): + pass + + def to_disk(self, path, **kwargs): + pass + + def from_disk(self, path, **cfg): + pass + + factory = lambda nlp, model, **cfg: MyComponent(nlp, **cfg) + Language.factories["my_component"] = factory + + nlp = English() + nlp.add_pipe(nlp.create_pipe("my_component")) + assert nlp.get_pipe("my_component").categories == "all_categories" + + with make_tempdir() as tmpdir: + nlp.to_disk(tmpdir) + nlp2 = spacy.load(tmpdir, categories="my_categories") + assert nlp2.get_pipe("my_component").categories == "my_categories" diff --git a/spacy/tests/serialize/test_serialize_vocab_strings.py b/spacy/tests/serialize/test_serialize_vocab_strings.py index 359a0657f..d3e82296e 100644 --- a/spacy/tests/serialize/test_serialize_vocab_strings.py +++ b/spacy/tests/serialize/test_serialize_vocab_strings.py @@ -1,4 +1,5 @@ import pytest +import pickle from spacy.vocab import Vocab from spacy.strings import StringStore @@ -7,6 +8,7 @@ from ..util import make_tempdir test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])] test_strings_attrs = [(["rats", "are", "cute"], "Hello")] +default_strings = ("_SP", "POS=SPACE") @pytest.mark.xfail @@ -33,8 +35,8 @@ def test_serialize_vocab_roundtrip_bytes(strings1, strings2): assert vocab1.to_bytes() == vocab1_b new_vocab1 = Vocab().from_bytes(vocab1_b) assert new_vocab1.to_bytes() == vocab1_b - assert len(new_vocab1) == len(strings1) - assert sorted([lex.text for lex in new_vocab1]) == sorted(strings1) + assert len(new_vocab1.strings) == len(strings1) + 2 # adds _SP and POS=SPACE + assert sorted([s for s in new_vocab1.strings]) == sorted(strings1 + list(default_strings)) @pytest.mark.parametrize("strings1,strings2", test_strings) @@ -48,12 +50,17 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2): vocab2.to_disk(file_path2) vocab1_d = Vocab().from_disk(file_path1) vocab2_d = Vocab().from_disk(file_path2) - assert list(vocab1_d) == list(vocab1) - assert list(vocab2_d) == list(vocab2) + # check strings rather than lexemes, which are only reloaded on demand + assert strings1 == [s for s in vocab1_d.strings if s not in default_strings] + assert strings2 == [s for s in vocab2_d.strings if s not in default_strings] if strings1 == strings2: - assert list(vocab1_d) == list(vocab2_d) + assert [s for s in vocab1_d.strings if s not in default_strings] == [ + s for s in vocab2_d.strings if s not in default_strings + ] else: - assert list(vocab1_d) != list(vocab2_d) + assert [s for s in vocab1_d.strings if s not in default_strings] != [ + s for s in vocab2_d.strings if s not in default_strings + ] @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @@ -71,9 +78,8 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr): def test_deserialize_vocab_seen_entries(strings, lex_attr): # Reported in #2153 vocab = Vocab(strings=strings) - length = len(vocab) vocab.from_bytes(vocab.to_bytes()) - assert len(vocab) == length + assert len(vocab.strings) == len(strings) + 2 # adds _SP and POS=SPACE @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @@ -124,3 +130,12 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2): assert list(sstore1_d) == list(sstore2_d) else: assert list(sstore1_d) != list(sstore2_d) + + +@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) +def test_pickle_vocab(strings, lex_attr): + vocab = Vocab(strings=strings) + vocab[strings[0]].norm_ = lex_attr + vocab_pickled = pickle.dumps(vocab) + vocab_unpickled = pickle.loads(vocab_pickled) + assert vocab.to_bytes() == vocab_unpickled.to_bytes() diff --git a/spacy/tests/test_errors.py b/spacy/tests/test_errors.py new file mode 100644 index 000000000..1bd4eec7f --- /dev/null +++ b/spacy/tests/test_errors.py @@ -0,0 +1,17 @@ +from inspect import isclass + +import pytest + +from spacy.errors import add_codes + + +@add_codes +class Errors(object): + E001 = "error description" + + +def test_add_codes(): + assert Errors.E001 == "[E001] error description" + with pytest.raises(AttributeError): + Errors.E002 + assert isclass(Errors.__class__) diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py index 0754fb5bc..982c0d910 100644 --- a/spacy/tests/test_gold.py +++ b/spacy/tests/test_gold.py @@ -5,11 +5,12 @@ from spacy.gold import GoldCorpus, docs_to_json, Example, DocAnnotation from spacy.lang.en import English from spacy.syntax.nonproj import is_nonproj_tree from spacy.tokens import Doc -from spacy.util import compounding, minibatch -from .util import make_tempdir +from spacy.util import get_words_and_spaces, compounding, minibatch import pytest import srsly +from .util import make_tempdir + @pytest.fixture def doc(): @@ -137,10 +138,80 @@ def test_gold_biluo_misalign(en_vocab): spaces = [True, True, True, True, True, False] doc = Doc(en_vocab, words=words, spaces=spaces) entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - tags = biluo_tags_from_offsets(doc, entities) + with pytest.warns(UserWarning): + tags = biluo_tags_from_offsets(doc, entities) assert tags == ["O", "O", "O", "-", "-", "-"] +def test_gold_biluo_different_tokenization(en_vocab, en_tokenizer): + # one-to-many + words = ["I", "flew to", "San Francisco Valley", "."] + spaces = [True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + gp = GoldParse( + doc, + words=["I", "flew", "to", "San", "Francisco", "Valley", "."], + entities=entities, + ) + assert gp.ner == ["O", "O", "U-LOC", "O"] + + # many-to-one + words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] + spaces = [True, True, True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + gp = GoldParse( + doc, words=["I", "flew to", "San Francisco Valley", "."], entities=entities + ) + assert gp.ner == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] + + # misaligned + words = ["I flew", "to", "San Francisco", "Valley", "."] + spaces = [True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + gp = GoldParse( + doc, words=["I", "flew to", "San", "Francisco Valley", "."], entities=entities, + ) + assert gp.ner == ["O", "O", "B-LOC", "L-LOC", "O"] + + # additional whitespace tokens in GoldParse words + words, spaces = get_words_and_spaces( + ["I", "flew", "to", "San Francisco", "Valley", "."], + "I flew to San Francisco Valley.", + ) + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + gp = GoldParse( + doc, + words=["I", "flew", " ", "to", "San Francisco Valley", "."], + entities=entities, + ) + assert gp.ner == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"] + + # from issue #4791 + data = ( + "I'll return the ₹54 amount", + { + "words": ["I", "'ll", "return", "the", "₹", "54", "amount"], + "entities": [(16, 19, "MONEY")], + }, + ) + gp = GoldParse(en_tokenizer(data[0]), **data[1]) + assert gp.ner == ["O", "O", "O", "O", "U-MONEY", "O"] + + data = ( + "I'll return the $54 amount", + { + "words": ["I", "'ll", "return", "the", "$", "54", "amount"], + "entities": [(16, 19, "MONEY")], + }, + ) + gp = GoldParse(en_tokenizer(data[0]), **data[1]) + assert gp.ner == ["O", "O", "O", "O", "B-MONEY", "L-MONEY", "O"] + + def test_roundtrip_offsets_biluo_conversion(en_tokenizer): text = "I flew to Silicon Valley via London." biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] diff --git a/spacy/tests/test_lemmatizer.py b/spacy/tests/test_lemmatizer.py index c2534ca22..1779ff933 100644 --- a/spacy/tests/test_lemmatizer.py +++ b/spacy/tests/test_lemmatizer.py @@ -23,7 +23,7 @@ def test_lemmatizer_reflects_lookups_changes(): nlp_bytes = nlp.to_bytes() new_nlp.from_bytes(nlp_bytes) # Make sure we have the previously saved lookup table - assert len(new_nlp.vocab.lookups) == 1 + assert "lemma_lookup" in new_nlp.vocab.lookups assert len(new_nlp.vocab.lookups.get_table("lemma_lookup")) == 2 assert new_nlp.vocab.lookups.get_table("lemma_lookup")["hello"] == "world" assert Doc(new_nlp.vocab, words=["foo"])[0].lemma_ == "bar" diff --git a/spacy/tests/vocab_vectors/test_lexeme.py b/spacy/tests/vocab_vectors/test_lexeme.py index e033aa7c6..4288f427c 100644 --- a/spacy/tests/vocab_vectors/test_lexeme.py +++ b/spacy/tests/vocab_vectors/test_lexeme.py @@ -1,5 +1,7 @@ import pytest +import numpy from spacy.attrs import IS_ALPHA, IS_DIGIT +from spacy.util import OOV_RANK @pytest.mark.parametrize("text1,prob1,text2,prob2", [("NOUN", -1, "opera", -2)]) @@ -55,14 +57,8 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab): assert en_vocab["dogs"].check_flag(is_len4) is True -def test_lexeme_bytes_roundtrip(en_vocab): - one = en_vocab["one"] - alpha = en_vocab["alpha"] - assert one.orth != alpha.orth - assert one.lower != alpha.lower - alpha.from_bytes(one.to_bytes()) - - assert one.orth_ == alpha.orth_ - assert one.orth == alpha.orth - assert one.lower == alpha.lower - assert one.lower_ == alpha.lower_ +def test_vocab_lexeme_oov_rank(en_vocab): + """Test that default rank is OOV_RANK.""" + lex = en_vocab["word"] + assert OOV_RANK == numpy.iinfo(numpy.uint64).max + assert lex.rank == OOV_RANK diff --git a/spacy/tests/vocab_vectors/test_lookups.py b/spacy/tests/vocab_vectors/test_lookups.py index fff3d24ef..d8c7651e4 100644 --- a/spacy/tests/vocab_vectors/test_lookups.py +++ b/spacy/tests/vocab_vectors/test_lookups.py @@ -116,12 +116,11 @@ def test_lookups_to_from_bytes_via_vocab(): table_name = "test" vocab = Vocab() vocab.lookups.add_table(table_name, {"foo": "bar", "hello": "world"}) - assert len(vocab.lookups) == 1 assert table_name in vocab.lookups vocab_bytes = vocab.to_bytes() new_vocab = Vocab() new_vocab.from_bytes(vocab_bytes) - assert len(new_vocab.lookups) == 1 + assert len(new_vocab.lookups) == len(vocab.lookups) assert table_name in new_vocab.lookups table = new_vocab.lookups.get_table(table_name) assert len(table) == 2 @@ -134,13 +133,12 @@ def test_lookups_to_from_disk_via_vocab(): table_name = "test" vocab = Vocab() vocab.lookups.add_table(table_name, {"foo": "bar", "hello": "world"}) - assert len(vocab.lookups) == 1 assert table_name in vocab.lookups with make_tempdir() as tmpdir: vocab.to_disk(tmpdir) new_vocab = Vocab() new_vocab.from_disk(tmpdir) - assert len(new_vocab.lookups) == 1 + assert len(new_vocab.lookups) == len(vocab.lookups) assert table_name in new_vocab.lookups table = new_vocab.lookups.get_table(table_name) assert len(table) == 2 diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index 011cd16b1..cc95252a6 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -1,13 +1,13 @@ import pytest import numpy -from numpy.testing import assert_allclose +from numpy.testing import assert_allclose, assert_equal from spacy.vocab import Vocab from spacy.vectors import Vectors from spacy.tokenizer import Tokenizer from spacy.strings import hash_string from spacy.tokens import Doc -from ..util import add_vecs_to_vocab, get_cosine +from ..util import add_vecs_to_vocab, get_cosine, make_tempdir @pytest.fixture @@ -55,6 +55,11 @@ def most_similar_vectors_data(): ) +@pytest.fixture +def most_similar_vectors_keys(): + return ["a", "b", "c", "d"] + + @pytest.fixture def resize_data(): return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype="f") @@ -85,17 +90,28 @@ def test_init_vectors_with_resize_data(data, resize_data): assert v.shape != data.shape -def test_get_vector_resize(strings, data, resize_data): - v = Vectors(data=data) - v.resize(shape=resize_data.shape) +def test_get_vector_resize(strings, data): strings = [hash_string(s) for s in strings] + + # decrease vector dimension (truncate) + v = Vectors(data=data) + resized_dim = v.shape[1] - 1 + v.resize(shape=(v.shape[0], resized_dim)) for i, string in enumerate(strings): v.add(string, row=i) - assert list(v[strings[0]]) == list(resize_data[0]) - assert list(v[strings[0]]) != list(resize_data[1]) - assert list(v[strings[1]]) != list(resize_data[0]) - assert list(v[strings[1]]) == list(resize_data[1]) + assert list(v[strings[0]]) == list(data[0, :resized_dim]) + assert list(v[strings[1]]) == list(data[1, :resized_dim]) + + # increase vector dimension (pad with zeros) + v = Vectors(data=data) + resized_dim = v.shape[1] + 1 + v.resize(shape=(v.shape[0], resized_dim)) + for i, string in enumerate(strings): + v.add(string, row=i) + + assert list(v[strings[0]]) == list(data[0]) + [0] + assert list(v[strings[1]]) == list(data[1]) + [0] def test_init_vectors_with_data(strings, data): @@ -131,11 +147,14 @@ def test_set_vector(strings, data): assert list(v[strings[0]]) != list(orig[0]) -def test_vectors_most_similar(most_similar_vectors_data): - v = Vectors(data=most_similar_vectors_data) +def test_vectors_most_similar(most_similar_vectors_data, most_similar_vectors_keys): + v = Vectors(data=most_similar_vectors_data, keys=most_similar_vectors_keys) _, best_rows, _ = v.most_similar(v.data, batch_size=2, n=2, sort=True) assert all(row[0] == i for i, row in enumerate(best_rows)) + with pytest.raises(ValueError): + v.most_similar(v.data, batch_size=2, n=10, sort=True) + def test_vectors_most_similar_identical(): """Test that most similar identical vectors are assigned a score of 1.0.""" @@ -292,6 +311,9 @@ def test_vocab_add_vector(): dog = vocab["dog"] assert list(dog.vector) == [2.0, 2.0, 2.0] + with pytest.raises(ValueError): + vocab.vectors.add(vocab["hamster"].orth, row=1000000) + def test_vocab_prune_vectors(): vocab = Vocab(vectors_name="test_vocab_prune_vectors") @@ -311,3 +333,43 @@ def test_vocab_prune_vectors(): neighbour, similarity = list(remap.values())[0] assert neighbour == "cat", remap assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) + + +def test_vectors_serialize(): + data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") + v = Vectors(data=data, keys=["A", "B", "C"]) + b = v.to_bytes() + v_r = Vectors() + v_r.from_bytes(b) + assert_equal(v.data, v_r.data) + assert v.key2row == v_r.key2row + v.resize((5, 4)) + v_r.resize((5, 4)) + row = v.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) + row_r = v_r.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) + assert row == row_r + assert_equal(v.data, v_r.data) + assert v.is_full == v_r.is_full + with make_tempdir() as d: + v.to_disk(d) + v_r.from_disk(d) + assert_equal(v.data, v_r.data) + assert v.key2row == v_r.key2row + v.resize((5, 4)) + v_r.resize((5, 4)) + row = v.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) + row_r = v_r.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) + assert row == row_r + assert_equal(v.data, v_r.data) + + +def test_vector_is_oov(): + vocab = Vocab(vectors_name="test_vocab_is_oov") + data = numpy.ndarray((5, 3), dtype="f") + data[0] = 1.0 + data[1] = 2.0 + vocab.set_vector("cat", data[0]) + vocab.set_vector("dog", data[1]) + assert vocab["cat"].is_oov is True + assert vocab["dog"].is_oov is True + assert vocab["hamster"].is_oov is False diff --git a/spacy/tokens/doc.pxd b/spacy/tokens/doc.pxd index 050a6b898..42918ab6d 100644 --- a/spacy/tokens/doc.pxd +++ b/spacy/tokens/doc.pxd @@ -8,6 +8,7 @@ from ..attrs cimport attr_id_t cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil +cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil ctypedef const LexemeC* const_Lexeme_ptr diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 31c1e8c82..debab6aeb 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -74,6 +74,16 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil: return Lexeme.get_struct_attr(token.lex, feat_name) +cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil: + if feat_name == SENT_START: + if token.sent_start == 1: + return True + else: + return False + else: + return get_token_attr(token, feat_name) + + def _get_chunker(lang): try: cls = util.get_lang_class(lang) @@ -582,8 +592,7 @@ cdef class Doc: DOCS: https://spacy.io/api/doc#noun_chunks """ - if not self.is_parsed: - raise ValueError(Errors.E029) + # Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index 59323c393..b8f79f8a6 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -385,19 +385,9 @@ cdef class Span: return self.doc.user_span_hooks["sent"](self) # This should raise if not parsed / no custom sentence boundaries self.doc.sents - # If doc is parsed we can use the deps to find the sentence - # otherwise we use the `sent_start` token attribute + # Use `sent_start` token attribute to find sentence boundaries cdef int n = 0 - cdef int i - if self.doc.is_parsed: - root = &self.doc.c[self.start] - while root.head != 0: - root += root.head - n += 1 - if n >= self.doc.length: - raise RuntimeError(Errors.E038) - return self.doc[root.l_edge:root.r_edge + 1] - elif self.doc.is_sentenced: + if self.doc.is_sentenced: # Find start of the sentence start = self.start while self.doc.c[start].sent_start != 1 and start > 0: diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 0d1e82322..320cfaad5 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -14,7 +14,7 @@ from ..typedefs cimport hash_t from ..lexeme cimport Lexeme from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT -from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL +from ..attrs cimport IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP from ..symbols cimport conj @@ -261,7 +261,7 @@ cdef class Token: @property def prob(self): """RETURNS (float): Smoothed log probability estimate of token type.""" - return self.c.lex.prob + return self.vocab[self.c.lex.orth].prob @property def sentiment(self): @@ -269,7 +269,7 @@ cdef class Token: negativity of the token.""" if "sentiment" in self.doc.user_token_hooks: return self.doc.user_token_hooks["sentiment"](self) - return self.c.lex.sentiment + return self.vocab[self.c.lex.orth].sentiment @property def lang(self): @@ -288,7 +288,7 @@ cdef class Token: @property def cluster(self): """RETURNS (int): Brown cluster ID.""" - return self.c.lex.cluster + return self.vocab[self.c.lex.orth].cluster @property def orth(self): @@ -496,6 +496,28 @@ cdef class Token: else: raise ValueError(Errors.E044.format(value=value)) + property is_sent_end: + """A boolean value indicating whether the token ends a sentence. + `None` if unknown. Defaults to `True` for the last token in the `Doc`. + + RETURNS (bool / None): Whether the token ends a sentence. + None if unknown. + + DOCS: https://spacy.io/api/token#is_sent_end + """ + def __get__(self): + if self.i + 1 == len(self.doc): + return True + elif self.doc[self.i+1].is_sent_start == None: + return None + elif self.doc[self.i+1].is_sent_start == True: + return True + else: + return False + + def __set__(self, value): + raise ValueError(Errors.E196) + @property def lefts(self): """The leftward immediate children of the word, in the syntactic @@ -903,7 +925,7 @@ cdef class Token: @property def is_oov(self): """RETURNS (bool): Whether the token is out-of-vocabulary.""" - return Lexeme.c_check_flag(self.c.lex, IS_OOV) + return self.c.lex.orth in self.vocab.vectors @property def is_stop(self): diff --git a/spacy/util.py b/spacy/util.py index 2d732e2b7..97cc5a8d7 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -10,6 +10,7 @@ from thinc.api import NumpyOps, get_current_ops, Adam, require_gpu, Config import functools import itertools import numpy.random +import numpy import srsly import catalogue import sys @@ -35,6 +36,7 @@ from . import about _PRINT_ENV = False +OOV_RANK = numpy.iinfo(numpy.uint64).max class registry(thinc.registry): @@ -882,6 +884,36 @@ def get_serialization_exclude(serializers, exclude, kwargs): return exclude +def get_words_and_spaces(words, text): + if "".join("".join(words).split()) != "".join(text.split()): + raise ValueError(Errors.E194.format(text=text, words=words)) + text_words = [] + text_spaces = [] + text_pos = 0 + # normalize words to remove all whitespace tokens + norm_words = [word for word in words if not word.isspace()] + # align words with text + for word in norm_words: + try: + word_start = text[text_pos:].index(word) + except ValueError: + raise ValueError(Errors.E194.format(text=text, words=words)) + if word_start > 0: + text_words.append(text[text_pos : text_pos + word_start]) + text_spaces.append(False) + text_pos += word_start + text_words.append(word) + text_spaces.append(False) + text_pos += len(word) + if text_pos < len(text) and text[text_pos] == " ": + text_spaces[-1] = True + text_pos += 1 + if text_pos < len(text): + text_words.append(text[text_pos:]) + text_spaces.append(False) + return (text_words, text_spaces) + + class SimpleFrozenDict(dict): """Simplified implementation of a frozen dict, mainly used as default function or method argument (for arguments that should default to empty diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index 0ed2462c6..4537d612d 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -192,13 +192,20 @@ cdef class Vectors: DOCS: https://spacy.io/api/vectors#resize """ + xp = get_array_module(self.data) if inplace: - self.data.resize(shape, refcheck=False) + if shape[1] != self.data.shape[1]: + raise ValueError(Errors.E193.format(new_dim=shape[1], curr_dim=self.data.shape[1])) + if xp == numpy: + self.data.resize(shape, refcheck=False) + else: + raise ValueError(Errors.E192) else: - xp = get_array_module(self.data) - self.data = xp.resize(self.data, shape) - filled = {row for row in self.key2row.values()} - self._unset = cppset[int]({row for row in range(shape[0]) if row not in filled}) + resized_array = xp.zeros(shape, dtype=self.data.dtype) + copy_shape = (min(shape[0], self.data.shape[0]), min(shape[1], self.data.shape[1])) + resized_array[:copy_shape[0], :copy_shape[1]] = self.data[:copy_shape[0], :copy_shape[1]] + self.data = resized_array + self._sync_unset() removed_items = [] for key, row in list(self.key2row.items()): if row >= shape[0]: @@ -289,11 +296,14 @@ cdef class Vectors: raise ValueError(Errors.E060.format(rows=self.data.shape[0], cols=self.data.shape[1])) row = deref(self._unset.begin()) - self.key2row[key] = row + if row < self.data.shape[0]: + self.key2row[key] = row + else: + raise ValueError(Errors.E197.format(row=row, key=key)) if vector is not None: self.data[row] = vector - if self._unset.count(row): - self._unset.erase(self._unset.find(row)) + if self._unset.count(row): + self._unset.erase(self._unset.find(row)) return row def most_similar(self, queries, *, batch_size=1024, n=1, sort=True): @@ -312,11 +322,14 @@ cdef class Vectors: RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)` tuple. """ + filled = sorted(list({row for row in self.key2row.values()})) + if len(filled) < n: + raise ValueError(Errors.E198.format(n=n, n_rows=len(filled))) xp = get_array_module(self.data) - norms = xp.linalg.norm(self.data, axis=1, keepdims=True) + norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True) norms[norms == 0] = 1 - vectors = self.data / norms + vectors = self.data[filled] / norms best_rows = xp.zeros((queries.shape[0], n), dtype='i') scores = xp.zeros((queries.shape[0], n), dtype='f') @@ -338,7 +351,8 @@ cdef class Vectors: scores[i:i+batch_size] = scores[sorted_index] best_rows[i:i+batch_size] = best_rows[sorted_index] - xp = get_array_module(self.data) + for i, j in numpy.ndindex(best_rows.shape): + best_rows[i, j] = filled[best_rows[i, j]] # Round values really close to 1 or -1 scores = xp.around(scores, decimals=4, out=scores) # Account for numerical error we want to return in range -1, 1 @@ -401,6 +415,7 @@ cdef class Vectors: "vectors": load_vectors, } util.from_disk(path, serializers, []) + self._sync_unset() return self def to_bytes(self, **kwargs): @@ -443,4 +458,9 @@ cdef class Vectors: "vectors": deserialize_weights } util.from_bytes(data, deserializers, []) + self._sync_unset() return self + + def _sync_unset(self): + filled = {row for row in self.key2row.values()} + self._unset = cppset[int]({row for row in range(self.data.shape[0]) if row not in filled}) diff --git a/spacy/vocab.pxd b/spacy/vocab.pxd index a95ffb11a..49f5bf415 100644 --- a/spacy/vocab.pxd +++ b/spacy/vocab.pxd @@ -29,6 +29,7 @@ cdef class Vocab: cpdef public Morphology morphology cpdef public object vectors cpdef public object lookups + cpdef public object lookups_extra cdef readonly int length cdef public object data_dir cdef public object lex_attr_getters diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 3a82ab72d..19896f07b 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -4,12 +4,11 @@ from libc.string cimport memcpy import srsly from thinc.api import get_array_module -from .lexeme cimport EMPTY_LEXEME +from .lexeme cimport EMPTY_LEXEME, OOV_RANK from .lexeme cimport Lexeme from .typedefs cimport attr_t from .tokens.token cimport Token -from .attrs cimport PROB, LANG, ORTH, TAG, POS -from .structs cimport SerializedLexemeC +from .attrs cimport LANG, ORTH, TAG, POS from .compat import copy_reg from .errors import Errors @@ -19,6 +18,8 @@ from .vectors import Vectors from .util import link_vectors_to_models from .lookups import Lookups from . import util +from .lang.norm_exceptions import BASE_NORMS +from .lang.lex_attrs import LEX_ATTRS cdef class Vocab: @@ -29,8 +30,8 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab """ def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None, - strings=tuple(), lookups=None, oov_prob=-20., vectors_name=None, - **deprecated_kwargs): + strings=tuple(), lookups=None, lookups_extra=None, + oov_prob=-20., vectors_name=None, **deprecated_kwargs): """Create the vocabulary. lex_attr_getters (dict): A dictionary mapping attribute IDs to @@ -41,15 +42,20 @@ cdef class Vocab: strings (StringStore): StringStore that maps strings to integers, and vice versa. lookups (Lookups): Container for large lookup tables and dictionaries. - name (str): Optional name to identify the vectors table. + lookups_extra (Lookups): Container for optional lookup tables and dictionaries. + name (unicode): Optional name to identify the vectors table. RETURNS (Vocab): The newly constructed object. """ lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {} tag_map = tag_map if tag_map is not None else {} if lookups in (None, True, False): lookups = Lookups() + if "lexeme_norm" not in lookups: + lookups.add_table("lexeme_norm") if lemmatizer in (None, True, False): lemmatizer = Lemmatizer(lookups) + if lookups_extra in (None, True, False): + lookups_extra = Lookups() self.cfg = {'oov_prob': oov_prob} self.mem = Pool() self._by_orth = PreshMap() @@ -62,6 +68,7 @@ cdef class Vocab: self.morphology = Morphology(self.strings, tag_map, lemmatizer) self.vectors = Vectors(name=vectors_name) self.lookups = lookups + self.lookups_extra = lookups_extra @property def lang(self): @@ -97,7 +104,7 @@ cdef class Vocab: See also: `Lexeme.set_flag`, `Lexeme.check_flag`, `Token.set_flag`, `Token.check_flag`. - flag_getter (callable): A function `f(str) -> bool`, to get the + flag_getter (callable): A function `f(unicode) -> bool`, to get the flag value. flag_id (int): An integer between 1 and 63 (inclusive), specifying the bit at which the flag will be stored. If -1, the lowest @@ -162,17 +169,15 @@ cdef class Vocab: lex.orth = self.strings.add(string) lex.length = len(string) if self.vectors is not None: - lex.id = self.vectors.key2row.get(lex.orth, 0) + lex.id = self.vectors.key2row.get(lex.orth, OOV_RANK) else: - lex.id = 0 + lex.id = OOV_RANK if self.lex_attr_getters is not None: for attr, func in self.lex_attr_getters.items(): value = func(string) if isinstance(value, unicode): value = self.strings.add(value) - if attr == PROB: - lex.prob = value - elif value is not None: + if value is not None: Lexeme.set_struct_attr(lex, attr, value) if not is_oov: self._add_lex_to_vocab(lex.orth, lex) @@ -187,7 +192,7 @@ cdef class Vocab: def __contains__(self, key): """Check whether the string or int key has an entry in the vocabulary. - string (str): The ID string. + string (unicode): The ID string. RETURNS (bool) Whether the string has an entry in the vocabulary. DOCS: https://spacy.io/api/vocab#contains @@ -312,11 +317,11 @@ cdef class Vocab: priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth) for lex in self if lex.orth in self.vectors.key2row] priority.sort() - indices = xp.asarray([i for (prob, i, key) in priority], dtype="i") + indices = xp.asarray([i for (prob, i, key) in priority], dtype="uint64") keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64") keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]]) toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]]) - self.vectors = Vectors(data=keep, keys=keys, name=self.vectors.name) + self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name) syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size) remap = {} for i, key in enumerate(keys[nr_row:]): @@ -336,7 +341,7 @@ cdef class Vocab: If `minn` is defined, then the resulting vector uses Fasttext's subword features by average over ngrams of `orth`. - orth (int / str): The hash value of a word, or its unicode string. + orth (int / unicode): The hash value of a word, or its unicode string. minn (int): Minimum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. maxn (int): Maximum n-gram length used for Fasttext's ngram computation. @@ -389,7 +394,7 @@ cdef class Vocab: """Set a vector for a word in the vocabulary. Words can be referenced by string or int ID. - orth (int / str): The word. + orth (int / unicode): The word. vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set. DOCS: https://spacy.io/api/vocab#set_vector @@ -403,15 +408,15 @@ cdef class Vocab: else: width = self.vectors.shape[1] self.vectors.resize((new_rows, width)) - lex = self[orth] # Adds words to vocab - self.vectors.add(orth, vector=vector) - self.vectors.add(orth, vector=vector) + lex = self[orth] # Add word to vocab if necessary + row = self.vectors.add(orth, vector=vector) + lex.rank = row def has_vector(self, orth): """Check whether a word has a vector. Returns False if no vectors have been loaded. Words can be looked up by string or int ID. - orth (int / str): The word. + orth (int / unicode): The word. RETURNS (bool): Whether the word has a vector. DOCS: https://spacy.io/api/vocab#has_vector @@ -423,7 +428,7 @@ cdef class Vocab: def to_disk(self, path, exclude=tuple(), **kwargs): """Save the current state to a directory. - path (str / Path): A path to a directory, which will be created if + path (unicode or Path): A path to a directory, which will be created if it doesn't exist. exclude (list): String names of serialization fields to exclude. @@ -432,36 +437,32 @@ cdef class Vocab: path = util.ensure_path(path) if not path.exists(): path.mkdir() - setters = ["strings", "lexemes", "vectors"] + setters = ["strings", "vectors"] exclude = util.get_serialization_exclude(setters, exclude, kwargs) if "strings" not in exclude: self.strings.to_disk(path / "strings.json") - if "lexemes" not in exclude: - with (path / "lexemes.bin").open("wb") as file_: - file_.write(self.lexemes_to_bytes()) if "vectors" not in "exclude" and self.vectors is not None: self.vectors.to_disk(path) if "lookups" not in "exclude" and self.lookups is not None: self.lookups.to_disk(path) + if "lookups_extra" not in "exclude" and self.lookups_extra is not None: + self.lookups_extra.to_disk(path, filename="lookups_extra.bin") def from_disk(self, path, exclude=tuple(), **kwargs): """Loads state from a directory. Modifies the object in place and returns it. - path (str / Path): A path to a directory. + path (unicode or Path): A path to a directory. exclude (list): String names of serialization fields to exclude. RETURNS (Vocab): The modified `Vocab` object. DOCS: https://spacy.io/api/vocab#to_disk """ path = util.ensure_path(path) - getters = ["strings", "lexemes", "vectors"] + getters = ["strings", "vectors"] exclude = util.get_serialization_exclude(getters, exclude, kwargs) if "strings" not in exclude: self.strings.from_disk(path / "strings.json") # TODO: add exclude? - if "lexemes" not in exclude: - with (path / "lexemes.bin").open("rb") as file_: - self.lexemes_from_bytes(file_.read()) if "vectors" not in exclude: if self.vectors is not None: self.vectors.from_disk(path, exclude=["strings"]) @@ -469,6 +470,14 @@ cdef class Vocab: link_vectors_to_models(self) if "lookups" not in exclude: self.lookups.from_disk(path) + if "lookups_extra" not in exclude: + self.lookups_extra.from_disk(path, filename="lookups_extra.bin") + if "lexeme_norm" in self.lookups: + self.lex_attr_getters[NORM] = util.add_lookups( + self.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), self.lookups.get_table("lexeme_norm") + ) + self.length = 0 + self._by_orth = PreshMap() return self def to_bytes(self, exclude=tuple(), **kwargs): @@ -487,9 +496,9 @@ cdef class Vocab: getters = { "strings": lambda: self.strings.to_bytes(), - "lexemes": lambda: self.lexemes_to_bytes(), "vectors": deserialize_vectors, - "lookups": lambda: self.lookups.to_bytes() + "lookups": lambda: self.lookups.to_bytes(), + "lookups_extra": lambda: self.lookups_extra.to_bytes() } exclude = util.get_serialization_exclude(getters, exclude, kwargs) return util.to_bytes(getters, exclude) @@ -513,97 +522,61 @@ cdef class Vocab: "strings": lambda b: self.strings.from_bytes(b), "lexemes": lambda b: self.lexemes_from_bytes(b), "vectors": lambda b: serialize_vectors(b), - "lookups": lambda b: self.lookups.from_bytes(b) + "lookups": lambda b: self.lookups.from_bytes(b), + "lookups_extra": lambda b: self.lookups_extra.from_bytes(b) } exclude = util.get_serialization_exclude(setters, exclude, kwargs) util.from_bytes(bytes_data, setters, exclude) + if "lexeme_norm" in self.lookups: + self.lex_attr_getters[NORM] = util.add_lookups( + self.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), self.lookups.get_table("lexeme_norm") + ) + self.length = 0 + self._by_orth = PreshMap() if self.vectors.name is not None: link_vectors_to_models(self) return self - def lexemes_to_bytes(self): - cdef hash_t key - cdef size_t addr - cdef LexemeC* lexeme = NULL - cdef SerializedLexemeC lex_data - cdef int size = 0 - for key, addr in self._by_orth.items(): - if addr == 0: - continue - size += sizeof(lex_data.data) - byte_string = b"\0" * size - byte_ptr = byte_string - cdef int j - cdef int i = 0 - for key, addr in self._by_orth.items(): - if addr == 0: - continue - lexeme = addr - lex_data = Lexeme.c_to_bytes(lexeme) - for j in range(sizeof(lex_data.data)): - byte_ptr[i] = lex_data.data[j] - i += 1 - return byte_string - - def lexemes_from_bytes(self, bytes bytes_data): - """Load the binary vocabulary data from the given string.""" - cdef LexemeC* lexeme - cdef hash_t key - cdef unicode py_str - cdef int i = 0 - cdef int j = 0 - cdef SerializedLexemeC lex_data - chunk_size = sizeof(lex_data.data) - cdef void* ptr - cdef unsigned char* bytes_ptr = bytes_data - for i in range(0, len(bytes_data), chunk_size): - lexeme = self.mem.alloc(1, sizeof(LexemeC)) - for j in range(sizeof(lex_data.data)): - lex_data.data[j] = bytes_ptr[i+j] - Lexeme.c_from_bytes(lexeme, lex_data) - prev_entry = self._by_orth.get(lexeme.orth) - if prev_entry != NULL: - memcpy(prev_entry, lexeme, sizeof(LexemeC)) - continue - ptr = self.strings._map.get(lexeme.orth) - if ptr == NULL: - continue - py_str = self.strings[lexeme.orth] - if self.strings[py_str] != lexeme.orth: - raise ValueError(Errors.E086.format(string=py_str, - orth_id=lexeme.orth, - hash_id=self.strings[py_str])) - self._by_orth.set(lexeme.orth, lexeme) - self.length += 1 - def _reset_cache(self, keys, strings): # I'm not sure this made sense. Disable it for now. raise NotImplementedError + def load_extra_lookups(self, table_name): + if table_name not in self.lookups_extra: + if self.lang + "_extra" in util.registry.lookups: + tables = util.registry.lookups.get(self.lang + "_extra") + for name, filename in tables.items(): + if table_name == name: + data = util.load_language_data(filename) + self.lookups_extra.add_table(name, data) + if table_name not in self.lookups_extra: + self.lookups_extra.add_table(table_name) + return self.lookups_extra.get_table(table_name) + + def pickle_vocab(vocab): sstore = vocab.strings vectors = vocab.vectors morph = vocab.morphology - length = vocab.length data_dir = vocab.data_dir lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters) - lexemes_data = vocab.lexemes_to_bytes() + lookups = vocab.lookups + lookups_extra = vocab.lookups_extra return (unpickle_vocab, - (sstore, vectors, morph, data_dir, lex_attr_getters, lexemes_data, length)) + (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, lookups_extra)) def unpickle_vocab(sstore, vectors, morphology, data_dir, - lex_attr_getters, bytes lexemes_data, int length): + lex_attr_getters, lookups, lookups_extra): cdef Vocab vocab = Vocab() - vocab.length = length vocab.vectors = vectors vocab.strings = sstore vocab.morphology = morphology vocab.data_dir = data_dir vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters) - vocab.lexemes_from_bytes(lexemes_data) - vocab.length = length + vocab.lookups = lookups + vocab.lookups_extra = lookups_extra return vocab diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index d507e13ec..aacfb414c 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -189,7 +189,7 @@ $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pi | `lang` | positional | Model language. | | `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | | `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | -| `--tag-map-path`, `-tm` 2.2.3 | option | Location of JSON-formatted tag map. | +| `--tag-map-path`, `-tm` 2.2.4 | option | Location of JSON-formatted tag map. | | `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. | | `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | | `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | @@ -457,7 +457,7 @@ improvement. $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] -[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save_every] +[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save-every] [--init-tok2vec] [--epoch-start] ``` @@ -547,7 +547,8 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc] | `output_dir` | positional | Model output directory. Will be created if it doesn't exist. | | `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. | | `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. | -| `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. | +| `--truncate-vectors`, `-t` | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. | +| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. | | `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. | | **CREATES** | model | A spaCy model containing the vocab and vectors. | diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index 75491358d..50fb10756 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -653,7 +653,7 @@ The L2 norm of the document's vector representation. | `mem` | `Pool` | The document's local memory heap, for all C data it owns. | | `vocab` | `Vocab` | The store of lexical types. | | `tensor` 2 | `ndarray` | Container for dense vector representations. | -| `cats` 2 | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. | +| `cats` 2 | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. | | `user_data` | - | A generic storage area, for user custom data. | | `lang` 2.1 | int | Language of the document's vocabulary. | | `lang_` 2.1 | str | Language of the document's vocabulary. | diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 1d0c1de3a..c9a81f6f1 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -105,7 +105,7 @@ Apply the pipeline's model to a batch of docs, without modifying them. | Name | Type | Description | | ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `docs` | iterable | The documents to predict. | -| **RETURNS** | tuple | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. | +| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). | ## EntityRecognizer.set_annotations {#set_annotations tag="method"} diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md index 2f841eedd..379913ba2 100644 --- a/website/docs/api/goldparse.md +++ b/website/docs/api/goldparse.md @@ -7,12 +7,10 @@ source: spacy/gold.pyx ## GoldParse.\_\_init\_\_ {#init tag="method"} -Create a `GoldParse`. Unlike annotations in `entities`, label annotations in -`cats` can overlap, i.e. a single word can be covered by multiple labelled -spans. The [`TextCategorizer`](/api/textcategorizer) component expects true -examples of a label to have the value `1.0`, and negative examples of a label to -have the value `0.0`. Labels not in the dictionary are treated as missing – the -gradient for those labels will be zero. +Create a `GoldParse`. The [`TextCategorizer`](/api/textcategorizer) component +expects true examples of a label to have the value `1.0`, and negative examples +of a label to have the value `0.0`. Labels not in the dictionary are treated as +missing – the gradient for those labels will be zero. | Name | Type | Description | | ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -22,8 +20,8 @@ gradient for those labels will be zero. | `heads` | iterable | A sequence of integers, representing syntactic head offsets. | | `deps` | iterable | A sequence of strings, representing the syntactic relation types. | | `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | -| `cats` | dict | Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence). | -| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative). | +| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). | +| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). | | **RETURNS** | `GoldParse` | The newly constructed object. | ## GoldParse.\_\_len\_\_ {#len tag="method"} @@ -53,7 +51,7 @@ Whether the provided syntactic annotations form a projective dependency tree. | `ner` | list | The named entity annotations as BILUO tags. | | `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | | `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | -| `cats` 2 | list | Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document. | +| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. | | `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. | ## Utilities {#util} diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 496c89776..e1991f260 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -136,7 +136,7 @@ Evaluate a model's pipeline components. | Name | Type | Description | | -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects or `(text, annotations)` of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). | +| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). | | `verbose` | bool | Print debugging information. | | `batch_size` | int | The batch size to use. | | `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 1accbe062..69dac23d6 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -58,7 +58,7 @@ For details, see the documentation on | Name | Type | Description | | --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- | -| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. | +| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. | | `default` | - | Optional default value of the attribute if no getter or method is defined. | | `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. | | `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | @@ -80,10 +80,10 @@ Look up a previously registered extension by name. Returns a 4-tuple > assert extension == (False, None, None, None) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------- | -| `name` | str | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | +| Name | Type | Description | +| ----------- | ------- | ------------------------------------------------------------- | +| `name` | unicode | Name of the extension. | +| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | ## Token.has_extension {#has_extension tag="classmethod" new="2"} @@ -97,10 +97,10 @@ Check whether an extension has been registered on the `Token` class. > assert Token.has_extension("is_fruit") > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------------ | -| `name` | str | Name of the extension to check. | -| **RETURNS** | bool | Whether the extension has been registered. | +| Name | Type | Description | +| ----------- | ------- | ------------------------------------------ | +| `name` | unicode | Name of the extension to check. | +| **RETURNS** | bool | Whether the extension has been registered. | ## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""} @@ -115,10 +115,10 @@ Remove a previously registered extension. > assert not Token.has_extension("is_fruit") > ``` -| Name | Type | Description | -| ----------- | ----- | --------------------------------------------------------------------- | -| `name` | str | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | +| Name | Type | Description | +| ----------- | ------- | --------------------------------------------------------------------- | +| `name` | unicode | Name of the extension. | +| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | ## Token.check_flag {#check_flag tag="method"} @@ -351,9 +351,25 @@ property to `0` for the first word of the document. - assert doc[4].sent_start == 1 + assert doc[4].is_sent_start == True ``` - +## Token.is_sent_end {#is_sent_end tag="property" new="2"} + +A boolean value indicating whether the token ends a sentence. `None` if +unknown. Defaults to `True` for the last token in the `Doc`. + +> #### Example +> +> ```python +> doc = nlp("Give it back! He pleaded.") +> assert doc[3].is_sent_end +> assert not doc[4].is_sent_end +> ``` + +| Name | Type | Description | +| ----------- | ---- | ------------------------------------ | +| **RETURNS** | bool | Whether the token ends a sentence. | + ## Token.has_vector {#has_vector tag="property" model="vectors"} A boolean value indicating whether a word vector is associated with the token. @@ -408,71 +424,71 @@ The L2 norm of the token's vector representation. ## Attributes {#attributes} -| Name | Type | Description | -| -------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `sent` 2.0.12 | `Span` | The sentence span that this token is a part of. | -| `text` | str | Verbatim text content. | -| `text_with_ws` | str | Text content, with trailing space character if present. | -| `whitespace_` | str | Trailing space character if present. | -| `orth` | int | ID of the verbatim text content. | -| `orth_` | str | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. | -| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | -| `tensor` 2.1.7 | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. | -| `head` | `Token` | The syntactic parent, or "governor", of this token. | -| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. | -| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. | -| `i` | int | The index of the token within the parent document. | -| `ent_type` | int | Named entity type. | -| `ent_type_` | str | Named entity type. | -| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | -| `ent_iob_` | str | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. | -| `ent_kb_id` 2.2 | int | Knowledge base ID that refers to the named entity this token is a part of, if any. | -| `ent_kb_id_` 2.2 | str | Knowledge base ID that refers to the named entity this token is a part of, if any. | -| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | -| `ent_id_` | str | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | -| `lemma` | int | Base form of the token, with no inflectional suffixes. | -| `lemma_` | str | Base form of the token, with no inflectional suffixes. | -| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | -| `norm_` | str | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | -| `lower` | int | Lowercase form of the token. | -| `lower_` | str | Lowercase form of the token text. Equivalent to `Token.text.lower()`. | +| Name | Type | Description | +| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | `Doc` | The parent document. | +| `sent` 2.0.12 | `Span` | The sentence span that this token is a part of. | +| `text` | unicode | Verbatim text content. | +| `text_with_ws` | unicode | Text content, with trailing space character if present. | +| `whitespace_` | unicode | Trailing space character if present. | +| `orth` | int | ID of the verbatim text content. | +| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. | +| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | +| `tensor` 2.1.7 | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. | +| `head` | `Token` | The syntactic parent, or "governor", of this token. | +| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. | +| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. | +| `i` | int | The index of the token within the parent document. | +| `ent_type` | int | Named entity type. | +| `ent_type_` | unicode | Named entity type. | +| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | +| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. | +| `ent_kb_id` 2.2 | int | Knowledge base ID that refers to the named entity this token is a part of, if any. | +| `ent_kb_id_` 2.2 | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. | +| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | +| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | +| `lemma` | int | Base form of the token, with no inflectional suffixes. | +| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. | +| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | +| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | +| `lower` | int | Lowercase form of the token. | +| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. | | `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `shape_` | str | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. | -| `prefix_` | str | A length-N substring from the start of the token. Defaults to `N=1`. | -| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. | -| `suffix_` | str | Length-N substring from the end of the token. Defaults to `N=3`. | -| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. | -| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. | -| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. | -| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. | -| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. | -| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. | -| `is_punct` | bool | Is the token punctuation? | -| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? | -| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? | -| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. | -| `is_bracket` | bool | Is the token a bracket? | -| `is_quote` | bool | Is the token a quotation mark? | -| `is_currency` 2.0.8 | bool | Is the token a currency symbol? | -| `like_url` | bool | Does the token resemble a URL? | -| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. | -| `like_email` | bool | Does the token resemble an email address? | -| `is_oov` | bool | Is the token out-of-vocabulary? | -| `is_stop` | bool | Is the token part of a "stop list"? | -| `pos` | int | Coarse-grained part-of-speech. | -| `pos_` | str | Coarse-grained part-of-speech. | -| `tag` | int | Fine-grained part-of-speech. | -| `tag_` | str | Fine-grained part-of-speech. | -| `dep` | int | Syntactic dependency relation. | -| `dep_` | str | Syntactic dependency relation. | -| `lang` | int | Language of the parent document's vocabulary. | -| `lang_` | str | Language of the parent document's vocabulary. | -| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). | -| `idx` | int | The character offset of the token within the parent document. | -| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. | -| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | -| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | -| `cluster` | int | Brown cluster ID. | -| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | +| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | +| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. | +| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. | +| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. | +| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. | +| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. | +| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. | +| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. | +| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. | +| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. | +| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. | +| `is_punct` | bool | Is the token punctuation? | +| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? | +| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? | +| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. | +| `is_bracket` | bool | Is the token a bracket? | +| `is_quote` | bool | Is the token a quotation mark? | +| `is_currency` 2.0.8 | bool | Is the token a currency symbol? | +| `like_url` | bool | Does the token resemble a URL? | +| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. | +| `like_email` | bool | Does the token resemble an email address? | +| `is_oov` | bool | Is the token out-of-vocabulary? | +| `is_stop` | bool | Is the token part of a "stop list"? | +| `pos` | int | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). | +| `pos_` | unicode | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). | +| `tag` | int | Fine-grained part-of-speech. | +| `tag_` | unicode | Fine-grained part-of-speech. | +| `dep` | int | Syntactic dependency relation. | +| `dep_` | unicode | Syntactic dependency relation. | +| `lang` | int | Language of the parent document's vocabulary. | +| `lang_` | unicode | Language of the parent document's vocabulary. | +| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). | +| `idx` | int | The character offset of the token within the parent document. | +| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. | +| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | +| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | +| `cluster` | int | Brown cluster ID. | +| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index d4c0269ef..939cc8655 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -111,7 +111,7 @@ Check whether a key has been mapped to a vector entry in the table. > > ```python > cat_id = nlp.vocab.strings["cat"] -> nlp.vectors.add(cat_id, numpy.random.uniform(-1, 1, (300,))) +> nlp.vocab.vectors.add(cat_id, numpy.random.uniform(-1, 1, (300,))) > assert cat_id in vectors > ``` @@ -315,7 +315,7 @@ performed in chunks, to avoid consuming too much memory. You can set the > > ```python > queries = numpy.asarray([numpy.random.uniform(-1, 1, (300,))]) -> most_similar = nlp.vectors.most_similar(queries, n=10) +> most_similar = nlp.vocab.vectors.most_similar(queries, n=10) > ``` | Name | Type | Description | diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md index 9d04d6ffc..1a438e424 100644 --- a/website/docs/usage/101/_pos-deps.md +++ b/website/docs/usage/101/_pos-deps.md @@ -25,7 +25,7 @@ for token in doc: > - **Text:** The original word text. > - **Lemma:** The base form of the word. -> - **POS:** The simple part-of-speech tag. +> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag. > - **Tag:** The detailed part-of-speech tag. > - **Dep:** Syntactic dependency, i.e. the relation between tokens. > - **Shape:** The word shape – capitalization, punctuation, digits. diff --git a/website/docs/usage/examples.md b/website/docs/usage/examples.md index 180b02ff4..854b2d42b 100644 --- a/website/docs/usage/examples.md +++ b/website/docs/usage/examples.md @@ -111,6 +111,27 @@ start. https://github.com/explosion/spaCy/tree/master/examples/training/train_new_entity_type.py ``` +### Creating a Knowledge Base for Named Entity Linking {#kb} + +This example shows how to create a knowledge base in spaCy, +which is needed to implement entity linking functionality. +It requires as input a spaCy model with pretrained word vectors, +and it stores the KB to file (if an `output_dir` is provided). + +```python +https://github.com/explosion/spaCy/tree/master/examples/training/create_kb.py +``` + +### Training spaCy's Named Entity Linker {#nel} + +This example shows how to train spaCy's entity linker with your own custom +examples, starting off with a predefined knowledge base and its vocab, +and using a blank `English` class. + +```python +https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py +``` + ### Training spaCy's Dependency Parser {#parser} This example shows how to update spaCy's dependency parser, starting off with an @@ -162,7 +183,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p This script lets you load any spaCy model containing word vectors into [TensorBoard](https://projector.tensorflow.org/) to create an -[embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz). +[embedding visualization](https://github.com/tensorflow/tensorboard/blob/master/docs/tensorboard_projector_plugin.ipynb). ```python https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index 6ea2b0721..473ffded8 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -133,10 +133,10 @@ support, we've been grateful to use the work of Chainer's interface for GPU arrays. spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`, -`spacy[cuda91]`, `spacy[cuda92]` or `spacy[cuda100]`. If you know your cuda -version, using the more explicit specifier allows cupy to be installed via -wheel, saving some compilation time. The specifiers should install -[`cupy`](https://cupy.chainer.org). +`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]` or +`spacy[cuda102]`. If you know your cuda version, using the more explicit +specifier allows cupy to be installed via wheel, saving some compilation time. +The specifiers should install [`cupy`](https://cupy.chainer.org). ```bash $ pip install -U spacy[cuda92] diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 420e8263a..4b3c61b9d 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -579,9 +579,7 @@ import DisplacyEntHtml from 'images/displacy-ent2.html' To ground the named entities into the "real world", spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique -identifier from a knowledge base (KB). The -[processing scripts](https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking) -we provide use WikiData identifiers, but you can create your own +identifier from a knowledge base (KB). You can create your own [`KnowledgeBase`](/api/kb) and [train a new Entity Linking model](/usage/training#entity-linker) using that custom-made KB. @@ -1303,7 +1301,7 @@ with doc.retokenize() as retokenizer: ### Overwriting custom extension attributes {#retokenization-extensions} If you've registered custom -[extension attributes](/usage/processing-pipelines##custom-components-attributes), +[extension attributes](/usage/processing-pipelines#custom-components-attributes), you can overwrite them during tokenization by providing a dictionary of attribute names mapped to new values as the `"_"` key in the `attrs`. For merging, you need to provide one dictionary of attributes for the resulting diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 588782986..c0dbfc732 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -216,7 +216,7 @@ class CustomComponent(object): # Add something to the component's data self.data.append(data) - def to_disk(self, path): + def to_disk(self, path, **kwargs): # This will receive the directory path + /my_component data_path = path / "data.json" with data_path.open("w", encoding="utf8") as f: @@ -461,7 +461,7 @@ model. When you save out a model using `nlp.to_disk` and the component exposes a `to_disk` method, it will be called with the disk path. ```python -def to_disk(self, path): +def to_disk(self, path, **kwargs): snek_path = path / "snek.txt" with snek_path.open("w", encoding="utf8") as snek_file: snek_file.write(self.snek) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index a10c60357..55d4accba 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -347,9 +347,9 @@ your data** to find a solution that works best for you. ### Updating the Named Entity Recognizer {#example-train-ner} This example shows how to update spaCy's entity recognizer with your own -examples, starting off with an existing, pretrained model, or from scratch -using a blank `Language` class. To do this, you'll need **example texts** and -the **character offsets** and **labels** of each entity contained in the texts. +examples, starting off with an existing, pretrained model, or from scratch using +a blank `Language` class. To do this, you'll need **example texts** and the +**character offsets** and **labels** of each entity contained in the texts. ```python https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py @@ -440,8 +440,8 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py training the parser. 2. **Add the dependency labels** to the parser using the [`add_label`](/api/dependencyparser#add_label) method. If you're starting off - with a pretrained spaCy model, this is usually not necessary – but it - doesn't hurt either, just to be safe. + with a pretrained spaCy model, this is usually not necessary – but it doesn't + hurt either, just to be safe. 3. **Shuffle and loop over** the examples. For each example, **update the model** by calling [`nlp.update`](/api/language#update), which steps through the words of the input. At each word, it makes a **prediction**. It then @@ -605,39 +605,38 @@ To train an entity linking model, you first need to define a knowledge base A KB consists of a list of entities with unique identifiers. Each such entity has an entity vector that will be used to measure similarity with the context in -which an entity is used. These vectors are pretrained and stored in the KB -before the entity linking model will be trained. +which an entity is used. These vectors have a fixed length and are stored in the +KB. The following example shows how to build a knowledge base from scratch, given a -list of entities and potential aliases. The script further demonstrates how to -pretrain and store the entity vectors. To run this example, the script needs -access to a `vocab` instance or an `nlp` model with pretrained word embeddings. +list of entities and potential aliases. The script requires an `nlp` model with +pretrained word vectors to obtain an encoding of an entity's description as its +vector. ```python -https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py +https://github.com/explosion/spaCy/tree/master/examples/training/create_kb.py ``` #### Step by step guide {#step-by-step-kb} -1. **Load the model** you want to start with, or create an **empty model** using - [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and - a pre-defined [`vocab`](/api/vocab) object. -2. **Pretrain the entity embeddings** by running the descriptions of the - entities through a simple encoder-decoder network. The current implementation - requires the `nlp` model to have access to pretrained word embeddings, but a - custom implementation of this encoding step can also be used. -3. **Construct the KB** by defining all entities with their pretrained vectors, - and all aliases with their prior probabilities. +1. **Load the model** you want to start with. It should contain pretrained word + vectors. +2. **Obtain the entity embeddings** by running the descriptions of the entities + through the `nlp` model and taking the average of all words with + `nlp(desc).vector`. At this point, a custom encoding step can also be used. +3. **Construct the KB** by defining all entities with their embeddings, and all + aliases with their prior probabilities. 4. **Save** the KB using [`kb.dump`](/api/kb#dump). -5. **Test** the KB to make sure the entities were added correctly. +5. **Print** the contents of the KB to make sure the entities were added + correctly. ### Training an entity linking model {#entity-linker-model} This example shows how to create an entity linker pipe using a previously -created knowledge base. The entity linker pipe is then trained with your own -examples. To do so, you'll need to provide **example texts**, and the -**character offsets** and **knowledge base identifiers** of each entity -contained in the texts. +created knowledge base. The entity linker is then trained with a set of custom +examples. To do so, you need to provide **example texts**, and the **character +offsets** and **knowledge base identifiers** of each entity contained in the +texts. ```python https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py @@ -647,14 +646,16 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_li 1. **Load the KB** you want to start with, and specify the path to the `Vocab` object that was used to create this KB. Then, create an **empty model** using - [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. - Don't forget to add the KB to the entity linker, and to add the entity linker - to the pipeline. In practical applications, you will want a more advanced - pipeline including also a component for - [named entity recognition](/usage/training#ner). If you're using a model with - additional components, make sure to disable all other pipeline components - during training using [`nlp.select_pipes`](/api/language#select_pipes). - This way, you'll only be training the entity linker. + [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. Add + a component for recognizing sentences en one for identifying relevant + entities. In practical applications, you will want a more advanced pipeline + including also a component for + [named entity recognition](/usage/training#ner). Then, create a new entity + linker component, add the KB to it, and then add the entity linker to the + pipeline. If you're using a model with additional components, make sure to + disable all other pipeline components during training using + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be + training the entity linker. 2. **Shuffle and loop over** the examples. For each example, **update the model** by calling [`nlp.update`](/api/language#update), which steps through the annotated examples of the input. For each combination of a mention in diff --git a/website/meta/universe.json b/website/meta/universe.json index 70aace8c0..7bd954c2f 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -1,5 +1,32 @@ { "resources": [ + { + "id": "whatlies", + "title": "whatlies", + "slogan": "Make interactive visualisations to figure out 'what lies' in word embeddings.", + "description": "This small library offers tools to make visualisation easier of both word embeddings as well as operations on them. It has support for spaCy prebuilt models as a first class citizen but also offers support for sense2vec. There's a convenient API to perform linear algebra as well as support for popular transformations like PCA/UMAP/etc.", + "github": "rasahq/whatlies", + "pip": "whatlies", + "thumb": "https://i.imgur.com/rOkOiLv.png", + "image": "https://raw.githubusercontent.com/RasaHQ/whatlies/master/docs/gif-two.gif", + "code_example": [ + "from whatlies import EmbeddingSet", + "from whatlies.language import SpacyLanguage", + "", + "lang = SpacyLanguage('en_core_web_md')", + "words = ['cat', 'dog', 'fish', 'kitten', 'man', 'woman', 'king', 'queen', 'doctor', 'nurse']", + "", + "emb = lang[words]", + "emb.plot_interactive(x_axis='man', y_axis='woman')" + ], + "category": ["visualizers", "research"], + "author": "Vincent D. Warmerdam", + "author_links": { + "twitter": "fishnets88", + "github": "koaning", + "website": "https://koaning.io" + } + }, { "id": "spacy-stanza", "title": "spacy-stanza", @@ -87,7 +114,13 @@ " text = f(text)", "print(text)" ], - "category": ["scientific"] + "category": ["scientific", "biomedical"], + "author": "Travis Hoppe", + "author_links": { + "github": "thoppe", + "twitter":"metasemantic", + "website" : "http://thoppe.github.io/" + } }, { "id": "Chatterbot", @@ -337,7 +370,7 @@ "entity = Entity(keywords_list=['python', 'product manager', 'java platform'])", "nlp.add_pipe(entity, last=True)", "", - "doc = nlp(u\"I am a product manager for a java and python.\")", + "doc = nlp(\"I am a product manager for a java and python.\")", "assert doc._.has_entities == True", "assert doc[0]._.is_entity == False", "assert doc[3]._.entity_desc == 'product manager'", @@ -642,7 +675,7 @@ "tags": ["chatbots"] }, { - "id": "tochtext", + "id": "torchtext", "title": "torchtext", "slogan": "Data loaders and abstractions for text and NLP", "github": "pytorch/text", @@ -1620,6 +1653,20 @@ }, "category": ["standalone", "research"] }, + { + "id": "pic2phrase_bot", + "title": "pic2phrase_bot: Photo Description Generator", + "slogan": "A bot that generates descriptions to submitted photos, in a human-like manner.", + "description": "pic2phrase_bot runs inside Telegram messenger and can be used to generate a phrase describing a submitted photo, employing computer vision, web scraping, and syntactic dependency analysis powered by spaCy.", + "thumb": "https://i.imgur.com/ggVI02O.jpg", + "image": "https://i.imgur.com/z1yhWQR.jpg", + "url": "https://telegram.me/pic2phrase_bot", + "author": "Yuli Vasiliev", + "author_links": { + "twitter": "VasilievYuli" + }, + "category": ["standalone", "conversational"] + }, { "id": "gracyql", "title": "gracyql", @@ -2067,6 +2114,102 @@ "predict_output = clf.predict(predict_input)" ], "category": ["standalone"] + }, + { + "id": "spacy_fastlang", + "title": "Spacy FastLang", + "slogan": "Language detection done fast", + "description": "Fast language detection using FastText and Spacy.", + "github": "thomasthiebaud/spacy-fastlang", + "pip": "spacy_fastlang", + "code_example": [ + "import spacy", + "from spacy_fastlang import LanguageDetector", + "", + "nlp = spacy.load('en_core_web_sm')", + "nlp.add_pipe(LanguageDetector())", + "doc = nlp('Life is like a box of chocolates. You never know what you're gonna get.')", + "", + "assert doc._.language == 'en'", + "assert doc._.language_score >= 0.8" + ], + "author": "Thomas Thiebaud", + "author_links": { + "github": "thomasthiebaud" + }, + "category": ["pipeline"] + }, + { + "id": "mlflow", + "title": "MLflow", + "slogan": "An open source platform for the machine learning lifecycle", + "description": "MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: Tracking, Projects, Models and Registry.", + "github": "mlflow/mlflow", + "pip": "mlflow", + "thumb": "https://www.mlflow.org/docs/latest/_static/MLflow-logo-final-black.png", + "image": "", + "url": "https://mlflow.org/", + "author": "Databricks", + "author_links": { + "github": "databricks", + "twitter": "databricks", + "website": "https://databricks.com/" + }, + "category": ["standalone", "apis"], + "code_example": [ + "import mlflow", + "import mlflow.spacy", + "", + "# MLflow Tracking", + "nlp = spacy.load('my_best_model_path/output/model-best')", + "with mlflow.start_run(run_name='Spacy'):", + " mlflow.set_tag('model_flavor', 'spacy')", + " mlflow.spacy.log_model(spacy_model=nlp, artifact_path='model')", + " mlflow.log_metric(('accuracy', 0.72))", + " my_run_id = mlflow.active_run().info.run_id", + "", + "", + "# MLflow Models", + "model_uri = f'runs:/{my_run_id}/model'", + "nlp2 = mlflow.spacy.load_model(model_uri=model_uri)" + ] + }, + { + "id": "pyate", + "title": "PyATE", + "slogan": "Python Automated Term Extraction", + "description": "PyATE is a term extraction library written in Python using Spacy POS tagging with Basic, Combo Basic, C-Value, TermExtractor, and Weirdness.", + "github": "kevinlu1248/pyate", + "pip": "pyate", + "code_example": [ + "import spacy", + "from pyate.term_extraction_pipeline import TermExtractionPipeline", + "", + "nlp = spacy.load('en_core_web_sm')", + "nlp.add_pipe(TermExtractionPipeline())", + "# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/", + "string = 'Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors, are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the connection between the inflammatory response and cancer.'", + "", + "doc = nlp(string)", + "print(doc._.combo_basic.sort_values(ascending=False).head(5))", + "\"\"\"\"\"\"", + "dysfunctional tumor 1.443147", + "tumor suppressors 1.443147", + "genetic changes 1.386294", + "cancer cells 1.386294", + "dysfunctional tumor suppressors 1.298612", + "\"\"\"\"\"\"" + ], + "code_language": "python", + "url": "https://github.com/kevinlu1248/pyate", + "author": "Kevin Lu", + "author_links": { + "twitter": "kevinlu1248", + "github": "kevinlu1248", + "website": "https://github.com/kevinlu1248/pyate" + }, + "category": ["pipeline", "research"], + "tags": ["term_extraction"] } ], diff --git a/website/src/components/landing.js b/website/src/components/landing.js index 16c342e3f..fb03d2845 100644 --- a/website/src/components/landing.js +++ b/website/src/components/landing.js @@ -46,10 +46,17 @@ export const LandingGrid = ({ cols = 3, blocks = false, children }) => ( export const LandingCol = ({ children }) =>
{children}
-export const LandingCard = ({ title, children }) => ( +export const LandingCard = ({ title, button, url, children }) => (
- {title &&

{title}

} - {children} +
+ {title &&

{title}

} +

{children}

+
+ {button && url && ( +
+ {button} +
+ )}
) diff --git a/website/src/styles/landing.module.sass b/website/src/styles/landing.module.sass index d7340229b..e36e36c0a 100644 --- a/website/src/styles/landing.module.sass +++ b/website/src/styles/landing.module.sass @@ -49,12 +49,17 @@ margin-bottom: -25rem .card + display: flex + flex-direction: column padding: 3rem 2.5rem background: var(--color-back) border-radius: var(--border-radius) box-shadow: var(--box-shadow) margin-bottom: 3rem +.card-text + flex: 100% + .button width: 100% diff --git a/website/src/widgets/landing.js b/website/src/widgets/landing.js index 2dc5d40dc..9aeec0cdc 100644 --- a/website/src/widgets/landing.js +++ b/website/src/widgets/landing.js @@ -79,34 +79,28 @@ const Landing = ({ data }) => { in Python - -

- spaCy is designed to help you do real work — to build real products, or - gather real insights. The library respects your time, and tries to avoid - wasting it. It's easy to install, and its API is simple and productive. We - like to think of spaCy as the Ruby on Rails of Natural Language Processing. -

- Get started + + spaCy is designed to help you do real work — to build real products, or gather + real insights. The library respects your time, and tries to avoid wasting it. + It's easy to install, and its API is simple and productive. We like to think of + spaCy as the Ruby on Rails of Natural Language Processing. - -

- spaCy excels at large-scale information extraction tasks. It's written from - the ground up in carefully memory-managed Cython. Independent research in - 2015 found spaCy to be the fastest in the world. If your application needs - to process entire web dumps, spaCy is the library you want to be using. -

- Facts & Figures + + spaCy excels at large-scale information extraction tasks. It's written from the + ground up in carefully memory-managed Cython. Independent research in 2015 found + spaCy to be the fastest in the world. If your application needs to process + entire web dumps, spaCy is the library you want to be using. - -

- spaCy is the best way to prepare text for deep learning. It interoperates - seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of - Python's awesome AI ecosystem. With spaCy, you can easily construct - linguistically sophisticated statistical models for a variety of NLP - problems. -

- Read more + + spaCy is the best way to prepare text for deep learning. It interoperates + seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of + Python's awesome AI ecosystem. With spaCy, you can easily construct + linguistically sophisticated statistical models for a variety of NLP problems.