diff --git a/.github/contributors/BigstickCarpet.md b/.github/contributors/BigstickCarpet.md new file mode 100644 index 000000000..07b356495 --- /dev/null +++ b/.github/contributors/BigstickCarpet.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | James Messinger | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | May 23, 2018 | +| GitHub username | BigstickCarpet | +| Website (optional) | | diff --git a/.github/contributors/ansgar-t.md b/.github/contributors/ansgar-t.md new file mode 100644 index 000000000..a45dea3f2 --- /dev/null +++ b/.github/contributors/ansgar-t.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Ansgar Tümmers | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2018-05-26 | +| GitHub username | ansgar-t | +| Website (optional) | | diff --git a/.github/contributors/aristorinjuang.md b/.github/contributors/aristorinjuang.md new file mode 100644 index 000000000..17cb692a6 --- /dev/null +++ b/.github/contributors/aristorinjuang.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------------- | +| Name | Aristo Rinjuang | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | May 22, 2018 | +| GitHub username | aristorinjuang | +| Website (optional) | https://aristorinjuang.com | diff --git a/.github/contributors/armsp.md b/.github/contributors/armsp.md new file mode 100644 index 000000000..63d1367e4 --- /dev/null +++ b/.github/contributors/armsp.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Shantam | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 21/5/2018 | +| GitHub username | armsp | +| Website (optional) | | diff --git a/.github/contributors/idealley.md b/.github/contributors/idealley.md new file mode 100644 index 000000000..9aa7d4a1b --- /dev/null +++ b/.github/contributors/idealley.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Pouyt Samuel | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 26.05.2018 | +| GitHub username | Idealley | +| Website (optional) | | diff --git a/.github/contributors/mpszumowski.md b/.github/contributors/mpszumowski.md new file mode 100644 index 000000000..226c400c6 --- /dev/null +++ b/.github/contributors/mpszumowski.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Maciej Szumowski | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 30.05.2018 | +| GitHub username | mpszumowski | +| Website (optional) | | diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 8fdf44dcb..5d6e2d55c 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -118,7 +118,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0, optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu) nlp._optimizer = None - print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %") + print("Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS") try: for i in range(n_iter): train_docs = corpus.train_docs(nlp, noise_level=0.0, @@ -208,17 +208,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0): scores.update(dev_scores) scores['cpu_wps'] = cpu_wps scores['gpu_wps'] = gpu_wps or 0.0 - tpl = '\t'.join(( - '{:d}', - '{dep_loss:.3f}', - '{ner_loss:.3f}', - '{uas:.3f}', - '{ents_p:.3f}', - '{ents_r:.3f}', - '{ents_f:.3f}', - '{tags_acc:.3f}', - '{token_acc:.3f}', - '{cpu_wps:.1f}', + tpl = ''.join(( + '{:<6d}', + '{dep_loss:<10.3f}', + '{ner_loss:<10.3f}', + '{uas:<8.3f}', + '{ents_p:<8.3f}', + '{ents_r:<8.3f}', + '{ents_f:<8.3f}', + '{tags_acc:<8.3f}', + '{token_acc:<9.3f}', + '{cpu_wps:<9.1f}', '{gpu_wps:.1f}', )) print(tpl.format(itn, **scores)) diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 4a494591c..fa84bf87d 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -3,7 +3,7 @@ from __future__ import unicode_literals from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE -from ..util import minify_html +from ..util import minify_html, escape_html class DependencyRenderer(object): @@ -84,7 +84,9 @@ class DependencyRenderer(object): """ y = self.offset_y+self.word_spacing x = self.offset_x+i*self.distance - return TPL_DEP_WORDS.format(text=text, tag=tag, x=x, y=y) + html_text = escape_html(text) + return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y) + def render_arrow(self, label, start, end, direction, i): """Render indivicual arrow. diff --git a/spacy/errors.py b/spacy/errors.py index ad0518fca..b096ab719 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -302,7 +302,7 @@ def _get_warn_excl(arg): return [w_id.strip() for w_id in arg.split(',')] -SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER', 'always') +SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER') SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES')) SPACY_WARNING_IGNORE = _get_warn_excl(os.environ.get('SPACY_WARNING_IGNORE')) @@ -329,5 +329,6 @@ def _warn(message, warn_type='user'): category = WARNINGS[warn_type] stack = inspect.stack()[-1] with warnings.catch_warnings(): - warnings.simplefilter(SPACY_WARNING_FILTER, category) + if SPACY_WARNING_FILTER: + warnings.simplefilter(SPACY_WARNING_FILTER, category) warnings.warn_explicit(message, category, stack[1], stack[2]) diff --git a/spacy/gold.pyx b/spacy/gold.pyx index ace5e6b88..0aaa14836 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -369,12 +369,14 @@ def _consume_os(tags): def _consume_ent(tags): if not tags: return [] - target = tags.pop(0).replace('B', 'I') + tag = tags.pop(0) + target_in = 'I' + tag[1:] + target_last = 'L' + tag[1:] length = 1 - while tags and tags[0] == target: + while tags and tags[0] in {target_in, target_last}: length += 1 tags.pop(0) - label = target[2:] + label = tag[2:] if length == 1: return ['U-' + label] else: diff --git a/spacy/lang/id/lex_attrs.py b/spacy/lang/id/lex_attrs.py index 235cee438..39f7042eb 100644 --- a/spacy/lang/id/lex_attrs.py +++ b/spacy/lang/id/lex_attrs.py @@ -4,19 +4,10 @@ from __future__ import unicode_literals from ...attrs import LIKE_NUM -_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', - 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', - 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', - 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', - 'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion', - 'gajillion', 'bazillion', - 'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', - 'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas', - 'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas', - 'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta', - 'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun', - 'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', - 'noniliun', 'desiliun'] +_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', + 'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh', + 'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun', + 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun'] def like_num(text): diff --git a/spacy/lang/id/norm_exceptions.py b/spacy/lang/id/norm_exceptions.py index cb168dfeb..2468efbcd 100644 --- a/spacy/lang/id/norm_exceptions.py +++ b/spacy/lang/id/norm_exceptions.py @@ -1,14 +1,7 @@ # coding: utf8 from __future__ import unicode_literals -_exc = { - "Rp": "$", - "IDR": "$", - "RMB": "$", - "USD": "$", - "AUD": "$", - "GBP": "$", -} +_exc = {} NORM_EXCEPTIONS = {} diff --git a/spacy/lang/id/tokenizer_exceptions.py b/spacy/lang/id/tokenizer_exceptions.py index 3bba57e4c..1e5282e52 100644 --- a/spacy/lang/id/tokenizer_exceptions.py +++ b/spacy/lang/id/tokenizer_exceptions.py @@ -5,7 +5,7 @@ import regex as re from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS from ..tokenizer_exceptions import URL_PATTERN -from ...symbols import ORTH +from ...symbols import ORTH, LEMMA, NORM _exc = {} @@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS: orth_caps = '-'.join([part.upper() for part in orth.split('-')]) _exc[orth_caps] = [{ORTH: orth_caps}] +for exc_data in [ + {ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"}, + {ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"}, + {ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"}, + {ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"}, + {ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"}, + {ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"}, + + {ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"}, + {ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"}, + {ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"}, + {ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"}, + {ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"}, + {ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"}, + {ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"}, + {ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"}, + {ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"}, + {ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"}, + {ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"}, + {ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"}, + {ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"}, + {ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"}, + + {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"}, + {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"}, + {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"}, + {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, + {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"}, + {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"}, + {ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"}, + {ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"}, + {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, + {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"}, + {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, + {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]: + _exc[exc_data[ORTH]] = [exc_data] for orth in [ - "'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.", - "E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.", - "Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.", - "Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.", + "A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.", "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.", - "M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.", - "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.", - "S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", - "S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.", + "M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,", + "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl", + "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", + "S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars", + "S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han", + "S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP", + "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH", + "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat", + "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.", + "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.", + "S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", + "S.Tekp.", "S.Th.", "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.", "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o", "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.", diff --git a/spacy/lang/ro/lex_attrs.py b/spacy/lang/ro/lex_attrs.py new file mode 100644 index 000000000..48027186b --- /dev/null +++ b/spacy/lang/ro/lex_attrs.py @@ -0,0 +1,42 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +_num_words = set(""" +zero unu doi două trei patru cinci șase șapte opt nouă zece +unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece +douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci +sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii +""".split()) + +_ordinal_words = set(""" +primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea +prima doua treia patra cincia șasea șaptea opta noua zecea +unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea +unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea +douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea +douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta +miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia +""".split()) + + +def like_num(text): + text = text.replace(',', '').replace('.', '') + if text.isdigit(): + return True + if text.count('/') == 1: + num, denom = text.split('/') + if num.isdigit() and denom.isdigit(): + return True + if text.lower() in _num_words: + return True + if text.lower() in _ordinal_words: + return True + return False + + +LEX_ATTRS = { + LIKE_NUM: like_num +} diff --git a/spacy/lang/ro/tokenizer_exceptions.py b/spacy/lang/ro/tokenizer_exceptions.py index 42ccd6a93..bc501c32a 100644 --- a/spacy/lang/ro/tokenizer_exceptions.py +++ b/spacy/lang/ro/tokenizer_exceptions.py @@ -9,8 +9,9 @@ _exc = {} # Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations for orth in [ - "1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea", - "d-voastră", "dvs.", "Rom.", "str."]: + "1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a", + "1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea", + "d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]: _exc[orth] = [{ORTH: orth}] diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 67f7479d1..f703489e3 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -15,12 +15,12 @@ from .. import util # here if it's using spaCy's tokenizer (not a different library) # TODO: re-implement generic tokenizer tests _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', - 'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx'] + 'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx'] _models = {'en': ['en_core_web_sm'], - 'de': ['de_core_news_md'], + 'de': ['de_core_news_sm'], 'fr': ['fr_core_news_sm'], - 'xx': ['xx_ent_web_md'], + 'xx': ['xx_ent_web_sm'], 'en_core_web_md': ['en_core_web_md'], 'es_core_news_md': ['es_core_news_md']} diff --git a/spacy/tests/lang/ro/test_tokenizer.py b/spacy/tests/lang/ro/test_tokenizer.py new file mode 100644 index 000000000..e754eaeae --- /dev/null +++ b/spacy/tests/lang/ro/test_tokenizer.py @@ -0,0 +1,25 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest + +DEFAULT_TESTS = [ + ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']), + ('Teste, etc.', ['Teste', ',', 'etc.']), + ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']), + ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']) +] + +NUMBER_TESTS = [ + ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']), + ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.']) +] + +TESTCASES = DEFAULT_TESTS + NUMBER_TESTS + + +@pytest.mark.parametrize('text,expected_tokens', TESTCASES) +def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens): + tokens = ro_tokenizer(text) + token_list = [token.text for token in tokens if not token.is_space] + assert expected_tokens == token_list diff --git a/spacy/tests/regression/test_issue2361.py b/spacy/tests/regression/test_issue2361.py new file mode 100644 index 000000000..a2ed38077 --- /dev/null +++ b/spacy/tests/regression/test_issue2361.py @@ -0,0 +1,14 @@ +from __future__ import unicode_literals +import pytest + +from ...displacy import render +from ..util import get_doc + +def test_issue2361(de_tokenizer): + tokens = de_tokenizer('< > & " ') + html = render(get_doc(tokens.vocab, [t.text for t in tokens])) + + assert '<' in html + assert '>' in html + assert '&' in html + assert '"' in html diff --git a/spacy/tests/regression/test_issue2385.py b/spacy/tests/regression/test_issue2385.py new file mode 100644 index 000000000..b3e4ba11a --- /dev/null +++ b/spacy/tests/regression/test_issue2385.py @@ -0,0 +1,34 @@ +# coding: utf-8 +import pytest + +from ...gold import iob_to_biluo + + +@pytest.mark.xfail +@pytest.mark.parametrize('tags', [('B-ORG', 'L-ORG'), + ('B-PERSON', 'I-PERSON', 'L-PERSON'), + ('U-BRAWLER', 'U-BRAWLER')]) +def test_issue2385_biluo(tags): + """already biluo format""" + assert iob_to_biluo(tags) == list(tags) + + +@pytest.mark.xfail +@pytest.mark.parametrize('tags', [('B-BRAWLER', 'I-BRAWLER', 'I-BRAWLER')]) +def test_issue2385_iob_bcharacter(tags): + """fix bug in labels with a 'b' character""" + assert iob_to_biluo(tags) == ['B-BRAWLER', 'I-BRAWLER', 'L-BRAWLER'] + + +@pytest.mark.xfail +@pytest.mark.parametrize('tags', [('I-ORG', 'I-ORG', 'B-ORG')]) +def test_issue2385_iob1(tags): + """maintain support for iob1 format""" + assert iob_to_biluo(tags) == ['B-ORG', 'L-ORG', 'U-ORG'] + + +@pytest.mark.xfail +@pytest.mark.parametrize('tags', [('B-PERSON', 'I-PERSON', 'B-PERSON')]) +def test_issue2385_iob2(tags): + """maintain support for iob2 format""" + assert iob_to_biluo(tags) == ['B-PERSON', 'L-PERSON', 'U-PERSON'] diff --git a/spacy/util.py b/spacy/util.py index 80adf7257..fbf35950c 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -632,6 +632,20 @@ def minify_html(html): return html.strip().replace(' ', '').replace('\n', '') +def escape_html(text): + """Replace <, >, &, " with their HTML encoded representation. Intended to + prevent HTML errors in rendered displaCy markup. + + text (unicode): The original text. + RETURNS (unicode): Equivalent text to be safely used within HTML. + """ + text = text.replace('&', '&') + text = text.replace('<', '<') + text = text.replace('>', '>') + text = text.replace('"', '"') + return text + + def use_gpu(gpu_id): try: import cupy.cuda.device diff --git a/website/api/_annotation/_training.jade b/website/api/_annotation/_training.jade index 9bd59cdae..8658866aa 100644 --- a/website/api/_annotation/_training.jade +++ b/website/api/_annotation/_training.jade @@ -53,7 +53,7 @@ p +tag-new(2) p - | The populate a model's vocabulary, you can use the + | To populate a model's vocabulary, you can use the | #[+api("cli#vocab") #[code spacy vocab]] command and load in a | #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON] | (JSONL) file containing one lexical entry per line. The first line diff --git a/website/models/_data.json b/website/models/_data.json index 06164d299..9c3cb8880 100644 --- a/website/models/_data.json +++ b/website/models/_data.json @@ -116,6 +116,7 @@ "hr": "Croatian", "tr": "Turkish", "he": "Hebrew", + "ar": "Arabic", "fa": "Persian", "ga": "Irish", "bn": "Bengali", diff --git a/website/universe/universe.json b/website/universe/universe.json index 21ec0177c..3910785e4 100644 --- a/website/universe/universe.json +++ b/website/universe/universe.json @@ -872,8 +872,37 @@ }, "category": ["standalone"], "tags": [ "question-answering", "elasticsearch"] + }, + { + "id": "self-attentive-parser", + "title": "Berkeley Neural Parser", + "slogan": "Constituency Parsing with a Self-Attentive Encoder (ACL 2018)", + "description": "A Python implementation of the parsers described in *\"Constituency Parsing with a Self-Attentive Encoder\"* from ACL 2018.", + "url": "https://arxiv.org/abs/1805.01052", + "github": "nikitakit/self-attentive-parser", + "pip": "benepar", + "code_example": [ + "import spacy", + "from benepar.spacy_plugin import BeneparComponent", + "", + "nlp = spacy.load('en')", + "nlp.add_pipe(BeneparComponent('benepar_en'))", + "doc = nlp(u'The time for action is now. It's never too late to do something.')", + "sent = list(doc.sents)[0]", + "print(sent._.parse_string)", + "# (S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))", + "print(sent._.labels)", + "# ('S',)", + "print(list(sent._.children)[0])", + "# The time for action" + ], + "author": "Nikita Kitaev", + "author_links": { + "github": "nikitakit", + "website": " http://kitaev.io" + }, + "category": ["research", "pipeline"] } - ], "projectCats": { "pipeline": { diff --git a/website/usage/_install/_quickstart.jade b/website/usage/_install/_quickstart.jade index f2d4383a4..976e7d4ad 100644 --- a/website/usage/_install/_quickstart.jade +++ b/website/usage/_install/_quickstart.jade @@ -16,7 +16,9 @@ +qs({package: 'source'}) git clone https://github.com/explosion/spaCy +qs({package: 'source'}) cd spaCy - +qs({package: 'source'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy +qs({package: 'source'}) pip install -r requirements.txt +qs({package: 'source'}) python setup.py build_ext --inplace diff --git a/website/usage/_linguistic-features/_rule-based-matching.jade b/website/usage/_linguistic-features/_rule-based-matching.jade index c0d418d46..9abda6e0f 100644 --- a/website/usage/_linguistic-features/_rule-based-matching.jade +++ b/website/usage/_linguistic-features/_rule-based-matching.jade @@ -184,7 +184,7 @@ p p | In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators - | behave inconsistently. They were usually interpretted + | behave inconsistently. They were usually interpreted | "greedily", i.e. longer matches are returned where possible. However, if | you specify two #[code +] and #[code *] patterns in a row and their | matches overlap, the first operator will behave non-greedily. This quirk @@ -260,41 +260,6 @@ p doc = nlp(u"This is a text about Google I/O 2015.") matches = matcher(doc) -p - | In addition to mentions of "Google I/O", your data also contains some - | annoying pre-processing artefacts, like leftover HTML line breaks - | (e.g. #[code <br>] or #[code <BR/>]). While you're at it, - | you want to merge those into one token and flag them, to make sure you - | can easily ignore them later. So you add a second pattern and pass in a - | function #[code merge_and_flag]: - -+code-exec. - import spacy - from spacy.matcher import Matcher - from spacy.tokens import Token - - nlp = spacy.load('en_core_web_sm') - matcher = Matcher(nlp.vocab) - # register a new token extension to flag bad HTML - Token.set_extension('bad_html', default=False) - - def merge_and_flag(matcher, doc, i, matches): - match_id, start, end = matches[i] - span = doc[start : end] - span.merge(is_stop=True) # merge (and mark it as a stop word, just in case) - for token in span: - token._.bad_html = True # mark token as bad HTML - print(span.text) - - matcher.add('BAD_HTML', merge_and_flag, - [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], - [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) - - doc = nlp(u"Hello<br>world!") - matches = matcher(doc) - for token in doc: - print(token.text, token._.bad_html) - +aside("Tip: Visualizing matches") | When working with entities, you can use #[+api("top-level#displacy") displaCy] | to quickly generate a NER visualization from your updated #[code Doc], @@ -315,7 +280,7 @@ p | that was matched, and invoke it. +code. - doc = nlp(LOTS_OF_TEXT) + doc = nlp(YOUR_TEXT_HERE) matcher(doc) p @@ -348,6 +313,70 @@ p | A list of #[code (match_id, start, end)] tuples, describing the | matches. A match tuple describes a span #[code doc[start:end]]. ++h(3, "matcher-pipeline") Using custom pipeline components + +p + | Let's say your data also contains some annoying pre-processing artefacts, + | like leftover HTML line breaks (e.g. #[code <br>] or + | #[code <BR/>]). To make your text easier to analyse, you want to + | merge those into one token and flag them, to make sure you + | can ignore them later. Ideally, this should all be done automatically + | as you process the text. You can achieve this by adding a + | #[+a("/usage/processing-pipelines#custom-components") custom pipeline component] + | that's called on each #[code Doc] object, merges the leftover HTML spans + | and sets an attribute #[code bad_html] on the token. + ++code-exec. + import spacy + from spacy.matcher import Matcher + from spacy.tokens import Token + + # we're using a class because the component needs to be initialised with + # the shared vocab via the nlp object + class BadHTMLMerger(object): + def __init__(self, nlp): + # register a new token extension to flag bad HTML + Token.set_extension('bad_html', default=False) + self.matcher = Matcher(nlp.vocab) + self.matcher.add('BAD_HTML', None, + [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], + [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) + + def __call__(self, doc): + # this method is invoked when the component is called on a Doc + matches = self.matcher(doc) + spans = [] # collect the matched spans here + for match_id, start, end in matches: + spans.append(doc[start:end]) + for span in spans: + span.merge() # merge + for token in span: + token._.bad_html = True # mark token as bad HTML + doc.vocab[token.text].is_stop = True # mark lexeme as stop word + return doc + + nlp = spacy.load('en_core_web_sm') + html_merger = BadHTMLMerger(nlp) + nlp.add_pipe(html_merger, last=True) # add component to the pipeline + doc = nlp(u"Hello<br>world! <br/> This is a test.") + for token in doc: + print(token.text, token._.bad_html) + +p + | Instead of hard-coding the patterns into the component, you could also + | make it take a path to a JSON file containing the patterns. This lets + | you reuse the component with different patterns, depending on your + | application: + ++code. + html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json') + ++infobox + | For more details and examples of how to + | #[strong create custom pipeline components] and + | #[strong extension attributes], see the + | #[+a("/usage/processing-pipelines") usage guide]. + +h(3, "regex") Using regular expressions p diff --git a/website/usage/_vectors-similarity/_custom.jade b/website/usage/_vectors-similarity/_custom.jade index fef22ae71..f5ad402a3 100644 --- a/website/usage/_vectors-similarity/_custom.jade +++ b/website/usage/_vectors-similarity/_custom.jade @@ -52,7 +52,7 @@ p +code(false, "bash"). wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz - python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz + python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz p | This will output a spaCy model in the directory