diff --git a/.github/contributors/EARL_GREYT.md b/.github/contributors/EARL_GREYT.md new file mode 100644 index 000000000..3ee7d4f41 --- /dev/null +++ b/.github/contributors/EARL_GREYT.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | David Weßling | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 27.09.19 | +| GitHub username | EarlGreyT | +| Website (optional) | | diff --git a/.github/contributors/Hazoom.md b/.github/contributors/Hazoom.md new file mode 100644 index 000000000..762cb5bef --- /dev/null +++ b/.github/contributors/Hazoom.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Moshe Hazoom | +| Company name (if applicable) | Amenity Analytics | +| Title or role (if applicable) | NLP Engineer | +| Date | 2019-09-15 | +| GitHub username | Hazoom | +| Website (optional) | | diff --git a/.github/contributors/jaydeepborkar.md b/.github/contributors/jaydeepborkar.md new file mode 100644 index 000000000..32199d596 --- /dev/null +++ b/.github/contributors/jaydeepborkar.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Jaydeep Borkar | +| Company name (if applicable) | Pune University, India | +| Title or role (if applicable) | CS Undergrad | +| Date | 9/26/2019 | +| GitHub username | jaydeepborkar | +| Website (optional) | http://jaydeepborkar.github.io | diff --git a/.github/contributors/seanBE.md b/.github/contributors/seanBE.md new file mode 100644 index 000000000..5e4b4de0a --- /dev/null +++ b/.github/contributors/seanBE.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------- | +| Name | Sean Löfgren | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-09-17 | +| GitHub username | seanBE | +| Website (optional) | http://seanbe.github.io | diff --git a/.github/contributors/zqianem.md b/.github/contributors/zqianem.md new file mode 100644 index 000000000..13f6ab214 --- /dev/null +++ b/.github/contributors/zqianem.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Em Zhan | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-09-25 | +| GitHub username | zqianem | +| Website (optional) | | diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8b02b7055..3c2b56cd3 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -73,9 +73,8 @@ issue body. A few more tips: ### Issue labels -To distinguish issues that are opened by us, the maintainers, we usually add a -💫 to the title. [See this page](https://github.com/explosion/spaCy/labels) -for an overview of the system we use to tag our issues and pull requests. +[See this page](https://github.com/explosion/spaCy/labels) for an overview of +the system we use to tag our issues and pull requests. ## Contributing to the code base diff --git a/Makefile b/Makefile index 2834096b7..0f5c31ca6 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,17 @@ SHELL := /bin/bash sha = $(shell "git" "rev-parse" "--short" "HEAD") +version = $(shell "bin/get-version.sh") +wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl -dist/spacy.pex : spacy/*.py* spacy/*/*.py* +dist/spacy.pex : dist/spacy-$(sha).pex + cp dist/spacy-$(sha).pex dist/spacy.pex + chmod a+rx dist/spacy.pex + +dist/spacy-$(sha).pex : dist/$(wheel) + env3.6/bin/python -m pip install pex==1.5.3 + env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex + +dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* python3.6 -m venv env3.6 source env3.6/bin/activate env3.6/bin/pip install wheel @@ -9,10 +19,6 @@ dist/spacy.pex : spacy/*.py* spacy/*/*.py* env3.6/bin/python setup.py build_ext --inplace env3.6/bin/python setup.py sdist env3.6/bin/python setup.py bdist_wheel - env3.6/bin/python -m pip install pex==1.5.3 - env3.6/bin/pex pytest dist/*.whl -e spacy -o dist/spacy-$(sha).pex - cp dist/spacy-$(sha).pex dist/spacy.pex - chmod a+rx dist/spacy.pex .PHONY : clean diff --git a/README.md b/README.md index 27a49f465..6bdbc7e46 100644 --- a/README.md +++ b/README.md @@ -49,9 +49,12 @@ It's commercial open-source software, released under the MIT license. ## 💬 Where to ask questions The spaCy project is maintained by [@honnibal](https://github.com/honnibal) -and [@ines](https://github.com/ines). Please understand that we won't be able -to provide individual support via email. We also believe that help is much more -valuable if it's shared publicly, so that more people can benefit from it. +and [@ines](https://github.com/ines), along with core contributors +[@svlandeg](https://github.com/svlandeg) and +[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't +be able to provide individual support via email. We also believe that help is +much more valuable if it's shared publicly, so that more people can benefit +from it. | Type | Platforms | | ------------------------ | ------------------------------------------------------ | @@ -172,8 +175,8 @@ python -m spacy download en_core_web_sm python -m spacy download en # pip install .tar.gz archive from path or URL -pip install /Users/you/en_core_web_sm-2.1.0.tar.gz -pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz +pip install /Users/you/en_core_web_sm-2.2.0.tar.gz +pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz ``` ### Loading and using models diff --git a/azure-pipelines.yml b/azure-pipelines.yml index c5fa563be..c23995de6 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -79,14 +79,24 @@ jobs: # Downgrading pip is necessary to prevent a wheel version incompatiblity. # Might be fixed in the future or some other way, so investigate again. - script: | - python -m pip install --upgrade pip==18.1 + python -m pip install -U pip==18.1 setuptools pip install -r requirements.txt displayName: 'Install dependencies' - script: | python setup.py build_ext --inplace - pip install -e . - displayName: 'Build and install' + python setup.py sdist --formats=gztar + displayName: 'Compile and build sdist' - - script: python -m pytest --tb=native spacy + - task: DeleteFiles@1 + inputs: + contents: 'spacy' + displayName: 'Delete source directory' + + - bash: | + SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1) + pip install dist/$SDIST + displayName: 'Install from sdist' + + - script: python -m pytest --pyargs spacy displayName: 'Run tests' diff --git a/bin/get-version.sh b/bin/get-version.sh new file mode 100755 index 000000000..5a12ddd7a --- /dev/null +++ b/bin/get-version.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +set -e + +version=$(grep "__version__ = " spacy/about.py) +version=${version/__version__ = } +version=${version/\'/} +version=${version/\'/} +version=${version/\"/} +version=${version/\"/} + +echo $version diff --git a/bin/ud/run_eval.py b/bin/ud/run_eval.py index 171687980..2da476721 100644 --- a/bin/ud/run_eval.py +++ b/bin/ud/run_eval.py @@ -7,14 +7,16 @@ import datetime from pathlib import Path import xml.etree.ElementTree as ET -from spacy.cli.ud import conll17_ud_eval -from spacy.cli.ud.ud_train import write_conllu +import conll17_ud_eval +from ud_train import write_conllu from spacy.lang.lex_attrs import word_shape from spacy.util import get_lang_class # All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb') -ALL_LANGUAGES = "ar, ca, da, de, el, en, es, fa, fi, fr, ga, he, hi, hr, hu, id, " \ - "it, ja, no, nl, pl, pt, ro, ru, sv, tr, ur, vi, zh" +ALL_LANGUAGES = ("af, ar, bg, bn, ca, cs, da, de, el, en, es, et, fa, fi, fr," + "ga, he, hi, hr, hu, id, is, it, ja, kn, ko, lt, lv, mr, no," + "nl, pl, pt, ro, ru, si, sk, sl, sq, sr, sv, ta, te, th, tl," + "tr, tt, uk, ur, vi, zh") # Non-parsing tasks that will be evaluated (works for default models) EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats'] @@ -73,10 +75,10 @@ def _contains_blinded_text(stats_xml): tree = ET.parse(stats_xml) root = tree.getroot() total_tokens = int(root.find('size/total/tokens').text) - unique_lemmas = int(root.find('lemmas').get('unique')) + unique_forms = int(root.find('forms').get('unique')) # assume the corpus is largely blinded when there are less than 1% unique tokens - return (unique_lemmas / total_tokens) < 0.01 + return (unique_forms / total_tokens) < 0.01 def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language): @@ -262,22 +264,26 @@ def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_train if not exclude_trained_models: if 'de' in models: models['de'].append(load_model('de_core_news_sm')) - if 'es' in models: - models['es'].append(load_model('es_core_news_sm')) - models['es'].append(load_model('es_core_news_md')) - if 'pt' in models: - models['pt'].append(load_model('pt_core_news_sm')) - if 'it' in models: - models['it'].append(load_model('it_core_news_sm')) - if 'nl' in models: - models['nl'].append(load_model('nl_core_news_sm')) + models['de'].append(load_model('de_core_news_md')) + if 'el' in models: + models['el'].append(load_model('el_core_news_sm')) + models['el'].append(load_model('el_core_news_md')) if 'en' in models: models['en'].append(load_model('en_core_web_sm')) models['en'].append(load_model('en_core_web_md')) models['en'].append(load_model('en_core_web_lg')) + if 'es' in models: + models['es'].append(load_model('es_core_news_sm')) + models['es'].append(load_model('es_core_news_md')) if 'fr' in models: models['fr'].append(load_model('fr_core_news_sm')) models['fr'].append(load_model('fr_core_news_md')) + if 'it' in models: + models['it'].append(load_model('it_core_news_sm')) + if 'nl' in models: + models['nl'].append(load_model('nl_core_news_sm')) + if 'pt' in models: + models['pt'].append(load_model('pt_core_news_sm')) with out_path.open(mode='w', encoding='utf-8') as out_file: run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks) diff --git a/bin/ud/ud_run_test.py b/bin/ud/ud_run_test.py index 1c529c831..de01cf350 100644 --- a/bin/ud/ud_run_test.py +++ b/bin/ud/ud_run_test.py @@ -109,15 +109,13 @@ def write_conllu(docs, file_): merger = Matcher(docs[0].vocab) merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) for i, doc in enumerate(docs): - matches = merger(doc) + matches = [] + if doc.is_parsed: + matches = merger(doc) spans = [doc[start : end + 1] for _, start, end in matches] with doc.retokenize() as retokenizer: for span in spans: retokenizer.merge(span) - # TODO: This shouldn't be necessary? Should be handled in merge - for word in doc: - if word.i == word.head.i: - word.dep_ = "ROOT" file_.write("# newdoc id = {i}\n".format(i=i)) for j, sent in enumerate(doc.sents): file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j)) diff --git a/bin/ud/ud_train.py b/bin/ud/ud_train.py index 8f699db4f..c1a1501d9 100644 --- a/bin/ud/ud_train.py +++ b/bin/ud/ud_train.py @@ -25,7 +25,7 @@ import itertools import random import numpy.random -from . import conll17_ud_eval +import conll17_ud_eval from spacy import lang from spacy.lang import zh @@ -82,6 +82,8 @@ def read_data( head = int(head) - 1 if head != "0" else id_ sent["words"].append(word) sent["tags"].append(tag) + sent["morphology"].append(_parse_morph_string(morph)) + sent["morphology"][-1].add("POS_%s" % pos) sent["heads"].append(head) sent["deps"].append("ROOT" if dep == "root" else dep) sent["spaces"].append(space_after == "_") @@ -90,10 +92,12 @@ def read_data( if oracle_segments: docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"])) golds.append(GoldParse(docs[-1], **sent)) + assert golds[-1].morphology is not None sent_annots.append(sent) if raw_text and max_doc_length and len(sent_annots) >= max_doc_length: doc, gold = _make_gold(nlp, None, sent_annots) + assert gold.morphology is not None sent_annots = [] docs.append(doc) golds.append(gold) @@ -108,6 +112,17 @@ def read_data( return docs, golds return docs, golds +def _parse_morph_string(morph_string): + if morph_string == '_': + return set() + output = [] + replacements = {'1': 'one', '2': 'two', '3': 'three'} + for feature in morph_string.split('|'): + key, value = feature.split('=') + value = replacements.get(value, value) + value = value.split(',')[0] + output.append('%s_%s' % (key, value.lower())) + return set(output) def read_conllu(file_): docs = [] @@ -141,8 +156,8 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0): flat = defaultdict(list) sent_starts = [] for sent in sent_annots: - flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"]) - for field in ["words", "tags", "deps", "entities", "spaces"]: + flat["heads"].extend(len(flat["words"])+head for head in sent["heads"]) + for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]: flat[field].extend(sent[field]) sent_starts.append(True) sent_starts.extend([False] * (len(sent["words"]) - 1)) @@ -214,11 +229,18 @@ def write_conllu(docs, file_): merger = Matcher(docs[0].vocab) merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) for i, doc in enumerate(docs): - matches = merger(doc) + matches = [] + if doc.is_parsed: + matches = merger(doc) spans = [doc[start : end + 1] for _, start, end in matches] + seen_tokens = set() with doc.retokenize() as retokenizer: for span in spans: - retokenizer.merge(span) + span_tokens = set(range(span.start, span.end)) + if not span_tokens.intersection(seen_tokens): + retokenizer.merge(span) + seen_tokens.update(span_tokens) + file_.write("# newdoc id = {i}\n".format(i=i)) for j, sent in enumerate(doc.sents): file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j)) @@ -241,27 +263,29 @@ def write_conllu(docs, file_): def print_progress(itn, losses, ud_scores): fields = { "dep_loss": losses.get("parser", 0.0), + "morph_loss": losses.get("morphologizer", 0.0), "tag_loss": losses.get("tagger", 0.0), "words": ud_scores["Words"].f1 * 100, "sents": ud_scores["Sentences"].f1 * 100, "tags": ud_scores["XPOS"].f1 * 100, "uas": ud_scores["UAS"].f1 * 100, "las": ud_scores["LAS"].f1 * 100, + "morph": ud_scores["Feats"].f1 * 100, } - header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"] + header = ["Epoch", "P.Loss", "M.Loss", "LAS", "UAS", "TAG", "MORPH", "SENT", "WORD"] if itn == 0: print("\t".join(header)) - tpl = "\t".join( - ( - "{:d}", - "{dep_loss:.1f}", - "{las:.1f}", - "{uas:.1f}", - "{tags:.1f}", - "{sents:.1f}", - "{words:.1f}", - ) - ) + tpl = "\t".join(( + "{:d}", + "{dep_loss:.1f}", + "{morph_loss:.1f}", + "{las:.1f}", + "{uas:.1f}", + "{tags:.1f}", + "{morph:.1f}", + "{sents:.1f}", + "{words:.1f}", + )) print(tpl.format(itn, **fields)) @@ -282,25 +306,27 @@ def get_token_conllu(token, i): head = 0 else: head = i + (token.head.i - token.i) + 1 - fields = [ - str(i + 1), - token.text, - token.lemma_, - token.pos_, - token.tag_, - "_", - str(head), - token.dep_.lower(), - "_", - "_", - ] + features = list(token.morph) + feat_str = [] + replacements = {"one": "1", "two": "2", "three": "3"} + for feat in features: + if not feat.startswith("begin") and not feat.startswith("end"): + key, value = feat.split("_", 1) + value = replacements.get(value, value) + feat_str.append("%s=%s" % (key, value.title())) + if not feat_str: + feat_str = "_" + else: + feat_str = "|".join(feat_str) + fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, feat_str, + str(head), token.dep_.lower(), "_", "_"] lines.append("\t".join(fields)) return "\n".join(lines) -Token.set_extension("get_conllu_lines", method=get_token_conllu) -Token.set_extension("begins_fused", default=False) -Token.set_extension("inside_fused", default=False) +Token.set_extension("get_conllu_lines", method=get_token_conllu, force=True) +Token.set_extension("begins_fused", default=False, force=True) +Token.set_extension("inside_fused", default=False, force=True) ################## @@ -324,7 +350,8 @@ def load_nlp(corpus, config, vectors=None): def initialize_pipeline(nlp, docs, golds, config, device): - nlp.add_pipe(nlp.create_pipe("tagger")) + nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False})) + nlp.add_pipe(nlp.create_pipe("morphologizer")) nlp.add_pipe(nlp.create_pipe("parser")) if config.multitask_tag: nlp.parser.add_multitask_objective("tag") @@ -524,14 +551,12 @@ def main( out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i) with nlp.use_params(optimizer.averages): if use_oracle_segments: - parsed_docs, scores = evaluate( - nlp, paths.dev.conllu, paths.dev.conllu, out_path - ) + parsed_docs, scores = evaluate(nlp, paths.dev.conllu, + paths.dev.conllu, out_path) else: - parsed_docs, scores = evaluate( - nlp, paths.dev.text, paths.dev.conllu, out_path - ) - print_progress(i, losses, scores) + parsed_docs, scores = evaluate(nlp, paths.dev.text, + paths.dev.conllu, out_path) + print_progress(i, losses, scores) def _render_parses(i, to_render): diff --git a/examples/pipeline/dummy_entity_linking.py b/examples/pipeline/dummy_entity_linking.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/pipeline/wikidata_entity_linking.py b/examples/pipeline/wikidata_entity_linking.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/training/pretrain_kb.py b/examples/training/pretrain_kb.py index d5281ad42..2c494d5c4 100644 --- a/examples/training/pretrain_kb.py +++ b/examples/training/pretrain_kb.py @@ -8,8 +8,8 @@ For more details, see the documentation: * Knowledge base: https://spacy.io/api/kb * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking -Compatible with: spaCy vX.X -Last tested with: vX.X +Compatible with: spaCy v2.2 +Last tested with: v2.2 """ from __future__ import unicode_literals, print_function @@ -73,7 +73,6 @@ def main(vocab_path=None, model=None, output_dir=None, n_iter=50): input_dim=INPUT_DIM, desc_width=DESC_WIDTH, epochs=n_iter, - threshold=0.001, ) encoder.train(description_list=descriptions, to_print=True) diff --git a/examples/training/textcat_example_data/CC0.txt b/examples/training/textcat_example_data/CC0.txt new file mode 100644 index 000000000..0e259d42c --- /dev/null +++ b/examples/training/textcat_example_data/CC0.txt @@ -0,0 +1,121 @@ +Creative Commons Legal Code + +CC0 1.0 Universal + + CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE + LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN + ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS + INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES + REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS + PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM + THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED + HEREUNDER. + +Statement of Purpose + +The laws of most jurisdictions throughout the world automatically confer +exclusive Copyright and Related Rights (defined below) upon the creator +and subsequent owner(s) (each and all, an "owner") of an original work of +authorship and/or a database (each, a "Work"). + +Certain owners wish to permanently relinquish those rights to a Work for +the purpose of contributing to a commons of creative, cultural and +scientific works ("Commons") that the public can reliably and without fear +of later claims of infringement build upon, modify, incorporate in other +works, reuse and redistribute as freely as possible in any form whatsoever +and for any purposes, including without limitation commercial purposes. +These owners may contribute to the Commons to promote the ideal of a free +culture and the further production of creative, cultural and scientific +works, or to gain reputation or greater distribution for their Work in +part through the use and efforts of others. + +For these and/or other purposes and motivations, and without any +expectation of additional consideration or compensation, the person +associating CC0 with a Work (the "Affirmer"), to the extent that he or she +is an owner of Copyright and Related Rights in the Work, voluntarily +elects to apply CC0 to the Work and publicly distribute the Work under its +terms, with knowledge of his or her Copyright and Related Rights in the +Work and the meaning and intended legal effect of CC0 on those rights. + +1. Copyright and Related Rights. A Work made available under CC0 may be +protected by copyright and related or neighboring rights ("Copyright and +Related Rights"). Copyright and Related Rights include, but are not +limited to, the following: + + i. the right to reproduce, adapt, distribute, perform, display, + communicate, and translate a Work; + ii. moral rights retained by the original author(s) and/or performer(s); +iii. publicity and privacy rights pertaining to a person's image or + likeness depicted in a Work; + iv. rights protecting against unfair competition in regards to a Work, + subject to the limitations in paragraph 4(a), below; + v. rights protecting the extraction, dissemination, use and reuse of data + in a Work; + vi. database rights (such as those arising under Directive 96/9/EC of the + European Parliament and of the Council of 11 March 1996 on the legal + protection of databases, and under any national implementation + thereof, including any amended or successor version of such + directive); and +vii. other similar, equivalent or corresponding rights throughout the + world based on applicable law or treaty, and any national + implementations thereof. + +2. Waiver. To the greatest extent permitted by, but not in contravention +of, applicable law, Affirmer hereby overtly, fully, permanently, +irrevocably and unconditionally waives, abandons, and surrenders all of +Affirmer's Copyright and Related Rights and associated claims and causes +of action, whether now known or unknown (including existing as well as +future claims and causes of action), in the Work (i) in all territories +worldwide, (ii) for the maximum duration provided by applicable law or +treaty (including future time extensions), (iii) in any current or future +medium and for any number of copies, and (iv) for any purpose whatsoever, +including without limitation commercial, advertising or promotional +purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each +member of the public at large and to the detriment of Affirmer's heirs and +successors, fully intending that such Waiver shall not be subject to +revocation, rescission, cancellation, termination, or any other legal or +equitable action to disrupt the quiet enjoyment of the Work by the public +as contemplated by Affirmer's express Statement of Purpose. + +3. Public License Fallback. Should any part of the Waiver for any reason +be judged legally invalid or ineffective under applicable law, then the +Waiver shall be preserved to the maximum extent permitted taking into +account Affirmer's express Statement of Purpose. In addition, to the +extent the Waiver is so judged Affirmer hereby grants to each affected +person a royalty-free, non transferable, non sublicensable, non exclusive, +irrevocable and unconditional license to exercise Affirmer's Copyright and +Related Rights in the Work (i) in all territories worldwide, (ii) for the +maximum duration provided by applicable law or treaty (including future +time extensions), (iii) in any current or future medium and for any number +of copies, and (iv) for any purpose whatsoever, including without +limitation commercial, advertising or promotional purposes (the +"License"). The License shall be deemed effective as of the date CC0 was +applied by Affirmer to the Work. Should any part of the License for any +reason be judged legally invalid or ineffective under applicable law, such +partial invalidity or ineffectiveness shall not invalidate the remainder +of the License, and in such case Affirmer hereby affirms that he or she +will not (i) exercise any of his or her remaining Copyright and Related +Rights in the Work or (ii) assert any associated claims and causes of +action with respect to the Work, in either case contrary to Affirmer's +express Statement of Purpose. + +4. Limitations and Disclaimers. + + a. No trademark or patent rights held by Affirmer are waived, abandoned, + surrendered, licensed or otherwise affected by this document. + b. Affirmer offers the Work as-is and makes no representations or + warranties of any kind concerning the Work, express, implied, + statutory or otherwise, including without limitation warranties of + title, merchantability, fitness for a particular purpose, non + infringement, or the absence of latent or other defects, accuracy, or + the present or absence of errors, whether or not discoverable, all to + the greatest extent permissible under applicable law. + c. Affirmer disclaims responsibility for clearing rights of other persons + that may apply to the Work or any use thereof, including without + limitation any person's Copyright and Related Rights in the Work. + Further, Affirmer disclaims responsibility for obtaining any necessary + consents, permissions or other rights required for any use of the + Work. + d. Affirmer understands and acknowledges that Creative Commons is not a + party to this document and has no duty or obligation with respect to + this CC0 or use of the Work. diff --git a/examples/training/textcat_example_data/CC_BY-SA-3.0.txt b/examples/training/textcat_example_data/CC_BY-SA-3.0.txt new file mode 100644 index 000000000..604209a80 --- /dev/null +++ b/examples/training/textcat_example_data/CC_BY-SA-3.0.txt @@ -0,0 +1,359 @@ +Creative Commons Legal Code + +Attribution-ShareAlike 3.0 Unported + + CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE + LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN + ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS + INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES + REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR + DAMAGES RESULTING FROM ITS USE. + +License + +THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE +COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY +COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS +AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED. + +BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE +TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY +BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS +CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND +CONDITIONS. + +1. Definitions + + a. "Adaptation" means a work based upon the Work, or upon the Work and + other pre-existing works, such as a translation, adaptation, + derivative work, arrangement of music or other alterations of a + literary or artistic work, or phonogram or performance and includes + cinematographic adaptations or any other form in which the Work may be + recast, transformed, or adapted including in any form recognizably + derived from the original, except that a work that constitutes a + Collection will not be considered an Adaptation for the purpose of + this License. For the avoidance of doubt, where the Work is a musical + work, performance or phonogram, the synchronization of the Work in + timed-relation with a moving image ("synching") will be considered an + Adaptation for the purpose of this License. + b. "Collection" means a collection of literary or artistic works, such as + encyclopedias and anthologies, or performances, phonograms or + broadcasts, or other works or subject matter other than works listed + in Section 1(f) below, which, by reason of the selection and + arrangement of their contents, constitute intellectual creations, in + which the Work is included in its entirety in unmodified form along + with one or more other contributions, each constituting separate and + independent works in themselves, which together are assembled into a + collective whole. A work that constitutes a Collection will not be + considered an Adaptation (as defined below) for the purposes of this + License. + c. "Creative Commons Compatible License" means a license that is listed + at https://creativecommons.org/compatiblelicenses that has been + approved by Creative Commons as being essentially equivalent to this + License, including, at a minimum, because that license: (i) contains + terms that have the same purpose, meaning and effect as the License + Elements of this License; and, (ii) explicitly permits the relicensing + of adaptations of works made available under that license under this + License or a Creative Commons jurisdiction license with the same + License Elements as this License. + d. "Distribute" means to make available to the public the original and + copies of the Work or Adaptation, as appropriate, through sale or + other transfer of ownership. + e. "License Elements" means the following high-level license attributes + as selected by Licensor and indicated in the title of this License: + Attribution, ShareAlike. + f. "Licensor" means the individual, individuals, entity or entities that + offer(s) the Work under the terms of this License. + g. "Original Author" means, in the case of a literary or artistic work, + the individual, individuals, entity or entities who created the Work + or if no individual or entity can be identified, the publisher; and in + addition (i) in the case of a performance the actors, singers, + musicians, dancers, and other persons who act, sing, deliver, declaim, + play in, interpret or otherwise perform literary or artistic works or + expressions of folklore; (ii) in the case of a phonogram the producer + being the person or legal entity who first fixes the sounds of a + performance or other sounds; and, (iii) in the case of broadcasts, the + organization that transmits the broadcast. + h. "Work" means the literary and/or artistic work offered under the terms + of this License including without limitation any production in the + literary, scientific and artistic domain, whatever may be the mode or + form of its expression including digital form, such as a book, + pamphlet and other writing; a lecture, address, sermon or other work + of the same nature; a dramatic or dramatico-musical work; a + choreographic work or entertainment in dumb show; a musical + composition with or without words; a cinematographic work to which are + assimilated works expressed by a process analogous to cinematography; + a work of drawing, painting, architecture, sculpture, engraving or + lithography; a photographic work to which are assimilated works + expressed by a process analogous to photography; a work of applied + art; an illustration, map, plan, sketch or three-dimensional work + relative to geography, topography, architecture or science; a + performance; a broadcast; a phonogram; a compilation of data to the + extent it is protected as a copyrightable work; or a work performed by + a variety or circus performer to the extent it is not otherwise + considered a literary or artistic work. + i. "You" means an individual or entity exercising rights under this + License who has not previously violated the terms of this License with + respect to the Work, or who has received express permission from the + Licensor to exercise rights under this License despite a previous + violation. + j. "Publicly Perform" means to perform public recitations of the Work and + to communicate to the public those public recitations, by any means or + process, including by wire or wireless means or public digital + performances; to make available to the public Works in such a way that + members of the public may access these Works from a place and at a + place individually chosen by them; to perform the Work to the public + by any means or process and the communication to the public of the + performances of the Work, including by public digital performance; to + broadcast and rebroadcast the Work by any means including signs, + sounds or images. + k. "Reproduce" means to make copies of the Work by any means including + without limitation by sound or visual recordings and the right of + fixation and reproducing fixations of the Work, including storage of a + protected performance or phonogram in digital form or other electronic + medium. + +2. Fair Dealing Rights. Nothing in this License is intended to reduce, +limit, or restrict any uses free from copyright or rights arising from +limitations or exceptions that are provided for in connection with the +copyright protection under copyright law or other applicable laws. + +3. License Grant. Subject to the terms and conditions of this License, +Licensor hereby grants You a worldwide, royalty-free, non-exclusive, +perpetual (for the duration of the applicable copyright) license to +exercise the rights in the Work as stated below: + + a. to Reproduce the Work, to incorporate the Work into one or more + Collections, and to Reproduce the Work as incorporated in the + Collections; + b. to create and Reproduce Adaptations provided that any such Adaptation, + including any translation in any medium, takes reasonable steps to + clearly label, demarcate or otherwise identify that changes were made + to the original Work. For example, a translation could be marked "The + original work was translated from English to Spanish," or a + modification could indicate "The original work has been modified."; + c. to Distribute and Publicly Perform the Work including as incorporated + in Collections; and, + d. to Distribute and Publicly Perform Adaptations. + e. For the avoidance of doubt: + + i. Non-waivable Compulsory License Schemes. In those jurisdictions in + which the right to collect royalties through any statutory or + compulsory licensing scheme cannot be waived, the Licensor + reserves the exclusive right to collect such royalties for any + exercise by You of the rights granted under this License; + ii. Waivable Compulsory License Schemes. In those jurisdictions in + which the right to collect royalties through any statutory or + compulsory licensing scheme can be waived, the Licensor waives the + exclusive right to collect such royalties for any exercise by You + of the rights granted under this License; and, + iii. Voluntary License Schemes. The Licensor waives the right to + collect royalties, whether individually or, in the event that the + Licensor is a member of a collecting society that administers + voluntary licensing schemes, via that society, from any exercise + by You of the rights granted under this License. + +The above rights may be exercised in all media and formats whether now +known or hereafter devised. The above rights include the right to make +such modifications as are technically necessary to exercise the rights in +other media and formats. Subject to Section 8(f), all rights not expressly +granted by Licensor are hereby reserved. + +4. Restrictions. The license granted in Section 3 above is expressly made +subject to and limited by the following restrictions: + + a. You may Distribute or Publicly Perform the Work only under the terms + of this License. You must include a copy of, or the Uniform Resource + Identifier (URI) for, this License with every copy of the Work You + Distribute or Publicly Perform. You may not offer or impose any terms + on the Work that restrict the terms of this License or the ability of + the recipient of the Work to exercise the rights granted to that + recipient under the terms of the License. You may not sublicense the + Work. You must keep intact all notices that refer to this License and + to the disclaimer of warranties with every copy of the Work You + Distribute or Publicly Perform. When You Distribute or Publicly + Perform the Work, You may not impose any effective technological + measures on the Work that restrict the ability of a recipient of the + Work from You to exercise the rights granted to that recipient under + the terms of the License. This Section 4(a) applies to the Work as + incorporated in a Collection, but this does not require the Collection + apart from the Work itself to be made subject to the terms of this + License. If You create a Collection, upon notice from any Licensor You + must, to the extent practicable, remove from the Collection any credit + as required by Section 4(c), as requested. If You create an + Adaptation, upon notice from any Licensor You must, to the extent + practicable, remove from the Adaptation any credit as required by + Section 4(c), as requested. + b. You may Distribute or Publicly Perform an Adaptation only under the + terms of: (i) this License; (ii) a later version of this License with + the same License Elements as this License; (iii) a Creative Commons + jurisdiction license (either this or a later license version) that + contains the same License Elements as this License (e.g., + Attribution-ShareAlike 3.0 US)); (iv) a Creative Commons Compatible + License. If you license the Adaptation under one of the licenses + mentioned in (iv), you must comply with the terms of that license. If + you license the Adaptation under the terms of any of the licenses + mentioned in (i), (ii) or (iii) (the "Applicable License"), you must + comply with the terms of the Applicable License generally and the + following provisions: (I) You must include a copy of, or the URI for, + the Applicable License with every copy of each Adaptation You + Distribute or Publicly Perform; (II) You may not offer or impose any + terms on the Adaptation that restrict the terms of the Applicable + License or the ability of the recipient of the Adaptation to exercise + the rights granted to that recipient under the terms of the Applicable + License; (III) You must keep intact all notices that refer to the + Applicable License and to the disclaimer of warranties with every copy + of the Work as included in the Adaptation You Distribute or Publicly + Perform; (IV) when You Distribute or Publicly Perform the Adaptation, + You may not impose any effective technological measures on the + Adaptation that restrict the ability of a recipient of the Adaptation + from You to exercise the rights granted to that recipient under the + terms of the Applicable License. This Section 4(b) applies to the + Adaptation as incorporated in a Collection, but this does not require + the Collection apart from the Adaptation itself to be made subject to + the terms of the Applicable License. + c. If You Distribute, or Publicly Perform the Work or any Adaptations or + Collections, You must, unless a request has been made pursuant to + Section 4(a), keep intact all copyright notices for the Work and + provide, reasonable to the medium or means You are utilizing: (i) the + name of the Original Author (or pseudonym, if applicable) if supplied, + and/or if the Original Author and/or Licensor designate another party + or parties (e.g., a sponsor institute, publishing entity, journal) for + attribution ("Attribution Parties") in Licensor's copyright notice, + terms of service or by other reasonable means, the name of such party + or parties; (ii) the title of the Work if supplied; (iii) to the + extent reasonably practicable, the URI, if any, that Licensor + specifies to be associated with the Work, unless such URI does not + refer to the copyright notice or licensing information for the Work; + and (iv) , consistent with Ssection 3(b), in the case of an + Adaptation, a credit identifying the use of the Work in the Adaptation + (e.g., "French translation of the Work by Original Author," or + "Screenplay based on original Work by Original Author"). The credit + required by this Section 4(c) may be implemented in any reasonable + manner; provided, however, that in the case of a Adaptation or + Collection, at a minimum such credit will appear, if a credit for all + contributing authors of the Adaptation or Collection appears, then as + part of these credits and in a manner at least as prominent as the + credits for the other contributing authors. For the avoidance of + doubt, You may only use the credit required by this Section for the + purpose of attribution in the manner set out above and, by exercising + Your rights under this License, You may not implicitly or explicitly + assert or imply any connection with, sponsorship or endorsement by the + Original Author, Licensor and/or Attribution Parties, as appropriate, + of You or Your use of the Work, without the separate, express prior + written permission of the Original Author, Licensor and/or Attribution + Parties. + d. Except as otherwise agreed in writing by the Licensor or as may be + otherwise permitted by applicable law, if You Reproduce, Distribute or + Publicly Perform the Work either by itself or as part of any + Adaptations or Collections, You must not distort, mutilate, modify or + take other derogatory action in relation to the Work which would be + prejudicial to the Original Author's honor or reputation. Licensor + agrees that in those jurisdictions (e.g. Japan), in which any exercise + of the right granted in Section 3(b) of this License (the right to + make Adaptations) would be deemed to be a distortion, mutilation, + modification or other derogatory action prejudicial to the Original + Author's honor and reputation, the Licensor will waive or not assert, + as appropriate, this Section, to the fullest extent permitted by the + applicable national law, to enable You to reasonably exercise Your + right under Section 3(b) of this License (right to make Adaptations) + but not otherwise. + +5. Representations, Warranties and Disclaimer + +UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR +OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY +KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, +INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY, +FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF +LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, +WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION +OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU. + +6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE +LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR +ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES +ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS +BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. + +7. Termination + + a. This License and the rights granted hereunder will terminate + automatically upon any breach by You of the terms of this License. + Individuals or entities who have received Adaptations or Collections + from You under this License, however, will not have their licenses + terminated provided such individuals or entities remain in full + compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will + survive any termination of this License. + b. Subject to the above terms and conditions, the license granted here is + perpetual (for the duration of the applicable copyright in the Work). + Notwithstanding the above, Licensor reserves the right to release the + Work under different license terms or to stop distributing the Work at + any time; provided, however that any such election will not serve to + withdraw this License (or any other license that has been, or is + required to be, granted under the terms of this License), and this + License will continue in full force and effect unless terminated as + stated above. + +8. Miscellaneous + + a. Each time You Distribute or Publicly Perform the Work or a Collection, + the Licensor offers to the recipient a license to the Work on the same + terms and conditions as the license granted to You under this License. + b. Each time You Distribute or Publicly Perform an Adaptation, Licensor + offers to the recipient a license to the original Work on the same + terms and conditions as the license granted to You under this License. + c. If any provision of this License is invalid or unenforceable under + applicable law, it shall not affect the validity or enforceability of + the remainder of the terms of this License, and without further action + by the parties to this agreement, such provision shall be reformed to + the minimum extent necessary to make such provision valid and + enforceable. + d. No term or provision of this License shall be deemed waived and no + breach consented to unless such waiver or consent shall be in writing + and signed by the party to be charged with such waiver or consent. + e. This License constitutes the entire agreement between the parties with + respect to the Work licensed here. There are no understandings, + agreements or representations with respect to the Work not specified + here. Licensor shall not be bound by any additional provisions that + may appear in any communication from You. This License may not be + modified without the mutual written agreement of the Licensor and You. + f. The rights granted under, and the subject matter referenced, in this + License were drafted utilizing the terminology of the Berne Convention + for the Protection of Literary and Artistic Works (as amended on + September 28, 1979), the Rome Convention of 1961, the WIPO Copyright + Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 + and the Universal Copyright Convention (as revised on July 24, 1971). + These rights and subject matter take effect in the relevant + jurisdiction in which the License terms are sought to be enforced + according to the corresponding provisions of the implementation of + those treaty provisions in the applicable national law. If the + standard suite of rights granted under applicable copyright law + includes additional rights not granted under this License, such + additional rights are deemed to be included in the License; this + License is not intended to restrict the license of any rights under + applicable law. + + +Creative Commons Notice + + Creative Commons is not a party to this License, and makes no warranty + whatsoever in connection with the Work. Creative Commons will not be + liable to You or any party on any legal theory for any damages + whatsoever, including without limitation any general, special, + incidental or consequential damages arising in connection to this + license. Notwithstanding the foregoing two (2) sentences, if Creative + Commons has expressly identified itself as the Licensor hereunder, it + shall have all rights and obligations of Licensor. + + Except for the limited purpose of indicating to the public that the + Work is licensed under the CCPL, Creative Commons does not authorize + the use by either party of the trademark "Creative Commons" or any + related trademark or logo of Creative Commons without the prior + written consent of Creative Commons. Any permitted use will be in + compliance with Creative Commons' then-current trademark usage + guidelines, as may be published on its website or otherwise made + available upon request from time to time. For the avoidance of doubt, + this trademark restriction does not form part of the License. + + Creative Commons may be contacted at https://creativecommons.org/. diff --git a/examples/training/textcat_example_data/CC_BY-SA-4.0.txt b/examples/training/textcat_example_data/CC_BY-SA-4.0.txt new file mode 100644 index 000000000..a73481c4b --- /dev/null +++ b/examples/training/textcat_example_data/CC_BY-SA-4.0.txt @@ -0,0 +1,428 @@ +Attribution-ShareAlike 4.0 International + +======================================================================= + +Creative Commons Corporation ("Creative Commons") is not a law firm and +does not provide legal services or legal advice. Distribution of +Creative Commons public licenses does not create a lawyer-client or +other relationship. Creative Commons makes its licenses and related +information available on an "as-is" basis. Creative Commons gives no +warranties regarding its licenses, any material licensed under their +terms and conditions, or any related information. Creative Commons +disclaims all liability for damages resulting from their use to the +fullest extent possible. + +Using Creative Commons Public Licenses + +Creative Commons public licenses provide a standard set of terms and +conditions that creators and other rights holders may use to share +original works of authorship and other material subject to copyright +and certain other rights specified in the public license below. The +following considerations are for informational purposes only, are not +exhaustive, and do not form part of our licenses. + + Considerations for licensors: Our public licenses are + intended for use by those authorized to give the public + permission to use material in ways otherwise restricted by + copyright and certain other rights. Our licenses are + irrevocable. Licensors should read and understand the terms + and conditions of the license they choose before applying it. + Licensors should also secure all rights necessary before + applying our licenses so that the public can reuse the + material as expected. Licensors should clearly mark any + material not subject to the license. This includes other CC- + licensed material, or material used under an exception or + limitation to copyright. More considerations for licensors: + wiki.creativecommons.org/Considerations_for_licensors + + Considerations for the public: By using one of our public + licenses, a licensor grants the public permission to use the + licensed material under specified terms and conditions. If + the licensor's permission is not necessary for any reason--for + example, because of any applicable exception or limitation to + copyright--then that use is not regulated by the license. Our + licenses grant only permissions under copyright and certain + other rights that a licensor has authority to grant. Use of + the licensed material may still be restricted for other + reasons, including because others have copyright or other + rights in the material. A licensor may make special requests, + such as asking that all changes be marked or described. + Although not required by our licenses, you are encouraged to + respect those requests where reasonable. More considerations + for the public: + wiki.creativecommons.org/Considerations_for_licensees + +======================================================================= + +Creative Commons Attribution-ShareAlike 4.0 International Public +License + +By exercising the Licensed Rights (defined below), You accept and agree +to be bound by the terms and conditions of this Creative Commons +Attribution-ShareAlike 4.0 International Public License ("Public +License"). To the extent this Public License may be interpreted as a +contract, You are granted the Licensed Rights in consideration of Your +acceptance of these terms and conditions, and the Licensor grants You +such rights in consideration of benefits the Licensor receives from +making the Licensed Material available under these terms and +conditions. + + +Section 1 -- Definitions. + + a. Adapted Material means material subject to Copyright and Similar + Rights that is derived from or based upon the Licensed Material + and in which the Licensed Material is translated, altered, + arranged, transformed, or otherwise modified in a manner requiring + permission under the Copyright and Similar Rights held by the + Licensor. For purposes of this Public License, where the Licensed + Material is a musical work, performance, or sound recording, + Adapted Material is always produced where the Licensed Material is + synched in timed relation with a moving image. + + b. Adapter's License means the license You apply to Your Copyright + and Similar Rights in Your contributions to Adapted Material in + accordance with the terms and conditions of this Public License. + + c. BY-SA Compatible License means a license listed at + creativecommons.org/compatiblelicenses, approved by Creative + Commons as essentially the equivalent of this Public License. + + d. Copyright and Similar Rights means copyright and/or similar rights + closely related to copyright including, without limitation, + performance, broadcast, sound recording, and Sui Generis Database + Rights, without regard to how the rights are labeled or + categorized. For purposes of this Public License, the rights + specified in Section 2(b)(1)-(2) are not Copyright and Similar + Rights. + + e. Effective Technological Measures means those measures that, in the + absence of proper authority, may not be circumvented under laws + fulfilling obligations under Article 11 of the WIPO Copyright + Treaty adopted on December 20, 1996, and/or similar international + agreements. + + f. Exceptions and Limitations means fair use, fair dealing, and/or + any other exception or limitation to Copyright and Similar Rights + that applies to Your use of the Licensed Material. + + g. License Elements means the license attributes listed in the name + of a Creative Commons Public License. The License Elements of this + Public License are Attribution and ShareAlike. + + h. Licensed Material means the artistic or literary work, database, + or other material to which the Licensor applied this Public + License. + + i. Licensed Rights means the rights granted to You subject to the + terms and conditions of this Public License, which are limited to + all Copyright and Similar Rights that apply to Your use of the + Licensed Material and that the Licensor has authority to license. + + j. Licensor means the individual(s) or entity(ies) granting rights + under this Public License. + + k. Share means to provide material to the public by any means or + process that requires permission under the Licensed Rights, such + as reproduction, public display, public performance, distribution, + dissemination, communication, or importation, and to make material + available to the public including in ways that members of the + public may access the material from a place and at a time + individually chosen by them. + + l. Sui Generis Database Rights means rights other than copyright + resulting from Directive 96/9/EC of the European Parliament and of + the Council of 11 March 1996 on the legal protection of databases, + as amended and/or succeeded, as well as other essentially + equivalent rights anywhere in the world. + + m. You means the individual or entity exercising the Licensed Rights + under this Public License. Your has a corresponding meaning. + + +Section 2 -- Scope. + + a. License grant. + + 1. Subject to the terms and conditions of this Public License, + the Licensor hereby grants You a worldwide, royalty-free, + non-sublicensable, non-exclusive, irrevocable license to + exercise the Licensed Rights in the Licensed Material to: + + a. reproduce and Share the Licensed Material, in whole or + in part; and + + b. produce, reproduce, and Share Adapted Material. + + 2. Exceptions and Limitations. For the avoidance of doubt, where + Exceptions and Limitations apply to Your use, this Public + License does not apply, and You do not need to comply with + its terms and conditions. + + 3. Term. The term of this Public License is specified in Section + 6(a). + + 4. Media and formats; technical modifications allowed. The + Licensor authorizes You to exercise the Licensed Rights in + all media and formats whether now known or hereafter created, + and to make technical modifications necessary to do so. The + Licensor waives and/or agrees not to assert any right or + authority to forbid You from making technical modifications + necessary to exercise the Licensed Rights, including + technical modifications necessary to circumvent Effective + Technological Measures. For purposes of this Public License, + simply making modifications authorized by this Section 2(a) + (4) never produces Adapted Material. + + 5. Downstream recipients. + + a. Offer from the Licensor -- Licensed Material. Every + recipient of the Licensed Material automatically + receives an offer from the Licensor to exercise the + Licensed Rights under the terms and conditions of this + Public License. + + b. Additional offer from the Licensor -- Adapted Material. + Every recipient of Adapted Material from You + automatically receives an offer from the Licensor to + exercise the Licensed Rights in the Adapted Material + under the conditions of the Adapter's License You apply. + + c. No downstream restrictions. You may not offer or impose + any additional or different terms or conditions on, or + apply any Effective Technological Measures to, the + Licensed Material if doing so restricts exercise of the + Licensed Rights by any recipient of the Licensed + Material. + + 6. No endorsement. Nothing in this Public License constitutes or + may be construed as permission to assert or imply that You + are, or that Your use of the Licensed Material is, connected + with, or sponsored, endorsed, or granted official status by, + the Licensor or others designated to receive attribution as + provided in Section 3(a)(1)(A)(i). + + b. Other rights. + + 1. Moral rights, such as the right of integrity, are not + licensed under this Public License, nor are publicity, + privacy, and/or other similar personality rights; however, to + the extent possible, the Licensor waives and/or agrees not to + assert any such rights held by the Licensor to the limited + extent necessary to allow You to exercise the Licensed + Rights, but not otherwise. + + 2. Patent and trademark rights are not licensed under this + Public License. + + 3. To the extent possible, the Licensor waives any right to + collect royalties from You for the exercise of the Licensed + Rights, whether directly or through a collecting society + under any voluntary or waivable statutory or compulsory + licensing scheme. In all other cases the Licensor expressly + reserves any right to collect such royalties. + + +Section 3 -- License Conditions. + +Your exercise of the Licensed Rights is expressly made subject to the +following conditions. + + a. Attribution. + + 1. If You Share the Licensed Material (including in modified + form), You must: + + a. retain the following if it is supplied by the Licensor + with the Licensed Material: + + i. identification of the creator(s) of the Licensed + Material and any others designated to receive + attribution, in any reasonable manner requested by + the Licensor (including by pseudonym if + designated); + + ii. a copyright notice; + + iii. a notice that refers to this Public License; + + iv. a notice that refers to the disclaimer of + warranties; + + v. a URI or hyperlink to the Licensed Material to the + extent reasonably practicable; + + b. indicate if You modified the Licensed Material and + retain an indication of any previous modifications; and + + c. indicate the Licensed Material is licensed under this + Public License, and include the text of, or the URI or + hyperlink to, this Public License. + + 2. You may satisfy the conditions in Section 3(a)(1) in any + reasonable manner based on the medium, means, and context in + which You Share the Licensed Material. For example, it may be + reasonable to satisfy the conditions by providing a URI or + hyperlink to a resource that includes the required + information. + + 3. If requested by the Licensor, You must remove any of the + information required by Section 3(a)(1)(A) to the extent + reasonably practicable. + + b. ShareAlike. + + In addition to the conditions in Section 3(a), if You Share + Adapted Material You produce, the following conditions also apply. + + 1. The Adapter's License You apply must be a Creative Commons + license with the same License Elements, this version or + later, or a BY-SA Compatible License. + + 2. You must include the text of, or the URI or hyperlink to, the + Adapter's License You apply. You may satisfy this condition + in any reasonable manner based on the medium, means, and + context in which You Share Adapted Material. + + 3. You may not offer or impose any additional or different terms + or conditions on, or apply any Effective Technological + Measures to, Adapted Material that restrict exercise of the + rights granted under the Adapter's License You apply. + + +Section 4 -- Sui Generis Database Rights. + +Where the Licensed Rights include Sui Generis Database Rights that +apply to Your use of the Licensed Material: + + a. for the avoidance of doubt, Section 2(a)(1) grants You the right + to extract, reuse, reproduce, and Share all or a substantial + portion of the contents of the database; + + b. if You include all or a substantial portion of the database + contents in a database in which You have Sui Generis Database + Rights, then the database in which You have Sui Generis Database + Rights (but not its individual contents) is Adapted Material, + + including for purposes of Section 3(b); and + c. You must comply with the conditions in Section 3(a) if You Share + all or a substantial portion of the contents of the database. + +For the avoidance of doubt, this Section 4 supplements and does not +replace Your obligations under this Public License where the Licensed +Rights include other Copyright and Similar Rights. + + +Section 5 -- Disclaimer of Warranties and Limitation of Liability. + + a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE + EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS + AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF + ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, + IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, + WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR + PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, + ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT + KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT + ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. + + b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE + TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, + NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, + INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, + COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR + USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN + ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR + DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR + IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. + + c. The disclaimer of warranties and limitation of liability provided + above shall be interpreted in a manner that, to the extent + possible, most closely approximates an absolute disclaimer and + waiver of all liability. + + +Section 6 -- Term and Termination. + + a. This Public License applies for the term of the Copyright and + Similar Rights licensed here. However, if You fail to comply with + this Public License, then Your rights under this Public License + terminate automatically. + + b. Where Your right to use the Licensed Material has terminated under + Section 6(a), it reinstates: + + 1. automatically as of the date the violation is cured, provided + it is cured within 30 days of Your discovery of the + violation; or + + 2. upon express reinstatement by the Licensor. + + For the avoidance of doubt, this Section 6(b) does not affect any + right the Licensor may have to seek remedies for Your violations + of this Public License. + + c. For the avoidance of doubt, the Licensor may also offer the + Licensed Material under separate terms or conditions or stop + distributing the Licensed Material at any time; however, doing so + will not terminate this Public License. + + d. Sections 1, 5, 6, 7, and 8 survive termination of this Public + License. + + +Section 7 -- Other Terms and Conditions. + + a. The Licensor shall not be bound by any additional or different + terms or conditions communicated by You unless expressly agreed. + + b. Any arrangements, understandings, or agreements regarding the + Licensed Material not stated herein are separate from and + independent of the terms and conditions of this Public License. + + +Section 8 -- Interpretation. + + a. For the avoidance of doubt, this Public License does not, and + shall not be interpreted to, reduce, limit, restrict, or impose + conditions on any use of the Licensed Material that could lawfully + be made without permission under this Public License. + + b. To the extent possible, if any provision of this Public License is + deemed unenforceable, it shall be automatically reformed to the + minimum extent necessary to make it enforceable. If the provision + cannot be reformed, it shall be severed from this Public License + without affecting the enforceability of the remaining terms and + conditions. + + c. No term or condition of this Public License will be waived and no + failure to comply consented to unless expressly agreed to by the + Licensor. + + d. Nothing in this Public License constitutes or may be interpreted + as a limitation upon, or waiver of, any privileges and immunities + that apply to the Licensor or You, including from the legal + processes of any jurisdiction or authority. + + +======================================================================= + +Creative Commons is not a party to its public +licenses. Notwithstanding, Creative Commons may elect to apply one of +its public licenses to material it publishes and in those instances +will be considered the “Licensor.” The text of the Creative Commons +public licenses is dedicated to the public domain under the CC0 Public +Domain Dedication. Except for the limited purpose of indicating that +material is shared under a Creative Commons public license or as +otherwise permitted by the Creative Commons policies published at +creativecommons.org/policies, Creative Commons does not authorize the +use of the trademark "Creative Commons" or any other trademark or logo +of Creative Commons without its prior written consent including, +without limitation, in connection with any unauthorized modifications +to any of its public licenses or any other arrangements, +understandings, or agreements concerning use of licensed material. For +the avoidance of doubt, this paragraph does not form part of the +public licenses. + +Creative Commons may be contacted at creativecommons.org. + diff --git a/examples/training/textcat_example_data/README.md b/examples/training/textcat_example_data/README.md new file mode 100644 index 000000000..1165f0293 --- /dev/null +++ b/examples/training/textcat_example_data/README.md @@ -0,0 +1,34 @@ +## Examples of textcat training data + +spacy JSON training files were generated from JSONL with: + +``` +python textcatjsonl_to_trainjson.py -m en file.jsonl . +``` + +`cooking.json` is an example with mutually-exclusive classes with two labels: + +* `baking` +* `not_baking` + +`jigsaw-toxic-comment.json` is an example with multiple labels per instance: + +* `insult` +* `obscene` +* `severe_toxic` +* `toxic` + +### Data Sources + +* `cooking.jsonl`: https://cooking.stackexchange.com. The meta IDs link to the + original question as `https://cooking.stackexchange.com/questions/ID`, e.g., + `https://cooking.stackexchange.com/questions/2` for the first instance. +* `jigsaw-toxic-comment.jsonl`: [Jigsaw Toxic Comments Classification + Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) + +### Data Licenses + +* `cooking.jsonl`: CC BY-SA 4.0 ([`CC_BY-SA-4.0.txt`](CC_BY-SA-4.0.txt)) +* `jigsaw-toxic-comment.jsonl`: + * text: CC BY-SA 3.0 ([`CC_BY-SA-3.0.txt`](CC_BY-SA-3.0.txt)) + * annotation: CC0 ([`CC0.txt`](CC0.txt)) diff --git a/examples/training/textcat_example_data/cooking.json b/examples/training/textcat_example_data/cooking.json new file mode 100644 index 000000000..4bad4db79 --- /dev/null +++ b/examples/training/textcat_example_data/cooking.json @@ -0,0 +1,3487 @@ +[ + { + "id":0, + "paragraphs":[ + { + "raw":"How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven by laying the strips out on a cookie sheet. When using this method, how long should I cook the bacon for, and at what temperature?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"How", + "ner":"O" + }, + { + "id":1, + "orth":"should", + "ner":"O" + }, + { + "id":2, + "orth":"I", + "ner":"O" + }, + { + "id":3, + "orth":"cook", + "ner":"O" + }, + { + "id":4, + "orth":"bacon", + "ner":"O" + }, + { + "id":5, + "orth":"in", + "ner":"O" + }, + { + "id":6, + "orth":"an", + "ner":"O" + }, + { + "id":7, + "orth":"oven", + "ner":"O" + }, + { + "id":8, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":9, + "orth":"\n", + "ner":"O" + }, + { + "id":10, + "orth":"I", + "ner":"O" + }, + { + "id":11, + "orth":"'ve", + "ner":"O" + }, + { + "id":12, + "orth":"heard", + "ner":"O" + }, + { + "id":13, + "orth":"of", + "ner":"O" + }, + { + "id":14, + "orth":"people", + "ner":"O" + }, + { + "id":15, + "orth":"cooking", + "ner":"O" + }, + { + "id":16, + "orth":"bacon", + "ner":"O" + }, + { + "id":17, + "orth":"in", + "ner":"O" + }, + { + "id":18, + "orth":"an", + "ner":"O" + }, + { + "id":19, + "orth":"oven", + "ner":"O" + }, + { + "id":20, + "orth":"by", + "ner":"O" + }, + { + "id":21, + "orth":"laying", + "ner":"O" + }, + { + "id":22, + "orth":"the", + "ner":"O" + }, + { + "id":23, + "orth":"strips", + "ner":"O" + }, + { + "id":24, + "orth":"out", + "ner":"O" + }, + { + "id":25, + "orth":"on", + "ner":"O" + }, + { + "id":26, + "orth":"a", + "ner":"O" + }, + { + "id":27, + "orth":"cookie", + "ner":"O" + }, + { + "id":28, + "orth":"sheet", + "ner":"O" + }, + { + "id":29, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":30, + "orth":"When", + "ner":"O" + }, + { + "id":31, + "orth":"using", + "ner":"O" + }, + { + "id":32, + "orth":"this", + "ner":"O" + }, + { + "id":33, + "orth":"method", + "ner":"O" + }, + { + "id":34, + "orth":",", + "ner":"O" + }, + { + "id":35, + "orth":"how", + "ner":"O" + }, + { + "id":36, + "orth":"long", + "ner":"O" + }, + { + "id":37, + "orth":"should", + "ner":"O" + }, + { + "id":38, + "orth":"I", + "ner":"O" + }, + { + "id":39, + "orth":"cook", + "ner":"O" + }, + { + "id":40, + "orth":"the", + "ner":"O" + }, + { + "id":41, + "orth":"bacon", + "ner":"O" + }, + { + "id":42, + "orth":"for", + "ner":"O" + }, + { + "id":43, + "orth":",", + "ner":"O" + }, + { + "id":44, + "orth":"and", + "ner":"O" + }, + { + "id":45, + "orth":"at", + "ner":"O" + }, + { + "id":46, + "orth":"what", + "ner":"O" + }, + { + "id":47, + "orth":"temperature", + "ner":"O" + }, + { + "id":48, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":49, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"What is the difference between white and brown eggs?\nI always use brown extra large eggs, but I can't honestly say why I do this other than habit at this point. Are there any distinct advantages or disadvantages like flavor, shelf life, etc?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"What", + "ner":"O" + }, + { + "id":1, + "orth":"is", + "ner":"O" + }, + { + "id":2, + "orth":"the", + "ner":"O" + }, + { + "id":3, + "orth":"difference", + "ner":"O" + }, + { + "id":4, + "orth":"between", + "ner":"O" + }, + { + "id":5, + "orth":"white", + "ner":"O" + }, + { + "id":6, + "orth":"and", + "ner":"O" + }, + { + "id":7, + "orth":"brown", + "ner":"O" + }, + { + "id":8, + "orth":"eggs", + "ner":"O" + }, + { + "id":9, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":10, + "orth":"\n", + "ner":"O" + }, + { + "id":11, + "orth":"I", + "ner":"O" + }, + { + "id":12, + "orth":"always", + "ner":"O" + }, + { + "id":13, + "orth":"use", + "ner":"O" + }, + { + "id":14, + "orth":"brown", + "ner":"O" + }, + { + "id":15, + "orth":"extra", + "ner":"O" + }, + { + "id":16, + "orth":"large", + "ner":"O" + }, + { + "id":17, + "orth":"eggs", + "ner":"O" + }, + { + "id":18, + "orth":",", + "ner":"O" + }, + { + "id":19, + "orth":"but", + "ner":"O" + }, + { + "id":20, + "orth":"I", + "ner":"O" + }, + { + "id":21, + "orth":"ca", + "ner":"O" + }, + { + "id":22, + "orth":"n't", + "ner":"O" + }, + { + "id":23, + "orth":"honestly", + "ner":"O" + }, + { + "id":24, + "orth":"say", + "ner":"O" + }, + { + "id":25, + "orth":"why", + "ner":"O" + }, + { + "id":26, + "orth":"I", + "ner":"O" + }, + { + "id":27, + "orth":"do", + "ner":"O" + }, + { + "id":28, + "orth":"this", + "ner":"O" + }, + { + "id":29, + "orth":"other", + "ner":"O" + }, + { + "id":30, + "orth":"than", + "ner":"O" + }, + { + "id":31, + "orth":"habit", + "ner":"O" + }, + { + "id":32, + "orth":"at", + "ner":"O" + }, + { + "id":33, + "orth":"this", + "ner":"O" + }, + { + "id":34, + "orth":"point", + "ner":"O" + }, + { + "id":35, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":36, + "orth":"Are", + "ner":"O" + }, + { + "id":37, + "orth":"there", + "ner":"O" + }, + { + "id":38, + "orth":"any", + "ner":"O" + }, + { + "id":39, + "orth":"distinct", + "ner":"O" + }, + { + "id":40, + "orth":"advantages", + "ner":"O" + }, + { + "id":41, + "orth":"or", + "ner":"O" + }, + { + "id":42, + "orth":"disadvantages", + "ner":"O" + }, + { + "id":43, + "orth":"like", + "ner":"O" + }, + { + "id":44, + "orth":"flavor", + "ner":"O" + }, + { + "id":45, + "orth":",", + "ner":"O" + }, + { + "id":46, + "orth":"shelf", + "ner":"O" + }, + { + "id":47, + "orth":"life", + "ner":"O" + }, + { + "id":48, + "orth":",", + "ner":"O" + }, + { + "id":49, + "orth":"etc", + "ner":"O" + }, + { + "id":50, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":51, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"What is the difference between baking soda and baking powder?\nAnd can I use one in place of the other in certain recipes?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"What", + "ner":"O" + }, + { + "id":1, + "orth":"is", + "ner":"O" + }, + { + "id":2, + "orth":"the", + "ner":"O" + }, + { + "id":3, + "orth":"difference", + "ner":"O" + }, + { + "id":4, + "orth":"between", + "ner":"O" + }, + { + "id":5, + "orth":"baking", + "ner":"O" + }, + { + "id":6, + "orth":"soda", + "ner":"O" + }, + { + "id":7, + "orth":"and", + "ner":"O" + }, + { + "id":8, + "orth":"baking", + "ner":"O" + }, + { + "id":9, + "orth":"powder", + "ner":"O" + }, + { + "id":10, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":11, + "orth":"\n", + "ner":"O" + }, + { + "id":12, + "orth":"And", + "ner":"O" + }, + { + "id":13, + "orth":"can", + "ner":"O" + }, + { + "id":14, + "orth":"I", + "ner":"O" + }, + { + "id":15, + "orth":"use", + "ner":"O" + }, + { + "id":16, + "orth":"one", + "ner":"O" + }, + { + "id":17, + "orth":"in", + "ner":"O" + }, + { + "id":18, + "orth":"place", + "ner":"O" + }, + { + "id":19, + "orth":"of", + "ner":"O" + }, + { + "id":20, + "orth":"the", + "ner":"O" + }, + { + "id":21, + "orth":"other", + "ner":"O" + }, + { + "id":22, + "orth":"in", + "ner":"O" + }, + { + "id":23, + "orth":"certain", + "ner":"O" + }, + { + "id":24, + "orth":"recipes", + "ner":"O" + }, + { + "id":25, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":26, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"In a tomato sauce recipe, how can I cut the acidity?\nIt seems that every time I make a tomato sauce for pasta, the sauce is a little bit too acid for my taste. I've tried using sugar or sodium bicarbonate, but I'm not satisfied with the results.\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"In", + "ner":"O" + }, + { + "id":1, + "orth":"a", + "ner":"O" + }, + { + "id":2, + "orth":"tomato", + "ner":"O" + }, + { + "id":3, + "orth":"sauce", + "ner":"O" + }, + { + "id":4, + "orth":"recipe", + "ner":"O" + }, + { + "id":5, + "orth":",", + "ner":"O" + }, + { + "id":6, + "orth":"how", + "ner":"O" + }, + { + "id":7, + "orth":"can", + "ner":"O" + }, + { + "id":8, + "orth":"I", + "ner":"O" + }, + { + "id":9, + "orth":"cut", + "ner":"O" + }, + { + "id":10, + "orth":"the", + "ner":"O" + }, + { + "id":11, + "orth":"acidity", + "ner":"O" + }, + { + "id":12, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":13, + "orth":"\n", + "ner":"O" + }, + { + "id":14, + "orth":"It", + "ner":"O" + }, + { + "id":15, + "orth":"seems", + "ner":"O" + }, + { + "id":16, + "orth":"that", + "ner":"O" + }, + { + "id":17, + "orth":"every", + "ner":"O" + }, + { + "id":18, + "orth":"time", + "ner":"O" + }, + { + "id":19, + "orth":"I", + "ner":"O" + }, + { + "id":20, + "orth":"make", + "ner":"O" + }, + { + "id":21, + "orth":"a", + "ner":"O" + }, + { + "id":22, + "orth":"tomato", + "ner":"O" + }, + { + "id":23, + "orth":"sauce", + "ner":"O" + }, + { + "id":24, + "orth":"for", + "ner":"O" + }, + { + "id":25, + "orth":"pasta", + "ner":"O" + }, + { + "id":26, + "orth":",", + "ner":"O" + }, + { + "id":27, + "orth":"the", + "ner":"O" + }, + { + "id":28, + "orth":"sauce", + "ner":"O" + }, + { + "id":29, + "orth":"is", + "ner":"O" + }, + { + "id":30, + "orth":"a", + "ner":"O" + }, + { + "id":31, + "orth":"little", + "ner":"O" + }, + { + "id":32, + "orth":"bit", + "ner":"O" + }, + { + "id":33, + "orth":"too", + "ner":"O" + }, + { + "id":34, + "orth":"acid", + "ner":"O" + }, + { + "id":35, + "orth":"for", + "ner":"O" + }, + { + "id":36, + "orth":"my", + "ner":"O" + }, + { + "id":37, + "orth":"taste", + "ner":"O" + }, + { + "id":38, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":39, + "orth":"I", + "ner":"O" + }, + { + "id":40, + "orth":"'ve", + "ner":"O" + }, + { + "id":41, + "orth":"tried", + "ner":"O" + }, + { + "id":42, + "orth":"using", + "ner":"O" + }, + { + "id":43, + "orth":"sugar", + "ner":"O" + }, + { + "id":44, + "orth":"or", + "ner":"O" + }, + { + "id":45, + "orth":"sodium", + "ner":"O" + }, + { + "id":46, + "orth":"bicarbonate", + "ner":"O" + }, + { + "id":47, + "orth":",", + "ner":"O" + }, + { + "id":48, + "orth":"but", + "ner":"O" + }, + { + "id":49, + "orth":"I", + "ner":"O" + }, + { + "id":50, + "orth":"'m", + "ner":"O" + }, + { + "id":51, + "orth":"not", + "ner":"O" + }, + { + "id":52, + "orth":"satisfied", + "ner":"O" + }, + { + "id":53, + "orth":"with", + "ner":"O" + }, + { + "id":54, + "orth":"the", + "ner":"O" + }, + { + "id":55, + "orth":"results", + "ner":"O" + }, + { + "id":56, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":57, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"What ingredients (available in specific regions) can I substitute for parsley?\nI have a recipe that calls for fresh parsley. I have substituted other fresh herbs for their dried equivalents but I don't have fresh or dried parsley. Is there something else (ex another dried herb) that I can use instead of parsley?\nI know it is used mainly for looks rather than taste but I have a pasta recipe that calls for 2 tablespoons of parsley in the sauce and then another 2 tablespoons on top when it is done. I know the parsley on top is more for looks but there must be something about the taste otherwise it would call for parsley within the sauce as well.\nI would especially like to hear about substitutes available in Southeast Asia and other parts of the world where the obvious answers (such as cilantro) are not widely available.\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"What", + "ner":"O" + }, + { + "id":1, + "orth":"ingredients", + "ner":"O" + }, + { + "id":2, + "orth":"(", + "ner":"O" + }, + { + "id":3, + "orth":"available", + "ner":"O" + }, + { + "id":4, + "orth":"in", + "ner":"O" + }, + { + "id":5, + "orth":"specific", + "ner":"O" + }, + { + "id":6, + "orth":"regions", + "ner":"O" + }, + { + "id":7, + "orth":")", + "ner":"O" + }, + { + "id":8, + "orth":"can", + "ner":"O" + }, + { + "id":9, + "orth":"I", + "ner":"O" + }, + { + "id":10, + "orth":"substitute", + "ner":"O" + }, + { + "id":11, + "orth":"for", + "ner":"O" + }, + { + "id":12, + "orth":"parsley", + "ner":"O" + }, + { + "id":13, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":14, + "orth":"\n", + "ner":"O" + }, + { + "id":15, + "orth":"I", + "ner":"O" + }, + { + "id":16, + "orth":"have", + "ner":"O" + }, + { + "id":17, + "orth":"a", + "ner":"O" + }, + { + "id":18, + "orth":"recipe", + "ner":"O" + }, + { + "id":19, + "orth":"that", + "ner":"O" + }, + { + "id":20, + "orth":"calls", + "ner":"O" + }, + { + "id":21, + "orth":"for", + "ner":"O" + }, + { + "id":22, + "orth":"fresh", + "ner":"O" + }, + { + "id":23, + "orth":"parsley", + "ner":"O" + }, + { + "id":24, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":25, + "orth":"I", + "ner":"O" + }, + { + "id":26, + "orth":"have", + "ner":"O" + }, + { + "id":27, + "orth":"substituted", + "ner":"O" + }, + { + "id":28, + "orth":"other", + "ner":"O" + }, + { + "id":29, + "orth":"fresh", + "ner":"O" + }, + { + "id":30, + "orth":"herbs", + "ner":"O" + }, + { + "id":31, + "orth":"for", + "ner":"O" + }, + { + "id":32, + "orth":"their", + "ner":"O" + }, + { + "id":33, + "orth":"dried", + "ner":"O" + }, + { + "id":34, + "orth":"equivalents", + "ner":"O" + }, + { + "id":35, + "orth":"but", + "ner":"O" + }, + { + "id":36, + "orth":"I", + "ner":"O" + }, + { + "id":37, + "orth":"do", + "ner":"O" + }, + { + "id":38, + "orth":"n't", + "ner":"O" + }, + { + "id":39, + "orth":"have", + "ner":"O" + }, + { + "id":40, + "orth":"fresh", + "ner":"O" + }, + { + "id":41, + "orth":"or", + "ner":"O" + }, + { + "id":42, + "orth":"dried", + "ner":"O" + }, + { + "id":43, + "orth":"parsley", + "ner":"O" + }, + { + "id":44, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":45, + "orth":"Is", + "ner":"O" + }, + { + "id":46, + "orth":"there", + "ner":"O" + }, + { + "id":47, + "orth":"something", + "ner":"O" + }, + { + "id":48, + "orth":"else", + "ner":"O" + }, + { + "id":49, + "orth":"(", + "ner":"O" + }, + { + "id":50, + "orth":"ex", + "ner":"O" + }, + { + "id":51, + "orth":"another", + "ner":"O" + }, + { + "id":52, + "orth":"dried", + "ner":"O" + }, + { + "id":53, + "orth":"herb", + "ner":"O" + }, + { + "id":54, + "orth":")", + "ner":"O" + }, + { + "id":55, + "orth":"that", + "ner":"O" + }, + { + "id":56, + "orth":"I", + "ner":"O" + }, + { + "id":57, + "orth":"can", + "ner":"O" + }, + { + "id":58, + "orth":"use", + "ner":"O" + }, + { + "id":59, + "orth":"instead", + "ner":"O" + }, + { + "id":60, + "orth":"of", + "ner":"O" + }, + { + "id":61, + "orth":"parsley", + "ner":"O" + }, + { + "id":62, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":63, + "orth":"\n", + "ner":"O" + }, + { + "id":64, + "orth":"I", + "ner":"O" + }, + { + "id":65, + "orth":"know", + "ner":"O" + }, + { + "id":66, + "orth":"it", + "ner":"O" + }, + { + "id":67, + "orth":"is", + "ner":"O" + }, + { + "id":68, + "orth":"used", + "ner":"O" + }, + { + "id":69, + "orth":"mainly", + "ner":"O" + }, + { + "id":70, + "orth":"for", + "ner":"O" + }, + { + "id":71, + "orth":"looks", + "ner":"O" + }, + { + "id":72, + "orth":"rather", + "ner":"O" + }, + { + "id":73, + "orth":"than", + "ner":"O" + }, + { + "id":74, + "orth":"taste", + "ner":"O" + }, + { + "id":75, + "orth":"but", + "ner":"O" + }, + { + "id":76, + "orth":"I", + "ner":"O" + }, + { + "id":77, + "orth":"have", + "ner":"O" + }, + { + "id":78, + "orth":"a", + "ner":"O" + }, + { + "id":79, + "orth":"pasta", + "ner":"O" + }, + { + "id":80, + "orth":"recipe", + "ner":"O" + }, + { + "id":81, + "orth":"that", + "ner":"O" + }, + { + "id":82, + "orth":"calls", + "ner":"O" + }, + { + "id":83, + "orth":"for", + "ner":"O" + }, + { + "id":84, + "orth":"2", + "ner":"O" + }, + { + "id":85, + "orth":"tablespoons", + "ner":"O" + }, + { + "id":86, + "orth":"of", + "ner":"O" + }, + { + "id":87, + "orth":"parsley", + "ner":"O" + }, + { + "id":88, + "orth":"in", + "ner":"O" + }, + { + "id":89, + "orth":"the", + "ner":"O" + }, + { + "id":90, + "orth":"sauce", + "ner":"O" + }, + { + "id":91, + "orth":"and", + "ner":"O" + }, + { + "id":92, + "orth":"then", + "ner":"O" + }, + { + "id":93, + "orth":"another", + "ner":"O" + }, + { + "id":94, + "orth":"2", + "ner":"O" + }, + { + "id":95, + "orth":"tablespoons", + "ner":"O" + }, + { + "id":96, + "orth":"on", + "ner":"O" + }, + { + "id":97, + "orth":"top", + "ner":"O" + }, + { + "id":98, + "orth":"when", + "ner":"O" + }, + { + "id":99, + "orth":"it", + "ner":"O" + }, + { + "id":100, + "orth":"is", + "ner":"O" + }, + { + "id":101, + "orth":"done", + "ner":"O" + }, + { + "id":102, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":103, + "orth":"I", + "ner":"O" + }, + { + "id":104, + "orth":"know", + "ner":"O" + }, + { + "id":105, + "orth":"the", + "ner":"O" + }, + { + "id":106, + "orth":"parsley", + "ner":"O" + }, + { + "id":107, + "orth":"on", + "ner":"O" + }, + { + "id":108, + "orth":"top", + "ner":"O" + }, + { + "id":109, + "orth":"is", + "ner":"O" + }, + { + "id":110, + "orth":"more", + "ner":"O" + }, + { + "id":111, + "orth":"for", + "ner":"O" + }, + { + "id":112, + "orth":"looks", + "ner":"O" + }, + { + "id":113, + "orth":"but", + "ner":"O" + }, + { + "id":114, + "orth":"there", + "ner":"O" + }, + { + "id":115, + "orth":"must", + "ner":"O" + }, + { + "id":116, + "orth":"be", + "ner":"O" + }, + { + "id":117, + "orth":"something", + "ner":"O" + }, + { + "id":118, + "orth":"about", + "ner":"O" + }, + { + "id":119, + "orth":"the", + "ner":"O" + }, + { + "id":120, + "orth":"taste", + "ner":"O" + }, + { + "id":121, + "orth":"otherwise", + "ner":"O" + }, + { + "id":122, + "orth":"it", + "ner":"O" + }, + { + "id":123, + "orth":"would", + "ner":"O" + }, + { + "id":124, + "orth":"call", + "ner":"O" + }, + { + "id":125, + "orth":"for", + "ner":"O" + }, + { + "id":126, + "orth":"parsley", + "ner":"O" + }, + { + "id":127, + "orth":"within", + "ner":"O" + }, + { + "id":128, + "orth":"the", + "ner":"O" + }, + { + "id":129, + "orth":"sauce", + "ner":"O" + }, + { + "id":130, + "orth":"as", + "ner":"O" + }, + { + "id":131, + "orth":"well", + "ner":"O" + }, + { + "id":132, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":133, + "orth":"\n", + "ner":"O" + }, + { + "id":134, + "orth":"I", + "ner":"O" + }, + { + "id":135, + "orth":"would", + "ner":"O" + }, + { + "id":136, + "orth":"especially", + "ner":"O" + }, + { + "id":137, + "orth":"like", + "ner":"O" + }, + { + "id":138, + "orth":"to", + "ner":"O" + }, + { + "id":139, + "orth":"hear", + "ner":"O" + }, + { + "id":140, + "orth":"about", + "ner":"O" + }, + { + "id":141, + "orth":"substitutes", + "ner":"O" + }, + { + "id":142, + "orth":"available", + "ner":"O" + }, + { + "id":143, + "orth":"in", + "ner":"O" + }, + { + "id":144, + "orth":"Southeast", + "ner":"O" + }, + { + "id":145, + "orth":"Asia", + "ner":"O" + }, + { + "id":146, + "orth":"and", + "ner":"O" + }, + { + "id":147, + "orth":"other", + "ner":"O" + }, + { + "id":148, + "orth":"parts", + "ner":"O" + }, + { + "id":149, + "orth":"of", + "ner":"O" + }, + { + "id":150, + "orth":"the", + "ner":"O" + }, + { + "id":151, + "orth":"world", + "ner":"O" + }, + { + "id":152, + "orth":"where", + "ner":"O" + }, + { + "id":153, + "orth":"the", + "ner":"O" + }, + { + "id":154, + "orth":"obvious", + "ner":"O" + }, + { + "id":155, + "orth":"answers", + "ner":"O" + }, + { + "id":156, + "orth":"(", + "ner":"O" + }, + { + "id":157, + "orth":"such", + "ner":"O" + }, + { + "id":158, + "orth":"as", + "ner":"O" + }, + { + "id":159, + "orth":"cilantro", + "ner":"O" + }, + { + "id":160, + "orth":")", + "ner":"O" + }, + { + "id":161, + "orth":"are", + "ner":"O" + }, + { + "id":162, + "orth":"not", + "ner":"O" + }, + { + "id":163, + "orth":"widely", + "ner":"O" + }, + { + "id":164, + "orth":"available", + "ner":"O" + }, + { + "id":165, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":166, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"What is the internal temperature a steak should be cooked to for Rare/Medium Rare/Medium/Well?\nI'd like to know when to take my steaks off the grill and please everybody.\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"What", + "ner":"O" + }, + { + "id":1, + "orth":"is", + "ner":"O" + }, + { + "id":2, + "orth":"the", + "ner":"O" + }, + { + "id":3, + "orth":"internal", + "ner":"O" + }, + { + "id":4, + "orth":"temperature", + "ner":"O" + }, + { + "id":5, + "orth":"a", + "ner":"O" + }, + { + "id":6, + "orth":"steak", + "ner":"O" + }, + { + "id":7, + "orth":"should", + "ner":"O" + }, + { + "id":8, + "orth":"be", + "ner":"O" + }, + { + "id":9, + "orth":"cooked", + "ner":"O" + }, + { + "id":10, + "orth":"to", + "ner":"O" + }, + { + "id":11, + "orth":"for", + "ner":"O" + }, + { + "id":12, + "orth":"Rare", + "ner":"O" + }, + { + "id":13, + "orth":"/", + "ner":"O" + }, + { + "id":14, + "orth":"Medium", + "ner":"O" + }, + { + "id":15, + "orth":"Rare", + "ner":"O" + }, + { + "id":16, + "orth":"/", + "ner":"O" + }, + { + "id":17, + "orth":"Medium", + "ner":"O" + }, + { + "id":18, + "orth":"/", + "ner":"O" + }, + { + "id":19, + "orth":"Well", + "ner":"O" + }, + { + "id":20, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":21, + "orth":"\n", + "ner":"O" + }, + { + "id":22, + "orth":"I", + "ner":"O" + }, + { + "id":23, + "orth":"'d", + "ner":"O" + }, + { + "id":24, + "orth":"like", + "ner":"O" + }, + { + "id":25, + "orth":"to", + "ner":"O" + }, + { + "id":26, + "orth":"know", + "ner":"O" + }, + { + "id":27, + "orth":"when", + "ner":"O" + }, + { + "id":28, + "orth":"to", + "ner":"O" + }, + { + "id":29, + "orth":"take", + "ner":"O" + }, + { + "id":30, + "orth":"my", + "ner":"O" + }, + { + "id":31, + "orth":"steaks", + "ner":"O" + }, + { + "id":32, + "orth":"off", + "ner":"O" + }, + { + "id":33, + "orth":"the", + "ner":"O" + }, + { + "id":34, + "orth":"grill", + "ner":"O" + }, + { + "id":35, + "orth":"and", + "ner":"O" + }, + { + "id":36, + "orth":"please", + "ner":"O" + }, + { + "id":37, + "orth":"everybody", + "ner":"O" + }, + { + "id":38, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":39, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"How should I poach an egg?\nWhat's the best method to poach an egg without it turning into an eggy soupy mess?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"How", + "ner":"O" + }, + { + "id":1, + "orth":"should", + "ner":"O" + }, + { + "id":2, + "orth":"I", + "ner":"O" + }, + { + "id":3, + "orth":"poach", + "ner":"O" + }, + { + "id":4, + "orth":"an", + "ner":"O" + }, + { + "id":5, + "orth":"egg", + "ner":"O" + }, + { + "id":6, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":7, + "orth":"\n", + "ner":"O" + }, + { + "id":8, + "orth":"What", + "ner":"O" + }, + { + "id":9, + "orth":"'s", + "ner":"O" + }, + { + "id":10, + "orth":"the", + "ner":"O" + }, + { + "id":11, + "orth":"best", + "ner":"O" + }, + { + "id":12, + "orth":"method", + "ner":"O" + }, + { + "id":13, + "orth":"to", + "ner":"O" + }, + { + "id":14, + "orth":"poach", + "ner":"O" + }, + { + "id":15, + "orth":"an", + "ner":"O" + }, + { + "id":16, + "orth":"egg", + "ner":"O" + }, + { + "id":17, + "orth":"without", + "ner":"O" + }, + { + "id":18, + "orth":"it", + "ner":"O" + }, + { + "id":19, + "orth":"turning", + "ner":"O" + }, + { + "id":20, + "orth":"into", + "ner":"O" + }, + { + "id":21, + "orth":"an", + "ner":"O" + }, + { + "id":22, + "orth":"eggy", + "ner":"O" + }, + { + "id":23, + "orth":"soupy", + "ner":"O" + }, + { + "id":24, + "orth":"mess", + "ner":"O" + }, + { + "id":25, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":26, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"How can I make my Ice Cream \"creamier\"\nMy ice cream doesn't feel creamy enough. I got the recipe from Good Eats, and I can't tell if it's just the recipe or maybe that I'm just not getting my \"batter\" cold enough before I try to make it (I let it chill overnight in the refrigerator, but it doesn't always come out of the machine looking like \"soft serve\" as he said on the show - it's usually a little thinner).\nRecipe: http://www.foodnetwork.com/recipes/alton-brown/serious-vanilla-ice-cream-recipe/index.html\nThanks!\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"How", + "ner":"O" + }, + { + "id":1, + "orth":"can", + "ner":"O" + }, + { + "id":2, + "orth":"I", + "ner":"O" + }, + { + "id":3, + "orth":"make", + "ner":"O" + }, + { + "id":4, + "orth":"my", + "ner":"O" + }, + { + "id":5, + "orth":"Ice", + "ner":"O" + }, + { + "id":6, + "orth":"Cream", + "ner":"O" + }, + { + "id":7, + "orth":"\"", + "ner":"O" + }, + { + "id":8, + "orth":"creamier", + "ner":"O" + }, + { + "id":9, + "orth":"\"", + "ner":"O" + }, + { + "id":10, + "orth":"\n", + "ner":"O" + }, + { + "id":11, + "orth":"My", + "ner":"O" + }, + { + "id":12, + "orth":"ice", + "ner":"O" + }, + { + "id":13, + "orth":"cream", + "ner":"O" + }, + { + "id":14, + "orth":"does", + "ner":"O" + }, + { + "id":15, + "orth":"n't", + "ner":"O" + }, + { + "id":16, + "orth":"feel", + "ner":"O" + }, + { + "id":17, + "orth":"creamy", + "ner":"O" + }, + { + "id":18, + "orth":"enough", + "ner":"O" + }, + { + "id":19, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":20, + "orth":" ", + "ner":"O" + }, + { + "id":21, + "orth":"I", + "ner":"O" + }, + { + "id":22, + "orth":"got", + "ner":"O" + }, + { + "id":23, + "orth":"the", + "ner":"O" + }, + { + "id":24, + "orth":"recipe", + "ner":"O" + }, + { + "id":25, + "orth":"from", + "ner":"O" + }, + { + "id":26, + "orth":"Good", + "ner":"O" + }, + { + "id":27, + "orth":"Eats", + "ner":"O" + }, + { + "id":28, + "orth":",", + "ner":"O" + }, + { + "id":29, + "orth":"and", + "ner":"O" + }, + { + "id":30, + "orth":"I", + "ner":"O" + }, + { + "id":31, + "orth":"ca", + "ner":"O" + }, + { + "id":32, + "orth":"n't", + "ner":"O" + }, + { + "id":33, + "orth":"tell", + "ner":"O" + }, + { + "id":34, + "orth":"if", + "ner":"O" + }, + { + "id":35, + "orth":"it", + "ner":"O" + }, + { + "id":36, + "orth":"'s", + "ner":"O" + }, + { + "id":37, + "orth":"just", + "ner":"O" + }, + { + "id":38, + "orth":"the", + "ner":"O" + }, + { + "id":39, + "orth":"recipe", + "ner":"O" + }, + { + "id":40, + "orth":"or", + "ner":"O" + }, + { + "id":41, + "orth":"maybe", + "ner":"O" + }, + { + "id":42, + "orth":"that", + "ner":"O" + }, + { + "id":43, + "orth":"I", + "ner":"O" + }, + { + "id":44, + "orth":"'m", + "ner":"O" + }, + { + "id":45, + "orth":"just", + "ner":"O" + }, + { + "id":46, + "orth":"not", + "ner":"O" + }, + { + "id":47, + "orth":"getting", + "ner":"O" + }, + { + "id":48, + "orth":"my", + "ner":"O" + }, + { + "id":49, + "orth":"\"", + "ner":"O" + }, + { + "id":50, + "orth":"batter", + "ner":"O" + }, + { + "id":51, + "orth":"\"", + "ner":"O" + }, + { + "id":52, + "orth":"cold", + "ner":"O" + }, + { + "id":53, + "orth":"enough", + "ner":"O" + }, + { + "id":54, + "orth":"before", + "ner":"O" + }, + { + "id":55, + "orth":"I", + "ner":"O" + }, + { + "id":56, + "orth":"try", + "ner":"O" + }, + { + "id":57, + "orth":"to", + "ner":"O" + }, + { + "id":58, + "orth":"make", + "ner":"O" + }, + { + "id":59, + "orth":"it", + "ner":"O" + }, + { + "id":60, + "orth":"(", + "ner":"O" + }, + { + "id":61, + "orth":"I", + "ner":"O" + }, + { + "id":62, + "orth":"let", + "ner":"O" + }, + { + "id":63, + "orth":"it", + "ner":"O" + }, + { + "id":64, + "orth":"chill", + "ner":"O" + }, + { + "id":65, + "orth":"overnight", + "ner":"O" + }, + { + "id":66, + "orth":"in", + "ner":"O" + }, + { + "id":67, + "orth":"the", + "ner":"O" + }, + { + "id":68, + "orth":"refrigerator", + "ner":"O" + }, + { + "id":69, + "orth":",", + "ner":"O" + }, + { + "id":70, + "orth":"but", + "ner":"O" + }, + { + "id":71, + "orth":"it", + "ner":"O" + }, + { + "id":72, + "orth":"does", + "ner":"O" + }, + { + "id":73, + "orth":"n't", + "ner":"O" + }, + { + "id":74, + "orth":"always", + "ner":"O" + }, + { + "id":75, + "orth":"come", + "ner":"O" + }, + { + "id":76, + "orth":"out", + "ner":"O" + }, + { + "id":77, + "orth":"of", + "ner":"O" + }, + { + "id":78, + "orth":"the", + "ner":"O" + }, + { + "id":79, + "orth":"machine", + "ner":"O" + }, + { + "id":80, + "orth":"looking", + "ner":"O" + }, + { + "id":81, + "orth":"like", + "ner":"O" + }, + { + "id":82, + "orth":"\"", + "ner":"O" + }, + { + "id":83, + "orth":"soft", + "ner":"O" + }, + { + "id":84, + "orth":"serve", + "ner":"O" + }, + { + "id":85, + "orth":"\"", + "ner":"O" + }, + { + "id":86, + "orth":"as", + "ner":"O" + }, + { + "id":87, + "orth":"he", + "ner":"O" + }, + { + "id":88, + "orth":"said", + "ner":"O" + }, + { + "id":89, + "orth":"on", + "ner":"O" + }, + { + "id":90, + "orth":"the", + "ner":"O" + }, + { + "id":91, + "orth":"show", + "ner":"O" + }, + { + "id":92, + "orth":"-", + "ner":"O" + }, + { + "id":93, + "orth":"it", + "ner":"O" + }, + { + "id":94, + "orth":"'s", + "ner":"O" + }, + { + "id":95, + "orth":"usually", + "ner":"O" + }, + { + "id":96, + "orth":"a", + "ner":"O" + }, + { + "id":97, + "orth":"little", + "ner":"O" + }, + { + "id":98, + "orth":"thinner", + "ner":"O" + }, + { + "id":99, + "orth":")", + "ner":"O" + }, + { + "id":100, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":101, + "orth":"\n", + "ner":"O" + }, + { + "id":102, + "orth":"Recipe", + "ner":"O" + }, + { + "id":103, + "orth":":", + "ner":"O" + }, + { + "id":104, + "orth":"http://www.foodnetwork.com/recipes/alton-brown/serious-vanilla-ice-cream-recipe/index.html", + "ner":"O" + }, + { + "id":105, + "orth":"\n", + "ner":"O" + }, + { + "id":106, + "orth":"Thanks", + "ner":"O" + }, + { + "id":107, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":108, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":0.0 + }, + { + "label":"not_baking", + "value":1.0 + } + ] + }, + { + "raw":"How long and at what temperature do the various parts of a chicken need to be cooked?\nI'm interested in baking thighs, legs, breasts and wings. How long do each of these items need to bake and at what temperature?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"How", + "ner":"O" + }, + { + "id":1, + "orth":"long", + "ner":"O" + }, + { + "id":2, + "orth":"and", + "ner":"O" + }, + { + "id":3, + "orth":"at", + "ner":"O" + }, + { + "id":4, + "orth":"what", + "ner":"O" + }, + { + "id":5, + "orth":"temperature", + "ner":"O" + }, + { + "id":6, + "orth":"do", + "ner":"O" + }, + { + "id":7, + "orth":"the", + "ner":"O" + }, + { + "id":8, + "orth":"various", + "ner":"O" + }, + { + "id":9, + "orth":"parts", + "ner":"O" + }, + { + "id":10, + "orth":"of", + "ner":"O" + }, + { + "id":11, + "orth":"a", + "ner":"O" + }, + { + "id":12, + "orth":"chicken", + "ner":"O" + }, + { + "id":13, + "orth":"need", + "ner":"O" + }, + { + "id":14, + "orth":"to", + "ner":"O" + }, + { + "id":15, + "orth":"be", + "ner":"O" + }, + { + "id":16, + "orth":"cooked", + "ner":"O" + }, + { + "id":17, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":18, + "orth":"\n", + "ner":"O" + }, + { + "id":19, + "orth":"I", + "ner":"O" + }, + { + "id":20, + "orth":"'m", + "ner":"O" + }, + { + "id":21, + "orth":"interested", + "ner":"O" + }, + { + "id":22, + "orth":"in", + "ner":"O" + }, + { + "id":23, + "orth":"baking", + "ner":"O" + }, + { + "id":24, + "orth":"thighs", + "ner":"O" + }, + { + "id":25, + "orth":",", + "ner":"O" + }, + { + "id":26, + "orth":"legs", + "ner":"O" + }, + { + "id":27, + "orth":",", + "ner":"O" + }, + { + "id":28, + "orth":"breasts", + "ner":"O" + }, + { + "id":29, + "orth":"and", + "ner":"O" + }, + { + "id":30, + "orth":"wings", + "ner":"O" + }, + { + "id":31, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":32, + "orth":" ", + "ner":"O" + }, + { + "id":33, + "orth":"How", + "ner":"O" + }, + { + "id":34, + "orth":"long", + "ner":"O" + }, + { + "id":35, + "orth":"do", + "ner":"O" + }, + { + "id":36, + "orth":"each", + "ner":"O" + }, + { + "id":37, + "orth":"of", + "ner":"O" + }, + { + "id":38, + "orth":"these", + "ner":"O" + }, + { + "id":39, + "orth":"items", + "ner":"O" + }, + { + "id":40, + "orth":"need", + "ner":"O" + }, + { + "id":41, + "orth":"to", + "ner":"O" + }, + { + "id":42, + "orth":"bake", + "ner":"O" + }, + { + "id":43, + "orth":"and", + "ner":"O" + }, + { + "id":44, + "orth":"at", + "ner":"O" + }, + { + "id":45, + "orth":"what", + "ner":"O" + }, + { + "id":46, + "orth":"temperature", + "ner":"O" + }, + { + "id":47, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":48, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":1.0 + }, + { + "label":"not_baking", + "value":0.0 + } + ] + }, + { + "raw":"Do I need to sift flour that is labeled sifted?\nIs there really an advantage to sifting flour that I bought that was labeled 'sifted'?\n", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"Do", + "ner":"O" + }, + { + "id":1, + "orth":"I", + "ner":"O" + }, + { + "id":2, + "orth":"need", + "ner":"O" + }, + { + "id":3, + "orth":"to", + "ner":"O" + }, + { + "id":4, + "orth":"sift", + "ner":"O" + }, + { + "id":5, + "orth":"flour", + "ner":"O" + }, + { + "id":6, + "orth":"that", + "ner":"O" + }, + { + "id":7, + "orth":"is", + "ner":"O" + }, + { + "id":8, + "orth":"labeled", + "ner":"O" + }, + { + "id":9, + "orth":"sifted", + "ner":"O" + }, + { + "id":10, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":11, + "orth":"\n", + "ner":"O" + }, + { + "id":12, + "orth":"Is", + "ner":"O" + }, + { + "id":13, + "orth":"there", + "ner":"O" + }, + { + "id":14, + "orth":"really", + "ner":"O" + }, + { + "id":15, + "orth":"an", + "ner":"O" + }, + { + "id":16, + "orth":"advantage", + "ner":"O" + }, + { + "id":17, + "orth":"to", + "ner":"O" + }, + { + "id":18, + "orth":"sifting", + "ner":"O" + }, + { + "id":19, + "orth":"flour", + "ner":"O" + }, + { + "id":20, + "orth":"that", + "ner":"O" + }, + { + "id":21, + "orth":"I", + "ner":"O" + }, + { + "id":22, + "orth":"bought", + "ner":"O" + }, + { + "id":23, + "orth":"that", + "ner":"O" + }, + { + "id":24, + "orth":"was", + "ner":"O" + }, + { + "id":25, + "orth":"labeled", + "ner":"O" + }, + { + "id":26, + "orth":"'", + "ner":"O" + }, + { + "id":27, + "orth":"sifted", + "ner":"O" + }, + { + "id":28, + "orth":"'", + "ner":"O" + }, + { + "id":29, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":30, + "orth":"\n", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"baking", + "value":1.0 + }, + { + "label":"not_baking", + "value":0.0 + } + ] + } + ] + } +] \ No newline at end of file diff --git a/examples/training/textcat_example_data/cooking.jsonl b/examples/training/textcat_example_data/cooking.jsonl new file mode 100644 index 000000000..cfdc9be87 --- /dev/null +++ b/examples/training/textcat_example_data/cooking.jsonl @@ -0,0 +1,10 @@ +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "2"}, "text": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven by laying the strips out on a cookie sheet. When using this method, how long should I cook the bacon for, and at what temperature?\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "3"}, "text": "What is the difference between white and brown eggs?\nI always use brown extra large eggs, but I can't honestly say why I do this other than habit at this point. Are there any distinct advantages or disadvantages like flavor, shelf life, etc?\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "4"}, "text": "What is the difference between baking soda and baking powder?\nAnd can I use one in place of the other in certain recipes?\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "5"}, "text": "In a tomato sauce recipe, how can I cut the acidity?\nIt seems that every time I make a tomato sauce for pasta, the sauce is a little bit too acid for my taste. I've tried using sugar or sodium bicarbonate, but I'm not satisfied with the results.\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "6"}, "text": "What ingredients (available in specific regions) can I substitute for parsley?\nI have a recipe that calls for fresh parsley. I have substituted other fresh herbs for their dried equivalents but I don't have fresh or dried parsley. Is there something else (ex another dried herb) that I can use instead of parsley?\nI know it is used mainly for looks rather than taste but I have a pasta recipe that calls for 2 tablespoons of parsley in the sauce and then another 2 tablespoons on top when it is done. I know the parsley on top is more for looks but there must be something about the taste otherwise it would call for parsley within the sauce as well.\nI would especially like to hear about substitutes available in Southeast Asia and other parts of the world where the obvious answers (such as cilantro) are not widely available.\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "9"}, "text": "What is the internal temperature a steak should be cooked to for Rare/Medium Rare/Medium/Well?\nI'd like to know when to take my steaks off the grill and please everybody.\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "11"}, "text": "How should I poach an egg?\nWhat's the best method to poach an egg without it turning into an eggy soupy mess?\n"} +{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "12"}, "text": "How can I make my Ice Cream \"creamier\"\nMy ice cream doesn't feel creamy enough. I got the recipe from Good Eats, and I can't tell if it's just the recipe or maybe that I'm just not getting my \"batter\" cold enough before I try to make it (I let it chill overnight in the refrigerator, but it doesn't always come out of the machine looking like \"soft serve\" as he said on the show - it's usually a little thinner).\nRecipe: http://www.foodnetwork.com/recipes/alton-brown/serious-vanilla-ice-cream-recipe/index.html\nThanks!\n"} +{"cats": {"baking": 1.0, "not_baking": 0.0}, "meta": {"id": "17"}, "text": "How long and at what temperature do the various parts of a chicken need to be cooked?\nI'm interested in baking thighs, legs, breasts and wings. How long do each of these items need to bake and at what temperature?\n"} +{"cats": {"baking": 1.0, "not_baking": 0.0}, "meta": {"id": "27"}, "text": "Do I need to sift flour that is labeled sifted?\nIs there really an advantage to sifting flour that I bought that was labeled 'sifted'?\n"} diff --git a/examples/training/textcat_example_data/jigsaw-toxic-comment.json b/examples/training/textcat_example_data/jigsaw-toxic-comment.json new file mode 100644 index 000000000..0c8d8f8e0 --- /dev/null +++ b/examples/training/textcat_example_data/jigsaw-toxic-comment.json @@ -0,0 +1,2987 @@ +[ + { + "id":0, + "paragraphs":[ + { + "raw":"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"Explanation", + "ner":"O" + }, + { + "id":1, + "orth":"\n", + "ner":"O" + }, + { + "id":2, + "orth":"Why", + "ner":"O" + }, + { + "id":3, + "orth":"the", + "ner":"O" + }, + { + "id":4, + "orth":"edits", + "ner":"O" + }, + { + "id":5, + "orth":"made", + "ner":"O" + }, + { + "id":6, + "orth":"under", + "ner":"O" + }, + { + "id":7, + "orth":"my", + "ner":"O" + }, + { + "id":8, + "orth":"username", + "ner":"O" + }, + { + "id":9, + "orth":"Hardcore", + "ner":"O" + }, + { + "id":10, + "orth":"Metallica", + "ner":"O" + }, + { + "id":11, + "orth":"Fan", + "ner":"O" + }, + { + "id":12, + "orth":"were", + "ner":"O" + }, + { + "id":13, + "orth":"reverted", + "ner":"O" + }, + { + "id":14, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":15, + "orth":"They", + "ner":"O" + }, + { + "id":16, + "orth":"were", + "ner":"O" + }, + { + "id":17, + "orth":"n't", + "ner":"O" + }, + { + "id":18, + "orth":"vandalisms", + "ner":"O" + }, + { + "id":19, + "orth":",", + "ner":"O" + }, + { + "id":20, + "orth":"just", + "ner":"O" + }, + { + "id":21, + "orth":"closure", + "ner":"O" + }, + { + "id":22, + "orth":"on", + "ner":"O" + }, + { + "id":23, + "orth":"some", + "ner":"O" + }, + { + "id":24, + "orth":"GAs", + "ner":"O" + }, + { + "id":25, + "orth":"after", + "ner":"O" + }, + { + "id":26, + "orth":"I", + "ner":"O" + }, + { + "id":27, + "orth":"voted", + "ner":"O" + }, + { + "id":28, + "orth":"at", + "ner":"O" + }, + { + "id":29, + "orth":"New", + "ner":"O" + }, + { + "id":30, + "orth":"York", + "ner":"O" + }, + { + "id":31, + "orth":"Dolls", + "ner":"O" + }, + { + "id":32, + "orth":"FAC", + "ner":"O" + }, + { + "id":33, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":34, + "orth":"And", + "ner":"O" + }, + { + "id":35, + "orth":"please", + "ner":"O" + }, + { + "id":36, + "orth":"do", + "ner":"O" + }, + { + "id":37, + "orth":"n't", + "ner":"O" + }, + { + "id":38, + "orth":"remove", + "ner":"O" + }, + { + "id":39, + "orth":"the", + "ner":"O" + }, + { + "id":40, + "orth":"template", + "ner":"O" + }, + { + "id":41, + "orth":"from", + "ner":"O" + }, + { + "id":42, + "orth":"the", + "ner":"O" + }, + { + "id":43, + "orth":"talk", + "ner":"O" + }, + { + "id":44, + "orth":"page", + "ner":"O" + }, + { + "id":45, + "orth":"since", + "ner":"O" + }, + { + "id":46, + "orth":"I", + "ner":"O" + }, + { + "id":47, + "orth":"'m", + "ner":"O" + }, + { + "id":48, + "orth":"retired", + "ner":"O" + }, + { + "id":49, + "orth":"now.89.205.38.27", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"I'm Sorry \n\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"I", + "ner":"O" + }, + { + "id":1, + "orth":"'m", + "ner":"O" + }, + { + "id":2, + "orth":"Sorry", + "ner":"O" + }, + { + "id":3, + "orth":"\n\n", + "ner":"O" + }, + { + "id":4, + "orth":"I", + "ner":"O" + }, + { + "id":5, + "orth":"'m", + "ner":"O" + }, + { + "id":6, + "orth":"sorry", + "ner":"O" + }, + { + "id":7, + "orth":"I", + "ner":"O" + }, + { + "id":8, + "orth":"screwed", + "ner":"O" + }, + { + "id":9, + "orth":"around", + "ner":"O" + }, + { + "id":10, + "orth":"with", + "ner":"O" + }, + { + "id":11, + "orth":"someones", + "ner":"O" + }, + { + "id":12, + "orth":"talk", + "ner":"O" + }, + { + "id":13, + "orth":"page", + "ner":"O" + }, + { + "id":14, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":15, + "orth":" ", + "ner":"O" + }, + { + "id":16, + "orth":"It", + "ner":"O" + }, + { + "id":17, + "orth":"was", + "ner":"O" + }, + { + "id":18, + "orth":"very", + "ner":"O" + }, + { + "id":19, + "orth":"bad", + "ner":"O" + }, + { + "id":20, + "orth":"to", + "ner":"O" + }, + { + "id":21, + "orth":"do", + "ner":"O" + }, + { + "id":22, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":23, + "orth":" ", + "ner":"O" + }, + { + "id":24, + "orth":"I", + "ner":"O" + }, + { + "id":25, + "orth":"know", + "ner":"O" + }, + { + "id":26, + "orth":"how", + "ner":"O" + }, + { + "id":27, + "orth":"having", + "ner":"O" + }, + { + "id":28, + "orth":"the", + "ner":"O" + }, + { + "id":29, + "orth":"templates", + "ner":"O" + }, + { + "id":30, + "orth":"on", + "ner":"O" + }, + { + "id":31, + "orth":"their", + "ner":"O" + }, + { + "id":32, + "orth":"talk", + "ner":"O" + }, + { + "id":33, + "orth":"page", + "ner":"O" + }, + { + "id":34, + "orth":"helps", + "ner":"O" + }, + { + "id":35, + "orth":"you", + "ner":"O" + }, + { + "id":36, + "orth":"assert", + "ner":"O" + }, + { + "id":37, + "orth":"your", + "ner":"O" + }, + { + "id":38, + "orth":"dominance", + "ner":"O" + }, + { + "id":39, + "orth":"over", + "ner":"O" + }, + { + "id":40, + "orth":"them", + "ner":"O" + }, + { + "id":41, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":42, + "orth":" ", + "ner":"O" + }, + { + "id":43, + "orth":"I", + "ner":"O" + }, + { + "id":44, + "orth":"know", + "ner":"O" + }, + { + "id":45, + "orth":"I", + "ner":"O" + }, + { + "id":46, + "orth":"should", + "ner":"O" + }, + { + "id":47, + "orth":"bow", + "ner":"O" + }, + { + "id":48, + "orth":"down", + "ner":"O" + }, + { + "id":49, + "orth":"to", + "ner":"O" + }, + { + "id":50, + "orth":"the", + "ner":"O" + }, + { + "id":51, + "orth":"almighty", + "ner":"O" + }, + { + "id":52, + "orth":"administrators", + "ner":"O" + }, + { + "id":53, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":54, + "orth":" ", + "ner":"O" + }, + { + "id":55, + "orth":"But", + "ner":"O" + }, + { + "id":56, + "orth":"then", + "ner":"O" + }, + { + "id":57, + "orth":"again", + "ner":"O" + }, + { + "id":58, + "orth":",", + "ner":"O" + }, + { + "id":59, + "orth":"I", + "ner":"O" + }, + { + "id":60, + "orth":"'m", + "ner":"O" + }, + { + "id":61, + "orth":"going", + "ner":"O" + }, + { + "id":62, + "orth":"to", + "ner":"O" + }, + { + "id":63, + "orth":"go", + "ner":"O" + }, + { + "id":64, + "orth":"play", + "ner":"O" + }, + { + "id":65, + "orth":"outside", + "ner":"O" + }, + { + "id":66, + "orth":"....", + "ner":"O" + }, + { + "id":67, + "orth":"with", + "ner":"O" + }, + { + "id":68, + "orth":"your", + "ner":"O" + }, + { + "id":69, + "orth":"mom", + "ner":"O" + }, + { + "id":70, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":71, + "orth":" ", + "ner":"O" + }, + { + "id":72, + "orth":"76.122.79.82", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":1 + } + ] + }, + { + "raw":"Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"Stupid", + "ner":"O" + }, + { + "id":1, + "orth":"peace", + "ner":"O" + }, + { + "id":2, + "orth":"of", + "ner":"O" + }, + { + "id":3, + "orth":"shit", + "ner":"O" + }, + { + "id":4, + "orth":"stop", + "ner":"O" + }, + { + "id":5, + "orth":"deleting", + "ner":"O" + }, + { + "id":6, + "orth":"my", + "ner":"O" + }, + { + "id":7, + "orth":"stuff", + "ner":"O" + }, + { + "id":8, + "orth":"asshole", + "ner":"O" + }, + { + "id":9, + "orth":"go", + "ner":"O" + }, + { + "id":10, + "orth":"die", + "ner":"O" + }, + { + "id":11, + "orth":"and", + "ner":"O" + }, + { + "id":12, + "orth":"fall", + "ner":"O" + }, + { + "id":13, + "orth":"in", + "ner":"O" + }, + { + "id":14, + "orth":"a", + "ner":"O" + }, + { + "id":15, + "orth":"hole", + "ner":"O" + }, + { + "id":16, + "orth":"go", + "ner":"O" + }, + { + "id":17, + "orth":"to", + "ner":"O" + }, + { + "id":18, + "orth":"hell", + "ner":"O" + }, + { + "id":19, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":1 + }, + { + "label":"obscene", + "value":1 + }, + { + "label":"severe_toxic", + "value":1 + }, + { + "label":"toxic", + "value":1 + } + ] + }, + { + "raw":"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"D'aww", + "ner":"O" + }, + { + "id":1, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":2, + "orth":"He", + "ner":"O" + }, + { + "id":3, + "orth":"matches", + "ner":"O" + }, + { + "id":4, + "orth":"this", + "ner":"O" + }, + { + "id":5, + "orth":"background", + "ner":"O" + }, + { + "id":6, + "orth":"colour", + "ner":"O" + }, + { + "id":7, + "orth":"I", + "ner":"O" + }, + { + "id":8, + "orth":"'m", + "ner":"O" + }, + { + "id":9, + "orth":"seemingly", + "ner":"O" + }, + { + "id":10, + "orth":"stuck", + "ner":"O" + }, + { + "id":11, + "orth":"with", + "ner":"O" + }, + { + "id":12, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":13, + "orth":"Thanks", + "ner":"O" + }, + { + "id":14, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":15, + "orth":" ", + "ner":"O" + }, + { + "id":16, + "orth":"(", + "ner":"O" + }, + { + "id":17, + "orth":"talk", + "ner":"O" + }, + { + "id":18, + "orth":")", + "ner":"O" + }, + { + "id":19, + "orth":"21:51", + "ner":"O" + }, + { + "id":20, + "orth":",", + "ner":"O" + }, + { + "id":21, + "orth":"January", + "ner":"O" + }, + { + "id":22, + "orth":"11", + "ner":"O" + }, + { + "id":23, + "orth":",", + "ner":"O" + }, + { + "id":24, + "orth":"2016", + "ner":"O" + }, + { + "id":25, + "orth":"(", + "ner":"O" + }, + { + "id":26, + "orth":"UTC", + "ner":"O" + }, + { + "id":27, + "orth":")", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"GET", + "ner":"O" + }, + { + "id":1, + "orth":"FUCKED", + "ner":"O" + }, + { + "id":2, + "orth":"UP", + "ner":"O" + }, + { + "id":3, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":4, + "orth":"GET", + "ner":"O" + }, + { + "id":5, + "orth":"FUCKEEED", + "ner":"O" + }, + { + "id":6, + "orth":"UP", + "ner":"O" + }, + { + "id":7, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":8, + "orth":" ", + "ner":"O" + }, + { + "id":9, + "orth":"GOT", + "ner":"O" + }, + { + "id":10, + "orth":"A", + "ner":"O" + }, + { + "id":11, + "orth":"DRINK", + "ner":"O" + }, + { + "id":12, + "orth":"THAT", + "ner":"O" + }, + { + "id":13, + "orth":"YOU", + "ner":"O" + }, + { + "id":14, + "orth":"CANT", + "ner":"O" + }, + { + "id":15, + "orth":"PUT", + "ner":"O" + }, + { + "id":16, + "orth":"DOWN???/", + "ner":"O" + }, + { + "id":17, + "orth":"GET", + "ner":"O" + }, + { + "id":18, + "orth":"FUCK", + "ner":"O" + }, + { + "id":19, + "orth":"UP", + "ner":"O" + }, + { + "id":20, + "orth":"GET", + "ner":"O" + }, + { + "id":21, + "orth":"FUCKED", + "ner":"O" + }, + { + "id":22, + "orth":"UP", + "ner":"O" + }, + { + "id":23, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":24, + "orth":" ", + "ner":"O" + }, + { + "id":25, + "orth":"I'M", + "ner":"O" + }, + { + "id":26, + "orth":"FUCKED", + "ner":"O" + }, + { + "id":27, + "orth":"UP", + "ner":"O" + }, + { + "id":28, + "orth":"RIGHT", + "ner":"O" + }, + { + "id":29, + "orth":"NOW", + "ner":"O" + }, + { + "id":30, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":1 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":1 + } + ] + }, + { + "raw":"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"Hey", + "ner":"O" + }, + { + "id":1, + "orth":"man", + "ner":"O" + }, + { + "id":2, + "orth":",", + "ner":"O" + }, + { + "id":3, + "orth":"I", + "ner":"O" + }, + { + "id":4, + "orth":"'m", + "ner":"O" + }, + { + "id":5, + "orth":"really", + "ner":"O" + }, + { + "id":6, + "orth":"not", + "ner":"O" + }, + { + "id":7, + "orth":"trying", + "ner":"O" + }, + { + "id":8, + "orth":"to", + "ner":"O" + }, + { + "id":9, + "orth":"edit", + "ner":"O" + }, + { + "id":10, + "orth":"war", + "ner":"O" + }, + { + "id":11, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":12, + "orth":"It", + "ner":"O" + }, + { + "id":13, + "orth":"'s", + "ner":"O" + }, + { + "id":14, + "orth":"just", + "ner":"O" + }, + { + "id":15, + "orth":"that", + "ner":"O" + }, + { + "id":16, + "orth":"this", + "ner":"O" + }, + { + "id":17, + "orth":"guy", + "ner":"O" + }, + { + "id":18, + "orth":"is", + "ner":"O" + }, + { + "id":19, + "orth":"constantly", + "ner":"O" + }, + { + "id":20, + "orth":"removing", + "ner":"O" + }, + { + "id":21, + "orth":"relevant", + "ner":"O" + }, + { + "id":22, + "orth":"information", + "ner":"O" + }, + { + "id":23, + "orth":"and", + "ner":"O" + }, + { + "id":24, + "orth":"talking", + "ner":"O" + }, + { + "id":25, + "orth":"to", + "ner":"O" + }, + { + "id":26, + "orth":"me", + "ner":"O" + }, + { + "id":27, + "orth":"through", + "ner":"O" + }, + { + "id":28, + "orth":"edits", + "ner":"O" + }, + { + "id":29, + "orth":"instead", + "ner":"O" + }, + { + "id":30, + "orth":"of", + "ner":"O" + }, + { + "id":31, + "orth":"my", + "ner":"O" + }, + { + "id":32, + "orth":"talk", + "ner":"O" + }, + { + "id":33, + "orth":"page", + "ner":"O" + }, + { + "id":34, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":35, + "orth":"He", + "ner":"O" + }, + { + "id":36, + "orth":"seems", + "ner":"O" + }, + { + "id":37, + "orth":"to", + "ner":"O" + }, + { + "id":38, + "orth":"care", + "ner":"O" + }, + { + "id":39, + "orth":"more", + "ner":"O" + }, + { + "id":40, + "orth":"about", + "ner":"O" + }, + { + "id":41, + "orth":"the", + "ner":"O" + }, + { + "id":42, + "orth":"formatting", + "ner":"O" + }, + { + "id":43, + "orth":"than", + "ner":"O" + }, + { + "id":44, + "orth":"the", + "ner":"O" + }, + { + "id":45, + "orth":"actual", + "ner":"O" + }, + { + "id":46, + "orth":"info", + "ner":"O" + }, + { + "id":47, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"\"\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport \"", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"\"", + "ner":"O" + }, + { + "id":1, + "orth":"\n", + "ner":"O" + }, + { + "id":2, + "orth":"More", + "ner":"O" + }, + { + "id":3, + "orth":"\n", + "ner":"O" + }, + { + "id":4, + "orth":"I", + "ner":"O" + }, + { + "id":5, + "orth":"ca", + "ner":"O" + }, + { + "id":6, + "orth":"n't", + "ner":"O" + }, + { + "id":7, + "orth":"make", + "ner":"O" + }, + { + "id":8, + "orth":"any", + "ner":"O" + }, + { + "id":9, + "orth":"real", + "ner":"O" + }, + { + "id":10, + "orth":"suggestions", + "ner":"O" + }, + { + "id":11, + "orth":"on", + "ner":"O" + }, + { + "id":12, + "orth":"improvement", + "ner":"O" + }, + { + "id":13, + "orth":"-", + "ner":"O" + }, + { + "id":14, + "orth":"I", + "ner":"O" + }, + { + "id":15, + "orth":"wondered", + "ner":"O" + }, + { + "id":16, + "orth":"if", + "ner":"O" + }, + { + "id":17, + "orth":"the", + "ner":"O" + }, + { + "id":18, + "orth":"section", + "ner":"O" + }, + { + "id":19, + "orth":"statistics", + "ner":"O" + }, + { + "id":20, + "orth":"should", + "ner":"O" + }, + { + "id":21, + "orth":"be", + "ner":"O" + }, + { + "id":22, + "orth":"later", + "ner":"O" + }, + { + "id":23, + "orth":"on", + "ner":"O" + }, + { + "id":24, + "orth":",", + "ner":"O" + }, + { + "id":25, + "orth":"or", + "ner":"O" + }, + { + "id":26, + "orth":"a", + "ner":"O" + }, + { + "id":27, + "orth":"subsection", + "ner":"O" + }, + { + "id":28, + "orth":"of", + "ner":"O" + }, + { + "id":29, + "orth":"\"", + "ner":"O" + }, + { + "id":30, + "orth":"\"", + "ner":"O" + }, + { + "id":31, + "orth":"types", + "ner":"O" + }, + { + "id":32, + "orth":"of", + "ner":"O" + }, + { + "id":33, + "orth":"accidents", + "ner":"O" + }, + { + "id":34, + "orth":"\"", + "ner":"O" + }, + { + "id":35, + "orth":"\"", + "ner":"O" + }, + { + "id":36, + "orth":" ", + "ner":"O" + }, + { + "id":37, + "orth":"-I", + "ner":"O" + }, + { + "id":38, + "orth":"think", + "ner":"O" + }, + { + "id":39, + "orth":"the", + "ner":"O" + }, + { + "id":40, + "orth":"references", + "ner":"O" + }, + { + "id":41, + "orth":"may", + "ner":"O" + }, + { + "id":42, + "orth":"need", + "ner":"O" + }, + { + "id":43, + "orth":"tidying", + "ner":"O" + }, + { + "id":44, + "orth":"so", + "ner":"O" + }, + { + "id":45, + "orth":"that", + "ner":"O" + }, + { + "id":46, + "orth":"they", + "ner":"O" + }, + { + "id":47, + "orth":"are", + "ner":"O" + }, + { + "id":48, + "orth":"all", + "ner":"O" + }, + { + "id":49, + "orth":"in", + "ner":"O" + }, + { + "id":50, + "orth":"the", + "ner":"O" + }, + { + "id":51, + "orth":"exact", + "ner":"O" + }, + { + "id":52, + "orth":"same", + "ner":"O" + }, + { + "id":53, + "orth":"format", + "ner":"O" + }, + { + "id":54, + "orth":"ie", + "ner":"O" + }, + { + "id":55, + "orth":"date", + "ner":"O" + }, + { + "id":56, + "orth":"format", + "ner":"O" + }, + { + "id":57, + "orth":"etc", + "ner":"O" + }, + { + "id":58, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":59, + "orth":"I", + "ner":"O" + }, + { + "id":60, + "orth":"can", + "ner":"O" + }, + { + "id":61, + "orth":"do", + "ner":"O" + }, + { + "id":62, + "orth":"that", + "ner":"O" + }, + { + "id":63, + "orth":"later", + "ner":"O" + }, + { + "id":64, + "orth":"on", + "ner":"O" + }, + { + "id":65, + "orth":",", + "ner":"O" + }, + { + "id":66, + "orth":"if", + "ner":"O" + }, + { + "id":67, + "orth":"no", + "ner":"O" + }, + { + "id":68, + "orth":"-", + "ner":"O" + }, + { + "id":69, + "orth":"one", + "ner":"O" + }, + { + "id":70, + "orth":"else", + "ner":"O" + }, + { + "id":71, + "orth":"does", + "ner":"O" + }, + { + "id":72, + "orth":"first", + "ner":"O" + }, + { + "id":73, + "orth":"-", + "ner":"O" + }, + { + "id":74, + "orth":"if", + "ner":"O" + }, + { + "id":75, + "orth":"you", + "ner":"O" + }, + { + "id":76, + "orth":"have", + "ner":"O" + }, + { + "id":77, + "orth":"any", + "ner":"O" + }, + { + "id":78, + "orth":"preferences", + "ner":"O" + }, + { + "id":79, + "orth":"for", + "ner":"O" + }, + { + "id":80, + "orth":"formatting", + "ner":"O" + }, + { + "id":81, + "orth":"style", + "ner":"O" + }, + { + "id":82, + "orth":"on", + "ner":"O" + }, + { + "id":83, + "orth":"references", + "ner":"O" + }, + { + "id":84, + "orth":"or", + "ner":"O" + }, + { + "id":85, + "orth":"want", + "ner":"O" + }, + { + "id":86, + "orth":"to", + "ner":"O" + }, + { + "id":87, + "orth":"do", + "ner":"O" + }, + { + "id":88, + "orth":"it", + "ner":"O" + }, + { + "id":89, + "orth":"yourself", + "ner":"O" + }, + { + "id":90, + "orth":"please", + "ner":"O" + }, + { + "id":91, + "orth":"let", + "ner":"O" + }, + { + "id":92, + "orth":"me", + "ner":"O" + }, + { + "id":93, + "orth":"know", + "ner":"O" + }, + { + "id":94, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":95, + "orth":"\n\n", + "ner":"O" + }, + { + "id":96, + "orth":"There", + "ner":"O" + }, + { + "id":97, + "orth":"appears", + "ner":"O" + }, + { + "id":98, + "orth":"to", + "ner":"O" + }, + { + "id":99, + "orth":"be", + "ner":"O" + }, + { + "id":100, + "orth":"a", + "ner":"O" + }, + { + "id":101, + "orth":"backlog", + "ner":"O" + }, + { + "id":102, + "orth":"on", + "ner":"O" + }, + { + "id":103, + "orth":"articles", + "ner":"O" + }, + { + "id":104, + "orth":"for", + "ner":"O" + }, + { + "id":105, + "orth":"review", + "ner":"O" + }, + { + "id":106, + "orth":"so", + "ner":"O" + }, + { + "id":107, + "orth":"I", + "ner":"O" + }, + { + "id":108, + "orth":"guess", + "ner":"O" + }, + { + "id":109, + "orth":"there", + "ner":"O" + }, + { + "id":110, + "orth":"may", + "ner":"O" + }, + { + "id":111, + "orth":"be", + "ner":"O" + }, + { + "id":112, + "orth":"a", + "ner":"O" + }, + { + "id":113, + "orth":"delay", + "ner":"O" + }, + { + "id":114, + "orth":"until", + "ner":"O" + }, + { + "id":115, + "orth":"a", + "ner":"O" + }, + { + "id":116, + "orth":"reviewer", + "ner":"O" + }, + { + "id":117, + "orth":"turns", + "ner":"O" + }, + { + "id":118, + "orth":"up", + "ner":"O" + }, + { + "id":119, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":120, + "orth":"It", + "ner":"O" + }, + { + "id":121, + "orth":"'s", + "ner":"O" + }, + { + "id":122, + "orth":"listed", + "ner":"O" + }, + { + "id":123, + "orth":"in", + "ner":"O" + }, + { + "id":124, + "orth":"the", + "ner":"O" + }, + { + "id":125, + "orth":"relevant", + "ner":"O" + }, + { + "id":126, + "orth":"form", + "ner":"O" + }, + { + "id":127, + "orth":"eg", + "ner":"O" + }, + { + "id":128, + "orth":"Wikipedia", + "ner":"O" + }, + { + "id":129, + "orth":":", + "ner":"O" + }, + { + "id":130, + "orth":"Good_article_nominations#Transport", + "ner":"O" + }, + { + "id":131, + "orth":" ", + "ner":"O" + }, + { + "id":132, + "orth":"\"", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"You, sir, are my hero. Any chance you remember what page that's on?", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"You", + "ner":"O" + }, + { + "id":1, + "orth":",", + "ner":"O" + }, + { + "id":2, + "orth":"sir", + "ner":"O" + }, + { + "id":3, + "orth":",", + "ner":"O" + }, + { + "id":4, + "orth":"are", + "ner":"O" + }, + { + "id":5, + "orth":"my", + "ner":"O" + }, + { + "id":6, + "orth":"hero", + "ner":"O" + }, + { + "id":7, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":8, + "orth":"Any", + "ner":"O" + }, + { + "id":9, + "orth":"chance", + "ner":"O" + }, + { + "id":10, + "orth":"you", + "ner":"O" + }, + { + "id":11, + "orth":"remember", + "ner":"O" + }, + { + "id":12, + "orth":"what", + "ner":"O" + }, + { + "id":13, + "orth":"page", + "ner":"O" + }, + { + "id":14, + "orth":"that", + "ner":"O" + }, + { + "id":15, + "orth":"'s", + "ner":"O" + }, + { + "id":16, + "orth":"on", + "ner":"O" + }, + { + "id":17, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"\"\n\nCongratulations from me as well, use the tools well. \u00a0\u00b7 talk \"", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"\"", + "ner":"O" + }, + { + "id":1, + "orth":"\n\n", + "ner":"O" + }, + { + "id":2, + "orth":"Congratulations", + "ner":"O" + }, + { + "id":3, + "orth":"from", + "ner":"O" + }, + { + "id":4, + "orth":"me", + "ner":"O" + }, + { + "id":5, + "orth":"as", + "ner":"O" + }, + { + "id":6, + "orth":"well", + "ner":"O" + }, + { + "id":7, + "orth":",", + "ner":"O" + }, + { + "id":8, + "orth":"use", + "ner":"O" + }, + { + "id":9, + "orth":"the", + "ner":"O" + }, + { + "id":10, + "orth":"tools", + "ner":"O" + }, + { + "id":11, + "orth":"well", + "ner":"O" + }, + { + "id":12, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":13, + "orth":"\u00a0", + "ner":"O" + }, + { + "id":14, + "orth":"\u00b7", + "ner":"O" + }, + { + "id":15, + "orth":"talk", + "ner":"O" + }, + { + "id":16, + "orth":"\"", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":0 + } + ] + }, + { + "raw":"Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \n\n \nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!", + "sentences":[ + { + "tokens":[ + { + "id":0, + "orth":"Why", + "ner":"O" + }, + { + "id":1, + "orth":"ca", + "ner":"O" + }, + { + "id":2, + "orth":"n't", + "ner":"O" + }, + { + "id":3, + "orth":"you", + "ner":"O" + }, + { + "id":4, + "orth":"believe", + "ner":"O" + }, + { + "id":5, + "orth":"how", + "ner":"O" + }, + { + "id":6, + "orth":"fat", + "ner":"O" + }, + { + "id":7, + "orth":"Artie", + "ner":"O" + }, + { + "id":8, + "orth":"is", + "ner":"O" + }, + { + "id":9, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":10, + "orth":"Did", + "ner":"O" + }, + { + "id":11, + "orth":"you", + "ner":"O" + }, + { + "id":12, + "orth":"see", + "ner":"O" + }, + { + "id":13, + "orth":"him", + "ner":"O" + }, + { + "id":14, + "orth":"on", + "ner":"O" + }, + { + "id":15, + "orth":"his", + "ner":"O" + }, + { + "id":16, + "orth":"recent", + "ner":"O" + }, + { + "id":17, + "orth":"appearence", + "ner":"O" + }, + { + "id":18, + "orth":"on", + "ner":"O" + }, + { + "id":19, + "orth":"the", + "ner":"O" + }, + { + "id":20, + "orth":"Tonight", + "ner":"O" + }, + { + "id":21, + "orth":"Show", + "ner":"O" + }, + { + "id":22, + "orth":"with", + "ner":"O" + }, + { + "id":23, + "orth":"Jay", + "ner":"O" + }, + { + "id":24, + "orth":"Leno", + "ner":"O" + }, + { + "id":25, + "orth":"?", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":26, + "orth":"He", + "ner":"O" + }, + { + "id":27, + "orth":"looks", + "ner":"O" + }, + { + "id":28, + "orth":"absolutely", + "ner":"O" + }, + { + "id":29, + "orth":"AWFUL", + "ner":"O" + }, + { + "id":30, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":31, + "orth":"If", + "ner":"O" + }, + { + "id":32, + "orth":"I", + "ner":"O" + }, + { + "id":33, + "orth":"had", + "ner":"O" + }, + { + "id":34, + "orth":"to", + "ner":"O" + }, + { + "id":35, + "orth":"put", + "ner":"O" + }, + { + "id":36, + "orth":"money", + "ner":"O" + }, + { + "id":37, + "orth":"on", + "ner":"O" + }, + { + "id":38, + "orth":"it", + "ner":"O" + }, + { + "id":39, + "orth":",", + "ner":"O" + }, + { + "id":40, + "orth":"I", + "ner":"O" + }, + { + "id":41, + "orth":"'d", + "ner":"O" + }, + { + "id":42, + "orth":"say", + "ner":"O" + }, + { + "id":43, + "orth":"that", + "ner":"O" + }, + { + "id":44, + "orth":"Artie", + "ner":"O" + }, + { + "id":45, + "orth":"Lange", + "ner":"O" + }, + { + "id":46, + "orth":"is", + "ner":"O" + }, + { + "id":47, + "orth":"a", + "ner":"O" + }, + { + "id":48, + "orth":"ca", + "ner":"O" + }, + { + "id":49, + "orth":"n't", + "ner":"O" + }, + { + "id":50, + "orth":"miss", + "ner":"O" + }, + { + "id":51, + "orth":"candidate", + "ner":"O" + }, + { + "id":52, + "orth":"for", + "ner":"O" + }, + { + "id":53, + "orth":"the", + "ner":"O" + }, + { + "id":54, + "orth":"2007", + "ner":"O" + }, + { + "id":55, + "orth":"Dead", + "ner":"O" + }, + { + "id":56, + "orth":"pool", + "ner":"O" + }, + { + "id":57, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":58, + "orth":" \n\n \n", + "ner":"O" + }, + { + "id":59, + "orth":"Kindly", + "ner":"O" + }, + { + "id":60, + "orth":"keep", + "ner":"O" + }, + { + "id":61, + "orth":"your", + "ner":"O" + }, + { + "id":62, + "orth":"malicious", + "ner":"O" + }, + { + "id":63, + "orth":"fingers", + "ner":"O" + }, + { + "id":64, + "orth":"off", + "ner":"O" + }, + { + "id":65, + "orth":"of", + "ner":"O" + }, + { + "id":66, + "orth":"my", + "ner":"O" + }, + { + "id":67, + "orth":"above", + "ner":"O" + }, + { + "id":68, + "orth":"comment", + "ner":"O" + }, + { + "id":69, + "orth":",", + "ner":"O" + }, + { + "id":70, + "orth":".", + "ner":"O" + } + ], + "brackets":[ + + ] + }, + { + "tokens":[ + { + "id":71, + "orth":"Everytime", + "ner":"O" + }, + { + "id":72, + "orth":"you", + "ner":"O" + }, + { + "id":73, + "orth":"remove", + "ner":"O" + }, + { + "id":74, + "orth":"it", + "ner":"O" + }, + { + "id":75, + "orth":",", + "ner":"O" + }, + { + "id":76, + "orth":"I", + "ner":"O" + }, + { + "id":77, + "orth":"will", + "ner":"O" + }, + { + "id":78, + "orth":"repost", + "ner":"O" + }, + { + "id":79, + "orth":"it", + "ner":"O" + }, + { + "id":80, + "orth":"!", + "ner":"O" + }, + { + "id":81, + "orth":"!", + "ner":"O" + }, + { + "id":82, + "orth":"!", + "ner":"O" + } + ], + "brackets":[ + + ] + } + ], + "cats":[ + { + "label":"insult", + "value":0 + }, + { + "label":"obscene", + "value":0 + }, + { + "label":"severe_toxic", + "value":0 + }, + { + "label":"toxic", + "value":1 + } + ] + } + ] + } +] \ No newline at end of file diff --git a/examples/training/textcat_example_data/jigsaw-toxic-comment.jsonl b/examples/training/textcat_example_data/jigsaw-toxic-comment.jsonl new file mode 100644 index 000000000..ac31b6255 --- /dev/null +++ b/examples/training/textcat_example_data/jigsaw-toxic-comment.jsonl @@ -0,0 +1,10 @@ +{"meta": {"id": "0000997932d777bf"}, "text": "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "001956c382006abd"}, "text": "I'm Sorry \n\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 1}} +{"meta": {"id": "0020e7119b96eeeb"}, "text": "Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!", "cats": {"insult": 1, "obscene": 1, "severe_toxic": 1, "toxic": 1}} +{"meta": {"id": "000103f0d9cfb60f"}, "text": "D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "001dc38a83d420cf"}, "text": "GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!", "cats": {"insult": 0, "obscene": 1, "severe_toxic": 0, "toxic": 1}} +{"meta": {"id": "000113f07ec002fd"}, "text": "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "0001b41b1c6bb37e"}, "text": "\"\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport \"", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "0001d958c54c6e35"}, "text": "You, sir, are my hero. Any chance you remember what page that's on?", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "00025465d4725e87"}, "text": "\"\n\nCongratulations from me as well, use the tools well.  · talk \"", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}} +{"meta": {"id": "002264ea4d5f2887"}, "text": "Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \n\n \nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 1}} diff --git a/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py b/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py new file mode 100644 index 000000000..339ce39be --- /dev/null +++ b/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py @@ -0,0 +1,53 @@ +from pathlib import Path +import plac +import spacy +from spacy.gold import docs_to_json +import srsly +import sys + +@plac.annotations( + model=("Model name. Defaults to 'en'.", "option", "m", str), + input_file=("Input file (jsonl)", "positional", None, Path), + output_dir=("Output directory", "positional", None, Path), + n_texts=("Number of texts to convert", "option", "t", int), +) +def convert(model='en', input_file=None, output_dir=None, n_texts=0): + # Load model with tokenizer + sentencizer only + nlp = spacy.load(model) + nlp.disable_pipes(*nlp.pipe_names) + sentencizer = nlp.create_pipe("sentencizer") + nlp.add_pipe(sentencizer, first=True) + + texts = [] + cats = [] + count = 0 + + if not input_file.exists(): + print("Input file not found:", input_file) + sys.exit(1) + else: + with open(input_file) as fileh: + for line in fileh: + data = srsly.json_loads(line) + texts.append(data["text"]) + cats.append(data["cats"]) + + if output_dir is not None: + output_dir = Path(output_dir) + if not output_dir.exists(): + output_dir.mkdir() + else: + output_dir = Path(".") + + docs = [] + for i, doc in enumerate(nlp.pipe(texts)): + doc.cats = cats[i] + docs.append(doc) + if n_texts > 0 and count == n_texts: + break + count += 1 + + srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)]) + +if __name__ == "__main__": + plac.call(convert) diff --git a/examples/training/train_entity_linker.py b/examples/training/train_entity_linker.py index 12ed531a6..d2b2c2417 100644 --- a/examples/training/train_entity_linker.py +++ b/examples/training/train_entity_linker.py @@ -8,8 +8,8 @@ For more details, see the documentation: * Training: https://spacy.io/usage/training * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking -Compatible with: spaCy vX.X -Last tested with: vX.X +Compatible with: spaCy v2.2 +Last tested with: v2.2 """ from __future__ import unicode_literals, print_function diff --git a/examples/training/training-data.json b/examples/training/training-data.json index 2565ce149..1f57e1fd9 100644 --- a/examples/training/training-data.json +++ b/examples/training/training-data.json @@ -8,7 +8,7 @@ { "tokens": [ { - "head": 44, + "head": 4, "dep": "prep", "tag": "IN", "orth": "In", diff --git a/fabfile.py b/fabfile.py index 0e69551c3..56570e8e0 100644 --- a/fabfile.py +++ b/fabfile.py @@ -10,113 +10,145 @@ import sys PWD = path.dirname(__file__) -ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env' +ENV = environ["VENV_DIR"] if "VENV_DIR" in environ else ".env" VENV_DIR = Path(PWD) / ENV @contextlib.contextmanager -def virtualenv(name, create=False, python='/usr/bin/python3.6'): +def virtualenv(name, create=False, python="/usr/bin/python3.6"): python = Path(python).resolve() env_path = VENV_DIR if create: if env_path.exists(): shutil.rmtree(str(env_path)) - local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR)) + local("{python} -m venv {env_path}".format(python=python, env_path=VENV_DIR)) + def wrapped_local(cmd, env_vars=[], capture=False, direct=False): - return local('source {}/bin/activate && {}'.format(env_path, cmd), - shell='/bin/bash', capture=False) + return local( + "source {}/bin/activate && {}".format(env_path, cmd), + shell="/bin/bash", + capture=False, + ) + yield wrapped_local -def env(lang='python3.6'): +def env(lang="python3.6"): if VENV_DIR.exists(): - local('rm -rf {env}'.format(env=VENV_DIR)) - if lang.startswith('python3'): - local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR)) + local("rm -rf {env}".format(env=VENV_DIR)) + if lang.startswith("python3"): + local("{lang} -m venv {env}".format(lang=lang, env=VENV_DIR)) else: - local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang)) - local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR)) + local("{lang} -m pip install virtualenv --no-cache-dir".format(lang=lang)) + local( + "{lang} -m virtualenv {env} --no-cache-dir".format(lang=lang, env=VENV_DIR) + ) with virtualenv(VENV_DIR) as venv_local: - print(venv_local('python --version', capture=True)) - venv_local('pip install --upgrade setuptools --no-cache-dir') - venv_local('pip install pytest --no-cache-dir') - venv_local('pip install wheel --no-cache-dir') - venv_local('pip install -r requirements.txt --no-cache-dir') - venv_local('pip install pex --no-cache-dir') - + print(venv_local("python --version", capture=True)) + venv_local("pip install --upgrade setuptools --no-cache-dir") + venv_local("pip install pytest --no-cache-dir") + venv_local("pip install wheel --no-cache-dir") + venv_local("pip install -r requirements.txt --no-cache-dir") + venv_local("pip install pex --no-cache-dir") def install(): with virtualenv(VENV_DIR) as venv_local: - venv_local('pip install dist/*.tar.gz') + venv_local("pip install dist/*.tar.gz") def make(): with lcd(path.dirname(__file__)): - local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace', - shell='/bin/bash') + local( + "export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace", + shell="/bin/bash", + ) + def sdist(): with virtualenv(VENV_DIR) as venv_local: with lcd(path.dirname(__file__)): - local('python -m pip install -U setuptools') - local('python setup.py sdist') + local("python -m pip install -U setuptools srsly") + local("python setup.py sdist") + def wheel(): with virtualenv(VENV_DIR) as venv_local: with lcd(path.dirname(__file__)): - venv_local('python setup.py bdist_wheel') + venv_local("python setup.py bdist_wheel") + def pex(): with virtualenv(VENV_DIR) as venv_local: with lcd(path.dirname(__file__)): - sha = local('git rev-parse --short HEAD', capture=True) - venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha, - direct=True) + sha = local("git rev-parse --short HEAD", capture=True) + venv_local( + "pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True + ) def clean(): with lcd(path.dirname(__file__)): - local('rm -f dist/*.whl') - local('rm -f dist/*.pex') + local("rm -f dist/*.whl") + local("rm -f dist/*.pex") with virtualenv(VENV_DIR) as venv_local: - venv_local('python setup.py clean --all') + venv_local("python setup.py clean --all") def test(): with virtualenv(VENV_DIR) as venv_local: with lcd(path.dirname(__file__)): - venv_local('pytest -x spacy/tests') + venv_local("pytest -x spacy/tests") + def train(): - args = environ.get('SPACY_TRAIN_ARGS', '') + args = environ.get("SPACY_TRAIN_ARGS", "") with virtualenv(VENV_DIR) as venv_local: - venv_local('spacy train {args}'.format(args=args)) + venv_local("spacy train {args}".format(args=args)) -def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=''): - is_not_clean = local('git status --porcelain', capture=True) +def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=""): + is_not_clean = local("git status --porcelain", capture=True) if is_not_clean: print("Repository is not clean") print(is_not_clean) sys.exit(1) - git_sha = local('git rev-parse --short HEAD', capture=True) - config_checksum = local('sha256sum {config}'.format(config=config), capture=True) - experiment_dir = Path(experiment_dir) / '{}--{}'.format(config_checksum[:6], git_sha) + git_sha = local("git rev-parse --short HEAD", capture=True) + config_checksum = local("sha256sum {config}".format(config=config), capture=True) + experiment_dir = Path(experiment_dir) / "{}--{}".format( + config_checksum[:6], git_sha + ) if not experiment_dir.exists(): experiment_dir.mkdir() - test_data_dir = Path(treebank_dir) / 'ud-test-v2.0-conll2017' + test_data_dir = Path(treebank_dir) / "ud-test-v2.0-conll2017" assert test_data_dir.exists() assert test_data_dir.is_dir() if corpus: corpora = [corpus] else: - corpora = ['UD_English', 'UD_Chinese', 'UD_Japanese', 'UD_Vietnamese'] + corpora = ["UD_English", "UD_Chinese", "UD_Japanese", "UD_Vietnamese"] - local('cp {config} {experiment_dir}/config.json'.format(config=config, experiment_dir=experiment_dir)) + local( + "cp {config} {experiment_dir}/config.json".format( + config=config, experiment_dir=experiment_dir + ) + ) with virtualenv(VENV_DIR) as venv_local: for corpus in corpora: - venv_local('spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}'.format( - treebank_dir=treebank_dir, experiment_dir=experiment_dir, config=config, corpus=corpus, vectors_dir=vectors_dir)) - venv_local('spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}'.format( - test_data_dir=test_data_dir, experiment_dir=experiment_dir, config=config, corpus=corpus)) + venv_local( + "spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}".format( + treebank_dir=treebank_dir, + experiment_dir=experiment_dir, + config=config, + corpus=corpus, + vectors_dir=vectors_dir, + ) + ) + venv_local( + "spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}".format( + test_data_dir=test_data_dir, + experiment_dir=experiment_dir, + config=config, + corpus=corpus, + ) + ) diff --git a/requirements.txt b/requirements.txt index a6d721e96..ebe660b97 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,8 @@ # Our libraries cymem>=2.0.2,<2.1.0 -preshed>=2.0.1,<2.1.0 -thinc>=7.0.8,<7.1.0 -blis>=0.2.2,<0.3.0 +preshed>=3.0.2,<3.1.0 +thinc>=7.1.1,<7.2.0 +blis>=0.4.0,<0.5.0 murmurhash>=0.28.0,<1.1.0 wasabi>=0.2.0,<1.1.0 srsly>=0.1.0,<1.1.0 diff --git a/setup.py b/setup.py index 984de2250..abe3fb509 100755 --- a/setup.py +++ b/setup.py @@ -27,7 +27,7 @@ def is_new_osx(): return False -PACKAGE_DATA = {"": ["*.pyx", "*.pxd", "*.txt", "*.tokens", "*.json"]} +PACKAGE_DATA = {"": ["*.pyx", "*.pxd", "*.txt", "*.tokens", "*.json", "*.json.gz"]} PACKAGES = find_packages() @@ -43,6 +43,7 @@ MOD_NAMES = [ "spacy.kb", "spacy.morphology", "spacy.pipeline.pipes", + "spacy.pipeline.morphologizer", "spacy.syntax.stateclass", "spacy.syntax._state", "spacy.tokenizer", @@ -56,6 +57,7 @@ MOD_NAMES = [ "spacy.tokens.doc", "spacy.tokens.span", "spacy.tokens.token", + "spacy.tokens.morphanalysis", "spacy.tokens._retokenize", "spacy.matcher.matcher", "spacy.matcher.phrasematcher", @@ -245,9 +247,9 @@ def setup_package(): "numpy>=1.15.0", "murmurhash>=0.28.0,<1.1.0", "cymem>=2.0.2,<2.1.0", - "preshed>=2.0.1,<2.1.0", - "thinc>=7.0.8,<7.1.0", - "blis>=0.2.2,<0.3.0", + "preshed>=3.0.2,<3.1.0", + "thinc>=7.1.1,<7.2.0", + "blis>=0.4.0,<0.5.0", "plac<1.0.0,>=0.9.6", "requests>=2.13.0,<3.0.0", "wasabi>=0.2.0,<1.1.0", @@ -281,7 +283,6 @@ def setup_package(): "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", diff --git a/spacy/_ml.py b/spacy/_ml.py index 660d20c46..6104324ab 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -15,7 +15,7 @@ from thinc.api import uniqued, wrap, noop from thinc.api import with_square_sequences from thinc.linear.linear import LinearModel from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module +from thinc.neural.util import get_array_module, copy_array from thinc.neural.optimizers import Adam from thinc import describe @@ -286,10 +286,7 @@ def link_vectors_to_models(vocab): if vectors.name is None: vectors.name = VECTORS_KEY if vectors.data.size != 0: - print( - "Warning: Unnamed vectors -- this won't allow multiple vectors " - "models to be loaded. (Shape: (%d, %d))" % vectors.data.shape - ) + user_warning(Warnings.W020.format(shape=vectors.data.shape)) ops = Model.ops for word in vocab: if word.orth in vectors.key2row: @@ -323,6 +320,9 @@ def Tok2Vec(width, embed_size, **kwargs): pretrained_vectors = kwargs.get("pretrained_vectors", None) cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) subword_features = kwargs.get("subword_features", True) + char_embed = kwargs.get("char_embed", False) + if char_embed: + subword_features = False conv_depth = kwargs.get("conv_depth", 4) bilstm_depth = kwargs.get("bilstm_depth", 0) cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] @@ -362,6 +362,14 @@ def Tok2Vec(width, embed_size, **kwargs): >> LN(Maxout(width, width * 4, pieces=3)), column=cols.index(ORTH), ) + elif char_embed: + embed = concatenate_lists( + CharacterEmbed(nM=64, nC=8), + FeatureExtracter(cols) >> with_flatten(norm), + ) + reduce_dimensions = LN( + Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) + ) else: embed = norm @@ -369,9 +377,15 @@ def Tok2Vec(width, embed_size, **kwargs): ExtractWindow(nW=1) >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) ) - tok2vec = FeatureExtracter(cols) >> with_flatten( - embed >> convolution ** conv_depth, pad=conv_depth - ) + if char_embed: + tok2vec = embed >> with_flatten( + reduce_dimensions >> convolution ** conv_depth, pad=conv_depth + ) + else: + tok2vec = FeatureExtracter(cols) >> with_flatten( + embed >> convolution ** conv_depth, pad=conv_depth + ) + if bilstm_depth >= 1: tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) # Work around thinc API limitations :(. TODO: Revise in Thinc 7 @@ -504,6 +518,46 @@ def getitem(i): return layerize(getitem_fwd) +@describe.attributes( + W=Synapses("Weights matrix", lambda obj: (obj.nO, obj.nI), lambda W, ops: None) +) +class MultiSoftmax(Affine): + """Neural network layer that predicts several multi-class attributes at once. + For instance, we might predict one class with 6 variables, and another with 5. + We predict the 11 neurons required for this, and then softmax them such + that columns 0-6 make a probability distribution and coumns 6-11 make another. + """ + + name = "multisoftmax" + + def __init__(self, out_sizes, nI=None, **kwargs): + Model.__init__(self, **kwargs) + self.out_sizes = out_sizes + self.nO = sum(out_sizes) + self.nI = nI + + def predict(self, input__BI): + output__BO = self.ops.affine(self.W, self.b, input__BI) + i = 0 + for out_size in self.out_sizes: + self.ops.softmax(output__BO[:, i : i + out_size], inplace=True) + i += out_size + return output__BO + + def begin_update(self, input__BI, drop=0.0): + output__BO = self.predict(input__BI) + + def finish_update(grad__BO, sgd=None): + self.d_W += self.ops.gemm(grad__BO, input__BI, trans1=True) + self.d_b += grad__BO.sum(axis=0) + grad__BI = self.ops.gemm(grad__BO, self.W) + if sgd is not None: + sgd(self._mem.weights, self._mem.gradient, key=self.id) + return grad__BI + + return output__BO, finish_update + + def build_tagger_model(nr_class, **cfg): embed_size = util.env_opt("embed_size", 2000) if "token_vector_width" in cfg: @@ -530,6 +584,33 @@ def build_tagger_model(nr_class, **cfg): return model +def build_morphologizer_model(class_nums, **cfg): + embed_size = util.env_opt("embed_size", 7000) + if "token_vector_width" in cfg: + token_vector_width = cfg["token_vector_width"] + else: + token_vector_width = util.env_opt("token_vector_width", 128) + pretrained_vectors = cfg.get("pretrained_vectors") + char_embed = cfg.get("char_embed", True) + with Model.define_operators({">>": chain, "+": add, "**": clone}): + if "tok2vec" in cfg: + tok2vec = cfg["tok2vec"] + else: + tok2vec = Tok2Vec( + token_vector_width, + embed_size, + char_embed=char_embed, + pretrained_vectors=pretrained_vectors, + ) + softmax = with_flatten(MultiSoftmax(class_nums, token_vector_width)) + softmax.out_sizes = class_nums + model = tok2vec >> softmax + model.nI = None + model.tok2vec = tok2vec + model.softmax = softmax + return model + + @layerize def SpacyVectors(docs, drop=0.0): batch = [] @@ -720,7 +801,8 @@ def concatenate_lists(*layers, **kwargs): # pragma: no cover concat = concatenate(*layers) def concatenate_lists_fwd(Xs, drop=0.0): - drop *= drop_factor + if drop is not None: + drop *= drop_factor lengths = ops.asarray([len(X) for X in Xs], dtype="i") flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) ys = ops.unflatten(flat_y, lengths) @@ -810,6 +892,67 @@ def _replace_word(word, random_words, mask="[MASK]"): return word +def _uniform_init(lo, hi): + def wrapped(W, ops): + copy_array(W, ops.xp.random.uniform(lo, hi, W.shape)) + + return wrapped + + +@describe.attributes( + nM=Dimension("Vector dimensions"), + nC=Dimension("Number of characters per word"), + vectors=Synapses( + "Embed matrix", lambda obj: (obj.nC, obj.nV, obj.nM), _uniform_init(-0.1, 0.1) + ), + d_vectors=Gradient("vectors"), +) +class CharacterEmbed(Model): + def __init__(self, nM=None, nC=None, **kwargs): + Model.__init__(self, **kwargs) + self.nM = nM + self.nC = nC + + @property + def nO(self): + return self.nM * self.nC + + @property + def nV(self): + return 256 + + def begin_update(self, docs, drop=0.0): + if not docs: + return [] + ids = [] + output = [] + weights = self.vectors + # This assists in indexing; it's like looping over this dimension. + # Still consider this weird witch craft...But thanks to Mark Neumann + # for the tip. + nCv = self.ops.xp.arange(self.nC) + for doc in docs: + doc_ids = doc.to_utf8_array(nr_char=self.nC) + doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM)) + # Let's say I have a 2d array of indices, and a 3d table of data. What numpy + # incantation do I chant to get + # output[i, j, k] == data[j, ids[i, j], k]? + doc_vectors[:, nCv] = weights[nCv, doc_ids[:, nCv]] + output.append(doc_vectors.reshape((len(doc), self.nO))) + ids.append(doc_ids) + + def backprop_character_embed(d_vectors, sgd=None): + gradient = self.d_vectors + for doc_ids, d_doc_vectors in zip(ids, d_vectors): + d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), self.nC, self.nM)) + gradient[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv] + if sgd is not None: + sgd(self._mem.weights, self._mem.gradient, key=self.id) + return None + + return output, backprop_character_embed + + def get_cossim_loss(yh, y): # Add a small constant to avoid 0 vectors yh = yh + 1e-8 diff --git a/spacy/about.py b/spacy/about.py index 9587c9071..7bb8e7ead 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,16 +1,12 @@ -# inspired from: -# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/ -# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py # fmt: off - __title__ = "spacy" -__version__ = "2.1.8" -__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython" +__version__ = "2.2.0.dev15" +__summary__ = "Industrial-strength Natural Language Processing (NLP) in Python" __uri__ = "https://spacy.io" -__author__ = "Explosion AI" +__author__ = "Explosion" __email__ = "contact@explosion.ai" __license__ = "MIT" -__release__ = True +__release__ = False __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index 8eeea363f..40236630a 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -144,8 +144,12 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False): for name, value in stringy_attrs.items(): if isinstance(name, int): int_key = name - else: + elif name in IDS: + int_key = IDS[name] + elif name.upper() in IDS: int_key = IDS[name.upper()] + else: + continue if strings_map is not None and isinstance(value, basestring): if hasattr(strings_map, 'add'): value = strings_map.add(value) diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 0a9a0f7ef..b649e6666 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -34,12 +34,6 @@ BLANK_MODEL_THRESHOLD = 2000 str, ), ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool), - ignore_validation=( - "Don't exit if JSON format validation fails", - "flag", - "IV", - bool, - ), verbose=("Print additional information and explanations", "flag", "V", bool), no_format=("Don't pretty-print the results", "flag", "NF", bool), ) @@ -50,10 +44,14 @@ def debug_data( base_model=None, pipeline="tagger,parser,ner", ignore_warnings=False, - ignore_validation=False, verbose=False, no_format=False, ): + """ + Analyze, debug and validate your training and development data, get useful + stats, and find problems like invalid entity annotations, cyclic + dependencies, low data labels and more. + """ msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings) # Make sure all files and paths exists if they are needed @@ -72,21 +70,9 @@ def debug_data( msg.divider("Data format validation") - # Validate data format using the JSON schema + # TODO: Validate data format using the JSON schema # TODO: update once the new format is ready # TODO: move validation to GoldCorpus in order to be able to load from dir - train_data_errors = [] # TODO: validate_json - dev_data_errors = [] # TODO: validate_json - if not train_data_errors: - msg.good("Training data JSON format is valid") - if not dev_data_errors: - msg.good("Development data JSON format is valid") - for error in train_data_errors: - msg.fail("Training data: {}".format(error)) - for error in dev_data_errors: - msg.fail("Develoment data: {}".format(error)) - if (train_data_errors or dev_data_errors) and not ignore_validation: - sys.exit(1) # Create the gold corpus to be able to better analyze data loading_train_error_message = "" @@ -284,7 +270,7 @@ def debug_data( if "textcat" in pipeline: msg.divider("Text Classification") - labels = [label for label in gold_train_data["textcat"]] + labels = [label for label in gold_train_data["cats"]] model_labels = _get_labels_from_model(nlp, "textcat") new_labels = [l for l in labels if l not in model_labels] existing_labels = [l for l in labels if l in model_labels] @@ -295,13 +281,45 @@ def debug_data( ) if new_labels: labels_with_counts = _format_labels( - gold_train_data["textcat"].most_common(), counts=True + gold_train_data["cats"].most_common(), counts=True ) msg.text("New: {}".format(labels_with_counts), show=verbose) if existing_labels: msg.text( "Existing: {}".format(_format_labels(existing_labels)), show=verbose ) + if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]): + msg.fail( + "The train and dev labels are not the same. " + "Train labels: {}. " + "Dev labels: {}.".format( + _format_labels(gold_train_data["cats"]), + _format_labels(gold_dev_data["cats"]), + ) + ) + if gold_train_data["n_cats_multilabel"] > 0: + msg.info( + "The train data contains instances without " + "mutually-exclusive classes. Use '--textcat-multilabel' " + "when training." + ) + if gold_dev_data["n_cats_multilabel"] == 0: + msg.warn( + "Potential train/dev mismatch: the train data contains " + "instances without mutually-exclusive classes while the " + "dev data does not." + ) + else: + msg.info( + "The train data contains only instances with " + "mutually-exclusive classes." + ) + if gold_dev_data["n_cats_multilabel"] > 0: + msg.fail( + "Train/dev mismatch: the dev data contains instances " + "without mutually-exclusive classes while the train data " + "contains only instances with mutually-exclusive classes." + ) if "tagger" in pipeline: msg.divider("Part-of-speech Tagging") @@ -330,6 +348,7 @@ def debug_data( ) if "parser" in pipeline: + has_low_data_warning = False msg.divider("Dependency Parsing") # profile sentence length @@ -518,6 +537,7 @@ def _compile_gold(train_docs, pipeline): "n_sents": 0, "n_nonproj": 0, "n_cycles": 0, + "n_cats_multilabel": 0, "texts": set(), } for doc, gold in train_docs: @@ -540,6 +560,8 @@ def _compile_gold(train_docs, pipeline): data["ner"]["-"] += 1 if "textcat" in pipeline: data["cats"].update(gold.cats) + if list(gold.cats.values()).count(1.0) != 1: + data["n_cats_multilabel"] += 1 if "tagger" in pipeline: data["tags"].update([x for x in gold.tags if x is not None]) if "parser" in pipeline: diff --git a/spacy/cli/download.py b/spacy/cli/download.py index 8a993178a..64ab03a75 100644 --- a/spacy/cli/download.py +++ b/spacy/cli/download.py @@ -28,6 +28,16 @@ def download(model, direct=False, *pip_args): can be shortcut, model name or, if --direct flag is set, full model name with version. For direct downloads, the compatibility check will be skipped. """ + if not require_package("spacy") and "--no-deps" not in pip_args: + msg.warn( + "Skipping model package dependencies and setting `--no-deps`. " + "You don't seem to have the spaCy package itself installed " + "(maybe because you've built from source?), so installing the " + "model dependencies would cause spaCy to be downloaded, which " + "probably isn't what you want. If the model package has other " + "dependencies, you'll have to install them manually." + ) + pip_args = pip_args + ("--no-deps",) dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}" if direct: components = model.split("-") @@ -72,12 +82,15 @@ def download(model, direct=False, *pip_args): # is_package check currently fails, because pkg_resources.working_set # is not refreshed automatically (see #3923). We're trying to work # around this here be requiring the package explicitly. - try: - pkg_resources.working_set.require(model_name) - except: # noqa: E722 - # Maybe it's possible to remove this – mostly worried about cross- - # platform and cross-Python copmpatibility here - pass + require_package(model_name) + + +def require_package(name): + try: + pkg_resources.working_set.require(name) + return True + except: # noqa: E722 + return False def get_json(url, desc): @@ -117,7 +130,7 @@ def get_version(model, comp): def download_model(filename, user_pip_args=None): download_url = about.__download_url__ + "/" + filename - pip_args = ["--no-cache-dir", "--no-deps"] + pip_args = ["--no-cache-dir"] if user_pip_args: pip_args.extend(user_pip_args) cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url] diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 0a57ef2da..1114ada08 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -61,6 +61,7 @@ def evaluate( "NER P": "%.2f" % scorer.ents_p, "NER R": "%.2f" % scorer.ents_r, "NER F": "%.2f" % scorer.ents_f, + "Textcat": "%.2f" % scorer.textcat_score, } msg.table(results, title="Results") diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py index 955b420aa..c285a12a6 100644 --- a/spacy/cli/init_model.py +++ b/spacy/cli/init_model.py @@ -35,6 +35,13 @@ msg = Printer() clusters_loc=("Optional location of brown clusters data", "option", "c", str), vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str), prune_vectors=("Optional number of vectors to prune to", "option", "V", int), + vectors_name=( + "Optional name for the word vectors, e.g. en_core_web_lg.vectors", + "option", + "vn", + str, + ), + model_name=("Optional name for the model meta", "option", "mn", str), ) def init_model( lang, @@ -44,6 +51,8 @@ def init_model( jsonl_loc=None, vectors_loc=None, prune_vectors=-1, + vectors_name=None, + model_name=None, ): """ Create a new model from raw data, like word frequencies, Brown clusters @@ -75,10 +84,10 @@ def init_model( lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc) with msg.loading("Creating model..."): - nlp = create_model(lang, lex_attrs) + nlp = create_model(lang, lex_attrs, name=model_name) msg.good("Successfully created model") if vectors_loc is not None: - add_vectors(nlp, vectors_loc, prune_vectors) + add_vectors(nlp, vectors_loc, prune_vectors, vectors_name) vec_added = len(nlp.vocab.vectors) lex_added = len(nlp.vocab) msg.good( @@ -138,7 +147,7 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc): return lex_attrs -def create_model(lang, lex_attrs): +def create_model(lang, lex_attrs, name=None): lang_class = get_lang_class(lang) nlp = lang_class() for lexeme in nlp.vocab: @@ -157,10 +166,12 @@ def create_model(lang, lex_attrs): else: oov_prob = DEFAULT_OOV_PROB nlp.vocab.cfg.update({"oov_prob": oov_prob}) + if name: + nlp.meta["name"] = name return nlp -def add_vectors(nlp, vectors_loc, prune_vectors): +def add_vectors(nlp, vectors_loc, prune_vectors, name=None): vectors_loc = ensure_path(vectors_loc) if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) @@ -181,7 +192,10 @@ def add_vectors(nlp, vectors_loc, prune_vectors): lexeme.is_oov = False if vectors_data is not None: nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) - nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"] + if name is None: + nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"] + else: + nlp.vocab.vectors.name = name nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name if prune_vectors >= 1: nlp.vocab.prune_vectors(prune_vectors) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index fe30e1a3c..2588a81a2 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -21,54 +21,35 @@ from .. import about @plac.annotations( + # fmt: off lang=("Model language", "positional", None, str), output_path=("Output directory to store model in", "positional", None, Path), train_path=("Location of JSON-formatted training data", "positional", None, Path), dev_path=("Location of JSON-formatted development data", "positional", None, Path), - raw_text=( - "Path to jsonl file with unlabelled text documents.", - "option", - "rt", - Path, - ), + raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path), base_model=("Name of model to update (optional)", "option", "b", str), pipeline=("Comma-separated names of pipeline components", "option", "p", str), vectors=("Model to load vectors from", "option", "v", str), n_iter=("Number of iterations", "option", "n", int), - n_early_stopping=( - "Maximum number of training epochs without dev accuracy improvement", - "option", - "ne", - int, - ), + n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int), n_examples=("Number of examples", "option", "ns", int), use_gpu=("Use GPU", "option", "g", int), version=("Model version", "option", "V", str), meta_path=("Optional path to meta.json to use as base.", "option", "m", Path), - init_tok2vec=( - "Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", - "option", - "t2v", - Path, - ), - parser_multitasks=( - "Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", - "option", - "pt", - str, - ), - entity_multitasks=( - "Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", - "option", - "et", - str, - ), + init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path), + parser_multitasks=("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str), + entity_multitasks=("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str), noise_level=("Amount of corruption for data augmentation", "option", "nl", float), + orth_variant_level=("Amount of orthography variation for data augmentation", "option", "ovl", float), eval_beam_widths=("Beam widths to evaluate, e.g. 4,8", "option", "bw", str), gold_preproc=("Use gold preprocessing", "flag", "G", bool), learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool), + textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool), + textcat_arch=("Textcat model architecture", "option", "ta", str), + textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str), verbose=("Display more information for debug", "flag", "VV", bool), debug=("Run data diagnostics before training", "flag", "D", bool), + # fmt: on ) def train( lang, @@ -89,9 +70,13 @@ def train( parser_multitasks="", entity_multitasks="", noise_level=0.0, + orth_variant_level=0.0, eval_beam_widths="", gold_preproc=False, learn_tokens=False, + textcat_multilabel=False, + textcat_arch="bow", + textcat_positive_label=None, verbose=False, debug=False, ): @@ -177,9 +162,37 @@ def train( if pipe not in nlp.pipe_names: if pipe == "parser": pipe_cfg = {"learn_tokens": learn_tokens} + elif pipe == "textcat": + pipe_cfg = { + "exclusive_classes": not textcat_multilabel, + "architecture": textcat_arch, + "positive_label": textcat_positive_label, + } else: pipe_cfg = {} nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) + else: + if pipe == "textcat": + textcat_cfg = nlp.get_pipe("textcat").cfg + base_cfg = { + "exclusive_classes": textcat_cfg["exclusive_classes"], + "architecture": textcat_cfg["architecture"], + "positive_label": textcat_cfg["positive_label"], + } + pipe_cfg = { + "exclusive_classes": not textcat_multilabel, + "architecture": textcat_arch, + "positive_label": textcat_positive_label, + } + if base_cfg != pipe_cfg: + msg.fail( + "The base textcat model configuration does" + "not match the provided training options. " + "Existing cfg: {}, provided cfg: {}".format( + base_cfg, pipe_cfg + ), + exits=1, + ) else: msg.text("Starting with blank model '{}'".format(lang)) lang_cls = util.get_lang_class(lang) @@ -187,6 +200,12 @@ def train( for pipe in pipeline: if pipe == "parser": pipe_cfg = {"learn_tokens": learn_tokens} + elif pipe == "textcat": + pipe_cfg = { + "exclusive_classes": not textcat_multilabel, + "architecture": textcat_arch, + "positive_label": textcat_positive_label, + } else: pipe_cfg = {} nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) @@ -227,12 +246,89 @@ def train( components = _load_pretrained_tok2vec(nlp, init_tok2vec) msg.text("Loaded pretrained tok2vec for: {}".format(components)) + # Verify textcat config + if "textcat" in pipeline: + textcat_labels = nlp.get_pipe("textcat").cfg["labels"] + if textcat_positive_label and textcat_positive_label not in textcat_labels: + msg.fail( + "The textcat_positive_label (tpl) '{}' does not match any " + "label in the training data.".format(textcat_positive_label), + exits=1, + ) + if textcat_positive_label and len(textcat_labels) != 2: + msg.fail( + "A textcat_positive_label (tpl) '{}' was provided for training " + "data that does not appear to be a binary classification " + "problem with two labels.".format(textcat_positive_label), + exits=1, + ) + train_docs = corpus.train_docs( + nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 + ) + train_labels = set() + if textcat_multilabel: + multilabel_found = False + for text, gold in train_docs: + train_labels.update(gold.cats.keys()) + if list(gold.cats.values()).count(1.0) != 1: + multilabel_found = True + if not multilabel_found and not base_model: + msg.warn( + "The textcat training instances look like they have " + "mutually-exclusive classes. Remove the flag " + "'--textcat-multilabel' to train a classifier with " + "mutually-exclusive classes." + ) + if not textcat_multilabel: + for text, gold in train_docs: + train_labels.update(gold.cats.keys()) + if list(gold.cats.values()).count(1.0) != 1 and not base_model: + msg.warn( + "Some textcat training instances do not have exactly " + "one positive label. Modifying training options to " + "include the flag '--textcat-multilabel' for classes " + "that are not mutually exclusive." + ) + nlp.get_pipe("textcat").cfg["exclusive_classes"] = False + textcat_multilabel = True + break + if base_model and set(textcat_labels) != train_labels: + msg.fail( + "Cannot extend textcat model using data with different " + "labels. Base model labels: {}, training data labels: " + "{}.".format(textcat_labels, list(train_labels)), + exits=1, + ) + if textcat_multilabel: + msg.text( + "Textcat evaluation score: ROC AUC score macro-averaged across " + "the labels '{}'".format(", ".join(textcat_labels)) + ) + elif textcat_positive_label and len(textcat_labels) == 2: + msg.text( + "Textcat evaluation score: F1-score for the " + "label '{}'".format(textcat_positive_label) + ) + elif len(textcat_labels) > 1: + if len(textcat_labels) == 2: + msg.warn( + "If the textcat component is a binary classifier with " + "exclusive classes, provide '--textcat_positive_label' for " + "an evaluation on the positive class." + ) + msg.text( + "Textcat evaluation score: F1-score macro-averaged across " + "the labels '{}'".format(", ".join(textcat_labels)) + ) + else: + msg.fail( + "Unsupported textcat configuration. Use `spacy debug-data` " + "for more information." + ) + # fmt: off - row_head = ["Itn", "Dep Loss", "NER Loss", "UAS", "NER P", "NER R", "NER F", "Tag %", "Token %", "CPU WPS", "GPU WPS"] - row_widths = [3, 10, 10, 7, 7, 7, 7, 7, 7, 7, 7] - if has_beam_widths: - row_head.insert(1, "Beam W.") - row_widths.insert(1, 7) + row_head, output_stats = _configure_training_output(pipeline, use_gpu, has_beam_widths) + row_widths = [len(w) for w in row_head] row_settings = {"widths": row_widths, "aligns": tuple(["r" for i in row_head]), "spacing": 2} # fmt: on print("") @@ -243,7 +339,11 @@ def train( best_score = 0.0 for i in range(n_iter): train_docs = corpus.train_docs( - nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 + nlp, + noise_level=noise_level, + orth_variant_level=orth_variant_level, + gold_preproc=gold_preproc, + max_length=0, ) if raw_text: random.shuffle(raw_text) @@ -286,7 +386,7 @@ def train( ) nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs, debug) + scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) end_time = timer() if use_gpu < 0: gpu_wps = None @@ -302,7 +402,7 @@ def train( corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) ) start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs) + scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) end_time = timer() cpu_wps = nwords / (end_time - start_time) acc_loc = output_path / ("model%d" % i) / "accuracy.json" @@ -336,6 +436,7 @@ def train( } meta.setdefault("name", "model%d" % i) meta.setdefault("version", version) + meta["labels"] = nlp.meta["labels"] meta_loc = output_path / ("model%d" % i) / "meta.json" srsly.write_json(meta_loc, meta) util.set_env_log(verbose) @@ -344,10 +445,19 @@ def train( i, losses, scorer.scores, + output_stats, beam_width=beam_width if has_beam_widths else None, cpu_wps=cpu_wps, gpu_wps=gpu_wps, ) + if i == 0 and "textcat" in pipeline: + textcats_per_cat = scorer.scores.get("textcats_per_cat", {}) + for cat, cat_score in textcats_per_cat.items(): + if cat_score.get("roc_auc_score", 0) < 0: + msg.warn( + "Textcat ROC AUC score is undefined due to " + "only one value in label '{}'.".format(cat) + ) msg.row(progress, **row_settings) # Early stopping if n_early_stopping is not None: @@ -388,6 +498,8 @@ def _score_for_model(meta): mean_acc.append((acc["uas"] + acc["las"]) / 2) if "ner" in pipes: mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3) + if "textcat" in pipes: + mean_acc.append(acc["textcat_score"]) return sum(mean_acc) / len(mean_acc) @@ -471,40 +583,55 @@ def _get_metrics(component): return ("token_acc",) -def _get_progress(itn, losses, dev_scores, beam_width=None, cpu_wps=0.0, gpu_wps=0.0): +def _configure_training_output(pipeline, use_gpu, has_beam_widths): + row_head = ["Itn"] + output_stats = [] + for pipe in pipeline: + if pipe == "tagger": + row_head.extend(["Tag Loss ", " Tag % "]) + output_stats.extend(["tag_loss", "tags_acc"]) + elif pipe == "parser": + row_head.extend(["Dep Loss ", " UAS ", " LAS "]) + output_stats.extend(["dep_loss", "uas", "las"]) + elif pipe == "ner": + row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "]) + output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"]) + elif pipe == "textcat": + row_head.extend(["Textcat Loss", "Textcat"]) + output_stats.extend(["textcat_loss", "textcat_score"]) + row_head.extend(["Token %", "CPU WPS"]) + output_stats.extend(["token_acc", "cpu_wps"]) + + if use_gpu >= 0: + row_head.extend(["GPU WPS"]) + output_stats.extend(["gpu_wps"]) + + if has_beam_widths: + row_head.insert(1, "Beam W.") + return row_head, output_stats + + +def _get_progress( + itn, losses, dev_scores, output_stats, beam_width=None, cpu_wps=0.0, gpu_wps=0.0 +): scores = {} - for col in [ - "dep_loss", - "tag_loss", - "uas", - "tags_acc", - "token_acc", - "ents_p", - "ents_r", - "ents_f", - "cpu_wps", - "gpu_wps", - ]: - scores[col] = 0.0 + for stat in output_stats: + scores[stat] = 0.0 scores["dep_loss"] = losses.get("parser", 0.0) scores["ner_loss"] = losses.get("ner", 0.0) scores["tag_loss"] = losses.get("tagger", 0.0) - scores.update(dev_scores) + scores["textcat_loss"] = losses.get("textcat", 0.0) scores["cpu_wps"] = cpu_wps scores["gpu_wps"] = gpu_wps or 0.0 - result = [ - itn, - "{:.3f}".format(scores["dep_loss"]), - "{:.3f}".format(scores["ner_loss"]), - "{:.3f}".format(scores["uas"]), - "{:.3f}".format(scores["ents_p"]), - "{:.3f}".format(scores["ents_r"]), - "{:.3f}".format(scores["ents_f"]), - "{:.3f}".format(scores["tags_acc"]), - "{:.3f}".format(scores["token_acc"]), - "{:.0f}".format(scores["cpu_wps"]), - "{:.0f}".format(scores["gpu_wps"]), - ] + scores.update(dev_scores) + formatted_scores = [] + for stat in output_stats: + format_spec = "{:.3f}" + if stat.endswith("_wps"): + format_spec = "{:.0f}" + formatted_scores.append(format_spec.format(scores[stat])) + result = [itn + 1] + result.extend(formatted_scores) if beam_width is not None: result.insert(1, beam_width) return result diff --git a/spacy/errors.py b/spacy/errors.py index 587a6e700..30c7a5f48 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -84,6 +84,10 @@ class Warnings(object): W018 = ("Entity '{entity}' already exists in the Knowledge base.") W019 = ("Changing vectors name from {old} to {new}, to avoid clash with " "previously loaded vectors. See Issue #3853.") + W020 = ("Unnamed vectors. This won't allow multiple vectors models to be " + "loaded. (Shape: {shape})") + W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be " + "incorrect. Modify PhraseMatcher._terminal_hash to fix.") @add_codes @@ -118,7 +122,7 @@ class Errors(object): E011 = ("Unknown operator: '{op}'. Options: {opts}") E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}") E013 = ("Error selecting action in matcher") - E014 = ("Uknown tag ID: {tag}") + E014 = ("Unknown tag ID: {tag}") E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use " "`force=True` to overwrite.") E016 = ("MultitaskObjective target should be function or one of: dep, " @@ -457,6 +461,25 @@ class Errors(object): E160 = ("Can't find language data file: {path}") E161 = ("Found an internal inconsistency when predicting entity links. " "This is likely a bug in spaCy, so feel free to open an issue.") + E162 = ("Cannot evaluate textcat model on data with different labels.\n" + "Labels in model: {model_labels}\nLabels in evaluation " + "data: {eval_labels}") + E163 = ("cumsum was found to be unstable: its last element does not " + "correspond to sum") + E164 = ("x is neither increasing nor decreasing: {}.") + E165 = ("Only one class present in y_true. ROC AUC score is not defined in " + "that case.") + E166 = ("Can only merge DocBins with the same pre-defined attributes.\n" + "Current DocBin: {current}\nOther DocBin: {other}") + E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can " + "happen if the tagger was trained with a different set of " + "morphological features. If you're using a pre-trained model, make " + "sure that your models are up to date:\npython -m spacy validate") + E168 = ("Unknown field: {field}") + E169 = ("Can't find module: {module}") + E170 = ("Cannot apply transition {name}: invalid for the current state.") + E171 = ("Matcher.add received invalid on_match callback argument: expected " + "callable or None, but got: {arg_type}") @add_codes diff --git a/spacy/glossary.py b/spacy/glossary.py index ff38e7138..52abc7bb5 100644 --- a/spacy/glossary.py +++ b/spacy/glossary.py @@ -307,4 +307,10 @@ GLOSSARY = { # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf "PER": "Named person or family.", "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art", + # https://github.com/ltgoslo/norne + "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.", + "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas", + "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')", + "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'", + "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'", } diff --git a/spacy/gold.pxd b/spacy/gold.pxd index a3123f7fa..20a25a939 100644 --- a/spacy/gold.pxd +++ b/spacy/gold.pxd @@ -24,6 +24,7 @@ cdef class GoldParse: cdef public int loss cdef public list words cdef public list tags + cdef public list morphology cdef public list heads cdef public list labels cdef public dict orths diff --git a/spacy/gold.pyx b/spacy/gold.pyx index f6ec8d3fa..4cc44f757 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -7,6 +7,7 @@ import random import numpy import tempfile import shutil +import itertools from pathlib import Path import srsly @@ -56,6 +57,7 @@ def tags_to_entities(tags): def merge_sents(sents): m_deps = [[], [], [], [], [], []] m_brackets = [] + m_cats = sents.pop() i = 0 for (ids, words, tags, heads, labels, ner), brackets in sents: m_deps[0].extend(id_ + i for id_ in ids) @@ -67,6 +69,7 @@ def merge_sents(sents): m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) for b in brackets) i += len(ids) + m_deps.append(m_cats) return [(m_deps, m_brackets)] @@ -198,6 +201,7 @@ class GoldCorpus(object): n = 0 i = 0 for raw_text, paragraph_tuples in self.train_tuples: + cats = paragraph_tuples.pop() for sent_tuples, brackets in paragraph_tuples: n += len(sent_tuples[1]) if self.limit and i >= self.limit: @@ -206,13 +210,14 @@ class GoldCorpus(object): return n def train_docs(self, nlp, gold_preproc=False, max_length=None, - noise_level=0.0): + noise_level=0.0, orth_variant_level=0.0): locs = list((self.tmp_dir / 'train').iterdir()) random.shuffle(locs) train_tuples = self.read_tuples(locs, limit=self.limit) gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc, max_length=max_length, noise_level=noise_level, + orth_variant_level=orth_variant_level, make_projective=True) yield from gold_docs @@ -226,43 +231,132 @@ class GoldCorpus(object): @classmethod def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, - noise_level=0.0, make_projective=False): + noise_level=0.0, orth_variant_level=0.0, make_projective=False): for raw_text, paragraph_tuples in tuples: if gold_preproc: raw_text = None else: paragraph_tuples = merge_sents(paragraph_tuples) - docs = cls._make_docs(nlp, raw_text, paragraph_tuples, gold_preproc, - noise_level=noise_level) + docs, paragraph_tuples = cls._make_docs(nlp, raw_text, + paragraph_tuples, gold_preproc, noise_level=noise_level, + orth_variant_level=orth_variant_level) golds = cls._make_golds(docs, paragraph_tuples, make_projective) for doc, gold in zip(docs, golds): if (not max_length) or len(doc) < max_length: yield doc, gold @classmethod - def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0): + def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): if raw_text is not None: + raw_text, paragraph_tuples = make_orth_variants(nlp, raw_text, paragraph_tuples, orth_variant_level=orth_variant_level) raw_text = add_noise(raw_text, noise_level) - return [nlp.make_doc(raw_text)] + return [nlp.make_doc(raw_text)], paragraph_tuples else: + docs = [] + raw_text, paragraph_tuples = make_orth_variants(nlp, None, paragraph_tuples, orth_variant_level=orth_variant_level) return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level)) - for (sent_tuples, brackets) in paragraph_tuples] + for (sent_tuples, brackets) in paragraph_tuples], paragraph_tuples + @classmethod def _make_golds(cls, docs, paragraph_tuples, make_projective): if len(docs) != len(paragraph_tuples): n_annots = len(paragraph_tuples) raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) - if len(docs) == 1: - return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0], - make_projective=make_projective)] - else: - return [GoldParse.from_annot_tuples(doc, sent_tuples, + return [GoldParse.from_annot_tuples(doc, sent_tuples, make_projective=make_projective) for doc, (sent_tuples, brackets) in zip(docs, paragraph_tuples)] +def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): + if random.random() >= orth_variant_level: + return raw, paragraph_tuples + if random.random() >= 0.5: + lower = True + if raw is not None: + raw = raw.lower() + ndsv = nlp.Defaults.single_orth_variants + ndpv = nlp.Defaults.paired_orth_variants + # modify words in paragraph_tuples + variant_paragraph_tuples = [] + for sent_tuples, brackets in paragraph_tuples: + ids, words, tags, heads, labels, ner, cats = sent_tuples + if lower: + words = [w.lower() for w in words] + # single variants + punct_choices = [random.choice(x["variants"]) for x in ndsv] + for word_idx in range(len(words)): + for punct_idx in range(len(ndsv)): + if tags[word_idx] in ndsv[punct_idx]["tags"] \ + and words[word_idx] in ndsv[punct_idx]["variants"]: + words[word_idx] = punct_choices[punct_idx] + # paired variants + punct_choices = [random.choice(x["variants"]) for x in ndpv] + for word_idx in range(len(words)): + for punct_idx in range(len(ndpv)): + if tags[word_idx] in ndpv[punct_idx]["tags"] \ + and words[word_idx] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]): + # backup option: random left vs. right from pair + pair_idx = random.choice([0, 1]) + # best option: rely on paired POS tags like `` / '' + if len(ndpv[punct_idx]["tags"]) == 2: + pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx]) + # next best option: rely on position in variants + # (may not be unambiguous, so order of variants matters) + else: + for pair in ndpv[punct_idx]["variants"]: + if words[word_idx] in pair: + pair_idx = pair.index(words[word_idx]) + words[word_idx] = punct_choices[punct_idx][pair_idx] + + variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets)) + # modify raw to match variant_paragraph_tuples + if raw is not None: + variants = [] + for single_variants in ndsv: + variants.extend(single_variants["variants"]) + for paired_variants in ndpv: + variants.extend(list(itertools.chain.from_iterable(paired_variants["variants"]))) + # store variants in reverse length order to be able to prioritize + # longer matches (e.g., "---" before "--") + variants = sorted(variants, key=lambda x: len(x)) + variants.reverse() + variant_raw = "" + raw_idx = 0 + # add initial whitespace + while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): + variant_raw += raw[raw_idx] + raw_idx += 1 + for sent_tuples, brackets in variant_paragraph_tuples: + ids, words, tags, heads, labels, ner, cats = sent_tuples + for word in words: + match_found = False + # add identical word + if word not in variants and raw[raw_idx:].startswith(word): + variant_raw += word + raw_idx += len(word) + match_found = True + # add variant word + else: + for variant in variants: + if not match_found and \ + raw[raw_idx:].startswith(variant): + raw_idx += len(variant) + variant_raw += word + match_found = True + # something went wrong, abort + # (add a warning message?) + if not match_found: + return raw, paragraph_tuples + # add following whitespace + while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): + variant_raw += raw[raw_idx] + raw_idx += 1 + return variant_raw, variant_paragraph_tuples + return raw, variant_paragraph_tuples + + def add_noise(orig, noise_level): if random.random() >= noise_level: return orig @@ -277,12 +371,8 @@ def add_noise(orig, noise_level): def _corrupt(c, noise_level): if random.random() >= noise_level: return c - elif c == " ": - return "\n" - elif c == "\n": - return " " elif c in [".", "'", "!", "?", ","]: - return "" + return "\n" else: return c.lower() @@ -330,6 +420,10 @@ def json_to_tuple(doc): sents.append([ [ids, words, tags, heads, labels, ner], sent.get("brackets", [])]) + cats = {} + for cat in paragraph.get("cats", {}): + cats[cat["label"]] = cat["value"] + sents.append(cats) if sents: yield [paragraph.get("raw", None), sents] @@ -443,11 +537,12 @@ cdef class GoldParse: """ @classmethod def from_annot_tuples(cls, doc, annot_tuples, make_projective=False): - _, words, tags, heads, deps, entities = annot_tuples + _, words, tags, heads, deps, entities, cats = annot_tuples return cls(doc, words=words, tags=tags, heads=heads, deps=deps, - entities=entities, make_projective=make_projective) + entities=entities, cats=cats, + make_projective=make_projective) - def __init__(self, doc, annot_tuples=None, words=None, tags=None, + def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None, heads=None, deps=None, entities=None, make_projective=False, cats=None, links=None, **_): """Create a GoldParse. @@ -482,11 +577,13 @@ cdef class GoldParse: if words is None: words = [token.text for token in doc] if tags is None: - tags = [None for _ in doc] + tags = [None for _ in words] if heads is None: - heads = [None for token in doc] + heads = [None for _ in words] if deps is None: - deps = [None for _ in doc] + deps = [None for _ in words] + if morphology is None: + morphology = [None for _ in words] if entities is None: entities = ["-" for _ in doc] elif len(entities) == 0: @@ -498,7 +595,6 @@ cdef class GoldParse: if not isinstance(entities[0], basestring): # Assume we have entities specified by character offset. entities = biluo_tags_from_offsets(doc, entities) - self.mem = Pool() self.loss = 0 self.length = len(doc) @@ -518,6 +614,7 @@ cdef class GoldParse: self.heads = [None] * len(doc) self.labels = [None] * len(doc) self.ner = [None] * len(doc) + self.morphology = [None] * len(doc) # This needs to be done before we align the words if make_projective and heads is not None and deps is not None: @@ -544,11 +641,13 @@ cdef class GoldParse: self.tags[i] = "_SP" self.heads[i] = None self.labels[i] = None - self.ner[i] = "O" + self.ner[i] = None + self.morphology[i] = set() if gold_i is None: if i in i2j_multi: self.words[i] = words[i2j_multi[i]] self.tags[i] = tags[i2j_multi[i]] + self.morphology[i] = morphology[i2j_multi[i]] is_last = i2j_multi[i] != i2j_multi.get(i+1) is_first = i2j_multi[i] != i2j_multi.get(i-1) # Set next word in multi-token span as head, until last @@ -585,6 +684,7 @@ cdef class GoldParse: else: self.words[i] = words[gold_i] self.tags[i] = tags[gold_i] + self.morphology[i] = morphology[gold_i] if heads[gold_i] is None: self.heads[i] = None else: @@ -592,9 +692,20 @@ cdef class GoldParse: self.labels[i] = deps[gold_i] self.ner[i] = entities[gold_i] + # Prevent whitespace that isn't within entities from being tagged as + # an entity. + for i in range(len(self.ner)): + if self.tags[i] == "_SP": + prev_ner = self.ner[i-1] if i >= 1 else None + next_ner = self.ner[i+1] if (i+1) < len(self.ner) else None + if prev_ner == "O" or next_ner == "O": + self.ner[i] = "O" + cycle = nonproj.contains_cycle(self.heads) if cycle is not None: - raise ValueError(Errors.E069.format(cycle=cycle, cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]), doc_tokens=" ".join(words[:50]))) + raise ValueError(Errors.E069.format(cycle=cycle, + cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]), + doc_tokens=" ".join(words[:50]))) def __len__(self): """Get the number of gold-standard tokens. @@ -638,7 +749,10 @@ def docs_to_json(docs, id=0): docs = [docs] json_doc = {"id": id, "paragraphs": []} for i, doc in enumerate(docs): - json_para = {'raw': doc.text, "sentences": []} + json_para = {'raw': doc.text, "sentences": [], "cats": []} + for cat, val in doc.cats.items(): + json_cat = {"label": cat, "value": val} + json_para["cats"].append(json_cat) ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents] biluo_tags = biluo_tags_from_offsets(doc, ent_offsets) for j, sent in enumerate(doc.sents): diff --git a/spacy/kb.pyx b/spacy/kb.pyx index 176ac17de..6cbc06e2c 100644 --- a/spacy/kb.pyx +++ b/spacy/kb.pyx @@ -24,7 +24,7 @@ cdef class Candidate: algorithm which will disambiguate the various candidates to the correct one. Each candidate (alias, entity) pair is assigned to a certain prior probability. - DOCS: https://spacy.io/api/candidate + DOCS: https://spacy.io/api/kb/#candidate_init """ def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob): diff --git a/spacy/lang/char_classes.py b/spacy/lang/char_classes.py index 131bdcd51..cb5b50ffc 100644 --- a/spacy/lang/char_classes.py +++ b/spacy/lang/char_classes.py @@ -201,7 +201,9 @@ _ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ" _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower -_uncased = _bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu +_uncased = ( + _bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu +) ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased) ALPHA_LOWER = group_chars(_lower + _uncased) diff --git a/spacy/lang/de/__init__.py b/spacy/lang/de/__init__.py index 1b5aee6a8..b96069235 100644 --- a/spacy/lang/de/__init__.py +++ b/spacy/lang/de/__init__.py @@ -27,6 +27,20 @@ class GermanDefaults(Language.Defaults): stop_words = STOP_WORDS syntax_iterators = SYNTAX_ITERATORS resources = {"lemma_lookup": "lemma_lookup.json"} + single_orth_variants = [ + {"tags": ["$("], "variants": ["…", "..."]}, + {"tags": ["$("], "variants": ["-", "—", "–", "--", "---", "——"]}, + ] + paired_orth_variants = [ + { + "tags": ["$("], + "variants": [("'", "'"), (",", "'"), ("‚", "‘"), ("›", "‹"), ("‹", "›")], + }, + { + "tags": ["$("], + "variants": [("``", "''"), ('"', '"'), ("„", "“"), ("»", "«"), ("«", "»")], + }, + ] class German(Language): diff --git a/spacy/lang/de/tag_map.py b/spacy/lang/de/tag_map.py index 3bb6247c4..394478145 100644 --- a/spacy/lang/de/tag_map.py +++ b/spacy/lang/de/tag_map.py @@ -10,7 +10,7 @@ TAG_MAP = { "$,": {POS: PUNCT, "PunctType": "comm"}, "$.": {POS: PUNCT, "PunctType": "peri"}, "ADJA": {POS: ADJ}, - "ADJD": {POS: ADJ, "Variant": "short"}, + "ADJD": {POS: ADJ}, "ADV": {POS: ADV}, "APPO": {POS: ADP, "AdpType": "post"}, "APPR": {POS: ADP, "AdpType": "prep"}, @@ -32,7 +32,7 @@ TAG_MAP = { "PDAT": {POS: DET, "PronType": "dem"}, "PDS": {POS: PRON, "PronType": "dem"}, "PIAT": {POS: DET, "PronType": "ind|neg|tot"}, - "PIDAT": {POS: DET, "AdjType": "pdt", "PronType": "ind|neg|tot"}, + "PIDAT": {POS: DET, "PronType": "ind|neg|tot"}, "PIS": {POS: PRON, "PronType": "ind|neg|tot"}, "PPER": {POS: PRON, "PronType": "prs"}, "PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"}, @@ -42,7 +42,7 @@ TAG_MAP = { "PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"}, "PTKA": {POS: PART}, "PTKANT": {POS: PART, "PartType": "res"}, - "PTKNEG": {POS: PART, "Polarity": "Neg"}, + "PTKNEG": {POS: PART, "Polarity": "neg"}, "PTKVZ": {POS: PART, "PartType": "vbp"}, "PTKZU": {POS: PART, "PartType": "inf"}, "PWAT": {POS: DET, "PronType": "int"}, diff --git a/spacy/lang/el/lemmatizer/__init__.py b/spacy/lang/el/lemmatizer/__init__.py index c0ce5c2ad..994bf9c16 100644 --- a/spacy/lang/el/lemmatizer/__init__.py +++ b/spacy/lang/el/lemmatizer/__init__.py @@ -46,9 +46,10 @@ class GreekLemmatizer(object): ) return lemmas - def lookup(self, string): - if string in self.lookup_table: - return self.lookup_table[string] + def lookup(self, string, orth=None): + key = orth if orth is not None else string + if key in self.lookup_table: + return self.lookup_table[key] return string diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py index 7d00c749c..e4c745c83 100644 --- a/spacy/lang/en/__init__.py +++ b/spacy/lang/en/__init__.py @@ -38,6 +38,14 @@ class EnglishDefaults(Language.Defaults): "lemma_index": "lemmatizer/lemma_index.json", "lemma_exc": "lemmatizer/lemma_exc.json", } + single_orth_variants = [ + {"tags": ["NFP"], "variants": ["…", "..."]}, + {"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]}, + ] + paired_orth_variants = [ + {"tags": ["``", "''"], "variants": [("'", "'"), ("‘", "’")]}, + {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]}, + ] class English(Language): diff --git a/spacy/lang/en/lemmatizer/lemma_lookup.json b/spacy/lang/en/lemmatizer/lemma_lookup.json index d0f92c37c..15d41e4ba 100644 --- a/spacy/lang/en/lemmatizer/lemma_lookup.json +++ b/spacy/lang/en/lemmatizer/lemma_lookup.json @@ -20574,7 +20574,7 @@ "lengthier": "lengthy", "lengthiest": "lengthy", "lengths": "length", - "lenses": "lense", + "lenses": "lens", "lent": "lend", "lenticels": "lenticel", "lentils": "lentil", diff --git a/spacy/lang/en/morph_rules.py b/spacy/lang/en/morph_rules.py index 198182ff0..5ed4eac59 100644 --- a/spacy/lang/en/morph_rules.py +++ b/spacy/lang/en/morph_rules.py @@ -3,55 +3,59 @@ from __future__ import unicode_literals from ...symbols import LEMMA, PRON_LEMMA +# Several entries here look pretty suspicious. These will get the POS SCONJ +# given the tag IN, when an adpositional reading seems much more likely for +# a lot of these prepositions. I'm not sure what I was running in 04395ffa4 +# when I did this? It doesn't seem right. _subordinating_conjunctions = [ "that", "if", "as", "because", - "of", - "for", - "before", - "in", + # "of", + # "for", + # "before", + # "in", "while", - "after", + # "after", "since", "like", - "with", + # "with", "so", - "to", - "by", - "on", - "about", + # "to", + # "by", + # "on", + # "about", "than", "whether", "although", - "from", + # "from", "though", - "until", + # "until", "unless", "once", - "without", - "at", - "into", + # "without", + # "at", + # "into", "cause", - "over", + # "over", "upon", "till", "whereas", - "beyond", + # "beyond", "whilst", "except", "despite", "wether", - "then", + # "then", "but", "becuse", "whie", - "below", - "against", + # "below", + # "against", "it", "w/out", - "toward", + # "toward", "albeit", "save", "besides", @@ -63,16 +67,17 @@ _subordinating_conjunctions = [ "out", "near", "seince", - "towards", + # "towards", "tho", "sice", "will", ] -_relative_pronouns = ["this", "that", "those", "these"] +# This seems kind of wrong too? +# _relative_pronouns = ["this", "that", "those", "these"] MORPH_RULES = { - "DT": {word: {"POS": "PRON"} for word in _relative_pronouns}, + # "DT": {word: {"POS": "PRON"} for word in _relative_pronouns}, "IN": {word: {"POS": "SCONJ"} for word in _subordinating_conjunctions}, "NN": { "something": {"POS": "PRON"}, diff --git a/spacy/lang/en/tag_map.py b/spacy/lang/en/tag_map.py index 246258f57..9bd884a3a 100644 --- a/spacy/lang/en/tag_map.py +++ b/spacy/lang/en/tag_map.py @@ -14,10 +14,10 @@ TAG_MAP = { '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, ":": {POS: PUNCT}, - "$": {POS: SYM, "Other": {"SymType": "currency"}}, - "#": {POS: SYM, "Other": {"SymType": "numbersign"}}, - "AFX": {POS: X, "Hyph": "yes"}, - "CC": {POS: CCONJ, "ConjType": "coor"}, + "$": {POS: SYM}, + "#": {POS: SYM}, + "AFX": {POS: ADJ, "Hyph": "yes"}, + "CC": {POS: CCONJ, "ConjType": "comp"}, "CD": {POS: NUM, "NumType": "card"}, "DT": {POS: DET}, "EX": {POS: PRON, "AdvType": "ex"}, @@ -34,7 +34,7 @@ TAG_MAP = { "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, "NNS": {POS: NOUN, "Number": "plur"}, - "PDT": {POS: DET, "AdjType": "pdt", "PronType": "prn"}, + "PDT": {POS: DET}, "POS": {POS: PART, "Poss": "yes"}, "PRP": {POS: PRON, "PronType": "prs"}, "PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"}, @@ -56,12 +56,12 @@ TAG_MAP = { "VerbForm": "fin", "Tense": "pres", "Number": "sing", - "Person": 3, + "Person": "three", }, - "WDT": {POS: PRON, "PronType": "int|rel"}, - "WP": {POS: PRON, "PronType": "int|rel"}, - "WP$": {POS: PRON, "Poss": "yes", "PronType": "int|rel"}, - "WRB": {POS: ADV, "PronType": "int|rel"}, + "WDT": {POS: PRON}, + "WP": {POS: PRON}, + "WP$": {POS: PRON, "Poss": "yes"}, + "WRB": {POS: ADV}, "ADD": {POS: X}, "NFP": {POS: PUNCT}, "GW": {POS: X}, diff --git a/spacy/lang/en/tokenizer_exceptions.py b/spacy/lang/en/tokenizer_exceptions.py index 21e664f7f..c45197771 100644 --- a/spacy/lang/en/tokenizer_exceptions.py +++ b/spacy/lang/en/tokenizer_exceptions.py @@ -30,14 +30,7 @@ for pron in ["i"]: for orth in [pron, pron.title()]: _exc[orth + "'m"] = [ {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - { - ORTH: "'m", - LEMMA: "be", - NORM: "am", - TAG: "VBP", - "tenspect": 1, - "number": 1, - }, + {ORTH: "'m", LEMMA: "be", NORM: "am", TAG: "VBP"}, ] _exc[orth + "m"] = [ diff --git a/spacy/lang/fr/lemmatizer/__init__.py b/spacy/lang/fr/lemmatizer/__init__.py index a0a0d2021..dfd822188 100644 --- a/spacy/lang/fr/lemmatizer/__init__.py +++ b/spacy/lang/fr/lemmatizer/__init__.py @@ -114,9 +114,9 @@ class FrenchLemmatizer(object): def punct(self, string, morphology=None): return self(string, "punct", morphology) - def lookup(self, string): - if string in self.lookup_table: - return self.lookup_table[string][0] + def lookup(self, string, orth=None): + if orth is not None and orth in self.lookup_table: + return self.lookup_table[orth][0] return string diff --git a/spacy/lang/hi/stop_words.py b/spacy/lang/hi/stop_words.py index 430a18a22..efad18c84 100644 --- a/spacy/lang/hi/stop_words.py +++ b/spacy/lang/hi/stop_words.py @@ -2,7 +2,8 @@ from __future__ import unicode_literals -# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt +# Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6 + STOP_WORDS = set( """ अंदर @@ -18,6 +19,7 @@ STOP_WORDS = set( अंदर आदि आप +अगर इंहिं इंहें इंहों @@ -171,6 +173,9 @@ STOP_WORDS = set( मानो मे में +मैं +मुझको +मेरा यदि यह यहाँ @@ -227,6 +232,7 @@ STOP_WORDS = set( है हैं हो +हूँ होता होति होती diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py index 3a6074bba..056a6893b 100644 --- a/spacy/lang/ja/__init__.py +++ b/spacy/lang/ja/__init__.py @@ -37,6 +37,11 @@ def resolve_pos(token): in the sentence. This function adds information to the POS tag to resolve ambiguous mappings. """ + + # this is only used for consecutive ascii spaces + if token.pos == "空白": + return "空白" + # TODO: This is a first take. The rules here are crude approximations. # For many of these, full dependencies are needed to properly resolve # PoS mappings. @@ -54,6 +59,7 @@ def detailed_tokens(tokenizer, text): node = tokenizer.parseToNode(text) node = node.next # first node is beginning of sentence and empty, skip it words = [] + spaces = [] while node.posid != 0: surface = node.surface base = surface # a default value. Updated if available later. @@ -64,8 +70,20 @@ def detailed_tokens(tokenizer, text): # dictionary base = parts[7] words.append(ShortUnitWord(surface, base, pos)) + + # The way MeCab stores spaces is that the rlength of the next token is + # the length of that token plus any preceding whitespace, **in bytes**. + # also note that this is only for half-width / ascii spaces. Full width + # spaces just become tokens. + scount = node.next.rlength - node.next.length + spaces.append(bool(scount)) + while scount > 1: + words.append(ShortUnitWord(" ", " ", "空白")) + spaces.append(False) + scount -= 1 + node = node.next - return words + return words, spaces class JapaneseTokenizer(DummyTokenizer): @@ -75,9 +93,8 @@ class JapaneseTokenizer(DummyTokenizer): self.tokenizer.parseToNode("") # see #2901 def __call__(self, text): - dtokens = detailed_tokens(self.tokenizer, text) + dtokens, spaces = detailed_tokens(self.tokenizer, text) words = [x.surface for x in dtokens] - spaces = [False] * len(words) doc = Doc(self.vocab, words=words, spaces=spaces) mecab_tags = [] for token, dtoken in zip(doc, dtokens): diff --git a/spacy/lang/ja/tag_map.py b/spacy/lang/ja/tag_map.py index 6b114eb10..4ff0a35ee 100644 --- a/spacy/lang/ja/tag_map.py +++ b/spacy/lang/ja/tag_map.py @@ -2,7 +2,7 @@ from __future__ import unicode_literals from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN -from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET +from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE TAG_MAP = { @@ -21,6 +21,8 @@ TAG_MAP = { "感動詞,一般,*,*": {POS: INTJ}, # this is specifically for unicode full-width space "空白,*,*,*": {POS: X}, + # This is used when sequential half-width spaces are present + "空白": {POS: SPACE}, "形状詞,一般,*,*": {POS: ADJ}, "形状詞,タリ,*,*": {POS: ADJ}, "形状詞,助動詞語幹,*,*": {POS: ADJ}, diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py index c8cd9c3fd..ec79a95ab 100644 --- a/spacy/lang/ko/__init__.py +++ b/spacy/lang/ko/__init__.py @@ -1,8 +1,6 @@ # encoding: utf8 from __future__ import unicode_literals, print_function -import sys - from .stop_words import STOP_WORDS from .tag_map import TAG_MAP from ...attrs import LANG @@ -10,35 +8,12 @@ from ...language import Language from ...tokens import Doc from ...compat import copy_reg from ...util import DummyTokenizer -from ...compat import is_python3, is_python_pre_3_5 - -is_python_post_3_7 = is_python3 and sys.version_info[1] >= 7 - -# fmt: off -if is_python_pre_3_5: - from collections import namedtuple - Morpheme = namedtuple("Morpheme", "surface lemma tag") -elif is_python_post_3_7: - from dataclasses import dataclass - - @dataclass(frozen=True) - class Morpheme: - surface: str - lemma: str - tag: str -else: - from typing import NamedTuple - - class Morpheme(NamedTuple): - - surface = str("") - lemma = str("") - tag = str("") def try_mecab_import(): try: from natto import MeCab + return MeCab except ImportError: raise ImportError( @@ -46,6 +21,8 @@ def try_mecab_import(): "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), " "and [natto-py](https://github.com/buruzaemon/natto-py)" ) + + # fmt: on @@ -69,13 +46,13 @@ class KoreanTokenizer(DummyTokenizer): def __call__(self, text): dtokens = list(self.detailed_tokens(text)) - surfaces = [dt.surface for dt in dtokens] + surfaces = [dt["surface"] for dt in dtokens] doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces))) for token, dtoken in zip(doc, dtokens): - first_tag, sep, eomi_tags = dtoken.tag.partition("+") + first_tag, sep, eomi_tags = dtoken["tag"].partition("+") token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미) - token.lemma_ = dtoken.lemma - doc.user_data["full_tags"] = [dt.tag for dt in dtokens] + token.lemma_ = dtoken["lemma"] + doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens] return doc def detailed_tokens(self, text): @@ -91,7 +68,7 @@ class KoreanTokenizer(DummyTokenizer): lemma, _, remainder = expr.partition("/") if lemma == "*": lemma = surface - yield Morpheme(surface, lemma, tag) + yield {"surface": surface, "lemma": lemma, "tag": tag} class KoreanDefaults(Language.Defaults): diff --git a/spacy/lang/lt/tag_map.py b/spacy/lang/lt/tag_map.py index eab231b2c..6ea4f8ae0 100644 --- a/spacy/lang/lt/tag_map.py +++ b/spacy/lang/lt/tag_map.py @@ -1605,7 +1605,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1613,7 +1613,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1621,7 +1621,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1630,7 +1630,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1638,7 +1638,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "VerbForm": "Fin", @@ -1647,7 +1647,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1655,7 +1655,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1664,7 +1664,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1672,7 +1672,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "VerbForm": "Fin", @@ -1681,7 +1681,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1689,7 +1689,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1697,7 +1697,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1706,7 +1706,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1714,7 +1714,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Reflex": "Yes", "VerbForm": "Fin", @@ -1723,7 +1723,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1731,7 +1731,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1739,7 +1739,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1748,7 +1748,7 @@ TAG_MAP = { POS: VERB, "Mood": "Imp", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1756,21 +1756,21 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "VerbForm": "Fin", }, "Vgm-3---n--ns-": { POS: VERB, "Mood": "Cnd", - "Person": "3", + "Person": "three", "Polarity": "Pos", "VerbForm": "Fin", }, "Vgm-3---n--ys-": { POS: VERB, "Mood": "Cnd", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1778,14 +1778,14 @@ TAG_MAP = { "Vgm-3---y--ns-": { POS: VERB, "Mood": "Cnd", - "Person": "3", + "Person": "three", "Polarity": "Neg", "VerbForm": "Fin", }, "Vgm-3---y--ys-": { POS: VERB, "Mood": "Cnd", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "VerbForm": "Fin", @@ -1794,7 +1794,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1802,7 +1802,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1811,7 +1811,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1819,7 +1819,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "VerbForm": "Fin", }, @@ -1827,7 +1827,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "VerbForm": "Fin", @@ -1836,7 +1836,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "VerbForm": "Fin", }, @@ -1844,7 +1844,7 @@ TAG_MAP = { POS: VERB, "Mood": "Cnd", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "VerbForm": "Fin", @@ -1853,7 +1853,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -1862,7 +1862,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -1872,7 +1872,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -1881,7 +1881,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Past", @@ -1891,7 +1891,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -1900,7 +1900,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -1910,7 +1910,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -1919,7 +1919,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Past", @@ -1929,7 +1929,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -1938,7 +1938,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -1948,7 +1948,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -1957,7 +1957,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -1966,7 +1966,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -1974,7 +1974,7 @@ TAG_MAP = { "Vgma3---n--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -1982,7 +1982,7 @@ TAG_MAP = { "Vgma3---n--yi-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -1991,7 +1991,7 @@ TAG_MAP = { "Vgma3---y--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -1999,7 +1999,7 @@ TAG_MAP = { "Vgma3--y--ni-": { POS: VERB, "Case": "Nom", - "Person": "3", + "Person": "three", "Tense": "Past", "VerbForm": "Fin", }, @@ -2007,7 +2007,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2016,7 +2016,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2026,7 +2026,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -2035,7 +2035,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Past", @@ -2045,7 +2045,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2054,7 +2054,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2064,7 +2064,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2074,7 +2074,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -2083,7 +2083,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Past", @@ -2093,7 +2093,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2102,7 +2102,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2112,7 +2112,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2121,7 +2121,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2130,7 +2130,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2140,7 +2140,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2149,7 +2149,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2158,7 +2158,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2168,7 +2168,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2177,7 +2177,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2187,7 +2187,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2196,7 +2196,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Fut", @@ -2205,7 +2205,7 @@ TAG_MAP = { "Vgmf3---n--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2213,7 +2213,7 @@ TAG_MAP = { "Vgmf3---y--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2222,7 +2222,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2231,7 +2231,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2241,7 +2241,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2250,7 +2250,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Fut", "VerbForm": "Fin", @@ -2259,7 +2259,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Fut", @@ -2269,7 +2269,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Fut", "VerbForm": "Fin", @@ -2278,7 +2278,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Fut", @@ -2288,7 +2288,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2297,7 +2297,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2307,7 +2307,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2316,7 +2316,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2326,7 +2326,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2335,7 +2335,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2344,7 +2344,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2354,7 +2354,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2363,7 +2363,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2373,7 +2373,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2382,7 +2382,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2392,7 +2392,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2401,7 +2401,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2411,7 +2411,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2420,7 +2420,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2430,7 +2430,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2438,7 +2438,7 @@ TAG_MAP = { "Vgmp3---n--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2446,7 +2446,7 @@ TAG_MAP = { "Vgmp3---n--yi-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2455,7 +2455,7 @@ TAG_MAP = { "Vgmp3---y--ni-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2463,7 +2463,7 @@ TAG_MAP = { "Vgmp3---y--yi-": { POS: VERB, "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2473,7 +2473,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2482,7 +2482,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2492,7 +2492,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2501,7 +2501,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2511,7 +2511,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2520,7 +2520,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2529,7 +2529,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Pres", "VerbForm": "Fin", @@ -2538,7 +2538,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Pres", @@ -2548,7 +2548,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Pres", "VerbForm": "Fin", @@ -2557,7 +2557,7 @@ TAG_MAP = { POS: VERB, "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Reflex": "Yes", "Tense": "Pres", @@ -2568,7 +2568,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2578,7 +2578,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2589,7 +2589,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "1", + "Person": "one", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -2599,7 +2599,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "2", + "Person": "two", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2608,7 +2608,7 @@ TAG_MAP = { POS: VERB, "Aspect": "Hab", "Mood": "Ind", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2618,7 +2618,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2628,7 +2628,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Plur", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2639,7 +2639,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", @@ -2649,7 +2649,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Reflex": "Yes", "Tense": "Past", @@ -2660,7 +2660,7 @@ TAG_MAP = { "Aspect": "Hab", "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Neg", "Tense": "Past", "VerbForm": "Fin", @@ -2670,7 +2670,7 @@ TAG_MAP = { "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", - "Person": "3", + "Person": "three", "Polarity": "Pos", "Tense": "Past", "VerbForm": "Fin", diff --git a/spacy/lang/nl/lemmatizer/__init__.py b/spacy/lang/nl/lemmatizer/__init__.py index 1e5d9aa1f..ee4eaabb3 100644 --- a/spacy/lang/nl/lemmatizer/__init__.py +++ b/spacy/lang/nl/lemmatizer/__init__.py @@ -73,7 +73,7 @@ class DutchLemmatizer(object): return [lemma[0]] except KeyError: pass - # string corresponds to key in lookup table + # string corresponds to key in lookup table lookup_table = self.lookup_table looked_up_lemma = lookup_table.get(string) if looked_up_lemma and looked_up_lemma in lemma_index: @@ -103,9 +103,12 @@ class DutchLemmatizer(object): # Overrides parent method so that a lowercased version of the string is # used to search the lookup table. This is necessary because our lookup # table consists entirely of lowercase keys. - def lookup(self, string): + def lookup(self, string, orth=None): string = string.lower() - return self.lookup_table.get(string, string) + if orth is not None: + return self.lookup_table.get(orth, string) + else: + return self.lookup_table.get(string, string) def noun(self, string, morphology=None): return self(string, "noun", morphology) diff --git a/spacy/lang/ru/lemmatizer.py b/spacy/lang/ru/lemmatizer.py index 300d61c52..70120566b 100644 --- a/spacy/lang/ru/lemmatizer.py +++ b/spacy/lang/ru/lemmatizer.py @@ -73,7 +73,7 @@ class RussianLemmatizer(Lemmatizer): if ( feature in morphology and feature in analysis_morph - and morphology[feature] != analysis_morph[feature] + and morphology[feature].lower() != analysis_morph[feature].lower() ): break else: @@ -115,7 +115,7 @@ class RussianLemmatizer(Lemmatizer): def pron(self, string, morphology=None): return self(string, "pron", morphology) - def lookup(self, string): + def lookup(self, string, orth=None): analyses = self._morph.parse(string) if len(analyses) == 1: return analyses[0].normal_form diff --git a/spacy/lang/uk/lemmatizer.py b/spacy/lang/uk/lemmatizer.py index ab56c824d..d40bdf2df 100644 --- a/spacy/lang/uk/lemmatizer.py +++ b/spacy/lang/uk/lemmatizer.py @@ -70,7 +70,7 @@ class UkrainianLemmatizer(Lemmatizer): if ( feature in morphology and feature in analysis_morph - and morphology[feature] != analysis_morph[feature] + and morphology[feature].lower() != analysis_morph[feature].lower() ): break else: @@ -112,7 +112,7 @@ class UkrainianLemmatizer(Lemmatizer): def pron(self, string, morphology=None): return self(string, "pron", morphology) - def lookup(self, string): + def lookup(self, string, orth=None): analyses = self._morph.parse(string) if len(analyses) == 1: return analyses[0].normal_form diff --git a/spacy/language.py b/spacy/language.py index 09dd22cf2..a28f2a84e 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -20,6 +20,7 @@ from .pipeline import Tensorizer, EntityRecognizer, EntityLinker from .pipeline import SimilarityHook, TextCategorizer, Sentencizer from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens from .pipeline import EntityRuler +from .pipeline import Morphologizer from .compat import izip, basestring_ from .gold import GoldParse from .scorer import Scorer @@ -38,6 +39,8 @@ from . import about class BaseDefaults(object): @classmethod def create_lemmatizer(cls, nlp=None, lookups=None): + if lookups is None: + lookups = cls.create_lookups(nlp=nlp) rules, index, exc, lookup = util.get_lemma_tables(lookups) return Lemmatizer(index, exc, rules, lookup) @@ -108,6 +111,8 @@ class BaseDefaults(object): syntax_iterators = {} resources = {} writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} + single_orth_variants = [] + paired_orth_variants = [] class Language(object): @@ -128,6 +133,7 @@ class Language(object): "tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp), "tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg), "tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg), + "morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg), "parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg), "ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg), "entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg), @@ -251,7 +257,8 @@ class Language(object): @property def pipe_labels(self): - """Get the labels set by the pipeline components, if available. + """Get the labels set by the pipeline components, if available (if + the component exposes a labels property). RETURNS (dict): Labels keyed by component name. """ @@ -442,29 +449,9 @@ class Language(object): def make_doc(self, text): return self.tokenizer(text) - def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None): - """Update the models in the pipeline. - - docs (iterable): A batch of `Doc` objects. - golds (iterable): A batch of `GoldParse` objects. - drop (float): The droput rate. - sgd (callable): An optimizer. - losses (dict): Dictionary to update with the loss, keyed by component. - component_cfg (dict): Config parameters for specific pipeline - components, keyed by component name. - - DOCS: https://spacy.io/api/language#update - """ + def _format_docs_and_golds(self, docs, golds): + """Format golds and docs before update models.""" expected_keys = ("words", "tags", "heads", "deps", "entities", "cats", "links") - if len(docs) != len(golds): - raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds))) - if len(docs) == 0: - return - if sgd is None: - if self._optimizer is None: - self._optimizer = create_default_optimizer(Model.ops) - sgd = self._optimizer - # Allow dict of args to GoldParse, instead of GoldParse objects. gold_objs = [] doc_objs = [] for doc, gold in zip(docs, golds): @@ -478,8 +465,32 @@ class Language(object): gold = GoldParse(doc, **gold) doc_objs.append(doc) gold_objs.append(gold) - golds = gold_objs - docs = doc_objs + + return doc_objs, gold_objs + + def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None): + """Update the models in the pipeline. + + docs (iterable): A batch of `Doc` objects. + golds (iterable): A batch of `GoldParse` objects. + drop (float): The droput rate. + sgd (callable): An optimizer. + losses (dict): Dictionary to update with the loss, keyed by component. + component_cfg (dict): Config parameters for specific pipeline + components, keyed by component name. + + DOCS: https://spacy.io/api/language#update + """ + if len(docs) != len(golds): + raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds))) + if len(docs) == 0: + return + if sgd is None: + if self._optimizer is None: + self._optimizer = create_default_optimizer(Model.ops) + sgd = self._optimizer + # Allow dict of args to GoldParse, instead of GoldParse objects. + docs, golds = self._format_docs_and_golds(docs, golds) grads = {} def get_grads(W, dW, key=None): @@ -583,6 +594,7 @@ class Language(object): # Populate vocab else: for _, annots_brackets in get_gold_tuples(): + _ = annots_brackets.pop() for annots, _ in annots_brackets: for word in annots[1]: _ = self.vocab[word] # noqa: F841 @@ -651,7 +663,7 @@ class Language(object): DOCS: https://spacy.io/api/language#evaluate """ if scorer is None: - scorer = Scorer() + scorer = Scorer(pipeline=self.pipeline) if component_cfg is None: component_cfg = {} docs, golds = zip(*docs_golds) diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py index f9e35f44a..26c2227a0 100644 --- a/spacy/lemmatizer.py +++ b/spacy/lemmatizer.py @@ -2,8 +2,7 @@ from __future__ import unicode_literals from collections import OrderedDict -from .symbols import POS, NOUN, VERB, ADJ, PUNCT, PROPN -from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos +from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN class Lemmatizer(object): @@ -55,12 +54,8 @@ class Lemmatizer(object): Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely. """ - morphology = {} if morphology is None else morphology - others = [ - key - for key in morphology - if key not in (POS, "Number", "POS", "VerbForm", "Tense") - ] + if morphology is None: + morphology = {} if univ_pos == "noun" and morphology.get("Number") == "sing": return True elif univ_pos == "verb" and morphology.get("VerbForm") == "inf": @@ -71,18 +66,17 @@ class Lemmatizer(object): morphology.get("VerbForm") == "fin" and morphology.get("Tense") == "pres" and morphology.get("Number") is None - and not others ): return True elif univ_pos == "adj" and morphology.get("Degree") == "pos": return True - elif VerbForm_inf in morphology: + elif morphology.get("VerbForm") == "inf": return True - elif VerbForm_none in morphology: + elif morphology.get("VerbForm") == "none": return True - elif Number_sing in morphology: + elif morphology.get("VerbForm") == "inf": return True - elif Degree_pos in morphology: + elif morphology.get("Degree") == "pos": return True else: return False @@ -99,9 +93,19 @@ class Lemmatizer(object): def punct(self, string, morphology=None): return self(string, "punct", morphology) - def lookup(self, string): - if string in self.lookup_table: - return self.lookup_table[string] + def lookup(self, string, orth=None): + """Look up a lemma in the table, if available. If no lemma is found, + the original string is returned. + + string (unicode): The original string. + orth (int): Optional hash of the string to look up. If not set, the + string will be used and hashed. + RETURNS (unicode): The lemma if the string was found, otherwise the + original string. + """ + key = orth if orth is not None else string + if key in self.lookup_table: + return self.lookup_table[key] return string diff --git a/spacy/lookups.py b/spacy/lookups.py index 801b4d00d..05a60f289 100644 --- a/spacy/lookups.py +++ b/spacy/lookups.py @@ -1,11 +1,13 @@ -# coding: utf8 +# coding: utf-8 from __future__ import unicode_literals import srsly from collections import OrderedDict +from preshed.bloom import BloomFilter from .errors import Errors from .util import SimpleFrozenDict, ensure_path +from .strings import get_string_id class Lookups(object): @@ -14,16 +16,14 @@ class Lookups(object): so they can be accessed before the pipeline components are applied (e.g. in the tokenizer and lemmatizer), as well as within the pipeline components via doc.vocab.lookups. - - Important note: At the moment, this class only performs a very basic - dictionary lookup. We're planning to replace this with a more efficient - implementation. See #3971 for details. """ def __init__(self): """Initialize the Lookups object. RETURNS (Lookups): The newly created object. + + DOCS: https://spacy.io/api/lookups#init """ self._tables = OrderedDict() @@ -32,7 +32,7 @@ class Lookups(object): Lookups.has_table. name (unicode): Name of the table. - RETURNS (bool): Whether a table of that name exists. + RETURNS (bool): Whether a table of that name is in the lookups. """ return self.has_table(name) @@ -51,11 +51,12 @@ class Lookups(object): name (unicode): Unique name of table. data (dict): Optional data to add to the table. RETURNS (Table): The newly added table. + + DOCS: https://spacy.io/api/lookups#add_table """ if name in self.tables: raise ValueError(Errors.E158.format(name=name)) - table = Table(name=name) - table.update(data) + table = Table(name=name, data=data) self._tables[name] = table return table @@ -64,6 +65,8 @@ class Lookups(object): name (unicode): Name of the table. RETURNS (Table): The table. + + DOCS: https://spacy.io/api/lookups#get_table """ if name not in self._tables: raise KeyError(Errors.E159.format(name=name, tables=self.tables)) @@ -72,8 +75,10 @@ class Lookups(object): def remove_table(self, name): """Remove a table. Raises an error if the table doesn't exist. - name (unicode): The name to remove. + name (unicode): Name of the table to remove. RETURNS (Table): The removed table. + + DOCS: https://spacy.io/api/lookups#remove_table """ if name not in self._tables: raise KeyError(Errors.E159.format(name=name, tables=self.tables)) @@ -84,45 +89,57 @@ class Lookups(object): name (unicode): Name of the table. RETURNS (bool): Whether a table of that name exists. + + DOCS: https://spacy.io/api/lookups#has_table """ return name in self._tables - def to_bytes(self, exclude=tuple(), **kwargs): + def to_bytes(self, **kwargs): """Serialize the lookups to a bytestring. - exclude (list): String names of serialization fields to exclude. RETURNS (bytes): The serialized Lookups. + + DOCS: https://spacy.io/api/lookups#to_bytes """ return srsly.msgpack_dumps(self._tables) - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def from_bytes(self, bytes_data, **kwargs): """Load the lookups from a bytestring. - exclude (list): String names of serialization fields to exclude. - RETURNS (bytes): The loaded Lookups. + bytes_data (bytes): The data to load. + RETURNS (Lookups): The loaded Lookups. + + DOCS: https://spacy.io/api/lookups#from_bytes """ - self._tables = OrderedDict() - msg = srsly.msgpack_loads(bytes_data) - for key, value in msg.items(): - self._tables[key] = Table.from_dict(value) + for key, value in srsly.msgpack_loads(bytes_data).items(): + self._tables[key] = Table(key) + self._tables[key].update(value) return self def to_disk(self, path, **kwargs): - """Save the lookups to a directory as lookups.bin. + """Save the lookups to a directory as lookups.bin. Expects a path to a + directory, which will be created if it doesn't exist. path (unicode / Path): The file path. + + DOCS: https://spacy.io/api/lookups#to_disk """ if len(self._tables): path = ensure_path(path) + if not path.exists(): + path.mkdir() filepath = path / "lookups.bin" with filepath.open("wb") as file_: file_.write(self.to_bytes()) def from_disk(self, path, **kwargs): - """Load lookups from a directory containing a lookups.bin. + """Load lookups from a directory containing a lookups.bin. Will skip + loading if the file doesn't exist. - path (unicode / Path): The file path. + path (unicode / Path): The directory path. RETURNS (Lookups): The loaded lookups. + + DOCS: https://spacy.io/api/lookups#from_disk """ path = ensure_path(path) filepath = path / "lookups.bin" @@ -136,22 +153,118 @@ class Lookups(object): class Table(OrderedDict): """A table in the lookups. Subclass of builtin dict that implements a slightly more consistent and unified API. + + Includes a Bloom filter to speed up missed lookups. """ + @classmethod def from_dict(cls, data, name=None): + """Initialize a new table from a dict. + + data (dict): The dictionary. + name (unicode): Optional table name for reference. + RETURNS (Table): The newly created object. + + DOCS: https://spacy.io/api/lookups#table.from_dict + """ self = cls(name=name) self.update(data) return self - def __init__(self, name=None): + def __init__(self, name=None, data=None): """Initialize a new table. name (unicode): Optional table name for reference. + data (dict): Initial data, used to hint Bloom Filter. RETURNS (Table): The newly created object. + + DOCS: https://spacy.io/api/lookups#table.init """ OrderedDict.__init__(self) self.name = name + # Assume a default size of 1M items + self.default_size = 1e6 + size = len(data) if data and len(data) > 0 else self.default_size + self.bloom = BloomFilter.from_error_rate(size) + if data: + self.update(data) + + def __setitem__(self, key, value): + """Set new key/value pair. String keys will be hashed. + + key (unicode / int): The key to set. + value: The value to set. + """ + key = get_string_id(key) + OrderedDict.__setitem__(self, key, value) + self.bloom.add(key) def set(self, key, value): - """Set new key/value pair. Same as table[key] = value.""" + """Set new key/value pair. String keys will be hashed. + Same as table[key] = value. + + key (unicode / int): The key to set. + value: The value to set. + """ self[key] = value + + def __getitem__(self, key): + """Get the value for a given key. String keys will be hashed. + + key (unicode / int): The key to get. + RETURNS: The value. + """ + key = get_string_id(key) + return OrderedDict.__getitem__(self, key) + + def get(self, key, default=None): + """Get the value for a given key. String keys will be hashed. + + key (unicode / int): The key to get. + default: The default value to return. + RETURNS: The value. + """ + key = get_string_id(key) + return OrderedDict.get(self, key, default) + + def __contains__(self, key): + """Check whether a key is in the table. String keys will be hashed. + + key (unicode / int): The key to check. + RETURNS (bool): Whether the key is in the table. + """ + key = get_string_id(key) + # This can give a false positive, so we need to check it after + if key not in self.bloom: + return False + return OrderedDict.__contains__(self, key) + + def to_bytes(self): + """Serialize table to a bytestring. + + RETURNS (bytes): The serialized table. + + DOCS: https://spacy.io/api/lookups#table.to_bytes + """ + data = [ + ("name", self.name), + ("dict", dict(self.items())), + ("bloom", self.bloom.to_bytes()), + ] + return srsly.msgpack_dumps(OrderedDict(data)) + + def from_bytes(self, bytes_data): + """Load a table from a bytestring. + + bytes_data (bytes): The data to load. + RETURNS (Table): The loaded table. + + DOCS: https://spacy.io/api/lookups#table.from_bytes + """ + loaded = srsly.msgpack_loads(bytes_data) + data = loaded.get("dict", {}) + self.name = loaded["name"] + self.bloom = BloomFilter().from_bytes(loaded["bloom"]) + self.clear() + self.update(data) + return self diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index c698c8024..950a7b977 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -103,6 +103,8 @@ cdef class Matcher: *patterns (list): List of token descriptions. """ errors = {} + if on_match is not None and not hasattr(on_match, "__call__"): + raise ValueError(Errors.E171.format(arg_type=type(on_match))) for i, pattern in enumerate(patterns): if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) @@ -162,18 +164,37 @@ cdef class Matcher: return default return (self._callbacks[key], self._patterns[key]) - def pipe(self, docs, batch_size=1000, n_threads=-1): + def pipe(self, docs, batch_size=1000, n_threads=-1, return_matches=False, + as_tuples=False): """Match a stream of documents, yielding them in turn. docs (iterable): A stream of documents. batch_size (int): Number of documents to accumulate into a working set. + return_matches (bool): Yield the match lists along with the docs, making + results (doc, matches) tuples. + as_tuples (bool): Interpret the input stream as (doc, context) tuples, + and yield (result, context) tuples out. + If both return_matches and as_tuples are True, the output will + be a sequence of ((doc, matches), context) tuples. YIELDS (Doc): Documents, in order. """ if n_threads != -1: deprecation_warning(Warnings.W016) - for doc in docs: - self(doc) - yield doc + + if as_tuples: + for doc, context in docs: + matches = self(doc) + if return_matches: + yield ((doc, matches), context) + else: + yield (doc, context) + else: + for doc in docs: + matches = self(doc) + if return_matches: + yield (doc, matches) + else: + yield doc def __call__(self, Doc doc): """Find all token sequences matching the supplied pattern. diff --git a/spacy/matcher/phrasematcher.pxd b/spacy/matcher/phrasematcher.pxd index 3aba1686f..753b2da74 100644 --- a/spacy/matcher/phrasematcher.pxd +++ b/spacy/matcher/phrasematcher.pxd @@ -1,5 +1,27 @@ from libcpp.vector cimport vector -from ..typedefs cimport hash_t +from cymem.cymem cimport Pool +from preshed.maps cimport key_t, MapStruct -ctypedef vector[hash_t] hash_vec +from ..attrs cimport attr_id_t +from ..tokens.doc cimport Doc +from ..vocab cimport Vocab + + +cdef class PhraseMatcher: + cdef Vocab vocab + cdef attr_id_t attr + cdef object _callbacks + cdef object _docs + cdef bint _validate + cdef MapStruct* c_map + cdef Pool mem + cdef key_t _terminal_hash + + cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil + + +cdef struct MatchStruct: + key_t match_id + int start + int end diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 9e8801cc1..b6c9e01d2 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -2,28 +2,16 @@ # cython: profile=True from __future__ import unicode_literals -from libcpp.vector cimport vector -from cymem.cymem cimport Pool -from murmurhash.mrmr cimport hash64 -from preshed.maps cimport PreshMap +from libc.stdint cimport uintptr_t -from .matcher cimport Matcher -from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA, attr_id_t -from ..vocab cimport Vocab -from ..tokens.doc cimport Doc, get_token_attr -from ..typedefs cimport attr_t, hash_t +from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter + +from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA +from ..structs cimport TokenC +from ..tokens.token cimport Token from ._schemas import TOKEN_PATTERN_SCHEMA from ..errors import Errors, Warnings, deprecation_warning, user_warning -from ..attrs import FLAG61 as U_ENT -from ..attrs import FLAG60 as B2_ENT -from ..attrs import FLAG59 as B3_ENT -from ..attrs import FLAG58 as B4_ENT -from ..attrs import FLAG43 as L2_ENT -from ..attrs import FLAG42 as L3_ENT -from ..attrs import FLAG41 as L4_ENT -from ..attrs import FLAG42 as I3_ENT -from ..attrs import FLAG41 as I4_ENT cdef class PhraseMatcher: @@ -33,18 +21,11 @@ cdef class PhraseMatcher: DOCS: https://spacy.io/api/phrasematcher USAGE: https://spacy.io/usage/rule-based-matching#phrasematcher + + Adapted from FlashText: https://github.com/vi3k6i5/flashtext + MIT License (see `LICENSE`) + Copyright (c) 2017 Vikash Singh (vikash.duliajan@gmail.com) """ - cdef Pool mem - cdef Vocab vocab - cdef Matcher matcher - cdef PreshMap phrase_ids - cdef vector[hash_vec] ent_id_matrix - cdef int max_length - cdef attr_id_t attr - cdef public object _callbacks - cdef public object _patterns - cdef public object _docs - cdef public object _validate def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False): """Initialize the PhraseMatcher. @@ -58,11 +39,17 @@ cdef class PhraseMatcher: """ if max_length != 0: deprecation_warning(Warnings.W010) - self.mem = Pool() - self.max_length = max_length self.vocab = vocab - self.matcher = Matcher(self.vocab, validate=False) - if isinstance(attr, long): + self._callbacks = {} + self._docs = {} + self._validate = validate + + self.mem = Pool() + self.c_map = self.mem.alloc(1, sizeof(MapStruct)) + self._terminal_hash = 826361138722620965 + map_init(self.mem, self.c_map, 8) + + if isinstance(attr, (int, long)): self.attr = attr else: attr = attr.upper() @@ -71,28 +58,15 @@ cdef class PhraseMatcher: if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]: raise ValueError(Errors.E152.format(attr=attr)) self.attr = self.vocab.strings[attr] - self.phrase_ids = PreshMap() - abstract_patterns = [ - [{U_ENT: True}], - [{B2_ENT: True}, {L2_ENT: True}], - [{B3_ENT: True}, {I3_ENT: True}, {L3_ENT: True}], - [{B4_ENT: True}, {I4_ENT: True}, {I4_ENT: True, "OP": "+"}, {L4_ENT: True}], - ] - self.matcher.add("Candidate", None, *abstract_patterns) - self._callbacks = {} - self._docs = {} - self._validate = validate def __len__(self): - """Get the number of rules added to the matcher. Note that this only - returns the number of rules (identical with the number of IDs), not the - number of individual patterns. + """Get the number of match IDs added to the matcher. RETURNS (int): The number of rules. DOCS: https://spacy.io/api/phrasematcher#len """ - return len(self._docs) + return len(self._callbacks) def __contains__(self, key): """Check whether the matcher contains rules for a match ID. @@ -102,13 +76,79 @@ cdef class PhraseMatcher: DOCS: https://spacy.io/api/phrasematcher#contains """ - cdef hash_t ent_id = self.matcher._normalize_key(key) - return ent_id in self._callbacks + return key in self._callbacks def __reduce__(self): - data = (self.vocab, self._docs, self._callbacks) + data = (self.vocab, self._docs, self._callbacks, self.attr) return (unpickle_matcher, data, None, None) + def remove(self, key): + """Remove a rule from the matcher by match ID. A KeyError is raised if + the key does not exist. + + key (unicode): The match ID. + + DOCS: https://spacy.io/api/phrasematcher#remove + """ + if key not in self._docs: + raise KeyError(key) + cdef MapStruct* current_node + cdef MapStruct* terminal_map + cdef MapStruct* node_pointer + cdef void* result + cdef key_t terminal_key + cdef void* value + cdef int c_i = 0 + cdef vector[MapStruct*] path_nodes + cdef vector[key_t] path_keys + cdef key_t key_to_remove + for keyword in self._docs[key]: + current_node = self.c_map + for token in keyword: + result = map_get(current_node, token) + if result: + path_nodes.push_back(current_node) + path_keys.push_back(token) + current_node = result + else: + # if token is not found, break out of the loop + current_node = NULL + break + # remove the tokens from trie node if there are no other + # keywords with them + result = map_get(current_node, self._terminal_hash) + if current_node != NULL and result: + terminal_map = result + terminal_keys = [] + c_i = 0 + while map_iter(terminal_map, &c_i, &terminal_key, &value): + terminal_keys.append(self.vocab.strings[terminal_key]) + # if this is the only remaining key, remove unnecessary paths + if terminal_keys == [key]: + while not path_nodes.empty(): + node_pointer = path_nodes.back() + path_nodes.pop_back() + key_to_remove = path_keys.back() + path_keys.pop_back() + result = map_get(node_pointer, key_to_remove) + if node_pointer.filled == 1: + map_clear(node_pointer, key_to_remove) + self.mem.free(result) + else: + # more than one key means more than 1 path, + # delete not required path and keep the others + map_clear(node_pointer, key_to_remove) + self.mem.free(result) + break + # otherwise simply remove the key + else: + result = map_get(current_node, self._terminal_hash) + if result: + map_clear(result, self.vocab.strings[key]) + + del self._callbacks[key] + del self._docs[key] + def add(self, key, on_match, *docs): """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns. @@ -119,53 +159,53 @@ cdef class PhraseMatcher: DOCS: https://spacy.io/api/phrasematcher#add """ - cdef Doc doc - cdef hash_t ent_id = self.matcher._normalize_key(key) - self._callbacks[ent_id] = on_match - self._docs[ent_id] = docs - cdef int length - cdef int i - cdef hash_t phrase_hash - cdef Pool mem = Pool() + + _ = self.vocab[key] + self._callbacks[key] = on_match + self._docs.setdefault(key, set()) + + cdef MapStruct* current_node + cdef MapStruct* internal_node + cdef void* result + for doc in docs: - length = doc.length - if length == 0: + if len(doc) == 0: continue - if self.attr in (POS, TAG, LEMMA) and not doc.is_tagged: - raise ValueError(Errors.E155.format()) - if self.attr == DEP and not doc.is_parsed: - raise ValueError(Errors.E156.format()) - if self._validate and (doc.is_tagged or doc.is_parsed) \ - and self.attr not in (DEP, POS, TAG, LEMMA): - string_attr = self.vocab.strings[self.attr] - user_warning(Warnings.W012.format(key=key, attr=string_attr)) - tags = get_biluo(length) - phrase_key = mem.alloc(length, sizeof(attr_t)) - for i, tag in enumerate(tags): - attr_value = self.get_lex_value(doc, i) - lexeme = self.vocab[attr_value] - lexeme.set_flag(tag, True) - phrase_key[i] = lexeme.orth - phrase_hash = hash64(phrase_key, length * sizeof(attr_t), 0) - - if phrase_hash in self.phrase_ids: - phrase_index = self.phrase_ids[phrase_hash] - ent_id_list = self.ent_id_matrix[phrase_index] - ent_id_list.append(ent_id) - self.ent_id_matrix[phrase_index] = ent_id_list - + if isinstance(doc, Doc): + if self.attr in (POS, TAG, LEMMA) and not doc.is_tagged: + raise ValueError(Errors.E155.format()) + if self.attr == DEP and not doc.is_parsed: + raise ValueError(Errors.E156.format()) + if self._validate and (doc.is_tagged or doc.is_parsed) \ + and self.attr not in (DEP, POS, TAG, LEMMA): + string_attr = self.vocab.strings[self.attr] + user_warning(Warnings.W012.format(key=key, attr=string_attr)) + keyword = self._convert_to_array(doc) else: - ent_id_list = hash_vec(1) - ent_id_list[0] = ent_id - new_index = self.ent_id_matrix.size() - if new_index == 0: - # PreshMaps can not contain 0 as value, so storing a dummy at 0 - self.ent_id_matrix.push_back(hash_vec(0)) - new_index = 1 - self.ent_id_matrix.push_back(ent_id_list) - self.phrase_ids.set(phrase_hash, new_index) + keyword = doc + self._docs[key].add(tuple(keyword)) - def __call__(self, Doc doc): + current_node = self.c_map + for token in keyword: + if token == self._terminal_hash: + user_warning(Warnings.W021) + break + result = map_get(current_node, token) + if not result: + internal_node = self.mem.alloc(1, sizeof(MapStruct)) + map_init(self.mem, internal_node, 8) + map_set(self.mem, current_node, token, internal_node) + result = internal_node + current_node = result + result = map_get(current_node, self._terminal_hash) + if not result: + internal_node = self.mem.alloc(1, sizeof(MapStruct)) + map_init(self.mem, internal_node, 8) + map_set(self.mem, current_node, self._terminal_hash, internal_node) + result = internal_node + map_set(self.mem, result, self.vocab.strings[key], NULL) + + def __call__(self, doc): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. @@ -176,25 +216,63 @@ cdef class PhraseMatcher: DOCS: https://spacy.io/api/phrasematcher#call """ matches = [] - if self.attr == ORTH: - match_doc = doc - else: - # If we're not matching on the ORTH, match_doc will be a Doc whose - # token.orth values are the attribute values we're matching on, - # e.g. Doc(nlp.vocab, words=[token.pos_ for token in doc]) - words = [self.get_lex_value(doc, i) for i in range(len(doc))] - match_doc = Doc(self.vocab, words=words) - for _, start, end in self.matcher(match_doc): - ent_ids = self.accept_match(match_doc, start, end) - if ent_ids is not None: - for ent_id in ent_ids: - matches.append((ent_id, start, end)) + if doc is None or len(doc) == 0: + # if doc is empty or None just return empty list + return matches + + cdef vector[MatchStruct] c_matches + self.find_matches(doc, &c_matches) + for i in range(c_matches.size()): + matches.append((c_matches[i].match_id, c_matches[i].start, c_matches[i].end)) for i, (ent_id, start, end) in enumerate(matches): on_match = self._callbacks.get(ent_id) if on_match is not None: on_match(self, doc, i, matches) return matches + cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil: + cdef MapStruct* current_node = self.c_map + cdef int start = 0 + cdef int idx = 0 + cdef int idy = 0 + cdef key_t key + cdef void* value + cdef int i = 0 + cdef MatchStruct ms + cdef void* result + while idx < doc.length: + start = idx + token = Token.get_struct_attr(&doc.c[idx], self.attr) + # look for sequences from this position + result = map_get(current_node, token) + if result: + current_node = result + idy = idx + 1 + while idy < doc.length: + result = map_get(current_node, self._terminal_hash) + if result: + i = 0 + while map_iter(result, &i, &key, &value): + ms = make_matchstruct(key, start, idy) + matches.push_back(ms) + inner_token = Token.get_struct_attr(&doc.c[idy], self.attr) + result = map_get(current_node, inner_token) + if result: + current_node = result + idy += 1 + else: + break + else: + # end of doc reached + result = map_get(current_node, self._terminal_hash) + if result: + i = 0 + while map_iter(result, &i, &key, &value): + ms = make_matchstruct(key, start, idy) + matches.push_back(ms) + current_node = self.c_map + idx += 1 + def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False, as_tuples=False): """Match a stream of documents, yielding them in turn. @@ -228,53 +306,21 @@ cdef class PhraseMatcher: else: yield doc - def accept_match(self, Doc doc, int start, int end): - cdef int i, j - cdef Pool mem = Pool() - phrase_key = mem.alloc(end-start, sizeof(attr_t)) - for i, j in enumerate(range(start, end)): - phrase_key[i] = doc.c[j].lex.orth - cdef hash_t key = hash64(phrase_key, (end-start) * sizeof(attr_t), 0) - - ent_index = self.phrase_ids.get(key) - if ent_index == 0: - return None - return self.ent_id_matrix[ent_index] - - def get_lex_value(self, Doc doc, int i): - if self.attr == ORTH: - # Return the regular orth value of the lexeme - return doc.c[i].lex.orth - # Get the attribute value instead, e.g. token.pos - attr_value = get_token_attr(&doc.c[i], self.attr) - if attr_value in (0, 1): - # Value is boolean, convert to string - string_attr_value = str(attr_value) - else: - string_attr_value = self.vocab.strings[attr_value] - string_attr_name = self.vocab.strings[self.attr] - # Concatenate the attr name and value to not pollute lexeme space - # e.g. 'POS-VERB' instead of just 'VERB', which could otherwise - # create false positive matches - return "matcher:{}-{}".format(string_attr_name, string_attr_value) + def _convert_to_array(self, Doc doc): + return [Token.get_struct_attr(&doc.c[i], self.attr) for i in range(len(doc))] -def get_biluo(length): - if length == 0: - raise ValueError(Errors.E127) - elif length == 1: - return [U_ENT] - elif length == 2: - return [B2_ENT, L2_ENT] - elif length == 3: - return [B3_ENT, I3_ENT, L3_ENT] - else: - return [B4_ENT, I4_ENT] + [I4_ENT] * (length-3) + [L4_ENT] - - -def unpickle_matcher(vocab, docs, callbacks): - matcher = PhraseMatcher(vocab) +def unpickle_matcher(vocab, docs, callbacks, attr): + matcher = PhraseMatcher(vocab, attr=attr) for key, specs in docs.items(): callback = callbacks.get(key, None) matcher.add(key, callback, *specs) return matcher + + +cdef MatchStruct make_matchstruct(key_t match_id, int start, int end) nogil: + cdef MatchStruct ms + ms.match_id = match_id + ms.start = start + ms.end = end + return ms diff --git a/spacy/morphology.pxd b/spacy/morphology.pxd index d0110b300..1a3cedf97 100644 --- a/spacy/morphology.pxd +++ b/spacy/morphology.pxd @@ -1,301 +1,41 @@ from cymem.cymem cimport Pool -from preshed.maps cimport PreshMapArray +from preshed.maps cimport PreshMap, PreshMapArray from libc.stdint cimport uint64_t +from murmurhash cimport mrmr -from .structs cimport TokenC +from .structs cimport TokenC, MorphAnalysisC from .strings cimport StringStore -from .typedefs cimport attr_t, flags_t +from .typedefs cimport hash_t, attr_t, flags_t from .parts_of_speech cimport univ_pos_t from . cimport symbols - -cdef struct RichTagC: - uint64_t morph - int id - univ_pos_t pos - attr_t name - - -cdef struct MorphAnalysisC: - RichTagC tag - attr_t lemma - - cdef class Morphology: cdef readonly Pool mem cdef readonly StringStore strings + cdef PreshMap tags # Keyed by hash, value is pointer to tag + cdef public object lemmatizer cdef readonly object tag_map - cdef public object n_tags - cdef public object reverse_index - cdef public object tag_names - cdef public object exc - - cdef RichTagC* rich_tags - cdef PreshMapArray _cache + cdef readonly object tag_names + cdef readonly object reverse_index + cdef readonly object exc + cdef readonly object _feat_map + cdef readonly PreshMapArray _cache + cdef readonly int n_tags + cpdef update(self, hash_t morph, features) + cdef hash_t insert(self, MorphAnalysisC tag) except 0 + cdef int assign_untagged(self, TokenC* token) except -1 - cdef int assign_tag(self, TokenC* token, tag) except -1 - cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1 - cdef int assign_feature(self, uint64_t* morph, univ_morph_t feat_id, bint value) except -1 + cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1 -cdef enum univ_morph_t: - NIL = 0 - Animacy_anim = symbols.Animacy_anim - Animacy_inan - Animacy_hum - Animacy_nhum - Aspect_freq - Aspect_imp - Aspect_mod - Aspect_none - Aspect_perf - Case_abe - Case_abl - Case_abs - Case_acc - Case_ade - Case_all - Case_cau - Case_com - Case_dat - Case_del - Case_dis - Case_ela - Case_ess - Case_gen - Case_ill - Case_ine - Case_ins - Case_loc - Case_lat - Case_nom - Case_par - Case_sub - Case_sup - Case_tem - Case_ter - Case_tra - Case_voc - Definite_two - Definite_def - Definite_red - Definite_cons # U20 - Definite_ind - Degree_cmp - Degree_comp - Degree_none - Degree_pos - Degree_sup - Degree_abs - Degree_com - Degree_dim # du - Gender_com - Gender_fem - Gender_masc - Gender_neut - Mood_cnd - Mood_imp - Mood_ind - Mood_n - Mood_pot - Mood_sub - Mood_opt - Negative_neg - Negative_pos - Negative_yes - Polarity_neg # U20 - Polarity_pos # U20 - Number_com - Number_dual - Number_none - Number_plur - Number_sing - Number_ptan # bg - Number_count # bg - NumType_card - NumType_dist - NumType_frac - NumType_gen - NumType_mult - NumType_none - NumType_ord - NumType_sets - Person_one - Person_two - Person_three - Person_none - Poss_yes - PronType_advPart - PronType_art - PronType_default - PronType_dem - PronType_ind - PronType_int - PronType_neg - PronType_prs - PronType_rcp - PronType_rel - PronType_tot - PronType_clit - PronType_exc # es, ca, it, fa - Reflex_yes - Tense_fut - Tense_imp - Tense_past - Tense_pres - VerbForm_fin - VerbForm_ger - VerbForm_inf - VerbForm_none - VerbForm_part - VerbForm_partFut - VerbForm_partPast - VerbForm_partPres - VerbForm_sup - VerbForm_trans - VerbForm_conv # U20 - VerbForm_gdv # la - Voice_act - Voice_cau - Voice_pass - Voice_mid # gkc - Voice_int # hb - Abbr_yes # cz, fi, sl, U - AdpType_prep # cz, U - AdpType_post # U - AdpType_voc # cz - AdpType_comprep # cz - AdpType_circ # U - AdvType_man - AdvType_loc - AdvType_tim - AdvType_deg - AdvType_cau - AdvType_mod - AdvType_sta - AdvType_ex - AdvType_adadj - ConjType_oper # cz, U - ConjType_comp # cz, U - Connegative_yes # fi - Derivation_minen # fi - Derivation_sti # fi - Derivation_inen # fi - Derivation_lainen # fi - Derivation_ja # fi - Derivation_ton # fi - Derivation_vs # fi - Derivation_ttain # fi - Derivation_ttaa # fi - Echo_rdp # U - Echo_ech # U - Foreign_foreign # cz, fi, U - Foreign_fscript # cz, fi, U - Foreign_tscript # cz, U - Foreign_yes # sl - Gender_dat_masc # bq, U - Gender_dat_fem # bq, U - Gender_erg_masc # bq - Gender_erg_fem # bq - Gender_psor_masc # cz, sl, U - Gender_psor_fem # cz, sl, U - Gender_psor_neut # sl - Hyph_yes # cz, U - InfForm_one # fi - InfForm_two # fi - InfForm_three # fi - NameType_geo # U, cz - NameType_prs # U, cz - NameType_giv # U, cz - NameType_sur # U, cz - NameType_nat # U, cz - NameType_com # U, cz - NameType_pro # U, cz - NameType_oth # U, cz - NounType_com # U - NounType_prop # U - NounType_class # U - Number_abs_sing # bq, U - Number_abs_plur # bq, U - Number_dat_sing # bq, U - Number_dat_plur # bq, U - Number_erg_sing # bq, U - Number_erg_plur # bq, U - Number_psee_sing # U - Number_psee_plur # U - Number_psor_sing # cz, fi, sl, U - Number_psor_plur # cz, fi, sl, U - NumForm_digit # cz, sl, U - NumForm_roman # cz, sl, U - NumForm_word # cz, sl, U - NumValue_one # cz, U - NumValue_two # cz, U - NumValue_three # cz, U - PartForm_pres # fi - PartForm_past # fi - PartForm_agt # fi - PartForm_neg # fi - PartType_mod # U - PartType_emp # U - PartType_res # U - PartType_inf # U - PartType_vbp # U - Person_abs_one # bq, U - Person_abs_two # bq, U - Person_abs_three # bq, U - Person_dat_one # bq, U - Person_dat_two # bq, U - Person_dat_three # bq, U - Person_erg_one # bq, U - Person_erg_two # bq, U - Person_erg_three # bq, U - Person_psor_one # fi, U - Person_psor_two # fi, U - Person_psor_three # fi, U - Polite_inf # bq, U - Polite_pol # bq, U - Polite_abs_inf # bq, U - Polite_abs_pol # bq, U - Polite_erg_inf # bq, U - Polite_erg_pol # bq, U - Polite_dat_inf # bq, U - Polite_dat_pol # bq, U - Prefix_yes # U - PrepCase_npr # cz - PrepCase_pre # U - PunctSide_ini # U - PunctSide_fin # U - PunctType_peri # U - PunctType_qest # U - PunctType_excl # U - PunctType_quot # U - PunctType_brck # U - PunctType_comm # U - PunctType_colo # U - PunctType_semi # U - PunctType_dash # U - Style_arch # cz, fi, U - Style_rare # cz, fi, U - Style_poet # cz, U - Style_norm # cz, U - Style_coll # cz, U - Style_vrnc # cz, U - Style_sing # cz, U - Style_expr # cz, U - Style_derg # cz, U - Style_vulg # cz, U - Style_yes # fi, U - StyleVariant_styleShort # cz - StyleVariant_styleBound # cz, sl - VerbType_aux # U - VerbType_cop # U - VerbType_mod # U - VerbType_light # U - +cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil +cdef attr_t get_field(const MorphAnalysisC* tag, int field) nogil +cdef list list_features(const MorphAnalysisC* tag) +cdef tag_to_json(const MorphAnalysisC* tag) diff --git a/spacy/morphology.pyx b/spacy/morphology.pyx index e9de621c8..c146094a9 100644 --- a/spacy/morphology.pyx +++ b/spacy/morphology.pyx @@ -3,18 +3,83 @@ from __future__ import unicode_literals from libc.string cimport memset +import srsly +from collections import Counter +from .compat import basestring_ +from .strings import get_string_id +from . import symbols from .attrs cimport POS, IS_SPACE from .attrs import LEMMA, intify_attrs from .parts_of_speech cimport SPACE from .parts_of_speech import IDS as POS_IDS from .lexeme cimport Lexeme from .errors import Errors +from .util import ensure_path + + +cdef enum univ_field_t: + Field_POS + Field_Abbr + Field_AdpType + Field_AdvType + Field_Animacy + Field_Aspect + Field_Case + Field_ConjType + Field_Connegative + Field_Definite + Field_Degree + Field_Derivation + Field_Echo + Field_Foreign + Field_Gender + Field_Hyph + Field_InfForm + Field_Mood + Field_NameType + Field_Negative + Field_NounType + Field_Number + Field_NumForm + Field_NumType + Field_NumValue + Field_PartForm + Field_PartType + Field_Person + Field_Polarity + Field_Polite + Field_Poss + Field_Prefix + Field_PrepCase + Field_PronType + Field_PunctSide + Field_PunctType + Field_Reflex + Field_Style + Field_StyleVariant + Field_Tense + Field_Typo + Field_VerbForm + Field_VerbType + Field_Voice def _normalize_props(props): """Transform deprecated string keys to correct names.""" out = {} + props = dict(props) + for key in FIELDS: + if key in props: + value = str(props[key]).lower() + # We don't have support for disjunctive int|rel features, so + # just take the first one :( + if "|" in value: + value = value.split("|")[0] + attr = '%s_%s' % (key, value) + if attr in FEATURES: + props.pop(key) + props[attr] = True for key, value in props.items(): if key == POS: if hasattr(value, 'upper'): @@ -24,17 +89,67 @@ def _normalize_props(props): out[key] = value elif isinstance(key, int): out[key] = value + elif value is True: + out[key] = value elif key.lower() == 'pos': out[POS] = POS_IDS[value.upper()] - else: + elif key.lower() != 'morph': out[key] = value return out +class MorphologyClassMap(object): + def __init__(self, features): + self.features = tuple(features) + self.fields = [] + self.feat2field = {} + seen_fields = set() + for feature in features: + field = feature.split("_", 1)[0] + if field not in seen_fields: + self.fields.append(field) + seen_fields.add(field) + self.feat2field[feature] = FIELDS[field] + self.id2feat = {get_string_id(name): name for name in features} + self.field2feats = {"POS": []} + self.col2info = [] + self.attr2field = dict(LOWER_FIELDS.items()) + self.feat2offset = {} + self.field2col = {} + self.field2id = dict(FIELDS.items()) + self.fieldid2field = {field_id: field for field, field_id in FIELDS.items()} + for feature in features: + field = self.fields[self.feat2field[feature]] + if field not in self.field2col: + self.field2col[field] = len(self.col2info) + if field != "POS" and field not in self.field2feats: + self.col2info.append((field, 0, "NIL")) + self.field2feats.setdefault(field, ["NIL"]) + offset = len(self.field2feats[field]) + self.field2feats[field].append(feature) + self.col2info.append((field, offset, feature)) + self.feat2offset[feature] = offset + + @property + def field_sizes(self): + return [len(self.field2feats[field]) for field in self.fields] + + def get_field_offset(self, field): + return self.field2col[field] + + cdef class Morphology: + '''Store the possible morphological analyses for a language, and index them + by hash. + + To save space on each token, tokens only know the hash of their morphological + analysis, so queries of morphological attributes are delegated + to this class. + ''' def __init__(self, StringStore string_store, tag_map, lemmatizer, exc=None): self.mem = Pool() self.strings = string_store + self.tags = PreshMap() # Add special space symbol. We prefix with underscore, to make sure it # always sorts to the end. space_attrs = tag_map.get('SP', {POS: SPACE}) @@ -47,127 +162,64 @@ cdef class Morphology: self.lemmatizer = lemmatizer self.n_tags = len(tag_map) self.reverse_index = {} - - self.rich_tags = self.mem.alloc(self.n_tags+1, sizeof(RichTagC)) - for i, (tag_str, attrs) in enumerate(sorted(tag_map.items())): - self.strings.add(tag_str) - self.tag_map[tag_str] = dict(attrs) - attrs = _normalize_props(attrs) - attrs = intify_attrs(attrs, self.strings, _do_deprecated=True) - self.rich_tags[i].id = i - self.rich_tags[i].name = self.strings.add(tag_str) - self.rich_tags[i].morph = 0 - self.rich_tags[i].pos = attrs[POS] - self.reverse_index[self.rich_tags[i].name] = i - # Add a 'null' tag, which we can reference when assign morphology to - # untagged tokens. - self.rich_tags[self.n_tags].id = self.n_tags + self._feat_map = MorphologyClassMap(FEATURES) + self._load_from_tag_map(tag_map) self._cache = PreshMapArray(self.n_tags) self.exc = {} if exc is not None: - for (tag_str, orth_str), attrs in exc.items(): - self.add_special_case(tag_str, orth_str, attrs) + for (tag, orth), attrs in exc.items(): + attrs = _normalize_props(attrs) + self.add_special_case( + self.strings.as_string(tag), self.strings.as_string(orth), attrs) + + def _load_from_tag_map(self, tag_map): + for i, (tag_str, attrs) in enumerate(sorted(tag_map.items())): + attrs = _normalize_props(attrs) + self.add({self._feat_map.id2feat[feat] for feat in attrs + if feat in self._feat_map.id2feat}) + self.tag_map[tag_str] = dict(attrs) + self.reverse_index[self.strings.add(tag_str)] = i def __reduce__(self): return (Morphology, (self.strings, self.tag_map, self.lemmatizer, - self.exc), None, None) + self.exc), None, None) - cdef int assign_untagged(self, TokenC* token) except -1: - """Set morphological attributes on a token without a POS tag. Uses - the lemmatizer's lookup() method, which looks up the string in the - table provided by the language data as lemma_lookup (if available). + def add(self, features): + """Insert a morphological analysis in the morphology table, if not already + present. Returns the hash of the new analysis. """ - if token.lemma == 0: - orth_str = self.strings[token.lex.orth] - lemma = self.lemmatizer.lookup(orth_str) - token.lemma = self.strings.add(lemma) + for f in features: + if isinstance(f, basestring_): + self.strings.add(f) + string_features = features + features = intify_features(features) + cdef attr_t feature + for feature in features: + if feature != 0 and feature not in self._feat_map.id2feat: + raise ValueError(Errors.E167.format(feat=self.strings[feature], feat_id=feature)) + cdef MorphAnalysisC tag + tag = create_rich_tag(features) + cdef hash_t key = self.insert(tag) + return key - cdef int assign_tag(self, TokenC* token, tag) except -1: - if isinstance(tag, basestring): - tag = self.strings.add(tag) - if tag in self.reverse_index: - tag_id = self.reverse_index[tag] - self.assign_tag_id(token, tag_id) + def get(self, hash_t morph): + tag = self.tags.get(morph) + if tag == NULL: + return [] else: - token.tag = tag + return tag_to_json(tag) - cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1: - if tag_id > self.n_tags: - raise ValueError(Errors.E014.format(tag=tag_id)) - # TODO: It's pretty arbitrary to put this logic here. I guess the - # justification is that this is where the specific word and the tag - # interact. Still, we should have a better way to enforce this rule, or - # figure out why the statistical model fails. Related to Issue #220 - if Lexeme.c_check_flag(token.lex, IS_SPACE): - tag_id = self.reverse_index[self.strings.add('_SP')] - rich_tag = self.rich_tags[tag_id] - analysis = self._cache.get(tag_id, token.lex.orth) - if analysis is NULL: - analysis = self.mem.alloc(1, sizeof(MorphAnalysisC)) - tag_str = self.strings[self.rich_tags[tag_id].name] - analysis.tag = rich_tag - analysis.lemma = self.lemmatize(analysis.tag.pos, token.lex.orth, - self.tag_map.get(tag_str, {})) - - self._cache.set(tag_id, token.lex.orth, analysis) - if token.lemma == 0: - token.lemma = analysis.lemma - token.pos = analysis.tag.pos - token.tag = analysis.tag.name - token.morph = analysis.tag.morph - - cdef int assign_feature(self, uint64_t* flags, univ_morph_t flag_id, bint value) except -1: - cdef flags_t one = 1 - if value: - flags[0] |= one << flag_id - else: - flags[0] &= ~(one << flag_id) - - def add_special_case(self, unicode tag_str, unicode orth_str, attrs, - force=False): - """Add a special-case rule to the morphological analyser. Tokens whose - tag and orth match the rule will receive the specified properties. - - tag (unicode): The part-of-speech tag to key the exception. - orth (unicode): The word-form to key the exception. - """ - # TODO: Currently we've assumed that we know the number of tags -- - # RichTagC is an array, and _cache is a PreshMapArray - # This is really bad: it makes the morphology typed to the tagger - # classes, which is all wrong. - self.exc[(tag_str, orth_str)] = dict(attrs) - tag = self.strings.add(tag_str) - if tag not in self.reverse_index: - return - tag_id = self.reverse_index[tag] - orth = self.strings.add(orth_str) - cdef RichTagC rich_tag = self.rich_tags[tag_id] - attrs = intify_attrs(attrs, self.strings, _do_deprecated=True) - cached = self._cache.get(tag_id, orth) - if cached is NULL: - cached = self.mem.alloc(1, sizeof(MorphAnalysisC)) - elif force: - memset(cached, 0, sizeof(cached[0])) - else: - raise ValueError(Errors.E015.format(tag=tag_str, orth=orth_str)) - - cached.tag = rich_tag - # TODO: Refactor this to take arbitrary attributes. - for name_id, value_id in attrs.items(): - if name_id == LEMMA: - cached.lemma = value_id - else: - self.assign_feature(&cached.tag.morph, name_id, value_id) - if cached.lemma == 0: - cached.lemma = self.lemmatize(rich_tag.pos, orth, attrs) - self._cache.set(tag_id, orth, cached) - - def load_morph_exceptions(self, dict exc): - # Map (form, pos) to (lemma, rich tag) - for tag_str, entries in exc.items(): - for form_str, attrs in entries.items(): - self.add_special_case(tag_str, form_str, attrs) + cpdef update(self, hash_t morph, features): + """Update a morphological analysis with new feature values.""" + tag = (self.tags.get(morph))[0] + features = intify_features(features) + cdef attr_t feature + for feature in features: + field = FEATURE_FIELDS[FEATURE_NAMES[feature]] + set_feature(&tag, field, feature, 1) + morph = self.insert(tag) + return morph def lemmatize(self, const univ_pos_t univ_pos, attr_t orth, morphology): if orth not in self.strings: @@ -177,269 +229,879 @@ cdef class Morphology: return self.strings.add(py_string.lower()) cdef list lemma_strings cdef unicode lemma_string - lemma_strings = self.lemmatizer(py_string, univ_pos, morphology) + # Normalize features into a dict keyed by the field, to make life easier + # for the lemmatizer. Handles string-to-int conversion too. + string_feats = {} + for key, value in morphology.items(): + if value is True: + name, value = self.strings.as_string(key).split('_', 1) + string_feats[name] = value + else: + string_feats[self.strings.as_string(key)] = self.strings.as_string(value) + lemma_strings = self.lemmatizer(py_string, univ_pos, string_feats) lemma_string = lemma_strings[0] lemma = self.strings.add(lemma_string) return lemma + def add_special_case(self, unicode tag_str, unicode orth_str, attrs, + force=False): + """Add a special-case rule to the morphological analyser. Tokens whose + tag and orth match the rule will receive the specified properties. -IDS = { - "Animacy_anim": Animacy_anim, - "Animacy_inan": Animacy_inan, - "Animacy_hum": Animacy_hum, # U20 - "Animacy_nhum": Animacy_nhum, - "Aspect_freq": Aspect_freq, - "Aspect_imp": Aspect_imp, - "Aspect_mod": Aspect_mod, - "Aspect_none": Aspect_none, - "Aspect_perf": Aspect_perf, - "Case_abe": Case_abe, - "Case_abl": Case_abl, - "Case_abs": Case_abs, - "Case_acc": Case_acc, - "Case_ade": Case_ade, - "Case_all": Case_all, - "Case_cau": Case_cau, - "Case_com": Case_com, - "Case_dat": Case_dat, - "Case_del": Case_del, - "Case_dis": Case_dis, - "Case_ela": Case_ela, - "Case_ess": Case_ess, - "Case_gen": Case_gen, - "Case_ill": Case_ill, - "Case_ine": Case_ine, - "Case_ins": Case_ins, - "Case_loc": Case_loc, - "Case_lat": Case_lat, - "Case_nom": Case_nom, - "Case_par": Case_par, - "Case_sub": Case_sub, - "Case_sup": Case_sup, - "Case_tem": Case_tem, - "Case_ter": Case_ter, - "Case_tra": Case_tra, - "Case_voc": Case_voc, - "Definite_two": Definite_two, - "Definite_def": Definite_def, - "Definite_red": Definite_red, - "Definite_cons": Definite_cons, # U20 - "Definite_ind": Definite_ind, - "Degree_cmp": Degree_cmp, - "Degree_comp": Degree_comp, - "Degree_none": Degree_none, - "Degree_pos": Degree_pos, - "Degree_sup": Degree_sup, - "Degree_abs": Degree_abs, - "Degree_com": Degree_com, - "Degree_dim ": Degree_dim, # du - "Gender_com": Gender_com, - "Gender_fem": Gender_fem, - "Gender_masc": Gender_masc, - "Gender_neut": Gender_neut, - "Mood_cnd": Mood_cnd, - "Mood_imp": Mood_imp, - "Mood_ind": Mood_ind, - "Mood_n": Mood_n, - "Mood_pot": Mood_pot, - "Mood_sub": Mood_sub, - "Mood_opt": Mood_opt, - "Negative_neg": Negative_neg, - "Negative_pos": Negative_pos, - "Negative_yes": Negative_yes, - "Polarity_neg": Polarity_neg, # U20 - "Polarity_pos": Polarity_pos, # U20 - "Number_com": Number_com, - "Number_dual": Number_dual, - "Number_none": Number_none, - "Number_plur": Number_plur, - "Number_sing": Number_sing, - "Number_ptan ": Number_ptan, # bg - "Number_count ": Number_count, # bg - "NumType_card": NumType_card, - "NumType_dist": NumType_dist, - "NumType_frac": NumType_frac, - "NumType_gen": NumType_gen, - "NumType_mult": NumType_mult, - "NumType_none": NumType_none, - "NumType_ord": NumType_ord, - "NumType_sets": NumType_sets, - "Person_one": Person_one, - "Person_two": Person_two, - "Person_three": Person_three, - "Person_none": Person_none, - "Poss_yes": Poss_yes, - "PronType_advPart": PronType_advPart, - "PronType_art": PronType_art, - "PronType_default": PronType_default, - "PronType_dem": PronType_dem, - "PronType_ind": PronType_ind, - "PronType_int": PronType_int, - "PronType_neg": PronType_neg, - "PronType_prs": PronType_prs, - "PronType_rcp": PronType_rcp, - "PronType_rel": PronType_rel, - "PronType_tot": PronType_tot, - "PronType_clit": PronType_clit, - "PronType_exc ": PronType_exc, # es, ca, it, fa, - "Reflex_yes": Reflex_yes, - "Tense_fut": Tense_fut, - "Tense_imp": Tense_imp, - "Tense_past": Tense_past, - "Tense_pres": Tense_pres, - "VerbForm_fin": VerbForm_fin, - "VerbForm_ger": VerbForm_ger, - "VerbForm_inf": VerbForm_inf, - "VerbForm_none": VerbForm_none, - "VerbForm_part": VerbForm_part, - "VerbForm_partFut": VerbForm_partFut, - "VerbForm_partPast": VerbForm_partPast, - "VerbForm_partPres": VerbForm_partPres, - "VerbForm_sup": VerbForm_sup, - "VerbForm_trans": VerbForm_trans, - "VerbForm_conv": VerbForm_conv, # U20 - "VerbForm_gdv ": VerbForm_gdv, # la, - "Voice_act": Voice_act, - "Voice_cau": Voice_cau, - "Voice_pass": Voice_pass, - "Voice_mid ": Voice_mid, # gkc, - "Voice_int ": Voice_int, # hb, - "Abbr_yes ": Abbr_yes, # cz, fi, sl, U, - "AdpType_prep ": AdpType_prep, # cz, U, - "AdpType_post ": AdpType_post, # U, - "AdpType_voc ": AdpType_voc, # cz, - "AdpType_comprep ": AdpType_comprep, # cz, - "AdpType_circ ": AdpType_circ, # U, - "AdvType_man": AdvType_man, - "AdvType_loc": AdvType_loc, - "AdvType_tim": AdvType_tim, - "AdvType_deg": AdvType_deg, - "AdvType_cau": AdvType_cau, - "AdvType_mod": AdvType_mod, - "AdvType_sta": AdvType_sta, - "AdvType_ex": AdvType_ex, - "AdvType_adadj": AdvType_adadj, - "ConjType_oper ": ConjType_oper, # cz, U, - "ConjType_comp ": ConjType_comp, # cz, U, - "Connegative_yes ": Connegative_yes, # fi, - "Derivation_minen ": Derivation_minen, # fi, - "Derivation_sti ": Derivation_sti, # fi, - "Derivation_inen ": Derivation_inen, # fi, - "Derivation_lainen ": Derivation_lainen, # fi, - "Derivation_ja ": Derivation_ja, # fi, - "Derivation_ton ": Derivation_ton, # fi, - "Derivation_vs ": Derivation_vs, # fi, - "Derivation_ttain ": Derivation_ttain, # fi, - "Derivation_ttaa ": Derivation_ttaa, # fi, - "Echo_rdp ": Echo_rdp, # U, - "Echo_ech ": Echo_ech, # U, - "Foreign_foreign ": Foreign_foreign, # cz, fi, U, - "Foreign_fscript ": Foreign_fscript, # cz, fi, U, - "Foreign_tscript ": Foreign_tscript, # cz, U, - "Foreign_yes ": Foreign_yes, # sl, - "Gender_dat_masc ": Gender_dat_masc, # bq, U, - "Gender_dat_fem ": Gender_dat_fem, # bq, U, - "Gender_erg_masc ": Gender_erg_masc, # bq, - "Gender_erg_fem ": Gender_erg_fem, # bq, - "Gender_psor_masc ": Gender_psor_masc, # cz, sl, U, - "Gender_psor_fem ": Gender_psor_fem, # cz, sl, U, - "Gender_psor_neut ": Gender_psor_neut, # sl, - "Hyph_yes ": Hyph_yes, # cz, U, - "InfForm_one ": InfForm_one, # fi, - "InfForm_two ": InfForm_two, # fi, - "InfForm_three ": InfForm_three, # fi, - "NameType_geo ": NameType_geo, # U, cz, - "NameType_prs ": NameType_prs, # U, cz, - "NameType_giv ": NameType_giv, # U, cz, - "NameType_sur ": NameType_sur, # U, cz, - "NameType_nat ": NameType_nat, # U, cz, - "NameType_com ": NameType_com, # U, cz, - "NameType_pro ": NameType_pro, # U, cz, - "NameType_oth ": NameType_oth, # U, cz, - "NounType_com ": NounType_com, # U, - "NounType_prop ": NounType_prop, # U, - "NounType_class ": NounType_class, # U, - "Number_abs_sing ": Number_abs_sing, # bq, U, - "Number_abs_plur ": Number_abs_plur, # bq, U, - "Number_dat_sing ": Number_dat_sing, # bq, U, - "Number_dat_plur ": Number_dat_plur, # bq, U, - "Number_erg_sing ": Number_erg_sing, # bq, U, - "Number_erg_plur ": Number_erg_plur, # bq, U, - "Number_psee_sing ": Number_psee_sing, # U, - "Number_psee_plur ": Number_psee_plur, # U, - "Number_psor_sing ": Number_psor_sing, # cz, fi, sl, U, - "Number_psor_plur ": Number_psor_plur, # cz, fi, sl, U, - "NumForm_digit ": NumForm_digit, # cz, sl, U, - "NumForm_roman ": NumForm_roman, # cz, sl, U, - "NumForm_word ": NumForm_word, # cz, sl, U, - "NumValue_one ": NumValue_one, # cz, U, - "NumValue_two ": NumValue_two, # cz, U, - "NumValue_three ": NumValue_three, # cz, U, - "PartForm_pres ": PartForm_pres, # fi, - "PartForm_past ": PartForm_past, # fi, - "PartForm_agt ": PartForm_agt, # fi, - "PartForm_neg ": PartForm_neg, # fi, - "PartType_mod ": PartType_mod, # U, - "PartType_emp ": PartType_emp, # U, - "PartType_res ": PartType_res, # U, - "PartType_inf ": PartType_inf, # U, - "PartType_vbp ": PartType_vbp, # U, - "Person_abs_one ": Person_abs_one, # bq, U, - "Person_abs_two ": Person_abs_two, # bq, U, - "Person_abs_three ": Person_abs_three, # bq, U, - "Person_dat_one ": Person_dat_one, # bq, U, - "Person_dat_two ": Person_dat_two, # bq, U, - "Person_dat_three ": Person_dat_three, # bq, U, - "Person_erg_one ": Person_erg_one, # bq, U, - "Person_erg_two ": Person_erg_two, # bq, U, - "Person_erg_three ": Person_erg_three, # bq, U, - "Person_psor_one ": Person_psor_one, # fi, U, - "Person_psor_two ": Person_psor_two, # fi, U, - "Person_psor_three ": Person_psor_three, # fi, U, - "Polite_inf ": Polite_inf, # bq, U, - "Polite_pol ": Polite_pol, # bq, U, - "Polite_abs_inf ": Polite_abs_inf, # bq, U, - "Polite_abs_pol ": Polite_abs_pol, # bq, U, - "Polite_erg_inf ": Polite_erg_inf, # bq, U, - "Polite_erg_pol ": Polite_erg_pol, # bq, U, - "Polite_dat_inf ": Polite_dat_inf, # bq, U, - "Polite_dat_pol ": Polite_dat_pol, # bq, U, - "Prefix_yes ": Prefix_yes, # U, - "PrepCase_npr ": PrepCase_npr, # cz, - "PrepCase_pre ": PrepCase_pre, # U, - "PunctSide_ini ": PunctSide_ini, # U, - "PunctSide_fin ": PunctSide_fin, # U, - "PunctType_peri ": PunctType_peri, # U, - "PunctType_qest ": PunctType_qest, # U, - "PunctType_excl ": PunctType_excl, # U, - "PunctType_quot ": PunctType_quot, # U, - "PunctType_brck ": PunctType_brck, # U, - "PunctType_comm ": PunctType_comm, # U, - "PunctType_colo ": PunctType_colo, # U, - "PunctType_semi ": PunctType_semi, # U, - "PunctType_dash ": PunctType_dash, # U, - "Style_arch ": Style_arch, # cz, fi, U, - "Style_rare ": Style_rare, # cz, fi, U, - "Style_poet ": Style_poet, # cz, U, - "Style_norm ": Style_norm, # cz, U, - "Style_coll ": Style_coll, # cz, U, - "Style_vrnc ": Style_vrnc, # cz, U, - "Style_sing ": Style_sing, # cz, U, - "Style_expr ": Style_expr, # cz, U, - "Style_derg ": Style_derg, # cz, U, - "Style_vulg ": Style_vulg, # cz, U, - "Style_yes ": Style_yes, # fi, U, - "StyleVariant_styleShort ": StyleVariant_styleShort, # cz, - "StyleVariant_styleBound ": StyleVariant_styleBound, # cz, sl, - "VerbType_aux ": VerbType_aux, # U, - "VerbType_cop ": VerbType_cop, # U, - "VerbType_mod ": VerbType_mod, # U, - "VerbType_light ": VerbType_light, # U, + tag (unicode): The part-of-speech tag to key the exception. + orth (unicode): The word-form to key the exception. + """ + attrs = dict(attrs) + attrs = _normalize_props(attrs) + self.add({self._feat_map.id2feat[feat] for feat in attrs + if feat in self._feat_map.id2feat}) + attrs = intify_attrs(attrs, self.strings, _do_deprecated=True) + self.exc[(tag_str, self.strings.add(orth_str))] = attrs + + cdef hash_t insert(self, MorphAnalysisC tag) except 0: + cdef hash_t key = hash_tag(tag) + if self.tags.get(key) == NULL: + tag_ptr = self.mem.alloc(1, sizeof(MorphAnalysisC)) + tag_ptr[0] = tag + self.tags.set(key, tag_ptr) + return key + + cdef int assign_untagged(self, TokenC* token) except -1: + """Set morphological attributes on a token without a POS tag. Uses + the lemmatizer's lookup() method, which looks up the string in the + table provided by the language data as lemma_lookup (if available). + """ + if token.lemma == 0: + orth_str = self.strings[token.lex.orth] + lemma = self.lemmatizer.lookup(orth_str, orth=token.lex.orth) + token.lemma = self.strings.add(lemma) + + cdef int assign_tag(self, TokenC* token, tag_str) except -1: + cdef attr_t tag = self.strings.as_int(tag_str) + if tag in self.reverse_index: + tag_id = self.reverse_index[tag] + self.assign_tag_id(token, tag_id) + else: + token.tag = tag + + cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1: + if tag_id > self.n_tags: + raise ValueError(Errors.E014.format(tag=tag_id)) + # Ensure spaces get tagged as space. + # It seems pretty arbitrary to put this logic here, but there's really + # nowhere better. I guess the justification is that this is where the + # specific word and the tag interact. Still, we should have a better + # way to enforce this rule, or figure out why the statistical model fails. + # Related to Issue #220 + if Lexeme.c_check_flag(token.lex, IS_SPACE): + tag_id = self.reverse_index[self.strings.add('_SP')] + tag_str = self.tag_names[tag_id] + features = dict(self.tag_map.get(tag_str, {})) + if features: + pos = self.strings.as_int(features.pop(POS)) + else: + pos = 0 + cdef attr_t lemma = self._cache.get(tag_id, token.lex.orth) + if lemma == 0: + # Ugh, self.lemmatize has opposite arg order from self.lemmatizer :( + lemma = self.lemmatize(pos, token.lex.orth, features) + self._cache.set(tag_id, token.lex.orth, lemma) + token.lemma = lemma + token.pos = pos + token.tag = self.strings[tag_str] + token.morph = self.add(features) + if (self.tag_names[tag_id], token.lex.orth) in self.exc: + self._assign_tag_from_exceptions(token, tag_id) + + cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1: + key = (self.tag_names[tag_id], token.lex.orth) + cdef dict attrs + attrs = self.exc[key] + token.pos = attrs.get(POS, token.pos) + token.lemma = attrs.get(LEMMA, token.lemma) + + def load_morph_exceptions(self, dict exc): + # Map (form, pos) to attributes + for tag_str, entries in exc.items(): + for form_str, attrs in entries.items(): + self.add_special_case(tag_str, form_str, attrs) + + @classmethod + def create_class_map(cls): + return MorphologyClassMap(FEATURES) + + +cpdef univ_pos_t get_int_tag(pos_): + return 0 + +cpdef intify_features(features): + return {get_string_id(feature) for feature in features} + +cdef hash_t hash_tag(MorphAnalysisC tag) nogil: + return mrmr.hash64(&tag, sizeof(tag), 0) + + +cdef MorphAnalysisC create_rich_tag(features) except *: + cdef MorphAnalysisC tag + cdef attr_t feature + memset(&tag, 0, sizeof(tag)) + for feature in features: + field = FEATURE_FIELDS[FEATURE_NAMES[feature]] + set_feature(&tag, field, feature, 1) + return tag + + +cdef tag_to_json(const MorphAnalysisC* tag): + return [FEATURE_NAMES[f] for f in list_features(tag)] + + +cdef MorphAnalysisC tag_from_json(json_tag): + raise NotImplementedError + + +cdef list list_features(const MorphAnalysisC* tag): + output = [] + if tag.abbr != 0: + output.append(tag.abbr) + if tag.adp_type != 0: + output.append(tag.adp_type) + if tag.adv_type != 0: + output.append(tag.adv_type) + if tag.animacy != 0: + output.append(tag.animacy) + if tag.aspect != 0: + output.append(tag.aspect) + if tag.case != 0: + output.append(tag.case) + if tag.conj_type != 0: + output.append(tag.conj_type) + if tag.connegative != 0: + output.append(tag.connegative) + if tag.definite != 0: + output.append(tag.definite) + if tag.degree != 0: + output.append(tag.degree) + if tag.derivation != 0: + output.append(tag.derivation) + if tag.echo != 0: + output.append(tag.echo) + if tag.foreign != 0: + output.append(tag.foreign) + if tag.gender != 0: + output.append(tag.gender) + if tag.hyph != 0: + output.append(tag.hyph) + if tag.inf_form != 0: + output.append(tag.inf_form) + if tag.mood != 0: + output.append(tag.mood) + if tag.negative != 0: + output.append(tag.negative) + if tag.number != 0: + output.append(tag.number) + if tag.name_type != 0: + output.append(tag.name_type) + if tag.noun_type != 0: + output.append(tag.noun_type) + if tag.part_form != 0: + output.append(tag.part_form) + if tag.part_type != 0: + output.append(tag.part_type) + if tag.person != 0: + output.append(tag.person) + if tag.polite != 0: + output.append(tag.polite) + if tag.polarity != 0: + output.append(tag.polarity) + if tag.poss != 0: + output.append(tag.poss) + if tag.prefix != 0: + output.append(tag.prefix) + if tag.prep_case != 0: + output.append(tag.prep_case) + if tag.pron_type != 0: + output.append(tag.pron_type) + if tag.punct_type != 0: + output.append(tag.punct_type) + if tag.reflex != 0: + output.append(tag.reflex) + if tag.style != 0: + output.append(tag.style) + if tag.style_variant != 0: + output.append(tag.style_variant) + if tag.typo != 0: + output.append(tag.typo) + if tag.verb_form != 0: + output.append(tag.verb_form) + if tag.voice != 0: + output.append(tag.voice) + if tag.verb_type != 0: + output.append(tag.verb_type) + return output + + +cdef attr_t get_field(const MorphAnalysisC* tag, int field_id) nogil: + field = field_id + if field == Field_POS: + return tag.pos + if field == Field_Abbr: + return tag.abbr + elif field == Field_AdpType: + return tag.adp_type + elif field == Field_AdvType: + return tag.adv_type + elif field == Field_Animacy: + return tag.animacy + elif field == Field_Aspect: + return tag.aspect + elif field == Field_Case: + return tag.case + elif field == Field_ConjType: + return tag.conj_type + elif field == Field_Connegative: + return tag.connegative + elif field == Field_Definite: + return tag.definite + elif field == Field_Degree: + return tag.degree + elif field == Field_Derivation: + return tag.derivation + elif field == Field_Echo: + return tag.echo + elif field == Field_Foreign: + return tag.foreign + elif field == Field_Gender: + return tag.gender + elif field == Field_Hyph: + return tag.hyph + elif field == Field_InfForm: + return tag.inf_form + elif field == Field_Mood: + return tag.mood + elif field == Field_Negative: + return tag.negative + elif field == Field_Number: + return tag.number + elif field == Field_NameType: + return tag.name_type + elif field == Field_NounType: + return tag.noun_type + elif field == Field_NumForm: + return tag.num_form + elif field == Field_NumType: + return tag.num_type + elif field == Field_NumValue: + return tag.num_value + elif field == Field_PartForm: + return tag.part_form + elif field == Field_PartType: + return tag.part_type + elif field == Field_Person: + return tag.person + elif field == Field_Polite: + return tag.polite + elif field == Field_Polarity: + return tag.polarity + elif field == Field_Poss: + return tag.poss + elif field == Field_Prefix: + return tag.prefix + elif field == Field_PrepCase: + return tag.prep_case + elif field == Field_PronType: + return tag.pron_type + elif field == Field_PunctSide: + return tag.punct_side + elif field == Field_PunctType: + return tag.punct_type + elif field == Field_Reflex: + return tag.reflex + elif field == Field_Style: + return tag.style + elif field == Field_StyleVariant: + return tag.style_variant + elif field == Field_Tense: + return tag.tense + elif field == Field_Typo: + return tag.typo + elif field == Field_VerbForm: + return tag.verb_form + elif field == Field_Voice: + return tag.voice + elif field == Field_VerbType: + return tag.verb_type + else: + raise ValueError(Errors.E168.format(field=field_id)) + + +cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil: + if tag.abbr == feature: + return 1 + elif tag.adp_type == feature: + return 1 + elif tag.adv_type == feature: + return 1 + elif tag.animacy == feature: + return 1 + elif tag.aspect == feature: + return 1 + elif tag.case == feature: + return 1 + elif tag.conj_type == feature: + return 1 + elif tag.connegative == feature: + return 1 + elif tag.definite == feature: + return 1 + elif tag.degree == feature: + return 1 + elif tag.derivation == feature: + return 1 + elif tag.echo == feature: + return 1 + elif tag.foreign == feature: + return 1 + elif tag.gender == feature: + return 1 + elif tag.hyph == feature: + return 1 + elif tag.inf_form == feature: + return 1 + elif tag.mood == feature: + return 1 + elif tag.negative == feature: + return 1 + elif tag.number == feature: + return 1 + elif tag.name_type == feature: + return 1 + elif tag.noun_type == feature: + return 1 + elif tag.num_form == feature: + return 1 + elif tag.num_type == feature: + return 1 + elif tag.num_value == feature: + return 1 + elif tag.part_form == feature: + return 1 + elif tag.part_type == feature: + return 1 + elif tag.person == feature: + return 1 + elif tag.polite == feature: + return 1 + elif tag.polarity == feature: + return 1 + elif tag.poss == feature: + return 1 + elif tag.prefix == feature: + return 1 + elif tag.prep_case == feature: + return 1 + elif tag.pron_type == feature: + return 1 + elif tag.punct_side == feature: + return 1 + elif tag.punct_type == feature: + return 1 + elif tag.reflex == feature: + return 1 + elif tag.style == feature: + return 1 + elif tag.style_variant == feature: + return 1 + elif tag.tense == feature: + return 1 + elif tag.typo == feature: + return 1 + elif tag.verb_form == feature: + return 1 + elif tag.voice == feature: + return 1 + elif tag.verb_type == feature: + return 1 + else: + return 0 + +cdef int set_feature(MorphAnalysisC* tag, + univ_field_t field, attr_t feature, int value) except -1: + if value == True: + value_ = feature + else: + value_ = 0 + prev_value = get_field(tag, field) + if prev_value != 0 and value_ == 0 and field != Field_POS: + tag.length -= 1 + elif prev_value == 0 and value_ != 0 and field != Field_POS: + tag.length += 1 + if feature == 0: + pass + elif field == Field_POS: + tag.pos = get_string_id(FEATURE_NAMES[value_].split('_')[1]) + elif field == Field_Abbr: + tag.abbr = value_ + elif field == Field_AdpType: + tag.adp_type = value_ + elif field == Field_AdvType: + tag.adv_type = value_ + elif field == Field_Animacy: + tag.animacy = value_ + elif field == Field_Aspect: + tag.aspect = value_ + elif field == Field_Case: + tag.case = value_ + elif field == Field_ConjType: + tag.conj_type = value_ + elif field == Field_Connegative: + tag.connegative = value_ + elif field == Field_Definite: + tag.definite = value_ + elif field == Field_Degree: + tag.degree = value_ + elif field == Field_Derivation: + tag.derivation = value_ + elif field == Field_Echo: + tag.echo = value_ + elif field == Field_Foreign: + tag.foreign = value_ + elif field == Field_Gender: + tag.gender = value_ + elif field == Field_Hyph: + tag.hyph = value_ + elif field == Field_InfForm: + tag.inf_form = value_ + elif field == Field_Mood: + tag.mood = value_ + elif field == Field_Negative: + tag.negative = value_ + elif field == Field_Number: + tag.number = value_ + elif field == Field_NameType: + tag.name_type = value_ + elif field == Field_NounType: + tag.noun_type = value_ + elif field == Field_NumForm: + tag.num_form = value_ + elif field == Field_NumType: + tag.num_type = value_ + elif field == Field_NumValue: + tag.num_value = value_ + elif field == Field_PartForm: + tag.part_form = value_ + elif field == Field_PartType: + tag.part_type = value_ + elif field == Field_Person: + tag.person = value_ + elif field == Field_Polite: + tag.polite = value_ + elif field == Field_Polarity: + tag.polarity = value_ + elif field == Field_Poss: + tag.poss = value_ + elif field == Field_Prefix: + tag.prefix = value_ + elif field == Field_PrepCase: + tag.prep_case = value_ + elif field == Field_PronType: + tag.pron_type = value_ + elif field == Field_PunctSide: + tag.punct_side = value_ + elif field == Field_PunctType: + tag.punct_type = value_ + elif field == Field_Reflex: + tag.reflex = value_ + elif field == Field_Style: + tag.style = value_ + elif field == Field_StyleVariant: + tag.style_variant = value_ + elif field == Field_Tense: + tag.tense = value_ + elif field == Field_Typo: + tag.typo = value_ + elif field == Field_VerbForm: + tag.verb_form = value_ + elif field == Field_Voice: + tag.voice = value_ + elif field == Field_VerbType: + tag.verb_type = value_ + else: + raise ValueError(Errors.E167.format(field=FEATURE_NAMES.get(feature), field_id=feature)) + + +FIELDS = { + 'POS': Field_POS, + 'Abbr': Field_Abbr, + 'AdpType': Field_AdpType, + 'AdvType': Field_AdvType, + 'Animacy': Field_Animacy, + 'Aspect': Field_Aspect, + 'Case': Field_Case, + 'ConjType': Field_ConjType, + 'Connegative': Field_Connegative, + 'Definite': Field_Definite, + 'Degree': Field_Degree, + 'Derivation': Field_Derivation, + 'Echo': Field_Echo, + 'Foreign': Field_Foreign, + 'Gender': Field_Gender, + 'Hyph': Field_Hyph, + 'InfForm': Field_InfForm, + 'Mood': Field_Mood, + 'NameType': Field_NameType, + 'Negative': Field_Negative, + 'NounType': Field_NounType, + 'Number': Field_Number, + 'NumForm': Field_NumForm, + 'NumType': Field_NumType, + 'NumValue': Field_NumValue, + 'PartForm': Field_PartForm, + 'PartType': Field_PartType, + 'Person': Field_Person, + 'Polite': Field_Polite, + 'Polarity': Field_Polarity, + 'Poss': Field_Poss, + 'Prefix': Field_Prefix, + 'PrepCase': Field_PrepCase, + 'PronType': Field_PronType, + 'PunctSide': Field_PunctSide, + 'PunctType': Field_PunctType, + 'Reflex': Field_Reflex, + 'Style': Field_Style, + 'StyleVariant': Field_StyleVariant, + 'Tense': Field_Tense, + 'Typo': Field_Typo, + 'VerbForm': Field_VerbForm, + 'VerbType': Field_VerbType, + 'Voice': Field_Voice, +} + +LOWER_FIELDS = { + 'pos': Field_POS, + 'abbr': Field_Abbr, + 'adp_type': Field_AdpType, + 'adv_type': Field_AdvType, + 'animacy': Field_Animacy, + 'aspect': Field_Aspect, + 'case': Field_Case, + 'conj_type': Field_ConjType, + 'connegative': Field_Connegative, + 'definite': Field_Definite, + 'degree': Field_Degree, + 'derivation': Field_Derivation, + 'echo': Field_Echo, + 'foreign': Field_Foreign, + 'gender': Field_Gender, + 'hyph': Field_Hyph, + 'inf_form': Field_InfForm, + 'mood': Field_Mood, + 'name_type': Field_NameType, + 'negative': Field_Negative, + 'noun_type': Field_NounType, + 'number': Field_Number, + 'num_form': Field_NumForm, + 'num_type': Field_NumType, + 'num_value': Field_NumValue, + 'part_form': Field_PartForm, + 'part_type': Field_PartType, + 'person': Field_Person, + 'polarity': Field_Polarity, + 'polite': Field_Polite, + 'poss': Field_Poss, + 'prefix': Field_Prefix, + 'prep_case': Field_PrepCase, + 'pron_type': Field_PronType, + 'punct_side': Field_PunctSide, + 'punct_type': Field_PunctType, + 'reflex': Field_Reflex, + 'style': Field_Style, + 'style_variant': Field_StyleVariant, + 'tense': Field_Tense, + 'typo': Field_Typo, + 'verb_form': Field_VerbForm, + 'verb_type': Field_VerbType, + 'voice': Field_Voice, } -NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])] -# Unfortunate hack here, to work around problem with long cpdef enum -# (which is generating an enormous amount of C++ in Cython 0.24+) -# We keep the enum cdef, and just make sure the names are available to Python -locals().update(IDS) +FEATURES = [ + "POS_ADJ", + "POS_ADP", + "POS_ADV", + "POS_AUX", + "POS_CONJ", + "POS_CCONJ", + "POS_DET", + "POS_INTJ", + "POS_NOUN", + "POS_NUM", + "POS_PART", + "POS_PRON", + "POS_PROPN", + "POS_PUNCT", + "POS_SCONJ", + "POS_SYM", + "POS_VERB", + "POS_X", + "POS_EOL", + "POS_SPACE", + "Abbr_yes", + "AdpType_circ", + "AdpType_comprep", + "AdpType_prep", + "AdpType_post", + "AdpType_voc", + "AdvType_adadj", + "AdvType_cau", + "AdvType_deg", + "AdvType_ex", + "AdvType_loc", + "AdvType_man", + "AdvType_mod", + "AdvType_sta", + "AdvType_tim", + "Animacy_anim", + "Animacy_hum", + "Animacy_inan", + "Animacy_nhum", + "Aspect_hab", + "Aspect_imp", + "Aspect_iter", + "Aspect_perf", + "Aspect_prog", + "Aspect_prosp", + "Aspect_none", + "Case_abe", + "Case_abl", + "Case_abs", + "Case_acc", + "Case_ade", + "Case_all", + "Case_cau", + "Case_com", + "Case_dat", + "Case_del", + "Case_dis", + "Case_ela", + "Case_ess", + "Case_gen", + "Case_ill", + "Case_ine", + "Case_ins", + "Case_loc", + "Case_lat", + "Case_nom", + "Case_par", + "Case_sub", + "Case_sup", + "Case_tem", + "Case_ter", + "Case_tra", + "Case_voc", + "ConjType_comp", + "ConjType_oper", + "Connegative_yes", + "Definite_cons", + "Definite_def", + "Definite_ind", + "Definite_red", + "Definite_two", + "Degree_abs", + "Degree_cmp", + "Degree_comp", + "Degree_none", + "Degree_pos", + "Degree_sup", + "Degree_com", + "Degree_dim", + "Derivation_minen", + "Derivation_sti", + "Derivation_inen", + "Derivation_lainen", + "Derivation_ja", + "Derivation_ton", + "Derivation_vs", + "Derivation_ttain", + "Derivation_ttaa", + "Echo_rdp", + "Echo_ech", + "Foreign_foreign", + "Foreign_fscript", + "Foreign_tscript", + "Foreign_yes", + "Gender_com", + "Gender_fem", + "Gender_masc", + "Gender_neut", + "Gender_dat_masc", + "Gender_dat_fem", + "Gender_erg_masc", + "Gender_erg_fem", + "Gender_psor_masc", + "Gender_psor_fem", + "Gender_psor_neut", + "Hyph_yes", + "InfForm_one", + "InfForm_two", + "InfForm_three", + "Mood_cnd", + "Mood_imp", + "Mood_ind", + "Mood_n", + "Mood_pot", + "Mood_sub", + "Mood_opt", + "NameType_geo", + "NameType_prs", + "NameType_giv", + "NameType_sur", + "NameType_nat", + "NameType_com", + "NameType_pro", + "NameType_oth", + "Negative_neg", + "Negative_pos", + "Negative_yes", + "NounType_com", + "NounType_prop", + "NounType_class", + "Number_com", + "Number_dual", + "Number_none", + "Number_plur", + "Number_sing", + "Number_ptan", + "Number_count", + "Number_abs_sing", + "Number_abs_plur", + "Number_dat_sing", + "Number_dat_plur", + "Number_erg_sing", + "Number_erg_plur", + "Number_psee_sing", + "Number_psee_plur", + "Number_psor_sing", + "Number_psor_plur", + "NumForm_digit", + "NumForm_roman", + "NumForm_word", + "NumForm_combi", + "NumType_card", + "NumType_dist", + "NumType_frac", + "NumType_gen", + "NumType_mult", + "NumType_none", + "NumType_ord", + "NumType_sets", + "NumType_dual", + "NumValue_one", + "NumValue_two", + "NumValue_three", + "PartForm_pres", + "PartForm_past", + "PartForm_agt", + "PartForm_neg", + "PartType_mod", + "PartType_emp", + "PartType_res", + "PartType_inf", + "PartType_vbp", + "Person_one", + "Person_two", + "Person_three", + "Person_none", + "Person_abs_one", + "Person_abs_two", + "Person_abs_three", + "Person_dat_one", + "Person_dat_two", + "Person_dat_three", + "Person_erg_one", + "Person_erg_two", + "Person_erg_three", + "Person_psor_one", + "Person_psor_two", + "Person_psor_three", + "Polarity_neg", + "Polarity_pos", + "Polite_inf", + "Polite_pol", + "Polite_abs_inf", + "Polite_abs_pol", + "Polite_erg_inf", + "Polite_erg_pol", + "Polite_dat_inf", + "Polite_dat_pol", + "Poss_yes", + "Prefix_yes", + "PrepCase_npr", + "PrepCase_pre", + "PronType_advPart", + "PronType_art", + "PronType_default", + "PronType_dem", + "PronType_ind", + "PronType_int", + "PronType_neg", + "PronType_prs", + "PronType_rcp", + "PronType_rel", + "PronType_tot", + "PronType_clit", + "PronType_exc", + "PunctSide_ini", + "PunctSide_fin", + "PunctType_peri", + "PunctType_qest", + "PunctType_excl", + "PunctType_quot", + "PunctType_brck", + "PunctType_comm", + "PunctType_colo", + "PunctType_semi", + "PunctType_dash", + "Reflex_yes", + "Style_arch", + "Style_rare", + "Style_poet", + "Style_norm", + "Style_coll", + "Style_vrnc", + "Style_sing", + "Style_expr", + "Style_derg", + "Style_vulg", + "Style_yes", + "StyleVariant_styleShort", + "StyleVariant_styleBound", + "Tense_fut", + "Tense_imp", + "Tense_past", + "Tense_pres", + "Typo_yes", + "VerbForm_fin", + "VerbForm_ger", + "VerbForm_inf", + "VerbForm_none", + "VerbForm_part", + "VerbForm_partFut", + "VerbForm_partPast", + "VerbForm_partPres", + "VerbForm_sup", + "VerbForm_trans", + "VerbForm_conv", + "VerbForm_gdv", + "VerbType_aux", + "VerbType_cop", + "VerbType_mod", + "VerbType_light", + "Voice_act", + "Voice_cau", + "Voice_pass", + "Voice_mid", + "Voice_int", +] + +FEATURE_NAMES = {get_string_id(f): f for f in FEATURES} +FEATURE_FIELDS = {f: FIELDS[f.split('_', 1)[0]] for f in FEATURES} diff --git a/spacy/pipeline/__init__.py b/spacy/pipeline/__init__.py index 5d7b079d9..2f30fbbee 100644 --- a/spacy/pipeline/__init__.py +++ b/spacy/pipeline/__init__.py @@ -3,6 +3,7 @@ from __future__ import unicode_literals from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer +from .morphologizer import Morphologizer from .entityruler import EntityRuler from .hooks import SentenceSegmenter, SimilarityHook from .functions import merge_entities, merge_noun_chunks, merge_subtokens @@ -15,6 +16,7 @@ __all__ = [ "TextCategorizer", "Tensorizer", "Pipe", + "Morphologizer", "EntityRuler", "Sentencizer", "SentenceSegmenter", diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index a1d3f922e..956d67291 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -180,21 +180,28 @@ class EntityRuler(object): DOCS: https://spacy.io/api/entityruler#add_patterns """ - for entry in patterns: - label = entry["label"] - if "id" in entry: - label = self._create_label(label, entry["id"]) - pattern = entry["pattern"] - if isinstance(pattern, basestring_): - self.phrase_patterns[label].append(self.nlp(pattern)) - elif isinstance(pattern, list): - self.token_patterns[label].append(pattern) - else: - raise ValueError(Errors.E097.format(pattern=pattern)) - for label, patterns in self.token_patterns.items(): - self.matcher.add(label, None, *patterns) - for label, patterns in self.phrase_patterns.items(): - self.phrase_matcher.add(label, None, *patterns) + # disable the nlp components after this one in case they hadn't been initialized / deserialised yet + try: + current_index = self.nlp.pipe_names.index(self.name) + subsequent_pipes = [pipe for pipe in self.nlp.pipe_names[current_index + 1:]] + except ValueError: + subsequent_pipes = [] + with self.nlp.disable_pipes(*subsequent_pipes): + for entry in patterns: + label = entry["label"] + if "id" in entry: + label = self._create_label(label, entry["id"]) + pattern = entry["pattern"] + if isinstance(pattern, basestring_): + self.phrase_patterns[label].append(self.nlp(pattern)) + elif isinstance(pattern, list): + self.token_patterns[label].append(pattern) + else: + raise ValueError(Errors.E097.format(pattern=pattern)) + for label, patterns in self.token_patterns.items(): + self.matcher.add(label, None, *patterns) + for label, patterns in self.phrase_patterns.items(): + self.phrase_matcher.add(label, None, *patterns) def _split_label(self, label): """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx new file mode 100644 index 000000000..b14e2bec7 --- /dev/null +++ b/spacy/pipeline/morphologizer.pyx @@ -0,0 +1,164 @@ +from __future__ import unicode_literals +from collections import OrderedDict, defaultdict + +import numpy +cimport numpy as np + +from thinc.api import chain +from thinc.neural.util import to_categorical, copy_array, get_array_module +from .. import util +from .pipes import Pipe +from .._ml import Tok2Vec, build_morphologizer_model +from .._ml import link_vectors_to_models, zero_init, flatten +from .._ml import create_default_optimizer +from ..errors import Errors, TempErrors +from ..compat import basestring_ +from ..tokens.doc cimport Doc +from ..vocab cimport Vocab +from ..morphology cimport Morphology + + +class Morphologizer(Pipe): + name = 'morphologizer' + + @classmethod + def Model(cls, **cfg): + if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): + raise ValueError(TempErrors.T008) + class_map = Morphology.create_class_map() + return build_morphologizer_model(class_map.field_sizes, **cfg) + + def __init__(self, vocab, model=True, **cfg): + self.vocab = vocab + self.model = model + self.cfg = OrderedDict(sorted(cfg.items())) + self.cfg.setdefault('cnn_maxout_pieces', 2) + self._class_map = self.vocab.morphology.create_class_map() + + @property + def labels(self): + return self.vocab.morphology.tag_names + + @property + def tok2vec(self): + if self.model in (None, True, False): + return None + else: + return chain(self.model.tok2vec, flatten) + + def __call__(self, doc): + features, tokvecs = self.predict([doc]) + self.set_annotations([doc], features, tensors=tokvecs) + return doc + + def pipe(self, stream, batch_size=128, n_threads=-1): + for docs in util.minibatch(stream, size=batch_size): + docs = list(docs) + features, tokvecs = self.predict(docs) + self.set_annotations(docs, features, tensors=tokvecs) + yield from docs + + def predict(self, docs): + if not any(len(doc) for doc in docs): + # Handle case where there are no tokens in any docs. + n_labels = self.model.nO + guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] + tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) + return guesses, tokvecs + tokvecs = self.model.tok2vec(docs) + scores = self.model.softmax(tokvecs) + return scores, tokvecs + + def set_annotations(self, docs, batch_scores, tensors=None): + if isinstance(docs, Doc): + docs = [docs] + cdef Doc doc + cdef Vocab vocab = self.vocab + offsets = [self._class_map.get_field_offset(field) + for field in self._class_map.fields] + for i, doc in enumerate(docs): + doc_scores = batch_scores[i] + doc_guesses = scores_to_guesses(doc_scores, self.model.softmax.out_sizes) + # Convert the neuron indices into feature IDs. + doc_feat_ids = numpy.zeros((len(doc), len(self._class_map.fields)), dtype='i') + for j in range(len(doc)): + for k, offset in enumerate(offsets): + if doc_guesses[j, k] == 0: + doc_feat_ids[j, k] = 0 + else: + doc_feat_ids[j, k] = offset + doc_guesses[j, k] + # Get the set of feature names. + feats = {self._class_map.col2info[f][2] for f in doc_feat_ids[j]} + if "NIL" in feats: + feats.remove("NIL") + # Now add the analysis, and set the hash. + doc.c[j].morph = self.vocab.morphology.add(feats) + if doc[j].morph.pos != 0: + doc.c[j].pos = doc[j].morph.pos + + def update(self, docs, golds, drop=0., sgd=None, losses=None): + if losses is not None and self.name not in losses: + losses[self.name] = 0. + + tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop) + loss, d_tag_scores = self.get_loss(docs, golds, tag_scores) + bp_tag_scores(d_tag_scores, sgd=sgd) + + if losses is not None: + losses[self.name] += loss + + def get_loss(self, docs, golds, scores): + guesses = [] + for doc_scores in scores: + guesses.append(scores_to_guesses(doc_scores, self.model.softmax.out_sizes)) + guesses = self.model.ops.xp.vstack(guesses) + scores = self.model.ops.xp.vstack(scores) + if not isinstance(scores, numpy.ndarray): + scores = scores.get() + if not isinstance(guesses, numpy.ndarray): + guesses = guesses.get() + cdef int idx = 0 + # Do this on CPU, as we can't vectorize easily. + target = numpy.zeros(scores.shape, dtype='f') + field_sizes = self.model.softmax.out_sizes + for doc, gold in zip(docs, golds): + for t, features in enumerate(gold.morphology): + if features is None: + target[idx] = scores[idx] + else: + gold_fields = {} + for feature in features: + field = self._class_map.feat2field[feature] + gold_fields[field] = self._class_map.feat2offset[feature] + for field in self._class_map.fields: + field_id = self._class_map.field2id[field] + col_offset = self._class_map.field2col[field] + if field_id in gold_fields: + target[idx, col_offset + gold_fields[field_id]] = 1. + else: + target[idx, col_offset] = 1. + #print(doc[t]) + #for col, info in enumerate(self._class_map.col2info): + # print(col, info, scores[idx, col], target[idx, col]) + idx += 1 + target = self.model.ops.asarray(target, dtype='f') + scores = self.model.ops.asarray(scores, dtype='f') + d_scores = scores - target + loss = (d_scores**2).sum() + d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) + return float(loss), d_scores + + def use_params(self, params): + with self.model.use_params(params): + yield + +def scores_to_guesses(scores, out_sizes): + xp = get_array_module(scores) + guesses = xp.zeros((scores.shape[0], len(out_sizes)), dtype='i') + offset = 0 + for i, size in enumerate(out_sizes): + slice_ = scores[:, offset : offset + size] + col_guesses = slice_.argmax(axis=1) + guesses[:, i] = col_guesses + offset += size + return guesses diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 190116a2e..9ac3affc9 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -69,7 +69,7 @@ class Pipe(object): predictions = self.predict([doc]) if isinstance(predictions, tuple) and len(predictions) == 2: scores, tensors = predictions - self.set_annotations([doc], scores, tensor=tensors) + self.set_annotations([doc], scores, tensors=tensors) else: self.set_annotations([doc], predictions) return doc @@ -90,7 +90,7 @@ class Pipe(object): predictions = self.predict(docs) if isinstance(predictions, tuple) and len(tuple) == 2: scores, tensors = predictions - self.set_annotations(docs, scores, tensor=tensors) + self.set_annotations(docs, scores, tensors=tensors) else: self.set_annotations(docs, predictions) yield from docs @@ -424,18 +424,22 @@ class Tagger(Pipe): cdef Doc doc cdef int idx = 0 cdef Vocab vocab = self.vocab + assign_morphology = self.cfg.get("set_morphology", True) for i, doc in enumerate(docs): doc_tag_ids = batch_tag_ids[i] if hasattr(doc_tag_ids, "get"): doc_tag_ids = doc_tag_ids.get() for j, tag_id in enumerate(doc_tag_ids): # Don't clobber preset POS tags - if doc.c[j].tag == 0 and doc.c[j].pos == 0: - # Don't clobber preset lemmas - lemma = doc.c[j].lemma - vocab.morphology.assign_tag_id(&doc.c[j], tag_id) - if lemma != 0 and lemma != doc.c[j].lex.orth: - doc.c[j].lemma = lemma + if doc.c[j].tag == 0: + if doc.c[j].pos == 0 and assign_morphology: + # Don't clobber preset lemmas + lemma = doc.c[j].lemma + vocab.morphology.assign_tag_id(&doc.c[j], tag_id) + if lemma != 0 and lemma != doc.c[j].lex.orth: + doc.c[j].lemma = lemma + else: + doc.c[j].tag = self.vocab.strings[self.labels[tag_id]] idx += 1 if tensors is not None and len(tensors): if isinstance(doc.tensor, numpy.ndarray) \ @@ -500,6 +504,7 @@ class Tagger(Pipe): orig_tag_map = dict(self.vocab.morphology.tag_map) new_tag_map = OrderedDict() for raw_text, annots_brackets in get_gold_tuples(): + _ = annots_brackets.pop() for annots, brackets in annots_brackets: ids, words, tags, heads, deps, ents = annots for tag in tags: @@ -932,11 +937,6 @@ class TextCategorizer(Pipe): def labels(self, value): self.cfg["labels"] = tuple(value) - def __call__(self, doc): - scores, tensors = self.predict([doc]) - self.set_annotations([doc], scores, tensors=tensors) - return doc - def pipe(self, stream, batch_size=128, n_threads=-1): for docs in util.minibatch(stream, size=batch_size): docs = list(docs) @@ -1017,6 +1017,10 @@ class TextCategorizer(Pipe): return 1 def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): + for raw_text, annots_brackets in get_gold_tuples(): + cats = annots_brackets.pop() + for cat in cats: + self.add_label(cat) if self.model is True: self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") self.require_labels() diff --git a/spacy/scorer.py b/spacy/scorer.py index 4032cc4dd..9c057d0a3 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -1,7 +1,10 @@ # coding: utf8 from __future__ import division, print_function, unicode_literals +import numpy as np + from .gold import tags_to_entities, GoldParse +from .errors import Errors class PRFScore(object): @@ -34,10 +37,39 @@ class PRFScore(object): return 2 * ((p * r) / (p + r + 1e-100)) +class ROCAUCScore(object): + """ + An AUC ROC score. + """ + + def __init__(self): + self.golds = [] + self.cands = [] + self.saved_score = 0.0 + self.saved_score_at_len = 0 + + def score_set(self, cand, gold): + self.cands.append(cand) + self.golds.append(gold) + + @property + def score(self): + if len(self.golds) == self.saved_score_at_len: + return self.saved_score + try: + self.saved_score = _roc_auc_score(self.golds, self.cands) + # catch ValueError: Only one class present in y_true. + # ROC AUC score is not defined in that case. + except ValueError: + self.saved_score = -float("inf") + self.saved_score_at_len = len(self.golds) + return self.saved_score + + class Scorer(object): """Compute evaluation scores.""" - def __init__(self, eval_punct=False): + def __init__(self, eval_punct=False, pipeline=None): """Initialize the Scorer. eval_punct (bool): Evaluate the dependency attachments to and from @@ -54,6 +86,24 @@ class Scorer(object): self.ner = PRFScore() self.ner_per_ents = dict() self.eval_punct = eval_punct + self.textcat = None + self.textcat_per_cat = dict() + self.textcat_positive_label = None + self.textcat_multilabel = False + + if pipeline: + for name, model in pipeline: + if name == "textcat": + self.textcat_positive_label = model.cfg.get("positive_label", None) + if self.textcat_positive_label: + self.textcat = PRFScore() + if not model.cfg.get("exclusive_classes", False): + self.textcat_multilabel = True + for label in model.cfg.get("labels", []): + self.textcat_per_cat[label] = ROCAUCScore() + else: + for label in model.cfg.get("labels", []): + self.textcat_per_cat[label] = PRFScore() @property def tags_acc(self): @@ -101,10 +151,47 @@ class Scorer(object): for k, v in self.ner_per_ents.items() } + @property + def textcat_score(self): + """RETURNS (float): f-score on positive label for binary exclusive, + macro-averaged f-score for 3+ exclusive, + macro-averaged AUC ROC score for multilabel (-1 if undefined) + """ + if not self.textcat_multilabel: + # binary multiclass + if self.textcat_positive_label: + return self.textcat.fscore * 100 + # other multiclass + return ( + sum([score.fscore for label, score in self.textcat_per_cat.items()]) + / (len(self.textcat_per_cat) + 1e-100) + * 100 + ) + # multilabel + return max( + sum([score.score for label, score in self.textcat_per_cat.items()]) + / (len(self.textcat_per_cat) + 1e-100), + -1, + ) + + @property + def textcats_per_cat(self): + """RETURNS (dict): Scores per textcat label. + """ + if not self.textcat_multilabel: + return { + k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} + for k, v in self.textcat_per_cat.items() + } + return { + k: {"roc_auc_score": max(v.score, -1)} + for k, v in self.textcat_per_cat.items() + } + @property def scores(self): """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`, - `ents_r`, `ents_f`, `tags_acc` and `token_acc`. + `ents_r`, `ents_f`, `tags_acc`, `token_acc`, and `textcat_score`. """ return { "uas": self.uas, @@ -115,6 +202,8 @@ class Scorer(object): "ents_per_type": self.ents_per_type, "tags_acc": self.tags_acc, "token_acc": self.token_acc, + "textcat_score": self.textcat_score, + "textcats_per_cat": self.textcats_per_cat, } def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")): @@ -192,9 +281,301 @@ class Scorer(object): self.unlabelled.score_set( set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps) ) + if ( + len(gold.cats) > 0 + and set(self.textcat_per_cat) == set(gold.cats) + and set(gold.cats) == set(doc.cats) + ): + goldcat = max(gold.cats, key=gold.cats.get) + candcat = max(doc.cats, key=doc.cats.get) + if self.textcat_positive_label: + self.textcat.score_set( + set([self.textcat_positive_label]) & set([candcat]), + set([self.textcat_positive_label]) & set([goldcat]), + ) + for label in self.textcat_per_cat: + if self.textcat_multilabel: + self.textcat_per_cat[label].score_set( + doc.cats[label], gold.cats[label] + ) + else: + self.textcat_per_cat[label].score_set( + set([label]) & set([candcat]), set([label]) & set([goldcat]) + ) + elif len(self.textcat_per_cat) > 0: + model_labels = set(self.textcat_per_cat) + eval_labels = set(gold.cats) + raise ValueError( + Errors.E162.format(model_labels=model_labels, eval_labels=eval_labels) + ) if verbose: gold_words = [item[1] for item in gold.orig_annot] for w_id, h_id, dep in cand_deps - gold_deps: print("F", gold_words[w_id], dep, gold_words[h_id]) for w_id, h_id, dep in gold_deps - cand_deps: print("M", gold_words[w_id], dep, gold_words[h_id]) + + +############################################################################# +# +# The following implementation of roc_auc_score() is adapted from +# scikit-learn, which is distributed under the following license: +# +# New BSD License +# +# Copyright (c) 2007–2019 The scikit-learn developers. +# All rights reserved. +# +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are met: +# +# a. Redistributions of source code must retain the above copyright notice, +# this list of conditions and the following disclaimer. +# b. Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# c. Neither the name of the Scikit-learn Developers nor the names of +# its contributors may be used to endorse or promote products +# derived from this software without specific prior written +# permission. +# +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +# ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR +# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH +# DAMAGE. + + +def _roc_auc_score(y_true, y_score): + """Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) + from prediction scores. + + Note: this implementation is restricted to the binary classification task + + Parameters + ---------- + y_true : array, shape = [n_samples] or [n_samples, n_classes] + True binary labels or binary label indicators. + The multiclass case expects shape = [n_samples] and labels + with values in ``range(n_classes)``. + + y_score : array, shape = [n_samples] or [n_samples, n_classes] + Target scores, can either be probability estimates of the positive + class, confidence values, or non-thresholded measure of decisions + (as returned by "decision_function" on some classifiers). For binary + y_true, y_score is supposed to be the score of the class with greater + label. The multiclass case expects shape = [n_samples, n_classes] + where the scores correspond to probability estimates. + + Returns + ------- + auc : float + + References + ---------- + .. [1] `Wikipedia entry for the Receiver operating characteristic + `_ + + .. [2] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition + Letters, 2006, 27(8):861-874. + + .. [3] `Analyzing a portion of the ROC curve. McClish, 1989 + `_ + """ + if len(np.unique(y_true)) != 2: + raise ValueError(Errors.E165) + fpr, tpr, _ = _roc_curve(y_true, y_score) + return _auc(fpr, tpr) + + +def _roc_curve(y_true, y_score): + """Compute Receiver operating characteristic (ROC) + + Note: this implementation is restricted to the binary classification task. + + Parameters + ---------- + + y_true : array, shape = [n_samples] + True binary labels. If labels are not either {-1, 1} or {0, 1}, then + pos_label should be explicitly given. + + y_score : array, shape = [n_samples] + Target scores, can either be probability estimates of the positive + class, confidence values, or non-thresholded measure of decisions + (as returned by "decision_function" on some classifiers). + + Returns + ------- + fpr : array, shape = [>2] + Increasing false positive rates such that element i is the false + positive rate of predictions with score >= thresholds[i]. + + tpr : array, shape = [>2] + Increasing true positive rates such that element i is the true + positive rate of predictions with score >= thresholds[i]. + + thresholds : array, shape = [n_thresholds] + Decreasing thresholds on the decision function used to compute + fpr and tpr. `thresholds[0]` represents no instances being predicted + and is arbitrarily set to `max(y_score) + 1`. + + Notes + ----- + Since the thresholds are sorted from low to high values, they + are reversed upon returning them to ensure they correspond to both ``fpr`` + and ``tpr``, which are sorted in reversed order during their calculation. + + References + ---------- + .. [1] `Wikipedia entry for the Receiver operating characteristic + `_ + + .. [2] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition + Letters, 2006, 27(8):861-874. + """ + fps, tps, thresholds = _binary_clf_curve(y_true, y_score) + + # Add an extra threshold position + # to make sure that the curve starts at (0, 0) + tps = np.r_[0, tps] + fps = np.r_[0, fps] + thresholds = np.r_[thresholds[0] + 1, thresholds] + + if fps[-1] <= 0: + fpr = np.repeat(np.nan, fps.shape) + else: + fpr = fps / fps[-1] + + if tps[-1] <= 0: + tpr = np.repeat(np.nan, tps.shape) + else: + tpr = tps / tps[-1] + + return fpr, tpr, thresholds + + +def _binary_clf_curve(y_true, y_score): + """Calculate true and false positives per binary classification threshold. + + Parameters + ---------- + y_true : array, shape = [n_samples] + True targets of binary classification + + y_score : array, shape = [n_samples] + Estimated probabilities or decision function + + Returns + ------- + fps : array, shape = [n_thresholds] + A count of false positives, at index i being the number of negative + samples assigned a score >= thresholds[i]. The total number of + negative samples is equal to fps[-1] (thus true negatives are given by + fps[-1] - fps). + + tps : array, shape = [n_thresholds <= len(np.unique(y_score))] + An increasing count of true positives, at index i being the number + of positive samples assigned a score >= thresholds[i]. The total + number of positive samples is equal to tps[-1] (thus false negatives + are given by tps[-1] - tps). + + thresholds : array, shape = [n_thresholds] + Decreasing score values. + """ + pos_label = 1.0 + + y_true = np.ravel(y_true) + y_score = np.ravel(y_score) + + # make y_true a boolean vector + y_true = y_true == pos_label + + # sort scores and corresponding truth values + desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1] + y_score = y_score[desc_score_indices] + y_true = y_true[desc_score_indices] + weight = 1.0 + + # y_score typically has many tied values. Here we extract + # the indices associated with the distinct values. We also + # concatenate a value for the end of the curve. + distinct_value_indices = np.where(np.diff(y_score))[0] + threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1] + + # accumulate the true positives with decreasing threshold + tps = _stable_cumsum(y_true * weight)[threshold_idxs] + fps = 1 + threshold_idxs - tps + return fps, tps, y_score[threshold_idxs] + + +def _stable_cumsum(arr, axis=None, rtol=1e-05, atol=1e-08): + """Use high precision for cumsum and check that final value matches sum + + Parameters + ---------- + arr : array-like + To be cumulatively summed as flat + axis : int, optional + Axis along which the cumulative sum is computed. + The default (None) is to compute the cumsum over the flattened array. + rtol : float + Relative tolerance, see ``np.allclose`` + atol : float + Absolute tolerance, see ``np.allclose`` + """ + out = np.cumsum(arr, axis=axis, dtype=np.float64) + expected = np.sum(arr, axis=axis, dtype=np.float64) + if not np.all( + np.isclose( + out.take(-1, axis=axis), expected, rtol=rtol, atol=atol, equal_nan=True + ) + ): + raise ValueError(Errors.E163) + return out + + +def _auc(x, y): + """Compute Area Under the Curve (AUC) using the trapezoidal rule + + This is a general function, given points on a curve. For computing the + area under the ROC-curve, see :func:`roc_auc_score`. + + Parameters + ---------- + x : array, shape = [n] + x coordinates. These must be either monotonic increasing or monotonic + decreasing. + y : array, shape = [n] + y coordinates. + + Returns + ------- + auc : float + """ + x = np.ravel(x) + y = np.ravel(y) + + direction = 1 + dx = np.diff(x) + if np.any(dx < 0): + if np.all(dx <= 0): + direction = -1 + else: + raise ValueError(Errors.E164.format(x)) + + area = direction * np.trapz(y, x) + if isinstance(area, np.memmap): + # Reductions such as .sum used internally in np.trapz do not return a + # scalar by default for numpy.memmap instances contrary to + # regular numpy.ndarray instances. + area = area.dtype.type(area) + return area diff --git a/spacy/strings.pyx b/spacy/strings.pyx index df86f8ac7..f3457e1a5 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -119,9 +119,7 @@ cdef class StringStore: return "" elif string_or_id in SYMBOLS_BY_STR: return SYMBOLS_BY_STR[string_or_id] - cdef hash_t key - if isinstance(string_or_id, unicode): key = hash_string(string_or_id) return key @@ -139,6 +137,20 @@ cdef class StringStore: else: return decode_Utf8Str(utf8str) + def as_int(self, key): + """If key is an int, return it; otherwise, get the int value.""" + if not isinstance(key, basestring): + return key + else: + return self[key] + + def as_string(self, key): + """If key is a string, return it; otherwise, get the string value.""" + if isinstance(key, basestring): + return key + else: + return self[key] + def add(self, string): """Add a string to the StringStore. diff --git a/spacy/structs.pxd b/spacy/structs.pxd index 6c643b4cd..468277f6b 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -78,6 +78,54 @@ cdef struct TokenC: hash_t ent_id +cdef struct MorphAnalysisC: + univ_pos_t pos + int length + + attr_t abbr + attr_t adp_type + attr_t adv_type + attr_t animacy + attr_t aspect + attr_t case + attr_t conj_type + attr_t connegative + attr_t definite + attr_t degree + attr_t derivation + attr_t echo + attr_t foreign + attr_t gender + attr_t hyph + attr_t inf_form + attr_t mood + attr_t negative + attr_t number + attr_t name_type + attr_t noun_type + attr_t num_form + attr_t num_type + attr_t num_value + attr_t part_form + attr_t part_type + attr_t person + attr_t polite + attr_t polarity + attr_t poss + attr_t prefix + attr_t prep_case + attr_t pron_type + attr_t punct_side + attr_t punct_type + attr_t reflex + attr_t style + attr_t style_variant + attr_t tense + attr_t typo + attr_t verb_form + attr_t voice + attr_t verb_type + # Internal struct, for storage and disambiguation of entities. cdef struct KBEntryC: diff --git a/spacy/syntax/arc_eager.pyx b/spacy/syntax/arc_eager.pyx index eb39124ce..5a7355061 100644 --- a/spacy/syntax/arc_eager.pyx +++ b/spacy/syntax/arc_eager.pyx @@ -342,6 +342,7 @@ cdef class ArcEager(TransitionSystem): actions[RIGHT][label] = 1 actions[REDUCE][label] = 1 for raw_text, sents in kwargs.get('gold_parses', []): + _ = sents.pop() for (ids, words, tags, heads, labels, iob), ctnts in sents: heads, labels = nonproj.projectivize(heads, labels) for child, head, label in zip(ids, heads, labels): diff --git a/spacy/syntax/ner.pyx b/spacy/syntax/ner.pyx index 767e4c2e0..3bd096463 100644 --- a/spacy/syntax/ner.pyx +++ b/spacy/syntax/ner.pyx @@ -66,12 +66,14 @@ cdef class BiluoPushDown(TransitionSystem): UNIT: Counter(), OUT: Counter() } - actions[OUT][''] = 1 + actions[OUT][''] = 1 # Represents a token predicted to be outside of any entity + actions[UNIT][''] = 1 # Represents a token prohibited to be in an entity for entity_type in kwargs.get('entity_types', []): for action in (BEGIN, IN, LAST, UNIT): actions[action][entity_type] = 1 moves = ('M', 'B', 'I', 'L', 'U') for raw_text, sents in kwargs.get('gold_parses', []): + _ = sents.pop() for (ids, words, tags, heads, labels, biluo), _ in sents: for i, ner_tag in enumerate(biluo): if ner_tag != 'O' and ner_tag != '-': @@ -161,8 +163,7 @@ cdef class BiluoPushDown(TransitionSystem): for i in range(self.n_moves): if self.c[i].move == move and self.c[i].label == label: return self.c[i] - else: - raise KeyError(Errors.E022.format(name=name)) + raise KeyError(Errors.E022.format(name=name)) cdef Transition init_transition(self, int clas, int move, attr_t label) except *: # TODO: Apparent Cython bug here when we try to use the Transition() @@ -266,7 +267,7 @@ cdef class Begin: return False elif label == 0: return False - elif preset_ent_iob == 1 or preset_ent_iob == 2: + elif preset_ent_iob == 1: # Ensure we don't clobber preset entities. If no entity preset, # ent_iob is 0 return False @@ -282,8 +283,8 @@ cdef class Begin: # Otherwise, force acceptance, even if we're across a sentence # boundary or the token is whitespace. return True - elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3: - # If the next word is B or O, we can't B now + elif st.B_(1).ent_iob == 3: + # If the next word is B, we can't B now return False elif st.B_(1).sent_start == 1: # Don't allow entities to extend across sentence boundaries @@ -326,6 +327,7 @@ cdef class In: @staticmethod cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef int preset_ent_iob = st.B_(0).ent_iob + cdef attr_t preset_ent_label = st.B_(0).ent_type if label == 0: return False elif st.E_(0).ent_type != label: @@ -335,13 +337,22 @@ cdef class In: elif st.B(1) == -1: # If we're at the end, we can't I. return False - elif preset_ent_iob == 2: - return False elif preset_ent_iob == 3: return False - elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3: - # If we know the next word is B or O, we can't be I (must be L) + elif st.B_(1).ent_iob == 3: + # If we know the next word is B, we can't be I (must be L) return False + elif preset_ent_iob == 1: + if st.B_(1).ent_iob in (0, 2): + # if next preset is missing or O, this can't be I (must be L) + return False + elif label != preset_ent_label: + # If label isn't right, reject + return False + else: + # Otherwise, force acceptance, even if we're across a sentence + # boundary or the token is whitespace. + return True elif st.B(1) != -1 and st.B_(1).sent_start == 1: # Don't allow entities to extend across sentence boundaries return False @@ -387,17 +398,24 @@ cdef class In: else: return 1 - cdef class Last: @staticmethod cdef bint is_valid(const StateC* st, attr_t label) nogil: + cdef int preset_ent_iob = st.B_(0).ent_iob + cdef attr_t preset_ent_label = st.B_(0).ent_type if label == 0: return False elif not st.entity_is_open(): return False - elif st.B_(0).ent_iob == 1 and st.B_(1).ent_iob != 1: + elif preset_ent_iob == 1 and st.B_(1).ent_iob != 1: # If a preset entity has I followed by not-I, is L - return True + if label != preset_ent_label: + # If label isn't right, reject + return False + else: + # Otherwise, force acceptance, even if we're across a sentence + # boundary or the token is whitespace. + return True elif st.E_(0).ent_type != label: return False elif st.B_(1).ent_iob == 1: @@ -450,12 +468,13 @@ cdef class Unit: cdef int preset_ent_iob = st.B_(0).ent_iob cdef attr_t preset_ent_label = st.B_(0).ent_type if label == 0: - return False + # this is only allowed if it's a preset blocked annotation + if preset_ent_label == 0 and preset_ent_iob == 3: + return True + else: + return False elif st.entity_is_open(): return False - elif preset_ent_iob == 2: - # Don't clobber preset O - return False elif st.B_(1).ent_iob == 1: # If next token is In, we can't be Unit -- must be Begin return False diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index 85f7b5bb9..aeb4a5306 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -135,7 +135,9 @@ cdef class Parser: names = [] for i in range(self.moves.n_moves): name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label) - names.append(name) + # Explicitly removing the internal "U-" token used for blocking entities + if name != "U-": + names.append(name) return names nr_feature = 8 @@ -161,10 +163,16 @@ cdef class Parser: added = self.moves.add_action(action, label) if added: resized = True - if resized and "nr_class" in self.cfg: + if resized: + self._resize() + + def _resize(self): + if "nr_class" in self.cfg: self.cfg["nr_class"] = self.moves.n_moves - if self.model not in (True, False, None) and resized: + if self.model not in (True, False, None): self.model.resize_output(self.moves.n_moves) + if self._rehearsal_model not in (True, False, None): + self._rehearsal_model.resize_output(self.moves.n_moves) def add_multitask_objective(self, target): # Defined in subclasses, to avoid circular import @@ -235,7 +243,9 @@ cdef class Parser: if isinstance(docs, Doc): docs = [docs] if not any(len(doc) for doc in docs): - return self.moves.init_batch(docs) + result = self.moves.init_batch(docs) + self._resize() + return result if beam_width < 2: return self.greedy_parse(docs, drop=drop) else: @@ -249,7 +259,7 @@ cdef class Parser: # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. - self.model.resize_output(self.moves.n_moves) + self._resize() model = self.model(docs) weights = get_c_weights(model) for state in batch: @@ -269,7 +279,7 @@ cdef class Parser: # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. - self.model.resize_output(self.moves.n_moves) + self._resize() model = self.model(docs) token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature), dtype='i', order='C') @@ -443,8 +453,7 @@ cdef class Parser: # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. - self.model.resize_output(self.moves.n_moves) - self._rehearsal_model.resize_output(self.moves.n_moves) + self._resize() # Prepare the stepwise model, and get the callback for finishing the batch tutor, _ = self._rehearsal_model.begin_update(docs, drop=0.0) model, finish_update = self.model.begin_update(docs, drop=0.0) @@ -585,6 +594,7 @@ cdef class Parser: doc_sample = [] gold_sample = [] for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): + _ = annots_brackets.pop() for annots, brackets in annots_brackets: ids, words, tags, heads, deps, ents = annots doc_sample.append(Doc(self.vocab, words=words)) diff --git a/spacy/syntax/transition_system.pyx b/spacy/syntax/transition_system.pyx index 523cd6699..58b3a6993 100644 --- a/spacy/syntax/transition_system.pyx +++ b/spacy/syntax/transition_system.pyx @@ -63,6 +63,13 @@ cdef class TransitionSystem: cdef Doc doc beams = [] cdef int offset = 0 + + # Doc objects might contain labels that we need to register actions for. We need to check for that + # *before* we create any Beam objects, because the Beam object needs the correct number of + # actions. It's sort of dumb, but the best way is to just call init_batch() -- that triggers the additions, + # and it doesn't matter that we create and discard the state objects. + self.init_batch(docs) + for doc in docs: beam = Beam(self.n_moves, beam_width, min_density=beam_density) beam.initialize(self.init_beam_state, doc.length, doc.c) @@ -96,8 +103,7 @@ cdef class TransitionSystem: def apply_transition(self, StateClass state, name): if not self.is_valid(state, name): - raise ValueError( - "Cannot apply transition {name}: invalid for the current state.".format(name=name)) + raise ValueError(Errors.E170.format(name=name)) action = self.lookup_transition(name) action.do(state.c, action.label) diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index c88f3314e..0763af32b 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -185,6 +185,12 @@ def ru_tokenizer(): return get_lang_class("ru").Defaults.create_tokenizer() +@pytest.fixture +def ru_lemmatizer(): + pytest.importorskip("pymorphy2") + return get_lang_class("ru").Defaults.create_lemmatizer() + + @pytest.fixture(scope="session") def sr_tokenizer(): return get_lang_class("sr").Defaults.create_tokenizer() diff --git a/spacy/tests/doc/test_add_entities.py b/spacy/tests/doc/test_add_entities.py index 433541c48..6c69e699a 100644 --- a/spacy/tests/doc/test_add_entities.py +++ b/spacy/tests/doc/test_add_entities.py @@ -1,12 +1,12 @@ # coding: utf-8 from __future__ import unicode_literals -from ...pipeline import EntityRecognizer -from ..util import get_doc -from ...tokens import Span - +from spacy.pipeline import EntityRecognizer +from spacy.tokens import Span import pytest +from ..util import get_doc + def test_doc_add_entities_set_ents_iob(en_vocab): text = ["This", "is", "a", "lion"] @@ -16,10 +16,23 @@ def test_doc_add_entities_set_ents_iob(en_vocab): ner(doc) assert len(list(doc.ents)) == 0 assert [w.ent_iob_ for w in doc] == (["O"] * len(doc)) + doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)] - assert [w.ent_iob_ for w in doc] == ["", "", "", "B"] + assert [w.ent_iob_ for w in doc] == ["O", "O", "O", "B"] + doc.ents = [(doc.vocab.strings["WORD"], 0, 2)] - assert [w.ent_iob_ for w in doc] == ["B", "I", "", ""] + assert [w.ent_iob_ for w in doc] == ["B", "I", "O", "O"] + + +def test_ents_reset(en_vocab): + text = ["This", "is", "a", "lion"] + doc = get_doc(en_vocab, text) + ner = EntityRecognizer(en_vocab) + ner.begin_training([]) + ner(doc) + assert [t.ent_iob_ for t in doc] == (["O"] * len(doc)) + doc.ents = list(doc.ents) + assert [t.ent_iob_ for t in doc] == (["O"] * len(doc)) def test_add_overlapping_entities(en_vocab): diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index ce42b39b9..b222f6bf0 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -5,11 +5,13 @@ import pytest from spacy.vocab import Vocab from spacy.tokens import Doc from spacy.lemmatizer import Lemmatizer +from spacy.lookups import Table @pytest.fixture def lemmatizer(): - return Lemmatizer(lookup={"dogs": "dog", "boxen": "box", "mice": "mouse"}) + lookup = Table(data={"dogs": "dog", "boxen": "box", "mice": "mouse"}) + return Lemmatizer(lookup=lookup) @pytest.fixture diff --git a/spacy/tests/doc/test_morphanalysis.py b/spacy/tests/doc/test_morphanalysis.py new file mode 100644 index 000000000..5d570af53 --- /dev/null +++ b/spacy/tests/doc/test_morphanalysis.py @@ -0,0 +1,33 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +@pytest.fixture +def i_has(en_tokenizer): + doc = en_tokenizer("I has") + doc[0].tag_ = "PRP" + doc[1].tag_ = "VBZ" + return doc + + +def test_token_morph_id(i_has): + assert i_has[0].morph.id + assert i_has[1].morph.id != 0 + assert i_has[0].morph.id != i_has[1].morph.id + + +def test_morph_props(i_has): + assert i_has[0].morph.pron_type == i_has.vocab.strings["PronType_prs"] + assert i_has[0].morph.pron_type_ == "PronType_prs" + assert i_has[1].morph.pron_type == 0 + + +def test_morph_iter(i_has): + assert list(i_has[0].morph) == ["PronType_prs"] + assert list(i_has[1].morph) == ["Number_sing", "Person_three", "VerbForm_fin"] + + +def test_morph_get(i_has): + assert i_has[0].morph.get("pron_type") == "PronType_prs" diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index c95e7bc40..ad8bfaa00 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -47,3 +47,10 @@ def test_ja_tokenizer_tags(ja_tokenizer, text, expected_tags): def test_ja_tokenizer_pos(ja_tokenizer, text, expected_pos): pos = [token.pos_ for token in ja_tokenizer(text)] assert pos == expected_pos + + +def test_extra_spaces(ja_tokenizer): + # note: three spaces after "I" + tokens = ja_tokenizer("I like cheese.") + assert tokens[1].orth_ == " " + assert tokens[2].orth_ == " " diff --git a/spacy/tests/lang/lt/test_lemmatizer.py b/spacy/tests/lang/lt/test_lemmatizer.py index 9b2969849..f7408fc16 100644 --- a/spacy/tests/lang/lt/test_lemmatizer.py +++ b/spacy/tests/lang/lt/test_lemmatizer.py @@ -17,4 +17,4 @@ TEST_CASES = [ @pytest.mark.parametrize("tokens,lemmas", TEST_CASES) def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas): - assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens] + assert lemmas == [lt_lemmatizer.lookup_table.get(token, token) for token in tokens] diff --git a/spacy/tests/lang/ru/test_lemmatizer.py b/spacy/tests/lang/ru/test_lemmatizer.py index b92dfa29c..b228fded8 100644 --- a/spacy/tests/lang/ru/test_lemmatizer.py +++ b/spacy/tests/lang/ru/test_lemmatizer.py @@ -2,17 +2,10 @@ from __future__ import unicode_literals import pytest -from spacy.lang.ru import Russian from ...util import get_doc -@pytest.fixture -def ru_lemmatizer(): - pytest.importorskip("pymorphy2") - return Russian.Defaults.create_lemmatizer() - - def test_ru_doc_lemmatization(ru_tokenizer): words = ["мама", "мыла", "раму"] tags = [ diff --git a/spacy/tests/lang/sr/test_еxceptions.py b/spacy/tests/lang/sr/test_exceptions.py similarity index 100% rename from spacy/tests/lang/sr/test_еxceptions.py rename to spacy/tests/lang/sr/test_exceptions.py diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index df35a1be2..0d640e1a2 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -410,3 +410,11 @@ def test_matcher_schema_token_attributes(en_vocab, pattern, text): assert len(matcher) == 1 matches = matcher(doc) assert len(matches) == 1 + + +def test_matcher_valid_callback(en_vocab): + """Test that on_match can only be None or callable.""" + matcher = Matcher(en_vocab) + with pytest.raises(ValueError): + matcher.add("TEST", [], [{"TEXT": "test"}]) + matcher(Doc(en_vocab, words=["test"])) diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index b82f9a058..486cbb984 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -8,10 +8,31 @@ from ..util import get_doc def test_matcher_phrase_matcher(en_vocab): - doc = Doc(en_vocab, words=["Google", "Now"]) - matcher = PhraseMatcher(en_vocab) - matcher.add("COMPANY", None, doc) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) + # intermediate phrase + pattern = Doc(en_vocab, words=["Google", "Now"]) + matcher = PhraseMatcher(en_vocab) + matcher.add("COMPANY", None, pattern) + assert len(matcher(doc)) == 1 + # initial token + pattern = Doc(en_vocab, words=["I"]) + matcher = PhraseMatcher(en_vocab) + matcher.add("I", None, pattern) + assert len(matcher(doc)) == 1 + # initial phrase + pattern = Doc(en_vocab, words=["I", "like"]) + matcher = PhraseMatcher(en_vocab) + matcher.add("ILIKE", None, pattern) + assert len(matcher(doc)) == 1 + # final token + pattern = Doc(en_vocab, words=["best"]) + matcher = PhraseMatcher(en_vocab) + matcher.add("BEST", None, pattern) + assert len(matcher(doc)) == 1 + # final phrase + pattern = Doc(en_vocab, words=["Now", "best"]) + matcher = PhraseMatcher(en_vocab) + matcher.add("NOWBEST", None, pattern) assert len(matcher(doc)) == 1 @@ -31,6 +52,68 @@ def test_phrase_matcher_contains(en_vocab): assert "TEST2" not in matcher +def test_phrase_matcher_repeated_add(en_vocab): + matcher = PhraseMatcher(en_vocab) + # match ID only gets added once + matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) + assert "TEST" in matcher + assert "TEST2" not in matcher + assert len(matcher(doc)) == 1 + + +def test_phrase_matcher_remove(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add("TEST1", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST2", None, Doc(en_vocab, words=["best"])) + doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) + assert "TEST1" in matcher + assert "TEST2" in matcher + assert "TEST3" not in matcher + assert len(matcher(doc)) == 2 + matcher.remove("TEST1") + assert "TEST1" not in matcher + assert "TEST2" in matcher + assert "TEST3" not in matcher + assert len(matcher(doc)) == 1 + matcher.remove("TEST2") + assert "TEST1" not in matcher + assert "TEST2" not in matcher + assert "TEST3" not in matcher + assert len(matcher(doc)) == 0 + with pytest.raises(KeyError): + matcher.remove("TEST3") + assert "TEST1" not in matcher + assert "TEST2" not in matcher + assert "TEST3" not in matcher + assert len(matcher(doc)) == 0 + + +def test_phrase_matcher_overlapping_with_remove(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + # TEST2 is added alongside TEST + matcher.add("TEST2", None, Doc(en_vocab, words=["like"])) + doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) + assert "TEST" in matcher + assert len(matcher) == 2 + assert len(matcher(doc)) == 2 + # removing TEST does not remove the entry for TEST2 + matcher.remove("TEST") + assert "TEST" not in matcher + assert len(matcher) == 1 + assert len(matcher(doc)) == 1 + assert matcher(doc)[0][0] == en_vocab.strings["TEST2"] + # removing TEST2 removes all + matcher.remove("TEST2") + assert "TEST2" not in matcher + assert len(matcher) == 0 + assert len(matcher(doc)) == 0 + + def test_phrase_matcher_string_attrs(en_vocab): words1 = ["I", "like", "cats"] pos1 = ["PRON", "VERB", "NOUN"] diff --git a/spacy/tests/morphology/__init__.py b/spacy/tests/morphology/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/morphology/test_morph_features.py b/spacy/tests/morphology/test_morph_features.py new file mode 100644 index 000000000..4b8f0d754 --- /dev/null +++ b/spacy/tests/morphology/test_morph_features.py @@ -0,0 +1,48 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.morphology import Morphology +from spacy.strings import StringStore, get_string_id +from spacy.lemmatizer import Lemmatizer + + +@pytest.fixture +def morphology(): + return Morphology(StringStore(), {}, Lemmatizer()) + + +def test_init(morphology): + pass + + +def test_add_morphology_with_string_names(morphology): + morphology.add({"Case_gen", "Number_sing"}) + + +def test_add_morphology_with_int_ids(morphology): + morphology.add({get_string_id("Case_gen"), get_string_id("Number_sing")}) + + +def test_add_morphology_with_mix_strings_and_ints(morphology): + morphology.add({get_string_id("PunctSide_ini"), "VerbType_aux"}) + + +def test_morphology_tags_hash_distinctly(morphology): + tag1 = morphology.add({"PunctSide_ini", "VerbType_aux"}) + tag2 = morphology.add({"Case_gen", "Number_sing"}) + assert tag1 != tag2 + + +def test_morphology_tags_hash_independent_of_order(morphology): + tag1 = morphology.add({"Case_gen", "Number_sing"}) + tag2 = morphology.add({"Number_sing", "Case_gen"}) + assert tag1 == tag2 + + +def test_update_morphology_tag(morphology): + tag1 = morphology.add({"Case_gen"}) + tag2 = morphology.update(tag1, {"Number_sing"}) + assert tag1 != tag2 + tag3 = morphology.add({"Number_sing", "Case_gen"}) + assert tag2 == tag3 diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index 43c00a963..4dc7542ed 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -2,7 +2,9 @@ from __future__ import unicode_literals import pytest -from spacy.pipeline import EntityRecognizer +from spacy.lang.en import English + +from spacy.pipeline import EntityRecognizer, EntityRuler from spacy.vocab import Vocab from spacy.syntax.ner import BiluoPushDown from spacy.gold import GoldParse @@ -80,14 +82,190 @@ def test_get_oracle_moves_negative_O(tsys, vocab): assert names -def test_doc_add_entities_set_ents_iob(en_vocab): - doc = Doc(en_vocab, words=["This", "is", "a", "lion"]) - ner = EntityRecognizer(en_vocab) - ner.begin_training([]) - ner(doc) - assert len(list(doc.ents)) == 0 - assert [w.ent_iob_ for w in doc] == (["O"] * len(doc)) - doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)] - assert [w.ent_iob_ for w in doc] == ["", "", "", "B"] - doc.ents = [(doc.vocab.strings["WORD"], 0, 2)] - assert [w.ent_iob_ for w in doc] == ["B", "I", "", ""] +def test_oracle_moves_missing_B(en_vocab): + words = ["B", "52", "Bomber"] + biluo_tags = [None, None, "L-PRODUCT"] + + doc = Doc(en_vocab, words=words) + gold = GoldParse(doc, words=words, entities=biluo_tags) + + moves = BiluoPushDown(en_vocab.strings) + move_types = ("M", "B", "I", "L", "U", "O") + for tag in biluo_tags: + if tag is None: + continue + elif tag == "O": + moves.add_action(move_types.index("O"), "") + else: + action, label = tag.split("-") + moves.add_action(move_types.index("B"), label) + moves.add_action(move_types.index("I"), label) + moves.add_action(move_types.index("L"), label) + moves.add_action(move_types.index("U"), label) + moves.preprocess_gold(gold) + seq = moves.get_oracle_sequence(doc, gold) + + +def test_oracle_moves_whitespace(en_vocab): + words = ["production", "\n", "of", "Northrop", "\n", "Corp.", "\n", "'s", "radar"] + biluo_tags = ["O", "O", "O", "B-ORG", None, "I-ORG", "L-ORG", "O", "O"] + + doc = Doc(en_vocab, words=words) + gold = GoldParse(doc, words=words, entities=biluo_tags) + + moves = BiluoPushDown(en_vocab.strings) + move_types = ("M", "B", "I", "L", "U", "O") + for tag in biluo_tags: + if tag is None: + continue + elif tag == "O": + moves.add_action(move_types.index("O"), "") + else: + action, label = tag.split("-") + moves.add_action(move_types.index(action), label) + moves.preprocess_gold(gold) + moves.get_oracle_sequence(doc, gold) + + +def test_accept_blocked_token(): + """Test succesful blocking of tokens to be in an entity.""" + # 1. test normal behaviour + nlp1 = English() + doc1 = nlp1("I live in New York") + ner1 = EntityRecognizer(doc1.vocab) + assert [token.ent_iob_ for token in doc1] == ["", "", "", "", ""] + assert [token.ent_type_ for token in doc1] == ["", "", "", "", ""] + + # Add the OUT action + ner1.moves.add_action(5, "") + ner1.add_label("GPE") + # Get into the state just before "New" + state1 = ner1.moves.init_batch([doc1])[0] + ner1.moves.apply_transition(state1, "O") + ner1.moves.apply_transition(state1, "O") + ner1.moves.apply_transition(state1, "O") + # Check that B-GPE is valid. + assert ner1.moves.is_valid(state1, "B-GPE") + + # 2. test blocking behaviour + nlp2 = English() + doc2 = nlp2("I live in New York") + ner2 = EntityRecognizer(doc2.vocab) + + # set "New York" to a blocked entity + doc2.ents = [(0, 3, 5)] + assert [token.ent_iob_ for token in doc2] == ["", "", "", "B", "B"] + assert [token.ent_type_ for token in doc2] == ["", "", "", "", ""] + + # Check that B-GPE is now invalid. + ner2.moves.add_action(4, "") + ner2.moves.add_action(5, "") + ner2.add_label("GPE") + state2 = ner2.moves.init_batch([doc2])[0] + ner2.moves.apply_transition(state2, "O") + ner2.moves.apply_transition(state2, "O") + ner2.moves.apply_transition(state2, "O") + # we can only use U- for "New" + assert not ner2.moves.is_valid(state2, "B-GPE") + assert ner2.moves.is_valid(state2, "U-") + ner2.moves.apply_transition(state2, "U-") + # we can only use U- for "York" + assert not ner2.moves.is_valid(state2, "B-GPE") + assert ner2.moves.is_valid(state2, "U-") + + +def test_overwrite_token(): + nlp = English() + ner1 = nlp.create_pipe("ner") + nlp.add_pipe(ner1, name="ner") + nlp.begin_training() + + # The untrained NER will predict O for each token + doc = nlp("I live in New York") + assert [token.ent_iob_ for token in doc] == ["O", "O", "O", "O", "O"] + assert [token.ent_type_ for token in doc] == ["", "", "", "", ""] + + # Check that a new ner can overwrite O + ner2 = EntityRecognizer(doc.vocab) + ner2.moves.add_action(5, "") + ner2.add_label("GPE") + state = ner2.moves.init_batch([doc])[0] + assert ner2.moves.is_valid(state, "B-GPE") + assert ner2.moves.is_valid(state, "U-GPE") + ner2.moves.apply_transition(state, "B-GPE") + assert ner2.moves.is_valid(state, "I-GPE") + assert ner2.moves.is_valid(state, "L-GPE") + + +def test_ruler_before_ner(): + """ Test that an NER works after an entity_ruler: the second can add annotations """ + nlp = English() + + # 1 : Entity Ruler - should set "this" to B and everything else to empty + ruler = EntityRuler(nlp) + patterns = [{"label": "THING", "pattern": "This"}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + # 2: untrained NER - should set everything else to O + untrained_ner = nlp.create_pipe("ner") + untrained_ner.add_label("MY_LABEL") + nlp.add_pipe(untrained_ner) + nlp.begin_training() + + doc = nlp("This is Antti Korhonen speaking in Finland") + expected_iobs = ["B", "O", "O", "O", "O", "O", "O"] + expected_types = ["THING", "", "", "", "", "", ""] + assert [token.ent_iob_ for token in doc] == expected_iobs + assert [token.ent_type_ for token in doc] == expected_types + + +def test_ner_before_ruler(): + """ Test that an entity_ruler works after an NER: the second can overwrite O annotations """ + nlp = English() + + # 1: untrained NER - should set everything to O + untrained_ner = nlp.create_pipe("ner") + untrained_ner.add_label("MY_LABEL") + nlp.add_pipe(untrained_ner, name="uner") + nlp.begin_training() + + # 2 : Entity Ruler - should set "this" to B and keep everything else O + ruler = EntityRuler(nlp) + patterns = [{"label": "THING", "pattern": "This"}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + doc = nlp("This is Antti Korhonen speaking in Finland") + expected_iobs = ["B", "O", "O", "O", "O", "O", "O"] + expected_types = ["THING", "", "", "", "", "", ""] + assert [token.ent_iob_ for token in doc] == expected_iobs + assert [token.ent_type_ for token in doc] == expected_types + + +def test_block_ner(): + """ Test functionality for blocking tokens so they can't be in a named entity """ + # block "Antti L Korhonen" from being a named entity + nlp = English() + nlp.add_pipe(BlockerComponent1(2, 5)) + untrained_ner = nlp.create_pipe("ner") + untrained_ner.add_label("MY_LABEL") + nlp.add_pipe(untrained_ner, name="uner") + nlp.begin_training() + doc = nlp("This is Antti L Korhonen speaking in Finland") + expected_iobs = ["O", "O", "B", "B", "B", "O", "O", "O"] + expected_types = ["", "", "", "", "", "", "", ""] + assert [token.ent_iob_ for token in doc] == expected_iobs + assert [token.ent_type_ for token in doc] == expected_types + + +class BlockerComponent1(object): + name = "my_blocker" + + def __init__(self, start, end): + self.start = start + self.end = end + + def __call__(self, doc): + doc.ents = [(0, self.start, self.end)] + return doc diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py index febf2b5b3..b3f347765 100644 --- a/spacy/tests/regression/test_issue1-1000.py +++ b/spacy/tests/regression/test_issue1-1000.py @@ -426,7 +426,7 @@ def test_issue957(en_tokenizer): def test_issue999(train_data): """Test that adding entities and resuming training works passably OK. There are two issues here: - 1) We have to readd labels. This isn't very nice. + 1) We have to read labels. This isn't very nice. 2) There's no way to set the learning rate for the weight update, so we end up out-of-scale, causing it to learn too fast. """ diff --git a/spacy/tests/regression/test_issue1501-2000.py b/spacy/tests/regression/test_issue1501-2000.py index 24f725ab8..520090bb4 100644 --- a/spacy/tests/regression/test_issue1501-2000.py +++ b/spacy/tests/regression/test_issue1501-2000.py @@ -187,7 +187,7 @@ def test_issue1799(): def test_issue1807(): """Test vocab.set_vector also adds the word to the vocab.""" - vocab = Vocab() + vocab = Vocab(vectors_name="test_issue1807") assert "hello" not in vocab vocab.set_vector("hello", numpy.ones((50,), dtype="f")) assert "hello" in vocab diff --git a/spacy/tests/regression/test_issue2501-3000.py b/spacy/tests/regression/test_issue2501-3000.py index cf29c2535..a0b1e2aac 100644 --- a/spacy/tests/regression/test_issue2501-3000.py +++ b/spacy/tests/regression/test_issue2501-3000.py @@ -184,7 +184,7 @@ def test_issue2833(en_vocab): def test_issue2871(): """Test that vectors recover the correct key for spaCy reserved words.""" words = ["dog", "cat", "SUFFIX"] - vocab = Vocab() + vocab = Vocab(vectors_name="test_issue2871") vocab.vectors.resize(shape=(3, 10)) vector_data = numpy.zeros((3, 10), dtype="f") for word in words: diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index 3b0c2f1ed..c430678d3 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -30,20 +30,20 @@ def test_issue3002(): def test_issue3009(en_vocab): """Test problem with matcher quantifiers""" patterns = [ - [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}], + [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"TAG": "IN"}], [ {"LEMMA": "have"}, {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"}, {"LOWER": "to"}, {"LOWER": "do"}, - {"POS": "ADP"}, + {"TAG": "IN"}, ], [ {"LEMMA": "have"}, {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"}, {"LOWER": "to"}, {"LOWER": "do"}, - {"POS": "ADP"}, + {"TAG": "IN"}, ], ] words = ["also", "has", "to", "do", "with"] diff --git a/spacy/tests/regression/test_issue4042.py b/spacy/tests/regression/test_issue4042.py new file mode 100644 index 000000000..500be9f2a --- /dev/null +++ b/spacy/tests/regression/test_issue4042.py @@ -0,0 +1,82 @@ +# coding: utf8 +from __future__ import unicode_literals + +import spacy +from spacy.pipeline import EntityRecognizer, EntityRuler +from spacy.lang.en import English +from spacy.tokens import Span +from spacy.util import ensure_path + +from ..util import make_tempdir + + +def test_issue4042(): + """Test that serialization of an EntityRuler before NER works fine.""" + nlp = English() + + # add ner pipe + ner = nlp.create_pipe("ner") + ner.add_label("SOME_LABEL") + nlp.add_pipe(ner) + nlp.begin_training() + + # Add entity ruler + ruler = EntityRuler(nlp) + patterns = [ + {"label": "MY_ORG", "pattern": "Apple"}, + {"label": "MY_GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}, + ] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler, before="ner") # works fine with "after" + doc1 = nlp("What do you think about Apple ?") + assert doc1.ents[0].label_ == "MY_ORG" + + with make_tempdir() as d: + output_dir = ensure_path(d) + if not output_dir.exists(): + output_dir.mkdir() + nlp.to_disk(output_dir) + + nlp2 = spacy.load(output_dir) + doc2 = nlp2("What do you think about Apple ?") + assert doc2.ents[0].label_ == "MY_ORG" + + +def test_issue4042_bug2(): + """ + Test that serialization of an NER works fine when new labels were added. + This is the second bug of two bugs underlying the issue 4042. + """ + nlp1 = English() + vocab = nlp1.vocab + + # add ner pipe + ner1 = nlp1.create_pipe("ner") + ner1.add_label("SOME_LABEL") + nlp1.add_pipe(ner1) + nlp1.begin_training() + + # add a new label to the doc + doc1 = nlp1("What do you think about Apple ?") + assert len(ner1.labels) == 1 + assert "SOME_LABEL" in ner1.labels + apple_ent = Span(doc1, 5, 6, label="MY_ORG") + doc1.ents = list(doc1.ents) + [apple_ent] + + # reapply the NER - at this point it should resize itself + ner1(doc1) + assert len(ner1.labels) == 2 + assert "SOME_LABEL" in ner1.labels + assert "MY_ORG" in ner1.labels + + with make_tempdir() as d: + # assert IO goes fine + output_dir = ensure_path(d) + if not output_dir.exists(): + output_dir.mkdir() + ner1.to_disk(output_dir) + + nlp2 = English(vocab) + ner2 = EntityRecognizer(vocab) + ner2.from_disk(output_dir) + assert len(ner2.labels) == 2 diff --git a/spacy/tests/regression/test_issue4054.py b/spacy/tests/regression/test_issue4054.py index 2c9d73751..cc84cebf8 100644 --- a/spacy/tests/regression/test_issue4054.py +++ b/spacy/tests/regression/test_issue4054.py @@ -2,12 +2,12 @@ from __future__ import unicode_literals from spacy.vocab import Vocab - import spacy from spacy.lang.en import English -from spacy.tests.util import make_tempdir from spacy.util import ensure_path +from ..util import make_tempdir + def test_issue4054(en_vocab): """Test that a new blank model can be made with a vocab from file, diff --git a/spacy/tests/regression/test_issue4267.py b/spacy/tests/regression/test_issue4267.py new file mode 100644 index 000000000..5fc61e142 --- /dev/null +++ b/spacy/tests/regression/test_issue4267.py @@ -0,0 +1,42 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest + +import spacy + +from spacy.lang.en import English +from spacy.pipeline import EntityRuler +from spacy.tokens import Span + + +def test_issue4267(): + """ Test that running an entity_ruler after ner gives consistent results""" + nlp = English() + ner = nlp.create_pipe("ner") + ner.add_label("PEOPLE") + nlp.add_pipe(ner) + nlp.begin_training() + + assert "ner" in nlp.pipe_names + + # assert that we have correct IOB annotations + doc1 = nlp("hi") + assert doc1.is_nered + for token in doc1: + assert token.ent_iob == 2 + + # add entity ruler and run again + ruler = EntityRuler(nlp) + patterns = [{"label": "SOFTWARE", "pattern": "spacy"}] + + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + assert "entity_ruler" in nlp.pipe_names + assert "ner" in nlp.pipe_names + + # assert that we still have correct IOB annotations + doc2 = nlp("hi") + assert doc2.is_nered + for token in doc2: + assert token.ent_iob == 2 diff --git a/spacy/tests/regression/test_issue4278.py b/spacy/tests/regression/test_issue4278.py index 4c85d15c4..cb09340ff 100644 --- a/spacy/tests/regression/test_issue4278.py +++ b/spacy/tests/regression/test_issue4278.py @@ -13,7 +13,7 @@ class DummyPipe(Pipe): def predict(self, docs): return ([1, 2, 3], [4, 5, 6]) - def set_annotations(self, docs, scores, tensor=None): + def set_annotations(self, docs, scores, tensors=None): return docs diff --git a/spacy/tests/regression/test_issue4313.py b/spacy/tests/regression/test_issue4313.py new file mode 100644 index 000000000..c68f745a7 --- /dev/null +++ b/spacy/tests/regression/test_issue4313.py @@ -0,0 +1,39 @@ +# coding: utf8 +from __future__ import unicode_literals + +from collections import defaultdict + +from spacy.pipeline import EntityRecognizer + +from spacy.lang.en import English +from spacy.tokens import Span + + +def test_issue4313(): + """ This should not crash or exit with some strange error code """ + beam_width = 16 + beam_density = 0.0001 + nlp = English() + ner = EntityRecognizer(nlp.vocab) + ner.add_label("SOME_LABEL") + ner.begin_training([]) + nlp.add_pipe(ner) + + # add a new label to the doc + doc = nlp("What do you think about Apple ?") + assert len(ner.labels) == 1 + assert "SOME_LABEL" in ner.labels + apple_ent = Span(doc, 5, 6, label="MY_ORG") + doc.ents = list(doc.ents) + [apple_ent] + + # ensure the beam_parse still works with the new label + docs = [doc] + beams = nlp.entity.beam_parse( + docs, beam_width=beam_width, beam_density=beam_density + ) + + for doc, beam in zip(docs, beams): + entity_scores = defaultdict(float) + for score, ents in nlp.entity.moves.get_beam_parses(beam): + for start, end, label in ents: + entity_scores[(start, end, label)] += score diff --git a/spacy/tests/serialize/test_serialize_kb.py b/spacy/tests/serialize/test_serialize_kb.py index 67fd9f0d4..b19c11864 100644 --- a/spacy/tests/serialize/test_serialize_kb.py +++ b/spacy/tests/serialize/test_serialize_kb.py @@ -1,11 +1,11 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import make_tempdir -from ...util import ensure_path - +from spacy.util import ensure_path from spacy.kb import KnowledgeBase +from ..util import make_tempdir + def test_serialize_kb_disk(en_vocab): # baseline assertions diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py index 5e99d261a..2d1f1bd8f 100644 --- a/spacy/tests/test_displacy.py +++ b/spacy/tests/test_displacy.py @@ -32,7 +32,7 @@ def test_displacy_parse_deps(en_vocab): assert isinstance(deps, dict) assert deps["words"] == [ {"text": "This", "tag": "DET"}, - {"text": "is", "tag": "VERB"}, + {"text": "is", "tag": "AUX"}, {"text": "a", "tag": "DET"}, {"text": "sentence", "tag": "NOUN"}, ] diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py index 860540be2..4f79c4463 100644 --- a/spacy/tests/test_gold.py +++ b/spacy/tests/test_gold.py @@ -3,8 +3,12 @@ from __future__ import unicode_literals from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags from spacy.gold import spans_from_biluo_tags, GoldParse +from spacy.gold import GoldCorpus, docs_to_json +from spacy.lang.en import English from spacy.tokens import Doc +from .util import make_tempdir import pytest +import srsly def test_gold_biluo_U(en_vocab): @@ -81,3 +85,28 @@ def test_gold_ner_missing_tags(en_tokenizer): doc = en_tokenizer("I flew to Silicon Valley via London.") biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] gold = GoldParse(doc, entities=biluo_tags) # noqa: F841 + + +def test_roundtrip_docs_to_json(): + text = "I flew to Silicon Valley via London." + cats = {"TRAVEL": 1.0, "BAKING": 0.0} + nlp = English() + doc = nlp(text) + doc.cats = cats + doc[0].is_sent_start = True + for i in range(1, len(doc)): + doc[i].is_sent_start = False + + with make_tempdir() as tmpdir: + json_file = tmpdir / "roundtrip.json" + srsly.write_json(json_file, [docs_to_json(doc)]) + goldcorpus = GoldCorpus(str(json_file), str(json_file)) + + reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + + assert len(doc) == goldcorpus.count_train() + assert text == reloaded_doc.text + assert "TRAVEL" in goldparse.cats + assert "BAKING" in goldparse.cats + assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] + assert cats["BAKING"] == goldparse.cats["BAKING"] diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index a747d3adb..9cc4f75b2 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -1,9 +1,12 @@ # coding: utf-8 from __future__ import unicode_literals +from numpy.testing import assert_almost_equal, assert_array_almost_equal +import pytest from pytest import approx from spacy.gold import GoldParse -from spacy.scorer import Scorer +from spacy.scorer import Scorer, ROCAUCScore +from spacy.scorer import _roc_auc_score, _roc_curve from .util import get_doc test_ner_cardinal = [ @@ -66,3 +69,73 @@ def test_ner_per_type(en_vocab): assert results["ents_per_type"]["ORG"]["p"] == 50 assert results["ents_per_type"]["ORG"]["r"] == 100 assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666) + + +def test_roc_auc_score(): + # Binary classification, toy tests from scikit-learn test suite + y_true = [0, 1] + y_score = [0, 1] + tpr, fpr, _ = _roc_curve(y_true, y_score) + roc_auc = _roc_auc_score(y_true, y_score) + assert_array_almost_equal(tpr, [0, 0, 1]) + assert_array_almost_equal(fpr, [0, 1, 1]) + assert_almost_equal(roc_auc, 1.0) + + y_true = [0, 1] + y_score = [1, 0] + tpr, fpr, _ = _roc_curve(y_true, y_score) + roc_auc = _roc_auc_score(y_true, y_score) + assert_array_almost_equal(tpr, [0, 1, 1]) + assert_array_almost_equal(fpr, [0, 0, 1]) + assert_almost_equal(roc_auc, 0.0) + + y_true = [1, 0] + y_score = [1, 1] + tpr, fpr, _ = _roc_curve(y_true, y_score) + roc_auc = _roc_auc_score(y_true, y_score) + assert_array_almost_equal(tpr, [0, 1]) + assert_array_almost_equal(fpr, [0, 1]) + assert_almost_equal(roc_auc, 0.5) + + y_true = [1, 0] + y_score = [1, 0] + tpr, fpr, _ = _roc_curve(y_true, y_score) + roc_auc = _roc_auc_score(y_true, y_score) + assert_array_almost_equal(tpr, [0, 0, 1]) + assert_array_almost_equal(fpr, [0, 1, 1]) + assert_almost_equal(roc_auc, 1.0) + + y_true = [1, 0] + y_score = [0.5, 0.5] + tpr, fpr, _ = _roc_curve(y_true, y_score) + roc_auc = _roc_auc_score(y_true, y_score) + assert_array_almost_equal(tpr, [0, 1]) + assert_array_almost_equal(fpr, [0, 1]) + assert_almost_equal(roc_auc, 0.5) + + # same result as above with ROCAUCScore wrapper + score = ROCAUCScore() + score.score_set(0.5, 1) + score.score_set(0.5, 0) + assert_almost_equal(score.score, 0.5) + + # check that errors are raised in undefined cases and score is -inf + y_true = [0, 0] + y_score = [0.25, 0.75] + with pytest.raises(ValueError): + _roc_auc_score(y_true, y_score) + + score = ROCAUCScore() + score.score_set(0.25, 0) + score.score_set(0.75, 0) + assert score.score == -float("inf") + + y_true = [1, 1] + y_score = [0.25, 0.75] + with pytest.raises(ValueError): + _roc_auc_score(y_true, y_score) + + score = ROCAUCScore() + score.score_set(0.25, 1) + score.score_set(0.75, 1) + assert score.score == -float("inf") diff --git a/spacy/tests/vocab_vectors/test_lookups.py b/spacy/tests/vocab_vectors/test_lookups.py index 16ffe83fc..f78dd33c4 100644 --- a/spacy/tests/vocab_vectors/test_lookups.py +++ b/spacy/tests/vocab_vectors/test_lookups.py @@ -2,7 +2,8 @@ from __future__ import unicode_literals import pytest -from spacy.lookups import Lookups +from spacy.lookups import Lookups, Table +from spacy.strings import get_string_id from spacy.vocab import Vocab from ..util import make_tempdir @@ -19,9 +20,9 @@ def test_lookups_api(): table = lookups.get_table(table_name) assert table.name == table_name assert len(table) == 2 - assert table.get("hello") == "world" - table.set("a", "b") - assert table.get("a") == "b" + assert table["hello"] == "world" + table["a"] = "b" + assert table["a"] == "b" table = lookups.get_table(table_name) assert len(table) == 3 with pytest.raises(KeyError): @@ -36,8 +37,44 @@ def test_lookups_api(): lookups.get_table(table_name) -# This fails on Python 3.5 -@pytest.mark.xfail +def test_table_api(): + table = Table(name="table") + assert table.name == "table" + assert len(table) == 0 + assert "abc" not in table + data = {"foo": "bar", "hello": "world"} + table = Table(name="table", data=data) + assert len(table) == len(data) + assert "foo" in table + assert get_string_id("foo") in table + assert table["foo"] == "bar" + assert table[get_string_id("foo")] == "bar" + assert table.get("foo") == "bar" + assert table.get("abc") is None + table["abc"] = 123 + assert table["abc"] == 123 + assert table[get_string_id("abc")] == 123 + table.set("def", 456) + assert table["def"] == 456 + assert table[get_string_id("def")] == 456 + + +def test_table_api_to_from_bytes(): + data = {"foo": "bar", "hello": "world", "abc": 123} + table = Table(name="table", data=data) + table_bytes = table.to_bytes() + new_table = Table().from_bytes(table_bytes) + assert new_table.name == "table" + assert len(new_table) == 3 + assert new_table["foo"] == "bar" + assert new_table[get_string_id("foo")] == "bar" + new_table2 = Table(data={"def": 456}) + new_table2.from_bytes(table_bytes) + assert len(new_table2) == 3 + assert "def" not in new_table2 + + +@pytest.mark.skip(reason="This fails on Python 3.5") def test_lookups_to_from_bytes(): lookups = Lookups() lookups.add_table("table1", {"foo": "bar", "hello": "world"}) @@ -50,15 +87,14 @@ def test_lookups_to_from_bytes(): assert "table2" in new_lookups table1 = new_lookups.get_table("table1") assert len(table1) == 2 - assert table1.get("foo") == "bar" + assert table1["foo"] == "bar" table2 = new_lookups.get_table("table2") assert len(table2) == 3 - assert table2.get("b") == 2 + assert table2["b"] == 2 assert new_lookups.to_bytes() == lookups_bytes -# This fails on Python 3.5 -@pytest.mark.xfail +@pytest.mark.skip(reason="This fails on Python 3.5") def test_lookups_to_from_disk(): lookups = Lookups() lookups.add_table("table1", {"foo": "bar", "hello": "world"}) @@ -72,14 +108,13 @@ def test_lookups_to_from_disk(): assert "table2" in new_lookups table1 = new_lookups.get_table("table1") assert len(table1) == 2 - assert table1.get("foo") == "bar" + assert table1["foo"] == "bar" table2 = new_lookups.get_table("table2") assert len(table2) == 3 - assert table2.get("b") == 2 + assert table2["b"] == 2 -# This fails on Python 3.5 -@pytest.mark.xfail +@pytest.mark.skip(reason="This fails on Python 3.5") def test_lookups_to_from_bytes_via_vocab(): table_name = "test" vocab = Vocab() @@ -93,12 +128,11 @@ def test_lookups_to_from_bytes_via_vocab(): assert table_name in new_vocab.lookups table = new_vocab.lookups.get_table(table_name) assert len(table) == 2 - assert table.get("hello") == "world" + assert table["hello"] == "world" assert new_vocab.to_bytes() == vocab_bytes -# This fails on Python 3.5 -@pytest.mark.xfail +@pytest.mark.skip(reason="This fails on Python 3.5") def test_lookups_to_from_disk_via_vocab(): table_name = "test" vocab = Vocab() @@ -113,4 +147,4 @@ def test_lookups_to_from_disk_via_vocab(): assert table_name in new_vocab.lookups table = new_vocab.lookups.get_table(table_name) assert len(table) == 2 - assert table.get("hello") == "world" + assert table["hello"] == "world" diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index 2a828de9c..4226bca3b 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -259,7 +259,7 @@ def test_vectors_doc_doc_similarity(vocab, text1, text2): def test_vocab_add_vector(): - vocab = Vocab() + vocab = Vocab(vectors_name="test_vocab_add_vector") data = numpy.ndarray((5, 3), dtype="f") data[0] = 1.0 data[1] = 2.0 @@ -272,7 +272,7 @@ def test_vocab_add_vector(): def test_vocab_prune_vectors(): - vocab = Vocab() + vocab = Vocab(vectors_name="test_vocab_prune_vectors") _ = vocab["cat"] # noqa: F841 _ = vocab["dog"] # noqa: F841 _ = vocab["kitten"] # noqa: F841 diff --git a/spacy/tokens/__init__.py b/spacy/tokens/__init__.py index 5722d45bc..536ec8349 100644 --- a/spacy/tokens/__init__.py +++ b/spacy/tokens/__init__.py @@ -4,5 +4,6 @@ from __future__ import unicode_literals from .doc import Doc from .token import Token from .span import Span +from ._serialize import DocBin -__all__ = ["Doc", "Token", "Span"] +__all__ = ["Doc", "Token", "Span", "DocBin"] diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 741be7e6a..5b0747fa0 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -146,11 +146,12 @@ def _merge(Doc doc, merges): syntactic root of the span. RETURNS (Token): The first newly merged token. """ - cdef int i, merge_index, start, end, token_index + cdef int i, merge_index, start, end, token_index, current_span_index, current_offset, offset, span_index cdef Span span cdef const LexemeC* lex cdef TokenC* token cdef Pool mem = Pool() + cdef int merged_iob = 0 tokens = mem.alloc(len(merges), sizeof(TokenC)) spans = [] diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py index 41f524839..634d7450a 100644 --- a/spacy/tokens/_serialize.py +++ b/spacy/tokens/_serialize.py @@ -8,36 +8,77 @@ from thinc.neural.ops import NumpyOps from ..compat import copy_reg from ..tokens import Doc -from ..attrs import SPACY, ORTH +from ..attrs import SPACY, ORTH, intify_attrs +from ..errors import Errors -class DocBox(object): - """Serialize analyses from a collection of doc objects.""" +class DocBin(object): + """Pack Doc objects for binary serialization. + + The DocBin class lets you efficiently serialize the information from a + collection of Doc objects. You can control which information is serialized + by passing a list of attribute IDs, and optionally also specify whether the + user data is serialized. The DocBin is faster and produces smaller data + sizes than pickle, and allows you to deserialize without executing arbitrary + Python code. + + The serialization format is gzipped msgpack, where the msgpack object has + the following structure: + + { + "attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE] + "tokens": bytes, # Serialized numpy uint64 array with the token data + "spaces": bytes, # Serialized numpy boolean array with spaces data + "lengths": bytes, # Serialized numpy int32 array with the doc lengths + "strings": List[unicode] # List of unique strings in the token data + } + + Strings for the words, tags, labels etc are represented by 64-bit hashes in + the token data, and every string that occurs at least once is passed via the + strings object. This means the storage is more efficient if you pack more + documents together, because you have less duplication in the strings. + + A notable downside to this format is that you can't easily extract just one + document from the DocBin. + """ def __init__(self, attrs=None, store_user_data=False): - """Create a DocBox object, to hold serialized annotations. + """Create a DocBin object to hold serialized annotations. attrs (list): List of attributes to serialize. 'orth' and 'spacy' are always serialized, so they're not required. Defaults to None. + store_user_data (bool): Whether to include the `Doc.user_data`. + RETURNS (DocBin): The newly constructed object. + + DOCS: https://spacy.io/api/docbin#init """ attrs = attrs or [] - # Ensure ORTH is always attrs[0] + attrs = sorted(intify_attrs(attrs)) self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY] - self.attrs.insert(0, ORTH) + self.attrs.insert(0, ORTH) # Ensure ORTH is always attrs[0] self.tokens = [] self.spaces = [] self.user_data = [] self.strings = set() self.store_user_data = store_user_data + def __len__(self): + """RETURNS: The number of Doc objects added to the DocBin.""" + return len(self.tokens) + def add(self, doc): - """Add a doc's annotations to the DocBox for serialization.""" + """Add a Doc's annotations to the DocBin for serialization. + + doc (Doc): The Doc object to add. + + DOCS: https://spacy.io/api/docbin#add + """ array = doc.to_array(self.attrs) if len(array.shape) == 1: array = array.reshape((array.shape[0], 1)) self.tokens.append(array) spaces = doc.to_array(SPACY) - assert array.shape[0] == spaces.shape[0] + assert array.shape[0] == spaces.shape[0] # this should never happen spaces = spaces.reshape((spaces.shape[0], 1)) self.spaces.append(numpy.asarray(spaces, dtype=bool)) self.strings.update(w.text for w in doc) @@ -45,7 +86,13 @@ class DocBox(object): self.user_data.append(srsly.msgpack_dumps(doc.user_data)) def get_docs(self, vocab): - """Recover Doc objects from the annotations, using the given vocab.""" + """Recover Doc objects from the annotations, using the given vocab. + + vocab (Vocab): The shared vocab. + YIELDS (Doc): The Doc objects. + + DOCS: https://spacy.io/api/docbin#get_docs + """ for string in self.strings: vocab[string] orth_col = self.attrs.index(ORTH) @@ -60,8 +107,16 @@ class DocBox(object): yield doc def merge(self, other): - """Extend the annotations of this DocBox with the annotations from another.""" - assert self.attrs == other.attrs + """Extend the annotations of this DocBin with the annotations from + another. Will raise an error if the pre-defined attrs of the two + DocBins don't match. + + other (DocBin): The DocBin to merge into the current bin. + + DOCS: https://spacy.io/api/docbin#merge + """ + if self.attrs != other.attrs: + raise ValueError(Errors.E166.format(current=self.attrs, other=other.attrs)) self.tokens.extend(other.tokens) self.spaces.extend(other.spaces) self.strings.update(other.strings) @@ -69,9 +124,14 @@ class DocBox(object): self.user_data.extend(other.user_data) def to_bytes(self): - """Serialize the DocBox's annotations into a byte string.""" + """Serialize the DocBin's annotations to a bytestring. + + RETURNS (bytes): The serialized DocBin. + + DOCS: https://spacy.io/api/docbin#to_bytes + """ for tokens in self.tokens: - assert len(tokens.shape) == 2, tokens.shape + assert len(tokens.shape) == 2, tokens.shape # this should never happen lengths = [len(tokens) for tokens in self.tokens] msg = { "attrs": self.attrs, @@ -84,9 +144,15 @@ class DocBox(object): msg["user_data"] = self.user_data return gzip.compress(srsly.msgpack_dumps(msg)) - def from_bytes(self, string): - """Deserialize the DocBox's annotations from a byte string.""" - msg = srsly.msgpack_loads(gzip.decompress(string)) + def from_bytes(self, bytes_data): + """Deserialize the DocBin's annotations from a bytestring. + + bytes_data (bytes): The data to load from. + RETURNS (DocBin): The loaded DocBin. + + DOCS: https://spacy.io/api/docbin#from_bytes + """ + msg = srsly.msgpack_loads(gzip.decompress(bytes_data)) self.attrs = msg["attrs"] self.strings = set(msg["strings"]) lengths = numpy.fromstring(msg["lengths"], dtype="int32") @@ -100,35 +166,35 @@ class DocBox(object): if self.store_user_data and "user_data" in msg: self.user_data = list(msg["user_data"]) for tokens in self.tokens: - assert len(tokens.shape) == 2, tokens.shape + assert len(tokens.shape) == 2, tokens.shape # this should never happen return self -def merge_boxes(boxes): +def merge_bins(bins): merged = None - for byte_string in boxes: + for byte_string in bins: if byte_string is not None: - box = DocBox(store_user_data=True).from_bytes(byte_string) + doc_bin = DocBin(store_user_data=True).from_bytes(byte_string) if merged is None: - merged = box + merged = doc_bin else: - merged.merge(box) + merged.merge(doc_bin) if merged is not None: return merged.to_bytes() else: return b"" -def pickle_box(box): - return (unpickle_box, (box.to_bytes(),)) +def pickle_bin(doc_bin): + return (unpickle_bin, (doc_bin.to_bytes(),)) -def unpickle_box(byte_string): - return DocBox().from_bytes(byte_string) +def unpickle_bin(byte_string): + return DocBin().from_bytes(byte_string) -copy_reg.pickle(DocBox, pickle_box, unpickle_box) +copy_reg.pickle(DocBin, pickle_bin, unpickle_bin) # Compatibility, as we had named it this previously. -Binder = DocBox +Binder = DocBin -__all__ = ["DocBox"] +__all__ = ["DocBin"] diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index e863b0807..80a808bae 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -256,7 +256,7 @@ cdef class Doc: def is_nered(self): """Check if the document has named entities set. Will return True if *any* of the tokens has a named entity tag set (even if the others are - uknown values). + unknown values). """ if len(self) == 0: return True @@ -525,13 +525,11 @@ cdef class Doc: def __set__(self, ents): # TODO: - # 1. Allow negative matches - # 2. Ensure pre-set NERs are not over-written during statistical - # prediction - # 3. Test basic data-driven ORTH gazetteer - # 4. Test more nuanced date and currency regex + # 1. Test basic data-driven ORTH gazetteer + # 2. Test more nuanced date and currency regex tokens_in_ents = {} cdef attr_t entity_type + cdef attr_t kb_id cdef int ent_start, ent_end for ent_info in ents: entity_type, kb_id, ent_start, ent_end = get_entity_info(ent_info) @@ -545,27 +543,31 @@ cdef class Doc: tokens_in_ents[token_index] = (ent_start, ent_end, entity_type, kb_id) cdef int i for i in range(self.length): - self.c[i].ent_type = 0 - self.c[i].ent_kb_id = 0 - self.c[i].ent_iob = 0 # Means missing. - cdef attr_t ent_type - cdef int start, end - for ent_info in ents: - ent_type, ent_kb_id, start, end = get_entity_info(ent_info) - if ent_type is None or ent_type < 0: - # Mark as O - for i in range(start, end): - self.c[i].ent_type = 0 - self.c[i].ent_kb_id = 0 - self.c[i].ent_iob = 2 - else: - # Mark (inside) as I - for i in range(start, end): - self.c[i].ent_type = ent_type - self.c[i].ent_kb_id = ent_kb_id - self.c[i].ent_iob = 1 - # Set start as B - self.c[start].ent_iob = 3 + # default values + entity_type = 0 + kb_id = 0 + + # Set ent_iob to Missing (0) bij default unless this token was nered before + ent_iob = 0 + if self.c[i].ent_iob != 0: + ent_iob = 2 + + # overwrite if the token was part of a specified entity + if i in tokens_in_ents.keys(): + ent_start, ent_end, entity_type, kb_id = tokens_in_ents[i] + if entity_type is None or entity_type <= 0: + # Blocking this token from being overwritten by downstream NER + ent_iob = 3 + elif ent_start == i: + # Marking the start of an entity + ent_iob = 3 + else: + # Marking the inside of an entity + ent_iob = 1 + + self.c[i].ent_type = entity_type + self.c[i].ent_kb_id = kb_id + self.c[i].ent_iob = ent_iob @property def noun_chunks(self): @@ -1089,6 +1091,37 @@ cdef class Doc: data["_"][attr] = value return data + def to_utf8_array(self, int nr_char=-1): + """Encode word strings to utf8, and export to a fixed-width array + of characters. Characters are placed into the array in the order: + 0, -1, 1, -2, etc + For example, if the array is sliced array[:, :8], the array will + contain the first 4 characters and last 4 characters of each word --- + with the middle characters clipped out. The value 255 is used as a pad + value. + """ + byte_strings = [token.orth_.encode('utf8') for token in self] + if nr_char == -1: + nr_char = max(len(bs) for bs in byte_strings) + cdef np.ndarray output = numpy.zeros((len(byte_strings), nr_char), dtype='uint8') + output.fill(255) + cdef int i, j, start_idx, end_idx + cdef bytes byte_string + cdef unsigned char utf8_char + for i, byte_string in enumerate(byte_strings): + j = 0 + start_idx = 0 + end_idx = len(byte_string) - 1 + while j < nr_char and start_idx <= end_idx: + output[i, j] = byte_string[start_idx] + start_idx += 1 + j += 1 + if j < nr_char and start_idx <= end_idx: + output[i, j] = byte_string[end_idx] + end_idx -= 1 + j += 1 + return output + cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2: cdef int i diff --git a/spacy/tokens/morphanalysis.pxd b/spacy/tokens/morphanalysis.pxd new file mode 100644 index 000000000..22844454a --- /dev/null +++ b/spacy/tokens/morphanalysis.pxd @@ -0,0 +1,9 @@ +from ..vocab cimport Vocab +from ..typedefs cimport hash_t +from ..structs cimport MorphAnalysisC + + +cdef class MorphAnalysis: + cdef readonly Vocab vocab + cdef hash_t key + cdef MorphAnalysisC c diff --git a/spacy/tokens/morphanalysis.pyx b/spacy/tokens/morphanalysis.pyx new file mode 100644 index 000000000..e09870741 --- /dev/null +++ b/spacy/tokens/morphanalysis.pyx @@ -0,0 +1,423 @@ +from libc.string cimport memset + +from ..vocab cimport Vocab +from ..typedefs cimport hash_t, attr_t +from ..morphology cimport list_features, check_feature, get_field, tag_to_json + +from ..strings import get_string_id + + +cdef class MorphAnalysis: + """Control access to morphological features for a token.""" + def __init__(self, Vocab vocab, features=tuple()): + self.vocab = vocab + self.key = self.vocab.morphology.add(features) + analysis = self.vocab.morphology.tags.get(self.key) + if analysis is not NULL: + self.c = analysis[0] + else: + memset(&self.c, 0, sizeof(self.c)) + + @classmethod + def from_id(cls, Vocab vocab, hash_t key): + """Create a morphological analysis from a given ID.""" + cdef MorphAnalysis morph = MorphAnalysis.__new__(MorphAnalysis, vocab) + morph.vocab = vocab + morph.key = key + analysis = vocab.morphology.tags.get(key) + if analysis is not NULL: + morph.c = analysis[0] + else: + memset(&morph.c, 0, sizeof(morph.c)) + return morph + + def __contains__(self, feature): + """Test whether the morphological analysis contains some feature.""" + cdef attr_t feat_id = get_string_id(feature) + return check_feature(&self.c, feat_id) + + def __iter__(self): + """Iterate over the features in the analysis.""" + cdef attr_t feature + for feature in list_features(&self.c): + yield self.vocab.strings[feature] + + def __len__(self): + """The number of features in the analysis.""" + return self.c.length + + def __str__(self): + return self.to_json() + + def __repr__(self): + return self.to_json() + + def __hash__(self): + return self.key + + def get(self, unicode field): + """Retrieve a feature by field.""" + cdef int field_id = self.vocab.morphology._feat_map.attr2field[field] + return self.vocab.strings[get_field(&self.c, field_id)] + + def to_json(self): + """Produce a json serializable representation, which will be a list of + strings. + """ + return tag_to_json(&self.c) + + @property + def is_base_form(self): + raise NotImplementedError + + @property + def pos(self): + return self.c.pos + + @property + def pos_(self): + return self.vocab.strings[self.c.pos] + + property id: + def __get__(self): + return self.key + + property abbr: + def __get__(self): + return self.c.abbr + + property adp_type: + def __get__(self): + return self.c.adp_type + + property adv_type: + def __get__(self): + return self.c.adv_type + + property animacy: + def __get__(self): + return self.c.animacy + + property aspect: + def __get__(self): + return self.c.aspect + + property case: + def __get__(self): + return self.c.case + + property conj_type: + def __get__(self): + return self.c.conj_type + + property connegative: + def __get__(self): + return self.c.connegative + + property definite: + def __get__(self): + return self.c.definite + + property degree: + def __get__(self): + return self.c.degree + + property derivation: + def __get__(self): + return self.c.derivation + + property echo: + def __get__(self): + return self.c.echo + + property foreign: + def __get__(self): + return self.c.foreign + + property gender: + def __get__(self): + return self.c.gender + + property hyph: + def __get__(self): + return self.c.hyph + + property inf_form: + def __get__(self): + return self.c.inf_form + + property mood: + def __get__(self): + return self.c.mood + + property name_type: + def __get__(self): + return self.c.name_type + + property negative: + def __get__(self): + return self.c.negative + + property noun_type: + def __get__(self): + return self.c.noun_type + + property number: + def __get__(self): + return self.c.number + + property num_form: + def __get__(self): + return self.c.num_form + + property num_type: + def __get__(self): + return self.c.num_type + + property num_value: + def __get__(self): + return self.c.num_value + + property part_form: + def __get__(self): + return self.c.part_form + + property part_type: + def __get__(self): + return self.c.part_type + + property person: + def __get__(self): + return self.c.person + + property polite: + def __get__(self): + return self.c.polite + + property polarity: + def __get__(self): + return self.c.polarity + + property poss: + def __get__(self): + return self.c.poss + + property prefix: + def __get__(self): + return self.c.prefix + + property prep_case: + def __get__(self): + return self.c.prep_case + + property pron_type: + def __get__(self): + return self.c.pron_type + + property punct_side: + def __get__(self): + return self.c.punct_side + + property punct_type: + def __get__(self): + return self.c.punct_type + + property reflex: + def __get__(self): + return self.c.reflex + + property style: + def __get__(self): + return self.c.style + + property style_variant: + def __get__(self): + return self.c.style_variant + + property tense: + def __get__(self): + return self.c.tense + + property typo: + def __get__(self): + return self.c.typo + + property verb_form: + def __get__(self): + return self.c.verb_form + + property voice: + def __get__(self): + return self.c.voice + + property verb_type: + def __get__(self): + return self.c.verb_type + + property abbr_: + def __get__(self): + return self.vocab.strings[self.c.abbr] + + property adp_type_: + def __get__(self): + return self.vocab.strings[self.c.adp_type] + + property adv_type_: + def __get__(self): + return self.vocab.strings[self.c.adv_type] + + property animacy_: + def __get__(self): + return self.vocab.strings[self.c.animacy] + + property aspect_: + def __get__(self): + return self.vocab.strings[self.c.aspect] + + property case_: + def __get__(self): + return self.vocab.strings[self.c.case] + + property conj_type_: + def __get__(self): + return self.vocab.strings[self.c.conj_type] + + property connegative_: + def __get__(self): + return self.vocab.strings[self.c.connegative] + + property definite_: + def __get__(self): + return self.vocab.strings[self.c.definite] + + property degree_: + def __get__(self): + return self.vocab.strings[self.c.degree] + + property derivation_: + def __get__(self): + return self.vocab.strings[self.c.derivation] + + property echo_: + def __get__(self): + return self.vocab.strings[self.c.echo] + + property foreign_: + def __get__(self): + return self.vocab.strings[self.c.foreign] + + property gender_: + def __get__(self): + return self.vocab.strings[self.c.gender] + + property hyph_: + def __get__(self): + return self.vocab.strings[self.c.hyph] + + property inf_form_: + def __get__(self): + return self.vocab.strings[self.c.inf_form] + + property name_type_: + def __get__(self): + return self.vocab.strings[self.c.name_type] + + property negative_: + def __get__(self): + return self.vocab.strings[self.c.negative] + + property mood_: + def __get__(self): + return self.vocab.strings[self.c.mood] + + property number_: + def __get__(self): + return self.vocab.strings[self.c.number] + + property num_form_: + def __get__(self): + return self.vocab.strings[self.c.num_form] + + property num_type_: + def __get__(self): + return self.vocab.strings[self.c.num_type] + + property num_value_: + def __get__(self): + return self.vocab.strings[self.c.num_value] + + property part_form_: + def __get__(self): + return self.vocab.strings[self.c.part_form] + + property part_type_: + def __get__(self): + return self.vocab.strings[self.c.part_type] + + property person_: + def __get__(self): + return self.vocab.strings[self.c.person] + + property polite_: + def __get__(self): + return self.vocab.strings[self.c.polite] + + property polarity_: + def __get__(self): + return self.vocab.strings[self.c.polarity] + + property poss_: + def __get__(self): + return self.vocab.strings[self.c.poss] + + property prefix_: + def __get__(self): + return self.vocab.strings[self.c.prefix] + + property prep_case_: + def __get__(self): + return self.vocab.strings[self.c.prep_case] + + property pron_type_: + def __get__(self): + return self.vocab.strings[self.c.pron_type] + + property punct_side_: + def __get__(self): + return self.vocab.strings[self.c.punct_side] + + property punct_type_: + def __get__(self): + return self.vocab.strings[self.c.punct_type] + + property reflex_: + def __get__(self): + return self.vocab.strings[self.c.reflex] + + property style_: + def __get__(self): + return self.vocab.strings[self.c.style] + + property style_variant_: + def __get__(self): + return self.vocab.strings[self.c.style_variant] + + property tense_: + def __get__(self): + return self.vocab.strings[self.c.tense] + + property typo_: + def __get__(self): + return self.vocab.strings[self.c.typo] + + property verb_form_: + def __get__(self): + return self.vocab.strings[self.c.verb_form] + + property voice_: + def __get__(self): + return self.vocab.strings[self.c.voice] + + property verb_type_: + def __get__(self): + return self.vocab.strings[self.c.verb_type] diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 07c6f1c99..8b15a4223 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -26,6 +26,7 @@ from .. import util from ..compat import is_config from ..errors import Errors, Warnings, user_warning, models_warning from .underscore import Underscore, get_ext_args +from .morphanalysis cimport MorphAnalysis cdef class Token: @@ -218,6 +219,10 @@ cdef class Token: xp = get_array_module(vector) return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) + @property + def morph(self): + return MorphAnalysis.from_id(self.vocab, self.c.morph) + @property def lex_id(self): """RETURNS (int): Sequential ID of the token's lexical type.""" @@ -330,7 +335,7 @@ cdef class Token: """ def __get__(self): if self.c.lemma == 0: - lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_) + lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_, orth=self.orth) return self.vocab.strings[lemma_] else: return self.c.lemma @@ -749,7 +754,8 @@ cdef class Token: def ent_iob_(self): """IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, - and "" means no entity tag is set. + and "" means no entity tag is set. "B" with an empty ent_type + means that the token is blocked from further processing by NER. RETURNS (unicode): IOB code of named entity tag. """ @@ -857,7 +863,7 @@ cdef class Token: """ def __get__(self): if self.c.lemma == 0: - return self.vocab.morphology.lemmatizer.lookup(self.orth_) + return self.vocab.morphology.lemmatizer.lookup(self.orth_, orth=self.orth) else: return self.vocab.strings[self.c.lemma] diff --git a/spacy/util.py b/spacy/util.py index e88d66452..dbe965392 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -136,7 +136,7 @@ def load_language_data(path): def get_module_path(module): if not hasattr(module, "__module__"): - raise ValueError("Can't find module {}".format(repr(module))) + raise ValueError(Errors.E169.format(module=repr(module))) return Path(sys.modules[module.__module__].__file__).parent diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index 2cb5b077f..3c238fe2d 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -63,7 +63,7 @@ cdef class Vectors: shape (tuple): Size of the table, as (# entries, # columns) data (numpy.ndarray): The vector data. keys (iterable): A sequence of keys, aligned with the data. - name (string): A name to identify the vectors table. + name (unicode): A name to identify the vectors table. RETURNS (Vectors): The newly created object. DOCS: https://spacy.io/api/vectors#init diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 7e360d409..62c1791b9 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -18,10 +18,10 @@ from .structs cimport SerializedLexemeC from .compat import copy_reg, basestring_ from .errors import Errors from .lemmatizer import Lemmatizer -from .lookups import Lookups from .attrs import intify_attrs, NORM from .vectors import Vectors from ._ml import link_vectors_to_models +from .lookups import Lookups from . import util @@ -33,7 +33,8 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab """ def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None, - strings=tuple(), lookups=None, oov_prob=-20., **deprecated_kwargs): + strings=tuple(), lookups=None, oov_prob=-20., vectors_name=None, + **deprecated_kwargs): """Create the vocabulary. lex_attr_getters (dict): A dictionary mapping attribute IDs to @@ -44,6 +45,7 @@ cdef class Vocab: strings (StringStore): StringStore that maps strings to integers, and vice versa. lookups (Lookups): Container for large lookup tables and dictionaries. + name (unicode): Optional name to identify the vectors table. RETURNS (Vocab): The newly constructed object. """ lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {} @@ -62,7 +64,7 @@ cdef class Vocab: _ = self[string] self.lex_attr_getters = lex_attr_getters self.morphology = Morphology(self.strings, tag_map, lemmatizer) - self.vectors = Vectors() + self.vectors = Vectors(name=vectors_name) self.lookups = lookups @property @@ -318,7 +320,7 @@ cdef class Vocab: keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64") keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]]) toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]]) - self.vectors = Vectors(data=keep, keys=keys) + self.vectors = Vectors(data=keep, keys=keys, name=self.vectors.name) syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size) remap = {} for i, key in enumerate(keys[nr_row:]): diff --git a/website/README.md b/website/README.md index be817225d..a02d5a151 100644 --- a/website/README.md +++ b/website/README.md @@ -309,7 +309,7 @@ indented block as plain text and preserve whitespace. ### Using spaCy import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"This is a sentence.") +doc = nlp("This is a sentence.") for token in doc: print(token.text, token.pos_) ``` @@ -335,9 +335,9 @@ from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) -pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}] -matcher.add('HelloWorld', None, pattern) -doc = nlp(u'Hello, world! Hello world!') +pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}] +matcher.add("HelloWorld", None, pattern) +doc = nlp("Hello, world! Hello world!") matches = matcher(doc) ``` @@ -360,7 +360,7 @@ interactive widget defaults to a regular code block. ### {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"This is a sentence.") +doc = nlp("This is a sentence.") for token in doc: print(token.text, token.pos_) ``` @@ -457,7 +457,8 @@ sit amet dignissim justo congue. ## Setup and installation {#setup} Before running the setup, make sure your versions of -[Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date. Node v10.15 or later is required. +[Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date. +Node v10.15 or later is required. ```bash # Clone the repository diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md index 7f7b46260..fb8b67c1e 100644 --- a/website/docs/api/annotation.md +++ b/website/docs/api/annotation.md @@ -16,7 +16,7 @@ menu: > ```python > from spacy.lang.en import English > nlp = English() -> tokens = nlp(u"Some\\nspaces and\\ttab characters") +> tokens = nlp("Some\\nspaces and\\ttab characters") > tokens_text = [t.text for t in tokens] > assert tokens_text == ["Some", "\\n", "spaces", " ", "and", "\\t", "tab", "characters"] > ``` @@ -80,8 +80,8 @@ training corpus and can be defined in the respective language data's -spaCy also maps all language-specific part-of-speech tags to a small, fixed set -of word type tags following the +spaCy maps all language-specific part-of-speech tags to a small, fixed set of +word type tags following the [Universal Dependencies scheme](http://universaldependencies.org/u/pos/). The universal tags don't code for any morphological features and only cover the word type. They're available as the [`Token.pos`](/api/token#attributes) and @@ -552,6 +552,10 @@ spaCy's JSON format, you can use the "last": int, # index of last token "label": string # phrase label }] + }], + "cats": [{ # new in v2.2: categories for text classifier + "label": string, # text category label + "value": float / bool # label applies (1.0/true) or not (0.0/false) }] }] }] diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index c5e77dc0d..7b20b76de 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -8,6 +8,7 @@ menu: - ['Info', 'info'] - ['Validate', 'validate'] - ['Convert', 'convert'] + - ['Debug data', 'debug-data'] - ['Train', 'train'] - ['Pretrain', 'pretrain'] - ['Init Model', 'init-model'] @@ -22,11 +23,11 @@ type `spacy --help`. ## Download {#download} Download [models](/usage/models) for spaCy. The downloader finds the -best-matching compatible version, uses pip to download the model as a package -and automatically creates a [shortcut link](/usage/models#usage) to load the -model by name. Direct downloads don't perform any compatibility checks and -require the model name to be specified with its version (e.g. -`en_core_web_sm-2.0.0`). +best-matching compatible version, uses `pip install` to download the model as a +package and creates a [shortcut link](/usage/models#usage) if the model was +downloaded via a shortcut. Direct downloads don't perform any compatibility +checks and require the model name to be specified with its version (e.g. +`en_core_web_sm-2.2.0`). > #### Downloading best practices > @@ -39,16 +40,16 @@ require the model name to be specified with its version (e.g. > also allow you to add it as a versioned package dependency to your project. ```bash -$ python -m spacy download [model] [--direct] +$ python -m spacy download [model] [--direct] [pip args] ``` -| Argument | Type | Description | -| ---------------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `model` | positional | Model name or shortcut (`en`, `de`, `en_core_web_sm`). | -| `--direct`, `-d` | flag | Force direct download of exact model version. | -| other 2.1 | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | directory, symlink | The installed model package in your `site-packages` directory and a shortcut link as a symlink in `spacy/data`. | +| Argument | Type | Description | +| ------------------------------------- | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `model` | positional | Model name or shortcut (`en`, `de`, `en_core_web_sm`). | +| `--direct`, `-d` | flag | Force direct download of exact model version. | +| pip args 2.1 | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. | +| `--help`, `-h` | flag | Show help message and available arguments. | +| **CREATES** | directory, symlink | The installed model package in your `site-packages` directory and a shortcut link as a symlink in `spacy/data` if installed via shortcut. | ## Link {#link} @@ -181,6 +182,166 @@ All output files generated by this command are compatible with | `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | | `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | +## Debug data {#debug-data new="2.2"} + +Analyze, debug and validate your training and development data, get useful +stats, and find problems like invalid entity annotations, cyclic dependencies, +low data labels and more. + +```bash +$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format] +``` + +| Argument | Type | Description | +| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- | +| `lang` | positional | Model language. | +| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | +| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | +| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. | +| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | +| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | +| `--verbose`, `-V` | flag | Print additional information and explanations. | +| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. | + + + +``` +=========================== Data format validation =========================== +✔ Corpus is loadable + +=============================== Training stats =============================== +Training pipeline: tagger, parser, ner +Starting with blank model 'en' +18127 training docs +2939 evaluation docs +⚠ 34 training examples also in evaluation data + +============================== Vocab & Vectors ============================== +ℹ 2083156 total words in the data (56962 unique) +⚠ 13020 misaligned tokens in the training data +⚠ 2423 misaligned tokens in the dev data +10 most common words: 'the' (98429), ',' (91756), '.' (87073), 'to' (50058), +'of' (49559), 'and' (44416), 'a' (34010), 'in' (31424), 'that' (22792), 'is' +(18952) +ℹ No word vectors present in the model + +========================== Named Entity Recognition ========================== +ℹ 18 new labels, 0 existing labels +528978 missing values (tokens with '-' label) +New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL' +(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122), +'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC' +(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338) +✔ Good amount of examples for all labels +✔ Examples without occurences available for all labels +✔ No entities consisting of or starting/ending with whitespace + +=========================== Part-of-speech Tagging =========================== +ℹ 49 labels in data (57 labels in tag map) +'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830), +'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB' +(74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN' +(42193), 'CD' (40326), 'VBG' (34764), 'TO' (31085), 'MD' (25863), 'PRP$' +(23335), 'HYPH' (13833), 'POS' (13427), 'UH' (13322), 'WP' (10423), 'WDT' +(9850), 'RP' (8230), 'WRB' (8201), ':' (8168), '''' (7392), '``' (6984), 'NNPS' +(5817), 'JJR' (5689), '$' (3710), 'EX' (3465), 'JJS' (3118), 'RBR' (2872), +'-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW' +(794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX' +(24) +✔ All labels present in tag map for language 'en' + +============================= Dependency Parsing ============================= +ℹ Found 111703 sentences with an average length of 18.6 words. +ℹ Found 2251 nonprojective train sentences +ℹ Found 303 nonprojective dev sentences +ℹ 47 labels in train data +ℹ 211 labels in projectivized train data +'punct' (236796), 'prep' (188853), 'pobj' (182533), 'det' (172674), 'nsubj' +(169481), 'compound' (116142), 'ROOT' (111697), 'amod' (107945), 'dobj' (93540), +'aux' (86802), 'advmod' (86197), 'cc' (62679), 'conj' (59575), 'poss' (36449), +'ccomp' (36343), 'advcl' (29017), 'mark' (27990), 'nummod' (24582), 'relcl' +(21359), 'xcomp' (21081), 'attr' (18347), 'npadvmod' (17740), 'acomp' (17204), +'auxpass' (15639), 'appos' (15368), 'neg' (15266), 'nsubjpass' (13922), 'case' +(13408), 'acl' (12574), 'pcomp' (10340), 'nmod' (9736), 'intj' (9285), 'prt' +(8196), 'quantmod' (7403), 'dep' (4300), 'dative' (4091), 'agent' (3908), 'expl' +(3456), 'parataxis' (3099), 'oprd' (2326), 'predet' (1946), 'csubj' (1494), +'subtok' (1147), 'preconj' (692), 'meta' (469), 'csubjpass' (64), 'iobj' (1) +⚠ Low number of examples for label 'iobj' (1) +⚠ Low number of examples for 130 labels in the projectivized dependency +trees used for training. You may want to projectivize labels such as punct +before training in order to improve parser performance. +⚠ Projectivized labels with low numbers of examples: appos||attr: 12 +advmod||dobj: 13 prep||ccomp: 12 nsubjpass||ccomp: 15 pcomp||prep: 14 +amod||dobj: 9 attr||xcomp: 14 nmod||nsubj: 17 prep||advcl: 2 prep||prep: 5 +nsubj||conj: 12 advcl||advmod: 18 ccomp||advmod: 11 ccomp||pcomp: 5 acl||pobj: +10 npadvmod||acomp: 7 dobj||pcomp: 14 nsubjpass||pcomp: 1 nmod||pobj: 8 +amod||attr: 6 nmod||dobj: 12 aux||conj: 1 neg||conj: 1 dative||xcomp: 11 +pobj||dative: 3 xcomp||acomp: 19 advcl||pobj: 2 nsubj||advcl: 2 csubj||ccomp: 1 +advcl||acl: 1 relcl||nmod: 2 dobj||advcl: 10 advmod||advcl: 3 nmod||nsubjpass: 6 +amod||pobj: 5 cc||neg: 1 attr||ccomp: 16 advcl||xcomp: 3 nmod||attr: 4 +advcl||nsubjpass: 5 advcl||ccomp: 4 ccomp||conj: 1 punct||acl: 1 meta||acl: 1 +parataxis||acl: 1 prep||acl: 1 amod||nsubj: 7 ccomp||ccomp: 3 acomp||xcomp: 5 +dobj||acl: 5 prep||oprd: 6 advmod||acl: 2 dative||advcl: 1 pobj||agent: 5 +xcomp||amod: 1 dep||advcl: 1 prep||amod: 8 relcl||compound: 1 advcl||csubj: 3 +npadvmod||conj: 2 npadvmod||xcomp: 4 advmod||nsubj: 3 ccomp||amod: 7 +advcl||conj: 1 nmod||conj: 2 advmod||nsubjpass: 2 dep||xcomp: 2 appos||ccomp: 1 +advmod||dep: 1 advmod||advmod: 5 aux||xcomp: 8 dep||advmod: 1 dative||ccomp: 2 +prep||dep: 1 conj||conj: 1 dep||ccomp: 4 cc||ROOT: 1 prep||ROOT: 1 nsubj||pcomp: +3 advmod||prep: 2 relcl||dative: 1 acl||conj: 1 advcl||attr: 4 prep||npadvmod: 1 +nsubjpass||xcomp: 1 neg||advmod: 1 xcomp||oprd: 1 advcl||advcl: 1 dobj||dep: 3 +nsubjpass||parataxis: 1 attr||pcomp: 1 ccomp||parataxis: 1 advmod||attr: 1 +nmod||oprd: 1 appos||nmod: 2 advmod||relcl: 1 appos||npadvmod: 1 appos||conj: 1 +prep||expl: 1 nsubjpass||conj: 1 punct||pobj: 1 cc||pobj: 1 conj||pobj: 1 +punct||conj: 1 ccomp||dep: 1 oprd||xcomp: 3 ccomp||xcomp: 1 ccomp||nsubj: 1 +nmod||dep: 1 xcomp||ccomp: 1 acomp||advcl: 1 intj||advmod: 1 advmod||acomp: 2 +relcl||oprd: 1 advmod||prt: 1 advmod||pobj: 1 appos||nummod: 1 relcl||npadvmod: +3 mark||advcl: 1 aux||ccomp: 1 amod||nsubjpass: 1 npadvmod||advmod: 1 conj||dep: +1 nummod||pobj: 1 amod||npadvmod: 1 intj||pobj: 1 nummod||npadvmod: 1 +xcomp||xcomp: 1 aux||dep: 1 advcl||relcl: 1 +⚠ The following labels were found only in the train data: xcomp||amod, +advcl||relcl, prep||nsubjpass, acl||nsubj, nsubjpass||conj, xcomp||oprd, +advmod||conj, advmod||advmod, iobj, advmod||nsubjpass, dobj||conj, ccomp||amod, +meta||acl, xcomp||xcomp, prep||attr, prep||ccomp, advcl||acomp, acl||dobj, +advcl||advcl, pobj||agent, prep||advcl, nsubjpass||xcomp, prep||dep, +acomp||xcomp, aux||ccomp, ccomp||dep, conj||dep, relcl||compound, +nsubjpass||ccomp, nmod||dobj, advmod||advcl, advmod||acl, dobj||advcl, +dative||xcomp, prep||nsubj, ccomp||ccomp, nsubj||ccomp, xcomp||acomp, +prep||acomp, dep||advmod, acl||pobj, appos||dobj, npadvmod||acomp, cc||ROOT, +relcl||nsubj, nmod||pobj, acl||nsubjpass, ccomp||advmod, pcomp||prep, +amod||dobj, advmod||attr, advcl||csubj, appos||attr, dobj||pcomp, prep||ROOT, +relcl||pobj, advmod||pobj, amod||nsubj, ccomp||xcomp, prep||oprd, +npadvmod||advmod, appos||nummod, advcl||pobj, neg||advmod, acl||attr, +appos||nsubjpass, csubj||ccomp, amod||nsubjpass, intj||pobj, dep||advcl, +cc||neg, xcomp||ccomp, dative||ccomp, nmod||oprd, pobj||dative, prep||dobj, +dep||ccomp, relcl||attr, ccomp||nsubj, advcl||xcomp, nmod||dep, advcl||advmod, +ccomp||conj, pobj||prep, advmod||acomp, advmod||relcl, attr||pcomp, +ccomp||parataxis, oprd||xcomp, intj||advmod, nmod||nsubjpass, prep||npadvmod, +parataxis||acl, prep||pobj, advcl||dobj, amod||pobj, prep||acl, conj||pobj, +advmod||dep, punct||pobj, ccomp||acomp, acomp||advcl, nummod||npadvmod, +dobj||dep, npadvmod||xcomp, advcl||conj, relcl||npadvmod, punct||acl, +relcl||dobj, dobj||xcomp, nsubjpass||parataxis, dative||advcl, relcl||nmod, +advcl||ccomp, appos||npadvmod, ccomp||pcomp, prep||amod, mark||advcl, +prep||advmod, prep||xcomp, appos||nsubj, attr||ccomp, advmod||prt, dobj||ccomp, +aux||conj, advcl||nsubj, conj||conj, advmod||ccomp, advcl||nsubjpass, +attr||xcomp, nmod||conj, npadvmod||conj, relcl||dative, prep||expl, +nsubjpass||pcomp, advmod||xcomp, advmod||dobj, appos||pobj, nsubj||conj, +relcl||nsubjpass, advcl||attr, appos||ccomp, advmod||prep, prep||conj, +nmod||attr, punct||conj, neg||conj, dep||xcomp, aux||xcomp, dobj||acl, +nummod||pobj, amod||npadvmod, nsubj||pcomp, advcl||acl, appos||nmod, +relcl||oprd, prep||prep, cc||pobj, nmod||nsubj, amod||attr, aux||dep, +appos||conj, advmod||nsubj, nsubj||advcl, acl||conj +To train a parser, your data should include at least 20 instances of each label. +⚠ Multiple root labels (ROOT, nsubj, aux, npadvmod, prep) found in +training data. spaCy's parser uses a single root label ROOT so this distinction +will not be available. + +================================== Summary ================================== +✔ 5 checks passed +⚠ 8 warnings +``` + + + ## Train {#train} Train a model. Expects data in spaCy's @@ -200,36 +361,41 @@ will only train the tagger and parser. ```bash $ python -m spacy train [lang] [output_path] [train_path] [dev_path] -[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] [--n-examples] [--use-gpu] -[--version] [--meta-path] [--init-tok2vec] [--parser-multitasks] -[--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens] -[--verbose] +[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] +[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec] +[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level] +[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel] +[--textcat-positive-label] [--verbose] ``` -| Argument | Type | Description | -| ----------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lang` | positional | Model language. | -| `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. | -| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | -| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | -| `--base-model`, `-b` 2.1 | option | Optional name of base model to update. Can be any loadable spaCy model. | -| `--pipeline`, `-p` 2.1 | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | -| `--vectors`, `-v` | option | Model to load vectors from. | -| `--n-iter`, `-n` | option | Number of iterations (default: `30`). | -| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. | -| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). | -| `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. | -| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. | -| `--meta-path`, `-m` 2 | option | Optional path to model [`meta.json`](/usage/training#models-generating). All relevant properties like `lang`, `pipeline` and `spacy_version` will be overwritten. | -| `--init-tok2vec`, `-t2v` 2.1 | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | -| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` | -| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` | -| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. | -| `--gold-preproc`, `-G` | flag | Use gold preprocessing. | -| `--learn-tokens`, `-T` | flag | Make parser learn gold-standard tokenization by merging ] subtokens. Typically used for languages like Chinese. | -| `--verbose`, `-VV` 2.0.13 | flag | Show more detailed messages during training. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | model, pickle | A spaCy model on each epoch. | +| Argument | Type | Description | +| --------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | positional | Model language. | +| `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. | +| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | +| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | +| `--base-model`, `-b` 2.1 | option | Optional name of base model to update. Can be any loadable spaCy model. | +| `--pipeline`, `-p` 2.1 | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | +| `--vectors`, `-v` | option | Model to load vectors from. | +| `--n-iter`, `-n` | option | Number of iterations (default: `30`). | +| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. | +| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). | +| `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. | +| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. | +| `--meta-path`, `-m` 2 | option | Optional path to model [`meta.json`](/usage/training#models-generating). All relevant properties like `lang`, `pipeline` and `spacy_version` will be overwritten. | +| `--init-tok2vec`, `-t2v` 2.1 | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | +| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` | +| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` | +| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. | +| `--orth-variant-level`, `-ovl` 2.2 | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). | +| `--gold-preproc`, `-G` | flag | Use gold preprocessing. | +| `--learn-tokens`, `-T` | flag | Make parser learn gold-standard tokenization by merging ] subtokens. Typically used for languages like Chinese. | +| `--textcat-multilabel`, `-TML` 2.2 | flag | Text classification classes aren't mutually exclusive (multilabel). | +| `--textcat-arch`, `-ta` 2.2 | option | Text classification model architecture. Defaults to `"bow"`. | +| `--textcat-positive-label`, `-tpl` 2.2 | option | Text classification positive label for binary classes with two labels. | +| `--verbose`, `-VV` 2.0.13 | flag | Show more detailed messages during training. | +| `--help`, `-h` | flag | Show help message and available arguments. | +| **CREATES** | model, pickle | A spaCy model on each epoch. | ### Environment variables for hyperparameters {#train-hyperparams new="2"} @@ -374,6 +540,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc] | `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. | | `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. | | `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. | +| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. | | **CREATES** | model | A spaCy model containing the vocab and vectors. | ## Evaluate {#evaluate new="2"} diff --git a/website/docs/api/cython-classes.md b/website/docs/api/cython-classes.md index 4d188d90f..77d6fdd10 100644 --- a/website/docs/api/cython-classes.md +++ b/website/docs/api/cython-classes.md @@ -45,9 +45,9 @@ Append a token to the `Doc`. The token can be provided as a > from spacy.vocab cimport Vocab > > doc = Doc(Vocab()) -> lexeme = doc.vocab.get(u'hello') +> lexeme = doc.vocab.get("hello") > doc.push_back(lexeme, True) -> assert doc.text == u'hello ' +> assert doc.text == "hello " > ``` | Name | Type | Description | @@ -164,7 +164,7 @@ vocabulary. > #### Example > > ```python -> lexeme = vocab.get(vocab.mem, u'hello') +> lexeme = vocab.get(vocab.mem, "hello") > ``` | Name | Type | Description | diff --git a/website/docs/api/cython-structs.md b/website/docs/api/cython-structs.md index 0e427a8d5..935bce25d 100644 --- a/website/docs/api/cython-structs.md +++ b/website/docs/api/cython-structs.md @@ -88,7 +88,7 @@ Find a token in a `TokenC*` array by the offset of its first character. > from spacy.tokens.doc cimport Doc, token_by_start > from spacy.vocab cimport Vocab > -> doc = Doc(Vocab(), words=[u'hello', u'world']) +> doc = Doc(Vocab(), words=["hello", "world"]) > assert token_by_start(doc.c, doc.length, 6) == 1 > assert token_by_start(doc.c, doc.length, 4) == -1 > ``` @@ -110,7 +110,7 @@ Find a token in a `TokenC*` array by the offset of its final character. > from spacy.tokens.doc cimport Doc, token_by_end > from spacy.vocab cimport Vocab > -> doc = Doc(Vocab(), words=[u'hello', u'world']) +> doc = Doc(Vocab(), words=["hello", "world"]) > assert token_by_end(doc.c, doc.length, 5) == 0 > assert token_by_end(doc.c, doc.length, 1) == -1 > ``` @@ -134,7 +134,7 @@ attribute, in order to make the parse tree navigation consistent. > from spacy.tokens.doc cimport Doc, set_children_from_heads > from spacy.vocab cimport Vocab > -> doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe']) +> doc = Doc(Vocab(), words=["Baileys", "from", "a", "shoe"]) > doc.c[0].head = 0 > doc.c[1].head = 0 > doc.c[2].head = 3 diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index 58acc4425..df0df3e38 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -58,7 +58,7 @@ and all pipeline components are applied to the `Doc` in order. Both > > ```python > parser = DependencyParser(nlp.vocab) -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > # This usually happens under the hood > processed = parser(doc) > ``` diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index 431d3a092..ad684f51e 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -20,11 +20,11 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the > > ```python > # Construction 1 -> doc = nlp(u"Some text") +> doc = nlp("Some text") > > # Construction 2 > from spacy.tokens import Doc -> words = [u"hello", u"world", u"!"] +> words = ["hello", "world", "!"] > spaces = [True, False, False] > doc = Doc(nlp.vocab, words=words, spaces=spaces) > ``` @@ -45,7 +45,7 @@ Negative indexing is supported, and follows the usual Python semantics, i.e. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > assert doc[0].text == "Give" > assert doc[-1].text == "." > span = doc[1:3] @@ -76,8 +76,8 @@ Iterate over `Token` objects, from which the annotations can be easily accessed. > #### Example > > ```python -> doc = nlp(u'Give it back') -> assert [t.text for t in doc] == [u'Give', u'it', u'back'] +> doc = nlp("Give it back") +> assert [t.text for t in doc] == ["Give", "it", "back"] > ``` This is the main way of accessing [`Token`](/api/token) objects, which are the @@ -96,7 +96,7 @@ Get the number of tokens in the document. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > assert len(doc) == 7 > ``` @@ -114,9 +114,9 @@ details, see the documentation on > > ```python > from spacy.tokens import Doc -> city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin')) -> Doc.set_extension('has_city', getter=city_getter) -> doc = nlp(u'I like New York') +> city_getter = lambda doc: any(city in doc.text for city in ("New York", "Paris", "Berlin")) +> Doc.set_extension("has_city", getter=city_getter) +> doc = nlp("I like New York") > assert doc._.has_city > ``` @@ -192,8 +192,8 @@ the character indices don't map to a valid span. > #### Example > > ```python -> doc = nlp(u"I like New York") -> span = doc.char_span(7, 15, label=u"GPE") +> doc = nlp("I like New York") +> span = doc.char_span(7, 15, label="GPE") > assert span.text == "New York" > ``` @@ -213,8 +213,8 @@ using an average of word vectors. > #### Example > > ```python -> apples = nlp(u"I like apples") -> oranges = nlp(u"I like oranges") +> apples = nlp("I like apples") +> oranges = nlp("I like oranges") > apples_oranges = apples.similarity(oranges) > oranges_apples = oranges.similarity(apples) > assert apples_oranges == oranges_apples @@ -235,7 +235,7 @@ attribute ID. > > ```python > from spacy.attrs import ORTH -> doc = nlp(u"apple apple orange banana") +> doc = nlp("apple apple orange banana") > assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2} > doc.to_array([ORTH]) > # array([[11880], [11880], [7561], [12800]]) @@ -255,7 +255,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor. > #### Example > > ```python -> doc = nlp(u"This is a test") +> doc = nlp("This is a test") > matrix = doc.get_lca_matrix() > # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32) > ``` @@ -274,7 +274,7 @@ They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`. > #### Example > > ```python -> doc = nlp(u"Hello") +> doc = nlp("Hello") > json_doc = doc.to_json() > ``` > @@ -342,7 +342,7 @@ array of attributes. > ```python > from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA > from spacy.tokens import Doc -> doc = nlp(u"Hello world!") +> doc = nlp("Hello world!") > np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) > doc2 = Doc(doc.vocab, words=[t.text for t in doc]) > doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array) @@ -396,7 +396,7 @@ Serialize, i.e. export the document contents to a binary string. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > doc_bytes = doc.to_bytes() > ``` @@ -413,10 +413,9 @@ Deserialize, i.e. import the document contents from a binary string. > > ```python > from spacy.tokens import Doc -> text = u"Give it back! He pleaded." -> doc = nlp(text) -> bytes = doc.to_bytes() -> doc2 = Doc(doc.vocab).from_bytes(bytes) +> doc = nlp("Give it back! He pleaded.") +> doc_bytes = doc.to_bytes() +> doc2 = Doc(doc.vocab).from_bytes(doc_bytes) > assert doc.text == doc2.text > ``` @@ -457,9 +456,9 @@ dictionary mapping attribute names to values as the `"_"` key. > #### Example > > ```python -> doc = nlp(u"I like David Bowie") +> doc = nlp("I like David Bowie") > with doc.retokenize() as retokenizer: -> attrs = {"LEMMA": u"David Bowie"} +> attrs = {"LEMMA": "David Bowie"} > retokenizer.merge(doc[2:4], attrs=attrs) > ``` @@ -489,7 +488,7 @@ underlying lexeme (if they're context-independent lexical attributes like > #### Example > > ```python -> doc = nlp(u"I live in NewYork") +> doc = nlp("I live in NewYork") > with doc.retokenize() as retokenizer: > heads = [(doc[3], 1), doc[2]] > attrs = {"POS": ["PROPN", "PROPN"], @@ -521,9 +520,9 @@ and end token boundaries, the document remains unchanged. > #### Example > > ```python -> doc = nlp(u"Los Angeles start.") +> doc = nlp("Los Angeles start.") > doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE") -> assert [t.text for t in doc] == [u"Los Angeles", u"start", u"."] +> assert [t.text for t in doc] == ["Los Angeles", "start", "."] > ``` | Name | Type | Description | @@ -541,11 +540,11 @@ objects, if the entity recognizer has been applied. > #### Example > > ```python -> doc = nlp(u"Mr. Best flew to New York on Saturday morning.") +> doc = nlp("Mr. Best flew to New York on Saturday morning.") > ents = list(doc.ents) > assert ents[0].label == 346 -> assert ents[0].label_ == u"PERSON" -> assert ents[0].text == u"Mr. Best" +> assert ents[0].label_ == "PERSON" +> assert ents[0].text == "Mr. Best" > ``` | Name | Type | Description | @@ -563,10 +562,10 @@ relative clauses. > #### Example > > ```python -> doc = nlp(u"A phrase with another phrase occurs.") +> doc = nlp("A phrase with another phrase occurs.") > chunks = list(doc.noun_chunks) -> assert chunks[0].text == u"A phrase" -> assert chunks[1].text == u"another phrase" +> assert chunks[0].text == "A phrase" +> assert chunks[1].text == "another phrase" > ``` | Name | Type | Description | @@ -583,10 +582,10 @@ will be unavailable. > #### Example > > ```python -> doc = nlp(u"This is a sentence. Here's another...") +> doc = nlp("This is a sentence. Here's another...") > sents = list(doc.sents) > assert len(sents) == 2 -> assert [s.root.text for s in sents] == [u"is", u"'s"] +> assert [s.root.text for s in sents] == ["is", "'s"] > ``` | Name | Type | Description | @@ -600,7 +599,7 @@ A boolean value indicating whether a word vector is associated with the object. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > assert doc.has_vector > ``` @@ -616,8 +615,8 @@ vectors. > #### Example > > ```python -> doc = nlp(u"I like apples") -> assert doc.vector.dtype == 'float32' +> doc = nlp("I like apples") +> assert doc.vector.dtype == "float32" > assert doc.vector.shape == (300,) > ``` @@ -632,8 +631,8 @@ The L2 norm of the document's vector representation. > #### Example > > ```python -> doc1 = nlp(u"I like apples") -> doc2 = nlp(u"I like oranges") +> doc1 = nlp("I like apples") +> doc2 = nlp("I like oranges") > doc1.vector_norm # 4.54232424414368 > doc2.vector_norm # 3.304373298575751 > assert doc1.vector_norm != doc2.vector_norm diff --git a/website/docs/api/docbin.md b/website/docs/api/docbin.md new file mode 100644 index 000000000..a4525906e --- /dev/null +++ b/website/docs/api/docbin.md @@ -0,0 +1,149 @@ +--- +title: DocBin +tag: class +new: 2.2 +teaser: Pack Doc objects for binary serialization +source: spacy/tokens/_serialize.py +--- + +The `DocBin` class lets you efficiently serialize the information from a +collection of `Doc` objects. You can control which information is serialized by +passing a list of attribute IDs, and optionally also specify whether the user +data is serialized. The `DocBin` is faster and produces smaller data sizes than +pickle, and allows you to deserialize without executing arbitrary Python code. A +notable downside to this format is that you can't easily extract just one +document from the `DocBin`. The serialization format is gzipped msgpack, where +the msgpack object has the following structure: + +```python +### msgpack object strcutrue +{ + "attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE] + "tokens": bytes, # Serialized numpy uint64 array with the token data + "spaces": bytes, # Serialized numpy boolean array with spaces data + "lengths": bytes, # Serialized numpy int32 array with the doc lengths + "strings": List[unicode] # List of unique strings in the token data +} +``` + +Strings for the words, tags, labels etc are represented by 64-bit hashes in the +token data, and every string that occurs at least once is passed via the strings +object. This means the storage is more efficient if you pack more documents +together, because you have less duplication in the strings. For usage examples, +see the docs on [serializing `Doc` objects](/usage/saving-loading#docs). + +## DocBin.\_\_init\_\_ {#init tag="method"} + +Create a `DocBin` object to hold serialized annotations. + +> #### Example +> +> ```python +> from spacy.tokens import DocBin +> doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"]) +> ``` + +| Argument | Type | Description | +| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `attrs` | list | List of attributes to serialize. `orth` (hash of token text) and `spacy` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `None`. | +| `store_user_data` | bool | Whether to include the `Doc.user_data`. Defaults to `False`. | +| **RETURNS** | `DocBin` | The newly constructed object. | + +## DocBin.\_\len\_\_ {#len tag="method"} + +Get the number of `Doc` objects that were added to the `DocBin`. + +> #### Example +> +> ```python +> doc_bin = DocBin(attrs=["LEMMA"]) +> doc = nlp("This is a document to serialize.") +> doc_bin.add(doc) +> assert len(doc_bin) == 1 +> ``` + +| Argument | Type | Description | +| ----------- | ---- | ------------------------------------------- | +| **RETURNS** | int | The number of `Doc`s added to the `DocBin`. | + +## DocBin.add {#add tag="method"} + +Add a `Doc`'s annotations to the `DocBin` for serialization. + +> #### Example +> +> ```python +> doc_bin = DocBin(attrs=["LEMMA"]) +> doc = nlp("This is a document to serialize.") +> doc_bin.add(doc) +> ``` + +| Argument | Type | Description | +| -------- | ----- | ------------------------ | +| `doc` | `Doc` | The `Doc` object to add. | + +## DocBin.get_docs {#get_docs tag="method"} + +Recover `Doc` objects from the annotations, using the given vocab. + +> #### Example +> +> ```python +> docs = list(doc_bin.get_docs(nlp.vocab)) +> ``` + +| Argument | Type | Description | +| ---------- | ------- | ------------------ | +| `vocab` | `Vocab` | The shared vocab. | +| **YIELDS** | `Doc` | The `Doc` objects. | + +## DocBin.merge {#merge tag="method"} + +Extend the annotations of this `DocBin` with the annotations from another. Will +raise an error if the pre-defined attrs of the two `DocBin`s don't match. + +> #### Example +> +> ```python +> doc_bin1 = DocBin(attrs=["LEMMA", "POS"]) +> doc_bin1.add(nlp("Hello world")) +> doc_bin2 = DocBin(attrs=["LEMMA", "POS"]) +> doc_bin2.add(nlp("This is a sentence")) +> merged_bins = doc_bin1.merge(doc_bin2) +> assert len(merged_bins) == 2 +> ``` + +| Argument | Type | Description | +| -------- | -------- | ------------------------------------------- | +| `other` | `DocBin` | The `DocBin` to merge into the current bin. | + +## DocBin.to_bytes {#to_bytes tag="method"} + +Serialize the `DocBin`'s annotations to a bytestring. + +> #### Example +> +> ```python +> doc_bin = DocBin(attrs=["DEP", "HEAD"]) +> doc_bin_bytes = doc_bin.to_bytes() +> ``` + +| Argument | Type | Description | +| ----------- | ----- | ------------------------ | +| **RETURNS** | bytes | The serialized `DocBin`. | + +## DocBin.from_bytes {#from_bytes tag="method"} + +Deserialize the `DocBin`'s annotations from a bytestring. + +> #### Example +> +> ```python +> doc_bin_bytes = doc_bin.to_bytes() +> new_doc_bin = DocBin().from_bytes(doc_bin_bytes) +> ``` + +| Argument | Type | Description | +| ------------ | -------- | ---------------------- | +| `bytes_data` | bytes | The data to load from. | +| **RETURNS** | `DocBin` | The loaded `DocBin`. | diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md new file mode 100644 index 000000000..88131761f --- /dev/null +++ b/website/docs/api/entitylinker.md @@ -0,0 +1,300 @@ +--- +title: EntityLinker +teaser: + Functionality to disambiguate a named entity in text to a unique knowledge + base identifier. +tag: class +source: spacy/pipeline/pipes.pyx +new: 2.2 +--- + +This class is a subclass of `Pipe` and follows the same API. The pipeline +component is available in the [processing pipeline](/usage/processing-pipelines) +via the ID `"entity_linker"`. + +## EntityLinker.Model {#model tag="classmethod"} + +Initialize a model for the pipe. The model should implement the +`thinc.neural.Model` API, and should contain a field `tok2vec` that contains the +context encoder. Wrappers are under development for most major machine learning +libraries. + +| Name | Type | Description | +| ----------- | ------ | ------------------------------------- | +| `**kwargs` | - | Parameters for initializing the model | +| **RETURNS** | object | The initialized model. | + +## EntityLinker.\_\_init\_\_ {#init tag="method"} + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.create_pipe`](/api/language#create_pipe). + +> #### Example +> +> ```python +> # Construction via create_pipe +> entity_linker = nlp.create_pipe("entity_linker") +> +> # Construction from class +> from spacy.pipeline import EntityLinker +> entity_linker = EntityLinker(nlp.vocab) +> entity_linker.from_disk("/path/to/model") +> ``` + +| Name | Type | Description | +| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | +| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to 128. | +| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to True. | +| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilites are used). Defaults to True. | +| **RETURNS** | `EntityLinker` | The newly constructed object. | + +## EntityLinker.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/entitylinker#call) and [`pipe`](/api/entitylinker#pipe) +delegate to the [`predict`](/api/entitylinker#predict) and +[`set_annotations`](/api/entitylinker#set_annotations) methods. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> doc = nlp("This is a sentence.") +> # This usually happens under the hood +> processed = entity_linker(doc) +> ``` + +| Name | Type | Description | +| ----------- | ----- | ------------------------ | +| `doc` | `Doc` | The document to process. | +| **RETURNS** | `Doc` | The processed document. | + +## EntityLinker.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and +[`pipe`](/api/entitylinker#pipe) delegate to the +[`predict`](/api/entitylinker#predict) and +[`set_annotations`](/api/entitylinker#set_annotations) methods. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> for doc in entity_linker.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Type | Description | +| ------------ | -------- | ------------------------------------------------------ | +| `stream` | iterable | A stream of documents. | +| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | +| **YIELDS** | `Doc` | Processed documents in the order of the original text. | + +## EntityLinker.predict {#predict tag="method"} + +Apply the pipeline's model to a batch of docs, without modifying them. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> kb_ids, tensors = entity_linker.predict([doc1, doc2]) +> ``` + +| Name | Type | Description | +| ----------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `docs` | iterable | The documents to predict. | +| **RETURNS** | tuple | A `(kb_ids, tensors)` tuple where `kb_ids` are the model's predicted KB identifiers for the entities in the `docs`, and `tensors` are the token representations used to predict these identifiers. | + +## EntityLinker.set_annotations {#set_annotations tag="method"} + +Modify a batch of documents, using pre-computed entity IDs for a list of named +entities. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> kb_ids, tensors = entity_linker.predict([doc1, doc2]) +> entity_linker.set_annotations([doc1, doc2], kb_ids, tensors) +> ``` + +| Name | Type | Description | +| --------- | -------- | ------------------------------------------------------------------------------------------------- | +| `docs` | iterable | The documents to modify. | +| `kb_ids` | iterable | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. | +| `tensors` | iterable | The token representations used to predict the identifiers. | + +## EntityLinker.update {#update tag="method"} + +Learn from a batch of documents and gold-standard information, updating both the +pipe's entity linking model and context encoder. Delegates to +[`predict`](/api/entitylinker#predict) and +[`get_loss`](/api/entitylinker#get_loss). + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> losses = {} +> optimizer = nlp.begin_training() +> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> ``` + +| Name | Type | Description | +| -------- | -------- | ------------------------------------------------------------------------------------------------------- | +| `docs` | iterable | A batch of documents to learn from. | +| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | +| `drop` | float | The dropout rate, used both for the EL model and the context encoder. | +| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. | +| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | + +## EntityLinker.get_loss {#get_loss tag="method"} + +Find the loss and gradient of loss for the entities in a batch of documents and +their predicted scores. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> kb_ids, tensors = entity_linker.predict(docs) +> loss, d_loss = entity_linker.get_loss(docs, [gold1, gold2], kb_ids, tensors) +> ``` + +| Name | Type | Description | +| ----------- | -------- | ------------------------------------------------------------ | +| `docs` | iterable | The batch of documents. | +| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | +| `kb_ids` | iterable | KB identifiers representing the model's predictions. | +| `tensors` | iterable | The token representations used to predict the identifiers | +| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | + +## EntityLinker.set_kb {#set_kb tag="method"} + +Define the knowledge base (KB) used for disambiguating named entities to KB +identifiers. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> entity_linker.set_kb(kb) +> ``` + +| Name | Type | Description | +| ---- | --------------- | ------------------------------- | +| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb). | + +## EntityLinker.begin_training {#begin_training tag="method"} + +Initialize the pipe for training, using data examples if available. If no model +has been initialized yet, the model is added. Before calling this method, a +knowledge base should have been defined with +[`set_kb`](/api/entitylinker#set_kb). + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> entity_linker.set_kb(kb) +> nlp.add_pipe(entity_linker, last=True) +> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline) +> ``` + +| Name | Type | Description | +| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | +| `pipeline` | list | Optional list of pipeline components that this component is part of. | +| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. | +| **RETURNS** | callable | An optimizer. | + +## EntityLinker.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> optimizer = entity_linker.create_optimizer() +> ``` + +| Name | Type | Description | +| ----------- | -------- | -------------- | +| **RETURNS** | callable | The optimizer. | + +## EntityLinker.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's EL model, to use the given parameter values. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> with entity_linker.use_params(optimizer.averages): +> entity_linker.to_disk("/best_model") +> ``` + +| Name | Type | Description | +| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | +| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | + +## EntityLinker.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> entity_linker.to_disk("/path/to/entity_linker") +> ``` + +| Name | Type | Description | +| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | + +## EntityLinker.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> entity_linker = EntityLinker(nlp.vocab) +> entity_linker.from_disk("/path/to/entity_linker") +> ``` + +| Name | Type | Description | +| ----------- | ---------------- | -------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | +| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = entity_linker.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | +| `kb` | The knowledge base. You usually don't want to exclude this. | diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 7279a7f77..9a2766c07 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -58,7 +58,7 @@ and all pipeline components are applied to the `Doc` in order. Both > > ```python > ner = EntityRecognizer(nlp.vocab) -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > # This usually happens under the hood > processed = ner(doc) > ``` @@ -99,7 +99,7 @@ Apply the pipeline's model to a batch of docs, without modifying them. > > ```python > ner = EntityRecognizer(nlp.vocab) -> scores = ner.predict([doc1, doc2]) +> scores, tensors = ner.predict([doc1, doc2]) > ``` | Name | Type | Description | @@ -115,14 +115,15 @@ Modify a batch of documents, using pre-computed scores. > > ```python > ner = EntityRecognizer(nlp.vocab) -> scores = ner.predict([doc1, doc2]) -> ner.set_annotations([doc1, doc2], scores) +> scores, tensors = ner.predict([doc1, doc2]) +> ner.set_annotations([doc1, doc2], scores, tensors) > ``` -| Name | Type | Description | -| -------- | -------- | ---------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `EntityRecognizer.predict`. | +| Name | Type | Description | +| --------- | -------- | ---------------------------------------------------------- | +| `docs` | iterable | The documents to modify. | +| `scores` | - | The scores to set, produced by `EntityRecognizer.predict`. | +| `tensors` | iterable | The token representations used to predict the scores. | ## EntityRecognizer.update {#update tag="method"} @@ -210,13 +211,13 @@ Modify the pipe's model, to use the given parameter values. > > ```python > ner = EntityRecognizer(nlp.vocab) -> with ner.use_params(): +> with ner.use_params(optimizer.averages): > ner.to_disk("/best_model") > ``` | Name | Type | Description | | -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | - | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | ## EntityRecognizer.add_label {#add_label tag="method"} diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index 006ba90e6..5b93fceac 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -10,7 +10,9 @@ token-based rules or exact phrase matches. It can be combined with the statistical [`EntityRecognizer`](/api/entityrecognizer) to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. After initialization, the component is typically added to the processing -pipeline using [`nlp.add_pipe`](/api/language#add_pipe). +pipeline using [`nlp.add_pipe`](/api/language#add_pipe). For usage examples, see +the docs on +[rule-based entity recogntion](/usage/rule-based-matching#entityruler). ## EntityRuler.\_\_init\_\_ {#init tag="method"} diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md index 5a2d8a110..2dd24316f 100644 --- a/website/docs/api/goldparse.md +++ b/website/docs/api/goldparse.md @@ -23,6 +23,7 @@ gradient for those labels will be zero. | `deps` | iterable | A sequence of strings, representing the syntactic relation types. | | `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | | `cats` | dict | Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence). | +| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative). | | **RETURNS** | `GoldParse` | The newly constructed object. | ## GoldParse.\_\_len\_\_ {#len tag="method"} @@ -43,16 +44,17 @@ Whether the provided syntactic annotations form a projective dependency tree. ## Attributes {#attributes} -| Name | Type | Description | -| --------------------------------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `words` | list | The words. | -| `tags` | list | The part-of-speech tag annotations. | -| `heads` | list | The syntactic head annotations. | -| `labels` | list | The syntactic relation-type annotations. | -| `ner` | list | The named entity annotations as BILUO tags. | -| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | -| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | -| `cats` 2 | list | Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document. | +| Name | Type | Description | +| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `words` | list | The words. | +| `tags` | list | The part-of-speech tag annotations. | +| `heads` | list | The syntactic head annotations. | +| `labels` | list | The syntactic relation-type annotations. | +| `ner` | list | The named entity annotations as BILUO tags. | +| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | +| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | +| `cats` 2 | list | Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document. | +| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. | ## Utilities {#util} @@ -67,7 +69,7 @@ Convert a list of Doc objects into the > ```python > from spacy.gold import docs_to_json > -> doc = nlp(u"I like London") +> doc = nlp("I like London") > json_data = docs_to_json([doc]) > ``` @@ -148,7 +150,7 @@ single-token entity. > ```python > from spacy.gold import biluo_tags_from_offsets > -> doc = nlp(u"I like London.") +> doc = nlp("I like London.") > entities = [(7, 13, "LOC")] > tags = biluo_tags_from_offsets(doc, entities) > assert tags == ["O", "O", "U-LOC", "O"] @@ -170,7 +172,7 @@ entity offsets. > ```python > from spacy.gold import offsets_from_biluo_tags > -> doc = nlp(u"I like London.") +> doc = nlp("I like London.") > tags = ["O", "O", "U-LOC", "O"] > entities = offsets_from_biluo_tags(doc, tags) > assert entities == [(7, 13, "LOC")] @@ -193,7 +195,7 @@ token-based tags, e.g. to overwrite the `doc.ents`. > ```python > from spacy.gold import spans_from_biluo_tags > -> doc = nlp(u"I like London.") +> doc = nlp("I like London.") > tags = ["O", "O", "U-LOC", "O"] > doc.ents = spans_from_biluo_tags(doc, tags) > ``` diff --git a/website/docs/api/kb.md b/website/docs/api/kb.md new file mode 100644 index 000000000..639ababb6 --- /dev/null +++ b/website/docs/api/kb.md @@ -0,0 +1,268 @@ +--- +title: KnowledgeBase +teaser: A storage class for entities and aliases of a specific knowledge base (ontology) +tag: class +source: spacy/kb.pyx +new: 2.2 +--- + +The `KnowledgeBase` object provides a method to generate [`Candidate`](/api/kb/#candidate_init) +objects, which are plausible external identifiers given a certain textual mention. +Each such `Candidate` holds information from the relevant KB entities, +such as its frequency in text and possible aliases. +Each entity in the knowledge base also has a pre-trained entity vector of a fixed size. + +## KnowledgeBase.\_\_init\_\_ {#init tag="method"} + +Create the knowledge base. + +> #### Example +> +> ```python +> from spacy.kb import KnowledgeBase +> vocab = nlp.vocab +> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) +> ``` + +| Name | Type | Description | +| ----------------------- | ---------------- | ----------------------------------------- | +| `vocab` | `Vocab` | A `Vocab` object. | +| `entity_vector_length` | int | Length of the fixed-size entity vectors. | +| **RETURNS** | `KnowledgeBase` | The newly constructed object. | + + +## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"} + +The length of the fixed-size entity vectors in the knowledge base. + +| Name | Type | Description | +| ----------- | ---- | ----------------------------------------- | +| **RETURNS** | int | Length of the fixed-size entity vectors. | + +## KnowledgeBase.add_entity {#add_entity tag="method"} + +Add an entity to the knowledge base, specifying its corpus frequency +and entity vector, which should be of length [`entity_vector_length`](/api/kb#entity_vector_length). + +> #### Example +> +> ```python +> kb.add_entity(entity="Q42", freq=32, entity_vector=vector1) +> kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2) +> ``` + +| Name | Type | Description | +| --------------- | ------------- | ------------------------------------------------- | +| `entity` | unicode | The unique entity identifier | +| `freq` | float | The frequency of the entity in a typical corpus | +| `entity_vector` | vector | The pre-trained vector of the entity | + +## KnowledgeBase.set_entities {#set_entities tag="method"} + +Define the full list of entities in the knowledge base, specifying the corpus frequency +and entity vector for each entity. + +> #### Example +> +> ```python +> kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2]) +> ``` + +| Name | Type | Description | +| ------------- | ------------- | ------------------------------------------------- | +| `entity_list` | iterable | List of unique entity identifiers | +| `freq_list` | iterable | List of entity frequencies | +| `vector_list` | iterable | List of entity vectors | + +## KnowledgeBase.add_alias {#add_alias tag="method"} + +Add an alias or mention to the knowledge base, specifying its potential KB identifiers +and their prior probabilities. The entity identifiers should refer to entities previously +added with [`add_entity`](/api/kb#add_entity) or [`set_entities`](/api/kb#set_entities). +The sum of the prior probabilities should not exceed 1. + +> #### Example +> +> ```python +> kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3]) +> ``` + +| Name | Type | Description | +| -------------- | ------------- | -------------------------------------------------- | +| `alias` | unicode | The textual mention or alias | +| `entities` | iterable | The potential entities that the alias may refer to | +| `probabilities`| iterable | The prior probabilities of each entity | + +## KnowledgeBase.\_\_len\_\_ {#len tag="method"} + +Get the total number of entities in the knowledge base. + +> #### Example +> +> ```python +> total_entities = len(kb) +> ``` + +| Name | Type | Description | +| ----------- | ---- | --------------------------------------------- | +| **RETURNS** | int | The number of entities in the knowledge base. | + +## KnowledgeBase.get_entity_strings {#get_entity_strings tag="method"} + +Get a list of all entity IDs in the knowledge base. + +> #### Example +> +> ```python +> all_entities = kb.get_entity_strings() +> ``` + +| Name | Type | Description | +| ----------- | ---- | --------------------------------------------- | +| **RETURNS** | list | The list of entities in the knowledge base. | + +## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"} + +Get the total number of aliases in the knowledge base. + +> #### Example +> +> ```python +> total_aliases = kb.get_size_aliases() +> ``` + +| Name | Type | Description | +| ----------- | ---- | --------------------------------------------- | +| **RETURNS** | int | The number of aliases in the knowledge base. | + +## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"} + +Get a list of all aliases in the knowledge base. + +> #### Example +> +> ```python +> all_aliases = kb.get_alias_strings() +> ``` + +| Name | Type | Description | +| ----------- | ---- | --------------------------------------------- | +| **RETURNS** | list | The list of aliases in the knowledge base. | + +## KnowledgeBase.get_candidates {#get_candidates tag="method"} + +Given a certain textual mention as input, retrieve a list of candidate entities +of type [`Candidate`](/api/kb/#candidate_init). + +> #### Example +> +> ```python +> candidates = kb.get_candidates("Douglas") +> ``` + +| Name | Type | Description | +| ------------- | ------------- | -------------------------------------------------- | +| `alias` | unicode | The textual mention or alias | +| **RETURNS** | iterable | The list of relevant `Candidate` objects | + +## KnowledgeBase.get_vector {#get_vector tag="method"} + +Given a certain entity ID, retrieve its pre-trained entity vector. + +> #### Example +> +> ```python +> vector = kb.get_vector("Q42") +> ``` + +| Name | Type | Description | +| ------------- | ------------- | -------------------------------------------------- | +| `entity` | unicode | The entity ID | +| **RETURNS** | vector | The entity vector | + +## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"} + +Given a certain entity ID and a certain textual mention, retrieve +the prior probability of the fact that the mention links to the entity ID. + +> #### Example +> +> ```python +> probability = kb.get_prior_prob("Q42", "Douglas") +> ``` + +| Name | Type | Description | +| ------------- | ------------- | --------------------------------------------------------------- | +| `entity` | unicode | The entity ID | +| `alias` | unicode | The textual mention or alias | +| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` | + +## KnowledgeBase.dump {#dump tag="method"} + +Save the current state of the knowledge base to a directory. + +> #### Example +> +> ```python +> kb.dump(loc) +> ``` + +| Name | Type | Description | +| ------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `loc` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | + +## KnowledgeBase.load_bulk {#load_bulk tag="method"} + +Restore the state of the knowledge base from a given directory. Note that the [`Vocab`](/api/vocab) +should also be the same as the one used to create the KB. + +> #### Example +> +> ```python +> from spacy.kb import KnowledgeBase +> from spacy.vocab import Vocab +> vocab = Vocab().from_disk("/path/to/vocab") +> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) +> kb.load_bulk("/path/to/kb") +> ``` + + +| Name | Type | Description | +| ----------- | ---------------- | ----------------------------------------------------------------------------------------- | +| `loc` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | +| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. | + + +## Candidate.\_\_init\_\_ {#candidate_init tag="method"} + +Construct a `Candidate` object. Usually this constructor is not called directly, +but instead these objects are returned by the [`get_candidates`](/api/kb#get_candidates) method +of a `KnowledgeBase`. + +> #### Example +> +> ```python +> from spacy.kb import Candidate +> candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob) +> ``` + +| Name | Type | Description | +| ------------- | --------------- | -------------------------------------------------------------- | +| `kb` | `KnowledgeBase` | The knowledge base that defined this candidate. | +| `entity_hash` | int | The hash of the entity's KB ID. | +| `entity_freq` | float | The entity frequency as recorded in the KB. | +| `alias_hash` | int | The hash of the textual mention or alias. | +| `prior_prob` | float | The prior probability of the `alias` referring to the `entity` | +| **RETURNS** | `Candidate` | The newly constructed object. | + +## Candidate attributes {#candidate_attributes} + +| Name | Type | Description | +| ---------------------- | ------------ | ------------------------------------------------------------------ | +| `entity` | int | The entity's unique KB identifier | +| `entity_` | unicode | The entity's unique KB identifier | +| `alias` | int | The alias or textual mention | +| `alias_` | unicode | The alias or textual mention | +| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` | +| `entity_freq` | long | The frequency of the entity in a typical corpus | +| `entity_vector` | vector | The pre-trained vector of the entity | diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 3fcdeb195..c44339ff5 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -45,7 +45,7 @@ contain arbitrary whitespace. Alignment into the original string is preserved. > #### Example > > ```python -> doc = nlp(u"An example sentence. Another sentence.") +> doc = nlp("An example sentence. Another sentence.") > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN") > ``` @@ -61,8 +61,8 @@ Pipeline components to prevent from being loaded can now be added as a list to `disable`, instead of specifying one keyword argument per component. ```diff -- doc = nlp(u"I don't want parsed", parse=False) -+ doc = nlp(u"I don't want parsed", disable=["parser"]) +- doc = nlp("I don't want parsed", parse=False) ++ doc = nlp("I don't want parsed", disable=["parser"]) ``` @@ -86,7 +86,7 @@ multiprocessing. > #### Example > > ```python -> texts = [u"One document.", u"...", u"Lots of documents"] +> texts = ["One document.", "...", "Lots of documents"] > for doc in nlp.pipe(texts, batch_size=50): > assert doc.is_parsed > ``` @@ -140,6 +140,7 @@ Evaluate a model's pipeline components. | `batch_size` | int | The batch size to use. | | `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | | `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | +| **RETURNS** | Scorer | The scorer containing the evaluation scores. | ## Language.begin_training {#begin_training tag="method"} @@ -443,15 +444,16 @@ per component. ## Attributes {#attributes} -| Name | Type | Description | -| --------------------------------------- | ------------------ | ----------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | A container for the lexical types. | -| `tokenizer` | `Tokenizer` | The tokenizer. | -| `make_doc` | `lambda text: Doc` | Create a `Doc` object from unicode text. | -| `pipeline` | list | List of `(name, component)` tuples describing the current processing pipeline, in order. | -| `pipe_names` 2 | list | List of pipeline component names, in order. | -| `meta` | dict | Custom meta data for the Language class. If a model is loaded, contains meta data of the model. | -| `path` 2 | `Path` | Path to the model data directory, if a model is loaded. Otherwise `None`. | +| Name | Type | Description | +| ------------------------------------------ | ----------- | ----------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | A container for the lexical types. | +| `tokenizer` | `Tokenizer` | The tokenizer. | +| `make_doc` | `callable` | Callable that takes a unicode text and returns a `Doc`. | +| `pipeline` | list | List of `(name, component)` tuples describing the current processing pipeline, in order. | +| `pipe_names` 2 | list | List of pipeline component names, in order. | +| `pipe_labels` 2.2 | dict | List of labels set by the pipeline components, if available, keyed by component name. | +| `meta` | dict | Custom meta data for the Language class. If a model is loaded, contains meta data of the model. | +| `path` 2 | `Path` | Path to the model data directory, if a model is loaded. Otherwise `None`. | ## Class attributes {#class-attributes} diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md index 7bc2691e5..805e96b0f 100644 --- a/website/docs/api/lemmatizer.md +++ b/website/docs/api/lemmatizer.md @@ -35,10 +35,10 @@ Lemmatize a string. > > ```python > from spacy.lemmatizer import Lemmatizer -> from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES -> lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES) -> lemmas = lemmatizer(u"ducks", u"NOUN") -> assert lemmas == [u"duck"] +> rules = {"noun": [["s", ""]]} +> lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules) +> lemmas = lemmatizer("ducks", "NOUN") +> assert lemmas == ["duck"] > ``` | Name | Type | Description | @@ -52,21 +52,22 @@ Lemmatize a string. Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a -[lookup table](/usage/adding-languages#lemmatizer) via the `lemma_lookup` -variable, set on the individual `Language` class. +[lookup table](/usage/adding-languages#lemmatizer) via the `resources`, set on +the individual `Language` class. > #### Example > > ```python -> lookup = {u"going": u"go"} +> lookup = {"going": "go"} > lemmatizer = Lemmatizer(lookup=lookup) -> assert lemmatizer.lookup(u"going") == u"go" +> assert lemmatizer.lookup("going") == "go" > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------------------- | -| `string` | unicode | The string to look up. | -| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. | +| Name | Type | Description | +| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- | +| `string` | unicode | The string to look up. | +| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. | +| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. | ## Lemmatizer.is_base_form {#is_base_form tag="method"} diff --git a/website/docs/api/lexeme.md b/website/docs/api/lexeme.md index 018dc72d8..398b71708 100644 --- a/website/docs/api/lexeme.md +++ b/website/docs/api/lexeme.md @@ -27,7 +27,7 @@ Change the value of a boolean flag. > > ```python > COOL_FLAG = nlp.vocab.add_flag(lambda text: False) -> nlp.vocab[u'spaCy'].set_flag(COOL_FLAG, True) +> nlp.vocab["spaCy"].set_flag(COOL_FLAG, True) > ``` | Name | Type | Description | @@ -42,9 +42,9 @@ Check the value of a boolean flag. > #### Example > > ```python -> is_my_library = lambda text: text in [u"spaCy", u"Thinc"] +> is_my_library = lambda text: text in ["spaCy", "Thinc"] > MY_LIBRARY = nlp.vocab.add_flag(is_my_library) -> assert nlp.vocab[u"spaCy"].check_flag(MY_LIBRARY) == True +> assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True > ``` | Name | Type | Description | @@ -59,8 +59,8 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors. > #### Example > > ```python -> apple = nlp.vocab[u"apple"] -> orange = nlp.vocab[u"orange"] +> apple = nlp.vocab["apple"] +> orange = nlp.vocab["orange"] > apple_orange = apple.similarity(orange) > orange_apple = orange.similarity(apple) > assert apple_orange == orange_apple @@ -78,7 +78,7 @@ A boolean value indicating whether a word vector is associated with the lexeme. > #### Example > > ```python -> apple = nlp.vocab[u"apple"] +> apple = nlp.vocab["apple"] > assert apple.has_vector > ``` @@ -93,7 +93,7 @@ A real-valued meaning representation. > #### Example > > ```python -> apple = nlp.vocab[u"apple"] +> apple = nlp.vocab["apple"] > assert apple.vector.dtype == "float32" > assert apple.vector.shape == (300,) > ``` @@ -109,8 +109,8 @@ The L2 norm of the lexeme's vector representation. > #### Example > > ```python -> apple = nlp.vocab[u"apple"] -> pasta = nlp.vocab[u"pasta"] +> apple = nlp.vocab["apple"] +> pasta = nlp.vocab["pasta"] > apple.vector_norm # 7.1346845626831055 > pasta.vector_norm # 7.759851932525635 > assert apple.vector_norm != pasta.vector_norm diff --git a/website/docs/api/lookups.md b/website/docs/api/lookups.md new file mode 100644 index 000000000..9878546ea --- /dev/null +++ b/website/docs/api/lookups.md @@ -0,0 +1,318 @@ +--- +title: Lookups +teaser: A container for large lookup tables and dictionaries +tag: class +source: spacy/lookups.py +new: 2.2 +--- + +This class allows convenient accesss to large lookup tables and dictionaries, +e.g. lemmatization data or tokenizer exception lists using Bloom filters. +Lookups are available via the [`Vocab`](/api/vocab) as `vocab.lookups`, so they +can be accessed before the pipeline components are applied (e.g. in the +tokenizer and lemmatizer), as well as within the pipeline components via +`doc.vocab.lookups`. + +## Lookups.\_\_init\_\_ {#init tag="method"} + +Create a `Lookups` object. + +> #### Example +> +> ```python +> from spacy.lookups import Lookups +> lookups = Lookups() +> ``` + +| Name | Type | Description | +| ----------- | --------- | ----------------------------- | +| **RETURNS** | `Lookups` | The newly constructed object. | + +## Lookups.\_\_len\_\_ {#len tag="method"} + +Get the current number of tables in the lookups. + +> #### Example +> +> ```python +> lookups = Lookups() +> assert len(lookups) == 0 +> ``` + +| Name | Type | Description | +| ----------- | ---- | ------------------------------------ | +| **RETURNS** | int | The number of tables in the lookups. | + +## Lookups.\_\contains\_\_ {#contains tag="method"} + +Check if the lookups contain a table of a given name. Delegates to +[`Lookups.has_table`](/api/lookups#has_table). + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table") +> assert "some_table" in lookups +> ``` + +| Name | Type | Description | +| ----------- | ------- | ----------------------------------------------- | +| `name` | unicode | Name of the table. | +| **RETURNS** | bool | Whether a table of that name is in the lookups. | + +## Lookups.tables {#tables tag="property"} + +Get the names of all tables in the lookups. + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table") +> assert lookups.tables == ["some_table"] +> ``` + +| Name | Type | Description | +| ----------- | ---- | ----------------------------------- | +| **RETURNS** | list | Names of the tables in the lookups. | + +## Lookups.add_table {#add_table tag="method"} + +Add a new table with optional data to the lookups. Raises an error if the table +exists. + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table", {"foo": "bar"}) +> ``` + +| Name | Type | Description | +| ----------- | ----------------------------- | ---------------------------------- | +| `name` | unicode | Unique name of the table. | +| `data` | dict | Optional data to add to the table. | +| **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. | + +## Lookups.get_table {#get_table tag="method"} + +Get a table from the lookups. Raises an error if the table doesn't exist. + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table", {"foo": "bar"}) +> table = lookups.get_table("some_table") +> assert table["foo"] == "bar" +> ``` + +| Name | Type | Description | +| ----------- | ----------------------------- | ------------------ | +| `name` | unicode | Name of the table. | +| **RETURNS** | [`Table`](/api/lookups#table) | The table. | + +## Lookups.remove_table {#remove_table tag="method"} + +Remove a table from the lookups. Raises an error if the table doesn't exist. + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table") +> removed_table = lookups.remove_table("some_table") +> assert "some_table" not in lookups +> ``` + +| Name | Type | Description | +| ----------- | ----------------------------- | ---------------------------- | +| `name` | unicode | Name of the table to remove. | +| **RETURNS** | [`Table`](/api/lookups#table) | The removed table. | + +## Lookups.has_table {#has_table tag="method"} + +Check if the lookups contain a table of a given name. Equivalent to +[`Lookups.__contains__`](/api/lookups#contains). + +> #### Example +> +> ```python +> lookups = Lookups() +> lookups.add_table("some_table") +> assert lookups.has_table("some_table") +> ``` + +| Name | Type | Description | +| ----------- | ------- | ----------------------------------------------- | +| `name` | unicode | Name of the table. | +| **RETURNS** | bool | Whether a table of that name is in the lookups. | + +## Lookups.to_bytes {#to_bytes tag="method"} + +Serialize the lookups to a bytestring. + +> #### Example +> +> ```python +> lookup_bytes = lookups.to_bytes() +> ``` + +| Name | Type | Description | +| ----------- | ----- | ----------------------- | +| **RETURNS** | bytes | The serialized lookups. | + +## Lookups.from_bytes {#from_bytes tag="method"} + +Load the lookups from a bytestring. + +> #### Example +> +> ```python +> lookup_bytes = lookups.to_bytes() +> lookups = Lookups() +> lookups.from_bytes(lookup_bytes) +> ``` + +| Name | Type | Description | +| ------------ | --------- | ---------------------- | +| `bytes_data` | bytes | The data to load from. | +| **RETURNS** | `Lookups` | The loaded lookups. | + +## Lookups.to_disk {#to_disk tag="method"} + +Save the lookups to a directory as `lookups.bin`. Expects a path to a directory, +which will be created if it doesn't exist. + +> #### Example +> +> ```python +> lookups.to_disk("/path/to/lookups") +> ``` + +| Name | Type | Description | +| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | + +## Lookups.from_disk {#from_disk tag="method"} + +Load lookups from a directory containing a `lookups.bin`. Will skip loading if +the file doesn't exist. + +> #### Example +> +> ```python +> from spacy.lookups import Lookups +> lookups = Lookups() +> lookups.from_disk("/path/to/lookups") +> ``` + +| Name | Type | Description | +| ----------- | ---------------- | -------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | +| **RETURNS** | `Lookups` | The loaded lookups. | + +## Table {#table tag="class, ordererddict"} + +A table in the lookups. Subclass of `OrderedDict` that implements a slightly +more consistent and unified API and includes a Bloom filter to speed up missed +lookups. Supports **all other methods and attributes** of `OrderedDict` / +`dict`, and the customized methods listed here. Methods that get or set keys +accept both integers and strings (which will be hashed before being added to the +table). + +### Table.\_\_init\_\_ {#table.init tag="method"} + +Initialize a new table. + +> #### Example +> +> ```python +> from spacy.lookups import Table +> data = {"foo": "bar", "baz": 100} +> table = Table(name="some_table", data=data) +> assert "foo" in table +> assert table["foo"] == "bar" +> ``` + +| Name | Type | Description | +| ----------- | ------- | ---------------------------------- | +| `name` | unicode | Optional table name for reference. | +| **RETURNS** | `Table` | The newly constructed object. | + +### Table.from_dict {#table.from_dict tag="classmethod"} + +Initialize a new table from a dict. + +> #### Example +> +> ```python +> from spacy.lookups import Table +> data = {"foo": "bar", "baz": 100} +> table = Table.from_dict(data, name="some_table") +> ``` + +| Name | Type | Description | +| ----------- | ------- | ---------------------------------- | +| `data` | dict | The dictionary. | +| `name` | unicode | Optional table name for reference. | +| **RETURNS** | `Table` | The newly constructed object. | + +### Table.set {#table.set tag="method"} + +Set a new key / value pair. String keys will be hashed. Same as +`table[key] = value`. + +> #### Example +> +> ```python +> from spacy.lookups import Table +> table = Table() +> table.set("foo", "bar") +> assert table["foo"] == "bar" +> ``` + +| Name | Type | Description | +| ------- | ------------- | ----------- | +| `key` | unicode / int | The key. | +| `value` | - | The value. | + +### Table.to_bytes {#table.to_bytes tag="method"} + +Serialize the table to a bytestring. + +> #### Example +> +> ```python +> table_bytes = table.to_bytes() +> ``` + +| Name | Type | Description | +| ----------- | ----- | --------------------- | +| **RETURNS** | bytes | The serialized table. | + +### Table.from_bytes {#table.from_bytes tag="method"} + +Load a table from a bytestring. + +> #### Example +> +> ```python +> table_bytes = table.to_bytes() +> table = Table() +> table.from_bytes(table_bytes) +> ``` + +| Name | Type | Description | +| ------------ | ------- | ----------------- | +| `bytes_data` | bytes | The data to load. | +| **RETURNS** | `Table` | The loaded table. | + +### Attributes {#table-attributes} + +| Name | Type | Description | +| -------------- | --------------------------- | ----------------------------------------------------- | +| `name` | unicode | Table name. | +| `default_size` | int | Default size of bloom filters if no data is provided. | +| `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. | diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index fb0ba1617..84d9ed888 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -50,7 +50,7 @@ Find all token sequences matching the supplied patterns on the `Doc`. > matcher = Matcher(nlp.vocab) > pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] > matcher.add("HelloWorld", None, pattern) -> doc = nlp(u'hello world!') +> doc = nlp("hello world!") > matches = matcher(doc) > ``` @@ -147,7 +147,7 @@ overwritten. > matcher = Matcher(nlp.vocab) > matcher.add("HelloWorld", on_match, [{"LOWER": "hello"}, {"LOWER": "world"}]) > matcher.add("GoogleMaps", on_match, [{"ORTH": "Google"}, {"ORTH": "Maps"}]) -> doc = nlp(u"HELLO WORLD on Google Maps.") +> doc = nlp("HELLO WORLD on Google Maps.") > matches = matcher(doc) > ``` diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index c61fa575d..40b8d6c1a 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -59,8 +59,8 @@ Find all token sequences matching the supplied patterns on the `Doc`. > from spacy.matcher import PhraseMatcher > > matcher = PhraseMatcher(nlp.vocab) -> matcher.add("OBAMA", None, nlp(u"Barack Obama")) -> doc = nlp(u"Barack Obama lifts America one last time in emotional farewell") +> matcher.add("OBAMA", None, nlp("Barack Obama")) +> doc = nlp("Barack Obama lifts America one last time in emotional farewell") > matches = matcher(doc) > ``` @@ -99,7 +99,7 @@ patterns. > ```python > matcher = PhraseMatcher(nlp.vocab) > assert len(matcher) == 0 -> matcher.add("OBAMA", None, nlp(u"Barack Obama")) +> matcher.add("OBAMA", None, nlp("Barack Obama")) > assert len(matcher) == 1 > ``` @@ -116,7 +116,7 @@ Check whether the matcher contains rules for a match ID. > ```python > matcher = PhraseMatcher(nlp.vocab) > assert "OBAMA" not in matcher -> matcher.add("OBAMA", None, nlp(u"Barack Obama")) +> matcher.add("OBAMA", None, nlp("Barack Obama")) > assert "OBAMA" in matcher > ``` @@ -140,10 +140,10 @@ overwritten. > print('Matched!', matches) > > matcher = PhraseMatcher(nlp.vocab) -> matcher.add("OBAMA", on_match, nlp(u"Barack Obama")) -> matcher.add("HEALTH", on_match, nlp(u"health care reform"), -> nlp(u"healthcare reform")) -> doc = nlp(u"Barack Obama urges Congress to find courage to defend his healthcare reforms") +> matcher.add("OBAMA", on_match, nlp("Barack Obama")) +> matcher.add("HEALTH", on_match, nlp("health care reform"), +> nlp("healthcare reform")) +> doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms") > matches = matcher(doc) > ``` @@ -152,3 +152,22 @@ overwritten. | `match_id` | unicode | An ID for the thing you're matching. | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `*docs` | list | `Doc` objects of the phrases to match. | + +## PhraseMatcher.remove {#remove tag="method" new="2.2"} + +Remove a rule from the matcher by match ID. A `KeyError` is raised if the key +does not exist. + +> #### Example +> +> ```python +> matcher = PhraseMatcher(nlp.vocab) +> matcher.add("OBAMA", None, nlp("Barack Obama")) +> assert "OBAMA" in matcher +> matcher.remove("OBAMA") +> assert "OBAMA" not in matcher +> ``` + +| Name | Type | Description | +| ----- | ------- | ------------------------- | +| `key` | unicode | The ID of the match rule. | diff --git a/website/docs/api/pipeline-functions.md b/website/docs/api/pipeline-functions.md index 63b3cd164..6e2b473b1 100644 --- a/website/docs/api/pipeline-functions.md +++ b/website/docs/api/pipeline-functions.md @@ -17,13 +17,13 @@ the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). > #### Example > > ```python -> texts = [t.text for t in nlp(u"I have a blue car")] +> texts = [t.text for t in nlp("I have a blue car")] > assert texts == ["I", "have", "a", "blue", "car"] > > merge_nps = nlp.create_pipe("merge_noun_chunks") > nlp.add_pipe(merge_nps) > -> texts = [t.text for t in nlp(u"I have a blue car")] +> texts = [t.text for t in nlp("I have a blue car")] > assert texts == ["I", "have", "a blue car"] > ``` @@ -50,13 +50,13 @@ the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). > #### Example > > ```python -> texts = [t.text for t in nlp(u"I like David Bowie")] +> texts = [t.text for t in nlp("I like David Bowie")] > assert texts == ["I", "like", "David", "Bowie"] > > merge_ents = nlp.create_pipe("merge_entities") > nlp.add_pipe(merge_ents) > -> texts = [t.text for t in nlp(u"I like David Bowie")] +> texts = [t.text for t in nlp("I like David Bowie")] > assert texts == ["I", "like", "David Bowie"] > ``` diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index 2af4ec0ce..35348217b 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -46,14 +46,16 @@ Update the evaluation scores from a single [`Doc`](/api/doc) / ## Properties -| Name | Type | Description | -| ---------------------------------------------- | ----- | ------------------------------------------------------------------------------------------------------------- | -| `token_acc` | float | Tokenization accuracy. | -| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). | -| `uas` | float | Unlabelled dependency score. | -| `las` | float | Labelled dependency score. | -| `ents_p` | float | Named entity accuracy (precision). | -| `ents_r` | float | Named entity accuracy (recall). | -| `ents_f` | float | Named entity accuracy (F-score). | -| `ents_per_type` 2.1.5 | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. | -| `scores` | dict | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `ents_per_type`, `tags_acc` and `token_acc`. | +| Name | Type | Description | +| ----------------------------------------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `token_acc` | float | Tokenization accuracy. | +| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). | +| `uas` | float | Unlabelled dependency score. | +| `las` | float | Labelled dependency score. | +| `ents_p` | float | Named entity accuracy (precision). | +| `ents_r` | float | Named entity accuracy (recall). | +| `ents_f` | float | Named entity accuracy (F-score). | +| `ents_per_type` 2.1.5 | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. | +| `textcat_score` 2.2 | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). | +| `textcats_per_cat` 2.2 | dict | Scores per textcat label, keyed by label. | +| `scores` | dict | All scores, keyed by type. | diff --git a/website/docs/api/sentencizer.md b/website/docs/api/sentencizer.md index 26d205c24..237cd6a8a 100644 --- a/website/docs/api/sentencizer.md +++ b/website/docs/api/sentencizer.md @@ -59,7 +59,7 @@ the component has been added to the pipeline using > nlp = English() > sentencizer = nlp.create_pipe("sentencizer") > nlp.add_pipe(sentencizer) -> doc = nlp(u"This is a sentence. This is another sentence.") +> doc = nlp("This is a sentence. This is another sentence.") > assert list(doc.sents) == 2 > ``` diff --git a/website/docs/api/span.md b/website/docs/api/span.md index c807c7bbf..64b77b89d 100644 --- a/website/docs/api/span.md +++ b/website/docs/api/span.md @@ -13,19 +13,20 @@ Create a Span object from the slice `doc[start : end]`. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > span = doc[1:4] -> assert [t.text for t in span] == [u"it", u"back", u"!"] +> assert [t.text for t in span] == ["it", "back", "!"] > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `start` | int | The index of the first token of the span. | -| `end` | int | The index of the first token after the span. | -| `label` | int / unicode | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string. | -| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. | -| **RETURNS** | `Span` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | +| `doc` | `Doc` | The parent document. | +| `start` | int | The index of the first token of the span. | +| `end` | int | The index of the first token after the span. | +| `label` | int / unicode | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string. | +| `kb_id` | int / unicode | A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a unicode string. | +| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. | +| **RETURNS** | `Span` | The newly constructed object. | ## Span.\_\_getitem\_\_ {#getitem tag="method"} @@ -34,7 +35,7 @@ Get a `Token` object. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > span = doc[1:4] > assert span[1].text == "back" > ``` @@ -49,9 +50,9 @@ Get a `Span` object. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > span = doc[1:4] -> assert span[1:3].text == u"back!" +> assert span[1:3].text == "back!" > ``` | Name | Type | Description | @@ -66,9 +67,9 @@ Iterate over `Token` objects. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > span = doc[1:4] -> assert [t.text for t in span] == [u"it", u"back", u"!"] +> assert [t.text for t in span] == ["it", "back", "!"] > ``` | Name | Type | Description | @@ -82,7 +83,7 @@ Get the number of tokens in the span. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > span = doc[1:4] > assert len(span) == 3 > ``` @@ -101,9 +102,9 @@ For details, see the documentation on > > ```python > from spacy.tokens import Span -> city_getter = lambda span: any(city in span.text for city in (u"New York", u"Paris", u"Berlin")) +> city_getter = lambda span: any(city in span.text for city in ("New York", "Paris", "Berlin")) > Span.set_extension("has_city", getter=city_getter) -> doc = nlp(u"I like New York in Autumn") +> doc = nlp("I like New York in Autumn") > assert doc[1:4]._.has_city > ``` @@ -179,7 +180,7 @@ using an average of word vectors. > #### Example > > ```python -> doc = nlp(u"green apples and red oranges") +> doc = nlp("green apples and red oranges") > green_apples = doc[:2] > red_oranges = doc[3:] > apples_oranges = green_apples.similarity(red_oranges) @@ -201,7 +202,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn") +> doc = nlp("I like New York in Autumn") > span = doc[1:4] > matrix = span.get_lca_matrix() > # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32) @@ -221,7 +222,7 @@ shape `(N, M)`, where `N` is the length of the document. The values will be > > ```python > from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > span = doc[2:3] > # All strings mapped to integers, for easy export to numpy > np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) @@ -247,11 +248,11 @@ Retokenize the document, such that the span is merged into a single token. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > span = doc[2:4] > span.merge() > assert len(doc) == 6 -> assert doc[2].text == u"New York" +> assert doc[2].text == "New York" > ``` | Name | Type | Description | @@ -267,12 +268,12 @@ if the entity recognizer has been applied. > #### Example > > ```python -> doc = nlp(u"Mr. Best flew to New York on Saturday morning.") +> doc = nlp("Mr. Best flew to New York on Saturday morning.") > span = doc[0:6] > ents = list(span.ents) > assert ents[0].label == 346 > assert ents[0].label_ == "PERSON" -> assert ents[0].text == u"Mr. Best" +> assert ents[0].text == "Mr. Best" > ``` | Name | Type | Description | @@ -286,10 +287,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > span = doc[2:4] > doc2 = span.as_doc() -> assert doc2.text == u"New York" +> assert doc2.text == "New York" > ``` | Name | Type | Description | @@ -306,12 +307,12 @@ taken. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > i, like, new, york, in_, autumn, dot = range(len(doc)) -> assert doc[new].head.text == u"York" -> assert doc[york].head.text == u"like" +> assert doc[new].head.text == "York" +> assert doc[york].head.text == "like" > new_york = doc[new:york+1] -> assert new_york.root.text == u"York" +> assert new_york.root.text == "York" > ``` | Name | Type | Description | @@ -325,9 +326,9 @@ A tuple of tokens coordinated to `span.root`. > #### Example > > ```python -> doc = nlp(u"I like apples and oranges") +> doc = nlp("I like apples and oranges") > apples_conjuncts = doc[2:3].conjuncts -> assert [t.text for t in apples_conjuncts] == [u"oranges"] +> assert [t.text for t in apples_conjuncts] == ["oranges"] > ``` | Name | Type | Description | @@ -341,9 +342,9 @@ Tokens that are to the left of the span, whose heads are within the span. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > lefts = [t.text for t in doc[3:7].lefts] -> assert lefts == [u"New"] +> assert lefts == ["New"] > ``` | Name | Type | Description | @@ -357,9 +358,9 @@ Tokens that are to the right of the span, whose heads are within the span. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > rights = [t.text for t in doc[2:4].rights] -> assert rights == [u"in"] +> assert rights == ["in"] > ``` | Name | Type | Description | @@ -374,7 +375,7 @@ the span. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > assert doc[3:7].n_lefts == 1 > ``` @@ -390,7 +391,7 @@ the span. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > assert doc[2:4].n_rights == 1 > ``` @@ -405,9 +406,9 @@ Tokens within the span and tokens which descend from them. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > subtree = [t.text for t in doc[:3].subtree] -> assert subtree == [u"Give", u"it", u"back", u"!"] +> assert subtree == ["Give", "it", "back", "!"] > ``` | Name | Type | Description | @@ -421,7 +422,7 @@ A boolean value indicating whether a word vector is associated with the object. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > assert doc[1:].has_vector > ``` @@ -437,7 +438,7 @@ vectors. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > assert doc[1:].vector.dtype == "float32" > assert doc[1:].vector.shape == (300,) > ``` @@ -453,7 +454,7 @@ The L2 norm of the span's vector representation. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > doc[1:].vector_norm # 4.800883928527915 > doc[2:].vector_norm # 6.895897646384268 > assert doc[1:].vector_norm != doc[2:].vector_norm @@ -478,9 +479,11 @@ The L2 norm of the span's vector representation. | `text_with_ws` | unicode | The text content of the span with a trailing whitespace character if the last token has one. | | `orth` | int | ID of the verbatim text content. | | `orth_` | unicode | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. | -| `label` | int | The span's label. | +| `label` | int | The hash value of the span's label. | | `label_` | unicode | The span's label. | | `lemma_` | unicode | The span's lemma. | +| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. | +| `kb_id_` | unicode | The knowledge base ID referred to by the span. | | `ent_id` | int | The hash value of the named entity the token is an instance of. | | `ent_id_` | unicode | The string ID of the named entity the token is an instance of. | | `sentiment` | float | A scalar value indicating the positivity or negativity of the span. | diff --git a/website/docs/api/stringstore.md b/website/docs/api/stringstore.md index 40d27a62a..268f19125 100644 --- a/website/docs/api/stringstore.md +++ b/website/docs/api/stringstore.md @@ -16,7 +16,7 @@ Create the `StringStore`. > > ```python > from spacy.strings import StringStore -> stringstore = StringStore([u"apple", u"orange"]) +> stringstore = StringStore(["apple", "orange"]) > ``` | Name | Type | Description | @@ -31,7 +31,7 @@ Get the number of strings in the store. > #### Example > > ```python -> stringstore = StringStore([u"apple", u"orange"]) +> stringstore = StringStore(["apple", "orange"]) > assert len(stringstore) == 2 > ``` @@ -46,10 +46,10 @@ Retrieve a string from a given hash, or vice versa. > #### Example > > ```python -> stringstore = StringStore([u"apple", u"orange"]) -> apple_hash = stringstore[u"apple"] +> stringstore = StringStore(["apple", "orange"]) +> apple_hash = stringstore["apple"] > assert apple_hash == 8566208034543834098 -> assert stringstore[apple_hash] == u"apple" +> assert stringstore[apple_hash] == "apple" > ``` | Name | Type | Description | @@ -64,9 +64,9 @@ Check whether a string is in the store. > #### Example > > ```python -> stringstore = StringStore([u"apple", u"orange"]) -> assert u"apple" in stringstore -> assert not u"cherry" in stringstore +> stringstore = StringStore(["apple", "orange"]) +> assert "apple" in stringstore +> assert not "cherry" in stringstore > ``` | Name | Type | Description | @@ -82,9 +82,9 @@ store will always include an empty string `''` at position `0`. > #### Example > > ```python -> stringstore = StringStore([u"apple", u"orange"]) +> stringstore = StringStore(["apple", "orange"]) > all_strings = [s for s in stringstore] -> assert all_strings == [u"apple", u"orange"] +> assert all_strings == ["apple", "orange"] > ``` | Name | Type | Description | @@ -98,12 +98,12 @@ Add a string to the `StringStore`. > #### Example > > ```python -> stringstore = StringStore([u"apple", u"orange"]) -> banana_hash = stringstore.add(u"banana") +> stringstore = StringStore(["apple", "orange"]) +> banana_hash = stringstore.add("banana") > assert len(stringstore) == 3 > assert banana_hash == 2525716904149915114 -> assert stringstore[banana_hash] == u"banana" -> assert stringstore[u"banana"] == banana_hash +> assert stringstore[banana_hash] == "banana" +> assert stringstore["banana"] == banana_hash > ``` | Name | Type | Description | @@ -182,7 +182,7 @@ Get a 64-bit hash for a given string. > > ```python > from spacy.strings import hash_string -> assert hash_string(u"apple") == 8566208034543834098 +> assert hash_string("apple") == 8566208034543834098 > ``` | Name | Type | Description | diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index a1d921b41..bd3382f89 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -57,7 +57,7 @@ and all pipeline components are applied to the `Doc` in order. Both > > ```python > tagger = Tagger(nlp.vocab) -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > # This usually happens under the hood > processed = tagger(doc) > ``` @@ -97,7 +97,7 @@ Apply the pipeline's model to a batch of docs, without modifying them. > > ```python > tagger = Tagger(nlp.vocab) -> scores = tagger.predict([doc1, doc2]) +> scores, tensors = tagger.predict([doc1, doc2]) > ``` | Name | Type | Description | @@ -113,14 +113,15 @@ Modify a batch of documents, using pre-computed scores. > > ```python > tagger = Tagger(nlp.vocab) -> scores = tagger.predict([doc1, doc2]) -> tagger.set_annotations([doc1, doc2], scores) +> scores, tensors = tagger.predict([doc1, doc2]) +> tagger.set_annotations([doc1, doc2], scores, tensors) > ``` -| Name | Type | Description | -| -------- | -------- | ------------------------------------------------ | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `Tagger.predict`. | +| Name | Type | Description | +| --------- | -------- | ----------------------------------------------------- | +| `docs` | iterable | The documents to modify. | +| `scores` | - | The scores to set, produced by `Tagger.predict`. | +| `tensors` | iterable | The token representations used to predict the scores. | ## Tagger.update {#update tag="method"} diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index 310122b9c..1a0280265 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -75,7 +75,7 @@ delegate to the [`predict`](/api/textcategorizer#predict) and > > ```python > textcat = TextCategorizer(nlp.vocab) -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > # This usually happens under the hood > processed = textcat(doc) > ``` @@ -116,7 +116,7 @@ Apply the pipeline's model to a batch of docs, without modifying them. > > ```python > textcat = TextCategorizer(nlp.vocab) -> scores = textcat.predict([doc1, doc2]) +> scores, tensors = textcat.predict([doc1, doc2]) > ``` | Name | Type | Description | @@ -132,14 +132,15 @@ Modify a batch of documents, using pre-computed scores. > > ```python > textcat = TextCategorizer(nlp.vocab) -> scores = textcat.predict([doc1, doc2]) -> textcat.set_annotations([doc1, doc2], scores) +> scores, tensors = textcat.predict([doc1, doc2]) +> textcat.set_annotations([doc1, doc2], scores, tensors) > ``` -| Name | Type | Description | -| -------- | -------- | --------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `TextCategorizer.predict`. | +| Name | Type | Description | +| --------- | -------- | --------------------------------------------------------- | +| `docs` | iterable | The documents to modify. | +| `scores` | - | The scores to set, produced by `TextCategorizer.predict`. | +| `tensors` | iterable | The token representations used to predict the scores. | ## TextCategorizer.update {#update tag="method"} @@ -227,13 +228,13 @@ Modify the pipe's model, to use the given parameter values. > > ```python > textcat = TextCategorizer(nlp.vocab) -> with textcat.use_params(): +> with textcat.use_params(optimizer.averages): > textcat.to_disk("/best_model") > ``` | Name | Type | Description | | -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | - | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | ## TextCategorizer.add_label {#add_label tag="method"} diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 24816b401..8d7ee5928 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -12,9 +12,9 @@ Construct a `Token` object. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > token = doc[0] -> assert token.text == u"Give" +> assert token.text == "Give" > ``` | Name | Type | Description | @@ -31,7 +31,7 @@ The number of unicode characters in the token, i.e. `token.text`. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > token = doc[0] > assert len(token) == 4 > ``` @@ -50,9 +50,9 @@ For details, see the documentation on > > ```python > from spacy.tokens import Token -> fruit_getter = lambda token: token.text in (u"apple", u"pear", u"banana") +> fruit_getter = lambda token: token.text in ("apple", "pear", "banana") > Token.set_extension("is_fruit", getter=fruit_getter) -> doc = nlp(u"I have an apple") +> doc = nlp("I have an apple") > assert doc[3]._.is_fruit > ``` @@ -128,7 +128,7 @@ Check the value of a boolean flag. > > ```python > from spacy.attrs import IS_TITLE -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > token = doc[0] > assert token.check_flag(IS_TITLE) == True > ``` @@ -145,7 +145,7 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors. > #### Example > > ```python -> apples, _, oranges = nlp(u"apples and oranges") +> apples, _, oranges = nlp("apples and oranges") > apples_oranges = apples.similarity(oranges) > oranges_apples = oranges.similarity(apples) > assert apples_oranges == oranges_apples @@ -163,9 +163,9 @@ Get a neighboring token. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > give_nbor = doc[0].nbor() -> assert give_nbor.text == u"it" +> assert give_nbor.text == "it" > ``` | Name | Type | Description | @@ -181,7 +181,7 @@ dependency tree. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > give = doc[0] > it = doc[1] > assert give.is_ancestor(it) @@ -199,11 +199,11 @@ The rightmost token of this token's syntactic descendants. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > it_ancestors = doc[1].ancestors -> assert [t.text for t in it_ancestors] == [u"Give"] +> assert [t.text for t in it_ancestors] == ["Give"] > he_ancestors = doc[4].ancestors -> assert [t.text for t in he_ancestors] == [u"pleaded"] +> assert [t.text for t in he_ancestors] == ["pleaded"] > ``` | Name | Type | Description | @@ -217,9 +217,9 @@ A tuple of coordinated tokens, not including the token itself. > #### Example > > ```python -> doc = nlp(u"I like apples and oranges") +> doc = nlp("I like apples and oranges") > apples_conjuncts = doc[2].conjuncts -> assert [t.text for t in apples_conjuncts] == [u"oranges"] +> assert [t.text for t in apples_conjuncts] == ["oranges"] > ``` | Name | Type | Description | @@ -233,9 +233,9 @@ A sequence of the token's immediate syntactic children. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > give_children = doc[0].children -> assert [t.text for t in give_children] == [u"it", u"back", u"!"] +> assert [t.text for t in give_children] == ["it", "back", "!"] > ``` | Name | Type | Description | @@ -249,9 +249,9 @@ The leftward immediate children of the word, in the syntactic dependency parse. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > lefts = [t.text for t in doc[3].lefts] -> assert lefts == [u'New'] +> assert lefts == ["New"] > ``` | Name | Type | Description | @@ -265,9 +265,9 @@ The rightward immediate children of the word, in the syntactic dependency parse. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > rights = [t.text for t in doc[3].rights] -> assert rights == [u"in"] +> assert rights == ["in"] > ``` | Name | Type | Description | @@ -282,7 +282,7 @@ dependency parse. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > assert doc[3].n_lefts == 1 > ``` @@ -298,7 +298,7 @@ dependency parse. > #### Example > > ```python -> doc = nlp(u"I like New York in Autumn.") +> doc = nlp("I like New York in Autumn.") > assert doc[3].n_rights == 1 > ``` @@ -313,9 +313,9 @@ A sequence containing the token and all the token's syntactic descendants. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > give_subtree = doc[0].subtree -> assert [t.text for t in give_subtree] == [u"Give", u"it", u"back", u"!"] +> assert [t.text for t in give_subtree] == ["Give", "it", "back", "!"] > ``` | Name | Type | Description | @@ -330,7 +330,7 @@ unknown. Defaults to `True` for the first token in the `Doc`. > #### Example > > ```python -> doc = nlp(u"Give it back! He pleaded.") +> doc = nlp("Give it back! He pleaded.") > assert doc[4].is_sent_start > assert not doc[5].is_sent_start > ``` @@ -361,7 +361,7 @@ A boolean value indicating whether a word vector is associated with the token. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > apples = doc[2] > assert apples.has_vector > ``` @@ -377,7 +377,7 @@ A real-valued meaning representation. > #### Example > > ```python -> doc = nlp(u"I like apples") +> doc = nlp("I like apples") > apples = doc[2] > assert apples.vector.dtype == "float32" > assert apples.vector.shape == (300,) @@ -394,7 +394,7 @@ The L2 norm of the token's vector representation. > #### Example > > ```python -> doc = nlp(u"I like apples and pasta") +> doc = nlp("I like apples and pasta") > apples = doc[2] > pasta = doc[4] > apples.vector_norm # 6.89589786529541 @@ -425,8 +425,10 @@ The L2 norm of the token's vector representation. | `i` | int | The index of the token within the parent document. | | `ent_type` | int | Named entity type. | | `ent_type_` | unicode | Named entity type. | -| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | | +| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | | `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. | +| `ent_kb_id` 2.2 | int | Knowledge base ID that refers to the named entity this token is a part of, if any. | +| `ent_kb_id_` 2.2 | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. | | `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | | `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | | `lemma` | int | Base form of the token, with no inflectional suffixes. | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index ce1ba9a21..d6ab73f14 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -5,7 +5,9 @@ tag: class source: spacy/tokenizer.pyx --- -Segment text, and create `Doc` objects with the discovered segment boundaries. For a deeper understanding, see the docs on [how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works). +Segment text, and create `Doc` objects with the discovered segment boundaries. +For a deeper understanding, see the docs on +[how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works). ## Tokenizer.\_\_init\_\_ {#init tag="method"} @@ -49,7 +51,7 @@ Tokenize a string. > #### Example > > ```python -> tokens = tokenizer(u"This is a sentence") +> tokens = tokenizer("This is a sentence") > assert len(tokens) == 4 > ``` @@ -65,7 +67,7 @@ Tokenize a stream of texts. > #### Example > > ```python -> texts = [u"One document.", u"...", u"Lots of documents"] +> texts = ["One document.", "...", "Lots of documents"] > for doc in tokenizer.pipe(texts, batch_size=50): > pass > ``` @@ -109,14 +111,15 @@ if no suffix rules match. Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on -[adding languages](/usage/adding-languages#tokenizer-exceptions) and [linguistic features](/usage/linguistic-features#special-cases) for more -details and examples. +[adding languages](/usage/adding-languages#tokenizer-exceptions) and +[linguistic features](/usage/linguistic-features#special-cases) for more details +and examples. > #### Example > > ```python -> from spacy.attrs import ORTH, LEMMA -> case = [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}] +> from spacy.attrs import ORTH, NORM +> case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}] > tokenizer.add_special_case("don't", case) > ``` diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 9d166a5c5..50ba0e3d9 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -112,10 +112,10 @@ list of available terms, see > #### Example > > ```python -> spacy.explain(u"NORP") +> spacy.explain("NORP") > # Nationalities or religious or political groups > -> doc = nlp(u"Hello world") +> doc = nlp("Hello world") > for word in doc: > print(word.text, word.tag_, spacy.explain(word.tag_)) > # Hello UH interjection @@ -181,8 +181,8 @@ browser. Will run a simple web server. > import spacy > from spacy import displacy > nlp = spacy.load("en_core_web_sm") -> doc1 = nlp(u"This is a sentence.") -> doc2 = nlp(u"This is another sentence.") +> doc1 = nlp("This is a sentence.") +> doc2 = nlp("This is another sentence.") > displacy.serve([doc1, doc2], style="dep") > ``` @@ -192,7 +192,7 @@ browser. Will run a simple web server. | `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | | `page` | bool | Render markup as full HTML page. | `True` | | `minify` | bool | Minify HTML markup. | `False` | -| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | +| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | | `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | | `port` | int | Port to serve visualization. | `5000` | | `host` | unicode | Host to serve visualization. | `'0.0.0.0'` | @@ -207,7 +207,7 @@ Render a dependency parse tree or named entity visualization. > import spacy > from spacy import displacy > nlp = spacy.load("en_core_web_sm") -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > html = displacy.render(doc, style="dep") > ``` @@ -218,7 +218,7 @@ Render a dependency parse tree or named entity visualization. | `page` | bool | Render markup as full HTML page. | `False` | | `minify` | bool | Minify HTML markup. | `False` | | `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` | -| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | +| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | | `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | | **RETURNS** | unicode | Rendered HTML markup. | @@ -262,15 +262,18 @@ If a setting is not present in the options, the default value will be used. > displacy.serve(doc, style="ent", options=options) > ``` -| Name | Type | Description | Default | -| -------- | ---- | ------------------------------------------------------------------------------------- | ------- | -| `ents` | list | Entity types to highlight (`None` for all types). | `None` | -| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` | +| Name | Type | Description | Default | +| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | +| `ents` | list | Entity types to highlight (`None` for all types). | `None` | +| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` | +| `template` 2.2 | unicode | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) | By default, displaCy comes with colors for all [entity types supported by spaCy](/api/annotation#named-entities). If you're using custom entity types, you can use the `colors` setting to add your own -colors for them. +colors for them. Your application or model package can also expose a +[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy) +to add custom labels and their colors automatically. ## Utility functions {#util source="spacy/util.py"} @@ -511,9 +514,9 @@ an error if key doesn't match `ORTH` values. > > ```python > BASE = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} -> NEW = {"a.": [{ORTH: "a.", LEMMA: "all"}]} +> NEW = {"a.": [{ORTH: "a.", NORM: "all"}]} > exceptions = util.update_exc(BASE, NEW) -> # {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]} +> # {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]} > ``` | Name | Type | Description | @@ -648,11 +651,11 @@ for batching. Larger `bufsize` means less bias. > shuffled = itershuffle(values) > ``` -| Name | Type | Description | -| ---------- | -------- | ------------------------------------- | -| `iterable` | iterable | Iterator to shuffle. | -| `bufsize` | int | Items to hold back (default: 1000). | -| **YIELDS** | iterable | The shuffled iterator. | +| Name | Type | Description | +| ---------- | -------- | ----------------------------------- | +| `iterable` | iterable | Iterator to shuffle. | +| `bufsize` | int | Items to hold back (default: 1000). | +| **YIELDS** | iterable | The shuffled iterator. | ### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"} diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index c04085091..ae62d8cfc 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -26,7 +26,7 @@ you can add vectors to later. > empty_vectors = Vectors(shape=(10000, 300)) > > data = numpy.zeros((3, 300), dtype='f') -> keys = [u"cat", u"dog", u"rat"] +> keys = ["cat", "dog", "rat"] > vectors = Vectors(data=data, keys=keys) > ``` @@ -35,6 +35,7 @@ you can add vectors to later. | `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. | | `keys` | iterable | A sequence of keys aligned with the data. | | `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. | +| `name` | unicode | A name to identify the vectors table. | | **RETURNS** | `Vectors` | The newly created object. | ## Vectors.\_\_getitem\_\_ {#getitem tag="method"} @@ -45,9 +46,9 @@ raised. > #### Example > > ```python -> cat_id = nlp.vocab.strings[u"cat"] +> cat_id = nlp.vocab.strings["cat"] > cat_vector = nlp.vocab.vectors[cat_id] -> assert cat_vector == nlp.vocab[u"cat"].vector +> assert cat_vector == nlp.vocab["cat"].vector > ``` | Name | Type | Description | @@ -62,7 +63,7 @@ Set a vector for the given key. > #### Example > > ```python -> cat_id = nlp.vocab.strings[u"cat"] +> cat_id = nlp.vocab.strings["cat"] > vector = numpy.random.uniform(-1, 1, (300,)) > nlp.vocab.vectors[cat_id] = vector > ``` @@ -109,7 +110,7 @@ Check whether a key has been mapped to a vector entry in the table. > #### Example > > ```python -> cat_id = nlp.vocab.strings[u"cat"] +> cat_id = nlp.vocab.strings["cat"] > nlp.vectors.add(cat_id, numpy.random.uniform(-1, 1, (300,))) > assert cat_id in vectors > ``` @@ -132,9 +133,9 @@ mapping separately. If you need to manage the strings, you should use the > > ```python > vector = numpy.random.uniform(-1, 1, (300,)) -> cat_id = nlp.vocab.strings[u"cat"] +> cat_id = nlp.vocab.strings["cat"] > nlp.vocab.vectors.add(cat_id, vector=vector) -> nlp.vocab.vectors.add(u"dog", row=0) +> nlp.vocab.vectors.add("dog", row=0) > ``` | Name | Type | Description | @@ -218,8 +219,8 @@ Look up one or more keys by row, or vice versa. > #### Example > > ```python -> row = nlp.vocab.vectors.find(key=u"cat") -> rows = nlp.vocab.vectors.find(keys=[u"cat", u"dog"]) +> row = nlp.vocab.vectors.find(key="cat") +> rows = nlp.vocab.vectors.find(keys=["cat", "dog"]) > key = nlp.vocab.vectors.find(row=256) > keys = nlp.vocab.vectors.find(rows=[18, 256, 985]) > ``` @@ -241,7 +242,7 @@ vector table. > > ```python > vectors = Vectors(shape(1, 300)) -> vectors.add(u"cat", numpy.random.uniform(-1, 1, (300,))) +> vectors.add("cat", numpy.random.uniform(-1, 1, (300,))) > rows, dims = vectors.shape > assert rows == 1 > assert dims == 300 @@ -276,7 +277,7 @@ If a table is full, it can be resized using > > ```python > vectors = Vectors(shape=(1, 300)) -> vectors.add(u"cat", numpy.random.uniform(-1, 1, (300,))) +> vectors.add("cat", numpy.random.uniform(-1, 1, (300,))) > assert vectors.is_full > ``` diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md index cd21a91d6..ea0c2d219 100644 --- a/website/docs/api/vocab.md +++ b/website/docs/api/vocab.md @@ -18,16 +18,17 @@ Create the vocabulary. > > ```python > from spacy.vocab import Vocab -> vocab = Vocab(strings=[u"hello", u"world"]) +> vocab = Vocab(strings=["hello", "world"]) > ``` -| Name | Type | Description | -| ------------------ | -------------------- | ------------------------------------------------------------------------------------------------------------------ | -| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. | -| `tag_map` | dict | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. | -| `lemmatizer` | object | A lemmatizer. Defaults to `None`. | -| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. | -| **RETURNS** | `Vocab` | The newly constructed object. | +| Name | Type | Description | +| ------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------ | +| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. | +| `tag_map` | dict | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. | +| `lemmatizer` | object | A lemmatizer. Defaults to `None`. | +| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. | +| `vectors_name` 2.2 | unicode | A name to identify the vectors table. | +| **RETURNS** | `Vocab` | The newly constructed object. | ## Vocab.\_\_len\_\_ {#len tag="method"} @@ -36,7 +37,7 @@ Get the current number of lexemes in the vocabulary. > #### Example > > ```python -> doc = nlp(u"This is a sentence.") +> doc = nlp("This is a sentence.") > assert len(nlp.vocab) > 0 > ``` @@ -52,8 +53,8 @@ unicode string is given, a new lexeme is created and stored. > #### Example > > ```python -> apple = nlp.vocab.strings[u"apple"] -> assert nlp.vocab[apple] == nlp.vocab[u"apple"] +> apple = nlp.vocab.strings["apple"] +> assert nlp.vocab[apple] == nlp.vocab["apple"] > ``` | Name | Type | Description | @@ -84,8 +85,8 @@ given string, you need to look it up in > #### Example > > ```python -> apple = nlp.vocab.strings[u"apple"] -> oov = nlp.vocab.strings[u"dskfodkfos"] +> apple = nlp.vocab.strings["apple"] +> oov = nlp.vocab.strings["dskfodkfos"] > assert apple in nlp.vocab > assert oov not in nlp.vocab > ``` @@ -106,11 +107,11 @@ using `token.check_flag(flag_id)`. > > ```python > def is_my_product(text): -> products = [u"spaCy", u"Thinc", u"displaCy"] +> products = ["spaCy", "Thinc", "displaCy"] > return text in products > > MY_PRODUCT = nlp.vocab.add_flag(is_my_product) -> doc = nlp(u"I like spaCy") +> doc = nlp("I like spaCy") > assert doc[2].check_flag(MY_PRODUCT) == True > ``` @@ -170,7 +171,7 @@ or hash value. If no vectors data is loaded, a `ValueError` is raised. > #### Example > > ```python -> nlp.vocab.get_vector(u"apple") +> nlp.vocab.get_vector("apple") > ``` | Name | Type | Description | @@ -186,7 +187,7 @@ or hash value. > #### Example > > ```python -> nlp.vocab.set_vector(u"apple", array([...])) +> nlp.vocab.set_vector("apple", array([...])) > ``` | Name | Type | Description | @@ -202,8 +203,8 @@ Words can be looked up by string or hash value. > #### Example > > ```python -> if nlp.vocab.has_vector(u"apple"): -> vector = nlp.vocab.get_vector(u"apple") +> if nlp.vocab.has_vector("apple"): +> vector = nlp.vocab.get_vector("apple") > ``` | Name | Type | Description | @@ -282,9 +283,9 @@ Load state from a binary string. > #### Example > > ```python -> apple_id = nlp.vocab.strings[u"apple"] +> apple_id = nlp.vocab.strings["apple"] > assert type(apple_id) == int -> PERSON = nlp.vocab.strings[u"PERSON"] +> PERSON = nlp.vocab.strings["PERSON"] > assert type(PERSON) == int > ``` @@ -293,6 +294,7 @@ Load state from a binary string. | `strings` | `StringStore` | A table managing the string-to-int mapping. | | `vectors` 2 | `Vectors` | A table associating word IDs to word vectors. | | `vectors_length` | int | Number of dimensions for each word vector. | +| `lookups` | `Lookups` | The available lookup tables in this vocab. | | `writing_system` 2.1 | dict | A dict with information about the language's writing system. | ## Serialization fields {#serialization-fields} @@ -313,3 +315,4 @@ serialization by passing in the string names via the `exclude` argument. | `strings` | The strings in the [`StringStore`](/api/stringstore). | | `lexemes` | The lexeme data. | | `vectors` | The word vectors, if available. | +| `lookups` | The lookup tables, if available. | diff --git a/website/docs/images/displacy-ent-snek.html b/website/docs/images/displacy-ent-snek.html new file mode 100644 index 000000000..1e4920fb5 --- /dev/null +++ b/website/docs/images/displacy-ent-snek.html @@ -0,0 +1,18 @@ +
+ 🌱🌿 🐍 SNEK ____ 🌳🌲 ____ 👨‍🌾 HUMAN 🏘️ +
diff --git a/website/docs/usage/101/_named-entities.md b/website/docs/usage/101/_named-entities.md index 54db6dbe8..1ecaf9fe7 100644 --- a/website/docs/usage/101/_named-entities.md +++ b/website/docs/usage/101/_named-entities.md @@ -12,7 +12,7 @@ Named entities are available as the `ents` property of a `Doc`: import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion") +doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_) @@ -21,7 +21,7 @@ for ent in doc.ents: > - **Text:** The original entity text. > - **Start:** Index of start of entity in the `Doc`. > - **End:** Index of end of entity in the `Doc`. -> - **LabeL:** Entity label, i.e. type. +> - **Label:** Entity label, i.e. type. | Text | Start | End | Label | Description | | ----------- | :---: | :-: | ------- | ---------------------------------------------------- | diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index 68308a381..d33ea45fd 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -12,14 +12,14 @@ passed on to the next component. > - **Creates:** Objects, attributes and properties modified and set by the > component. -| Name | Component | Creates | Description | -| ------------- | ------------------------------------------------------------------ | ----------------------------------------------------------- | ------------------------------------------------ | -| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | -| **tagger** | [`Tagger`](/api/tagger) | `Doc[i].tag` | Assign part-of-speech tags. | -| **parser** | [`DependencyParser`](/api/dependencyparser) | `Doc[i].head`, `Doc[i].dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | -| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Doc[i].ent_iob`, `Doc[i].ent_type` | Detect and label named entities. | -| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | -| ... | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | +| Name | Component | Creates | Description | +| ----------------- | ------------------------------------------------------------------ | ----------------------------------------------------------- | ------------------------------------------------ | +| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | +| **tagger** | [`Tagger`](/api/tagger) | `Doc[i].tag` | Assign part-of-speech tags. | +| **parser** | [`DependencyParser`](/api/dependencyparser) | `Doc[i].head`, `Doc[i].dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | +| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Doc[i].ent_iob`, `Doc[i].ent_type` | Detect and label named entities. | +| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | +| ... | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | The processing pipeline always **depends on the statistical model** and its capabilities. For example, a pipeline can only include an entity recognizer @@ -49,6 +49,10 @@ them, its dependency predictions may be different. Similarly, it matters if you add the [`EntityRuler`](/api/entityruler) before or after the statistical entity recognizer: if it's added before, the entity recognizer will take the existing entities into account when making predictions. +The [`EntityLinker`](/api/entitylinker), which resolves named entities to +knowledge base IDs, should be preceded by +a pipeline component that recognizes entities such as the +[`EntityRecognizer`](/api/entityrecognizer).
diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md index d86ee123d..9d04d6ffc 100644 --- a/website/docs/usage/101/_pos-deps.md +++ b/website/docs/usage/101/_pos-deps.md @@ -15,8 +15,8 @@ need to add an underscore `_` to its name: ### {executable="true"} import spacy -nlp = spacy.load('en_core_web_sm') -doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion') +nlp = spacy.load("en_core_web_sm") +doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, @@ -45,7 +45,7 @@ for token in doc: | for | for | `ADP` | `IN` | `prep` | `xxx` | `True` | `True` | | \$ | \$ | `SYM` | `$` | `quantmod` | `$` | `False` | `False` | | 1 | 1 | `NUM` | `CD` | `compound` | `d` | `False` | `False` | -| billion | billion | `NUM` | `CD` | `probj` | `xxxx` | `True` | `False` | +| billion | billion | `NUM` | `CD` | `pobj` | `xxxx` | `True` | `False` | > #### Tip: Understanding tags and labels > diff --git a/website/docs/usage/101/_serialization.md b/website/docs/usage/101/_serialization.md index 828b796b3..01a9c39d1 100644 --- a/website/docs/usage/101/_serialization.md +++ b/website/docs/usage/101/_serialization.md @@ -13,9 +13,9 @@ file or a byte string. This process is called serialization. spaCy comes with > object to and from disk, but it's also used for distributed computing, e.g. > with > [PySpark](https://spark.apache.org/docs/0.9.0/python-programming-guide.html) -> or [Dask](http://dask.pydata.org/en/latest/). When you unpickle an object, -> you're agreeing to execute whatever code it contains. It's like calling -> `eval()` on a string – so don't unpickle objects from untrusted sources. +> or [Dask](https://dask.org). When you unpickle an object, you're agreeing to +> execute whatever code it contains. It's like calling `eval()` on a string – so +> don't unpickle objects from untrusted sources. All container classes, i.e. [`Language`](/api/language) (`nlp`), [`Doc`](/api/doc), [`Vocab`](/api/vocab) and [`StringStore`](/api/stringstore) diff --git a/website/docs/usage/101/_tokenization.md b/website/docs/usage/101/_tokenization.md index e5f3d3080..764f1e62a 100644 --- a/website/docs/usage/101/_tokenization.md +++ b/website/docs/usage/101/_tokenization.md @@ -9,7 +9,7 @@ tokens, and we can iterate over them: import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion") +doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for token in doc: print(token.text) ``` diff --git a/website/docs/usage/101/_training.md b/website/docs/usage/101/_training.md index 61e047748..baf3a1891 100644 --- a/website/docs/usage/101/_training.md +++ b/website/docs/usage/101/_training.md @@ -20,7 +20,7 @@ difference, the more significant the gradient and the updates to our model. ![The training process](../../images/training.svg) When training a model, we don't just want it to memorize our examples – we want -it to come up with theory that can be **generalized across other examples**. +it to come up with a theory that can be **generalized across other examples**. After all, we don't just want the model to learn that this one instance of "Amazon" right here is a company – we want it to learn that "Amazon", in contexts _like this_, is most likely a company. That's why the training data diff --git a/website/docs/usage/101/_vectors-similarity.md b/website/docs/usage/101/_vectors-similarity.md index 2001d1481..73c35950f 100644 --- a/website/docs/usage/101/_vectors-similarity.md +++ b/website/docs/usage/101/_vectors-similarity.md @@ -48,8 +48,8 @@ norm, which can be used to normalize vectors. ### {executable="true"} import spacy -nlp = spacy.load('en_core_web_md') -tokens = nlp(u'dog cat banana afskfsd') +nlp = spacy.load("en_core_web_md") +tokens = nlp("dog cat banana afskfsd") for token in tokens: print(token.text, token.has_vector, token.vector_norm, token.is_oov) @@ -88,8 +88,8 @@ definition of similarity. ### {executable="true"} import spacy -nlp = spacy.load('en_core_web_md') # make sure to use larger model! -tokens = nlp(u'dog cat banana') +nlp = spacy.load("en_core_web_md") # make sure to use larger model! +tokens = nlp("dog cat banana") for token1 in tokens: for token2 in tokens: diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md index 374d948b2..94d75ea31 100644 --- a/website/docs/usage/adding-languages.md +++ b/website/docs/usage/adding-languages.md @@ -71,21 +71,19 @@ from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model. -| Variable | Type | Description | -| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | -| `STOP_WORDS` | set | Individual words. | -| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | -| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | -| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. | -| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | -| `LEX_ATTRS` | dict | Attribute ID mapped to function. | -| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | -| `LOOKUP` | dict | Keyed by strings mapping to their lemma. | -| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. | -| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | -| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | +| Variable | Type | Description | +| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | +| `STOP_WORDS` | set | Individual words. | +| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | +| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | +| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. | +| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. | +| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. | +| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | +| `LEX_ATTRS` | dict | Attribute ID mapped to function. | +| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | +| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | +| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | > #### Should I ever update the global data? > @@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works) lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule -is applied, giving the defined sequence of tokens. You can also attach -attributes to the subtokens, covered by your special case, such as the subtokens -`LEMMA` or `TAG`. +is applied, giving the defined sequence of tokens. Tokenizer exceptions can be added in the following format: @@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format: ### tokenizer_exceptions.py (excerpt) TOKENIZER_EXCEPTIONS = { "don't": [ - {ORTH: "do", LEMMA: "do"}, - {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}] + {ORTH: "do"}, + {ORTH: "n't", NORM: "not"}] } ``` @@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = { If an exception consists of more than one token, the `ORTH` values combined always need to **match the original string**. The way the original string is split up can be pretty arbitrary sometimes – for example `"gonna"` is split into -`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer +`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer works, it's currently not possible to split single-letter strings into multiple tokens. -Unambiguous abbreviations, like month names or locations in English, should be -added to exceptions with a lemma assigned, for example -`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python, -you can use custom logic to generate them more efficiently and make your data -less verbose. How you do this ultimately depends on the language. Here's an -example of how exceptions for time formats like "1a.m." and "1am" are generated -in the English -[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py): - -```python -### tokenizer_exceptions.py (excerpt) -# use short, internal variable for readability -_exc = {} - -for h in range(1, 12 + 1): - for period in ["a.m.", "am"]: - # always keep an eye on string interpolation! - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "a.m."}] - for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "p.m."}] - -# only declare this at the bottom -TOKENIZER_EXCEPTIONS = _exc -``` - > #### Generating tokenizer exceptions > > Keep in mind that generating exceptions only makes sense if there's a clearly @@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc > This is not always the case – in Spanish for instance, infinitive or > imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In > cases like this, spaCy shouldn't be generating exceptions for _all verbs_. -> Instead, this will be handled at a later stage during lemmatization. +> Instead, this will be handled at a later stage after part-of-speech tagging +> and lemmatization. When adding the tokenizer exceptions to the `Defaults`, you can use the [`update_exc`](/api/top-level#util.update_exc) helper function to merge them @@ -292,33 +260,23 @@ custom one. from ...util import update_exc BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} -TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]} +TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]} tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) -# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]} +# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]} ``` - - -Unlike verbs and common nouns, there's no clear base form of a personal pronoun. -Should the lemma of "me" be "I", or should we normalize person as well, giving -"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`, -which is used as the lemma for all personal pronouns. - - - ### Norm exceptions {#norm-exceptions new="2"} -In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM` -attribute. This is useful to specify a normalized version of the token – for -example, the norm of "n't" is "not". By default, a token's norm equals its -lowercase text. If the lowercase spelling of a word exists, norms should always -be in lowercase. +In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute. +This is useful to specify a normalized version of the token – for example, the +norm of "n't" is "not". By default, a token's norm equals its lowercase text. If +the lowercase spelling of a word exists, norms should always be in lowercase. > #### Norms vs. lemmas > > ```python -> doc = nlp(u"I'm gonna realise") +> doc = nlp("I'm gonna realise") > norms = [token.norm_ for token in doc] > lemmas = [token.lemma_ for token in doc] > assert norms == ["i", "am", "going", "to", "realize"] @@ -438,10 +396,10 @@ iterators: > #### Noun chunks example > > ```python -> doc = nlp(u"A phrase with another phrase occurs.") +> doc = nlp("A phrase with another phrase occurs.") > chunks = list(doc.noun_chunks) -> assert chunks[0].text == u"A phrase" -> assert chunks[1].text == u"another phrase" +> assert chunks[0].text == "A phrase" +> assert chunks[1].text == "another phrase" > ``` | Language | Code | Source | @@ -458,27 +416,50 @@ the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data: -```python -### lang/es/lemmatizer.py (excerpt) -LOOKUP = { - "aba": "abar", - "ababa": "abar", - "ababais": "abar", - "ababan": "abar", - "ababanes": "ababán", - "ababas": "abar", - "ababoles": "ababol", - "ababábites": "ababábite" +```json +### lang/es/lemma_lookup.json (excerpt) +{ + "aba": "abar", + "ababa": "abar", + "ababais": "abar", + "ababan": "abar", + "ababanes": "ababán", + "ababas": "abar", + "ababoles": "ababol", + "ababábites": "ababábite" } ``` -To provide a lookup lemmatizer for your language, import the lookup table and -add it to the `Language` class as `lemma_lookup`: +#### Adding JSON resources {#lemmatizer-resources new="2.2"} + +As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the +new [`Lookups`](/api/lookups) class. This allows easier access to the data, +serialization with the models and file compression on disk (so your spaCy +installation is smaller). Resource files can be provided via the `resources` +attribute on the custom language subclass. All paths are relative to the +language data directory, i.e. the directory the language's `__init__.py` is in. ```python -lemma_lookup = LOOKUP +resources = { + "lemma_lookup": "lemmatizer/lemma_lookup.json", + "lemma_rules": "lemmatizer/lemma_rules.json", + "lemma_index": "lemmatizer/lemma_index.json", + "lemma_exc": "lemmatizer/lemma_exc.json", +} ``` +> #### Lookups example +> +> ```python +> table = nlp.vocab.lookups.get_table("my_table") +> value = table.get("some_key") +> ``` + +If your language needs other large dictionaries and resources, you can also add +those files here. The data will become available via a [`Lookups`](/api/lookups) +table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer +or a custom pipeline component (via `doc.vocab.lookups`). + ### Tag map {#tag-map} Most treebanks define a custom part-of-speech tag scheme, striking a balance diff --git a/website/docs/usage/facts-figures.md b/website/docs/usage/facts-figures.md index a3683b668..40b39d871 100644 --- a/website/docs/usage/facts-figures.md +++ b/website/docs/usage/facts-figures.md @@ -26,7 +26,7 @@ Here's a quick comparison of the functionalities offered by spaCy, | Sentence segmentation | ✅ | ✅ | ✅ | | Dependency parsing | ✅ | ❌ | ✅ | | Entity recognition | ✅ | ✅ | ✅ | -| Entity linking | ❌ | ❌ | ❌ | +| Entity linking | ✅ | ❌ | ❌ | | Coreference resolution | ❌ | ❌ | ✅ | ### When should I use what? {#comparison-usage} diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index 1ffd0de0d..1d6c0574c 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -392,7 +392,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`. ```python -doc = nlp(u"They are") +doc = nlp("They are") print(doc[0].lemma_) # -PRON- ``` diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 66ad816f5..4128fa73f 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -69,7 +69,6 @@ of the two. The system works as follows: morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from [WordNet](https://wordnet.princeton.edu/). - ## Dependency Parsing {#dependency-parse model="parser"} @@ -93,7 +92,7 @@ get the noun chunks in a document, simply iterate over import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers") +doc = nlp("Autonomous cars shift insurance liability toward manufacturers") for chunk in doc.noun_chunks: print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text) @@ -124,7 +123,7 @@ get the string value with `.dep_`. import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers") +doc = nlp("Autonomous cars shift insurance liability toward manufacturers") for token in doc: print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children]) @@ -161,7 +160,7 @@ import spacy from spacy.symbols import nsubj, VERB nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers") +doc = nlp("Autonomous cars shift insurance liability toward manufacturers") # Finding a verb with a subject from below — good verbs = set() @@ -204,7 +203,7 @@ children. import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"bright red apples on the tree") +doc = nlp("bright red apples on the tree") print([token.text for token in doc[2].lefts]) # ['bright', 'red'] print([token.text for token in doc[2].rights]) # ['on'] print(doc[2].n_lefts) # 2 @@ -216,7 +215,7 @@ print(doc[2].n_rights) # 1 import spacy nlp = spacy.load("de_core_news_sm") -doc = nlp(u"schöne rote Äpfel auf dem Baum") +doc = nlp("schöne rote Äpfel auf dem Baum") print([token.text for token in doc[2].lefts]) # ['schöne', 'rote'] print([token.text for token in doc[2].rights]) # ['auf'] ``` @@ -240,7 +239,7 @@ sequence of tokens. You can walk up the tree with the import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Credit and mortgage account holders must submit their requests") +doc = nlp("Credit and mortgage account holders must submit their requests") root = [token for token in doc if token.head == token][0] subject = list(root.lefts)[0] @@ -270,7 +269,7 @@ end-point of a range, don't forget to `+1`! import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Credit and mortgage account holders must submit their requests") +doc = nlp("Credit and mortgage account holders must submit their requests") span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1] with doc.retokenize() as retokenizer: retokenizer.merge(span) @@ -311,7 +310,7 @@ import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers") +doc = nlp("Autonomous cars shift insurance liability toward manufacturers") # Since this is an interactive Jupyter environment, we can use displacy.render here displacy.render(doc, style='dep') ``` @@ -336,7 +335,7 @@ the `nlp` object. ```python nlp = spacy.load("en_core_web_sm", disable=["parser"]) nlp = English().from_disk("/model", disable=["parser"]) -doc = nlp(u"I don't want parsed", disable=["parser"]) +doc = nlp("I don't want parsed", disable=["parser"]) ``` @@ -350,10 +349,10 @@ Language class via [`from_disk`](/api/language#from_disk). ```diff + nlp = spacy.load("en_core_web_sm", disable=["parser"]) -+ doc = nlp(u"I don't want parsed", disable=["parser"]) ++ doc = nlp("I don't want parsed", disable=["parser"]) - nlp = spacy.load("en_core_web_sm", parser=False) -- doc = nlp(u"I don't want parsed", parse=False) +- doc = nlp("I don't want parsed", parse=False) ``` @@ -398,7 +397,7 @@ on a token, it will return an empty string. import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp(u"San Francisco considers banning sidewalk delivery robots") +doc = nlp("San Francisco considers banning sidewalk delivery robots") # document level ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] @@ -407,8 +406,8 @@ print(ents) # token level ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_] ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_] -print(ent_san) # [u'San', u'B', u'GPE'] -print(ent_francisco) # [u'Francisco', u'I', u'GPE'] +print(ent_san) # ['San', 'B', 'GPE'] +print(ent_francisco) # ['Francisco', 'I', 'GPE'] ``` | Text | ent_iob | ent_iob\_ | ent_type\_ | Description | @@ -435,18 +434,17 @@ import spacy from spacy.tokens import Span nlp = spacy.load("en_core_web_sm") -doc = nlp(u"FB is hiring a new Vice President of global policy") +doc = nlp("FB is hiring a new Vice President of global policy") ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('Before', ents) # the model didn't recognise "FB" as an entity :( -ORG = doc.vocab.strings[u"ORG"] # get hash value of entity label -fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity +fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity doc.ents = list(doc.ents) + [fb_ent] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('After', ents) -# [(u'FB', 0, 2, 'ORG')] 🎉 +# [('FB', 0, 2, 'ORG')] 🎉 ``` Keep in mind that you need to create a `Span` with the start and end index of @@ -468,13 +466,13 @@ import spacy from spacy.attrs import ENT_IOB, ENT_TYPE nlp = spacy.load("en_core_web_sm") -doc = nlp.make_doc(u"London is a big city in the United Kingdom.") +doc = nlp.make_doc("London is a big city in the United Kingdom.") print("Before", doc.ents) # [] header = [ENT_IOB, ENT_TYPE] attr_array = numpy.zeros((len(doc), len(header))) attr_array[0, 0] = 3 # B -attr_array[0, 1] = doc.vocab.strings[u"GPE"] +attr_array[0, 1] = doc.vocab.strings["GPE"] doc.from_array(header, attr_array) print("After", doc.ents) # [London] ``` @@ -533,8 +531,8 @@ train_data = [ ``` ```python -doc = Doc(nlp.vocab, [u"rats", u"make", u"good", u"pets"]) -gold = GoldParse(doc, entities=[u"U-ANIMAL", u"O", u"O", u"O"]) +doc = Doc(nlp.vocab, ["rats", "make", "good", "pets"]) +gold = GoldParse(doc, entities=["U-ANIMAL", "O", "O", "O"]) ``` @@ -565,7 +563,7 @@ For more details and examples, see the import spacy from spacy import displacy -text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." +text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." nlp = spacy.load("en_core_web_sm") doc = nlp(text) @@ -576,6 +574,52 @@ import DisplacyEntHtml from 'images/displacy-ent2.html'